Configuration/Base Model¶
The Configuration/Base Model
is a fundamental component of your DATAMIMIC project, serving as the foundation for setting up and configuring connected systems, including specifying system-wide settings, environment details, and included external files.
Example Configuration Model¶
Consider the following example of a DATAMIMIC Configuration Model:
1 2 3 4 5 6 7 8 9 |
|
In this example:
multiprocessing="True"
indicates the use of multiprocessing for concurrent data generation.<database>
elements are used to define database configurations for source and target systems, including "source_oracle" and "target_postgres."<mongodb>
is used to define the MongoDB configuration.<include>
elements reference external XML files (1_select_subset.xml
and2_obfuscate.xml
) for defining data generation and processing tasks.
This Configuration Model sets the stage for your DATAMIMIC project, allowing you to configure and connect with different systems, specify environment details, and include external configurations as needed.
It's important to tailor the 'Configuration/Base Model' to your specific project requirements, adapting it to the systems and databases you are working with.
<setup>
¶
The <setup>
element is the root of the Configuration/Base Model, defining the overall setup for the data generation process.
Attributes:¶
multiprocessing
: Indicates whether multiprocessing is enabled. Values can be "True" or "False". Default is "False".defaultSeparator
: Defines the default separator while reading data source (like csv). Common values are "|", ";", or ",". Default is "|".defaultDataset
: Specifies the default dataset (e.g., "DE", "US"). Default is "US".defaultLocale
: Specifies the default locale (e.g., "de", "en"). Default is "en".numProcess
: Defines the number of processes for multiprocessing. Default is "None", which mean using all possible CPU cores.defaultLineSeparator
: Defines the default line separator (e.g., "\r\n", "\r", "\n"). Default is "\n".defaultSourceScripted
: Indicates whether the value in any is scripted (e.g., "{1 + 2}" will be evaluated as "3"). Default is "False".reportLogging
: Indicates whether to log timer of generating and exporting process. If you have huge number of generate node, set this value to "False" may reduce processing time. Default is "True".defaultVariablePrefix
: Configurable attribute that defines the prefix for variable substitution in dynamic strings (default is__
).defaultVariableSuffix
: Configurable attribute that defines the suffix for variable substitution in dynamic strings (default is__
).
Children:¶
<database>
: Defines relational database connections.<mongodb>
: Defines MongoDB connections.<include>
: Includes external files or configurations.<memstore>
: Defines in-memory storage, useful for temporary data storage among generation tasks.<execute>
: Executes external scripts or commands.<echo>
: Outputs text or variables to the log, useful for debugging read more in data-definition.<variable>
: Defines variables used in data generation read more in data-definition.<data-warehouse>
: Defines data warehouse configurations.<kafka-exporter>
: Defines Kafka producer connections.<kafka-importer>
: Defines Kafka consumer connections.<object-storage>
: Defines object store configurations.
<database>
¶
The <database>
element is used to define database connections within the DATAMIMIC Configuration Model.
Attributes:¶
id
: Specifies the unique identifier for the database connection and the id of the configure system environment.system
: Specifies the name of your database environment configuration on platform. If this attribute is not defined, the system will use id as the system name.
Example:¶
1 |
|
<mongodb>
¶
The <mongodb>
element defines MongoDB connections within the DATAMIMIC Configuration Model.
Attributes:¶
id
: Specifies the unique identifier for the MongoDB connection and the id of the configure system environment.system
: Specifies the name of your MongoDB environment configuration on platform. If this attribute is not defined, the system will use id as the system name.
Example:¶
Setup mongodb database
1 |
|
Insert data to mongodb
1 2 3 4 |
|
Load data from mongodb with selector
1 |
|
Important: If you want to load data from mongodb and export it as JSON, you must select data without '_id' (because objectId is not regular type data of JSON)
1 2 3 |
|
Update mongodb data
1 2 3 |
|
update mongodb with upsert true
1 2 3 4 5 6 7 |
|
clear mongo_func_test collection
1 |
|
Here's a reworked and more detailed version of the <include>
element documentation, incorporating dynamic include examples and explanations:
<include>
¶
The <include>
element is used to include DATAMIMIC Model files or configurations in the DATAMIMIC configuration model. This allows for modularizing large setups by referencing external configurations dynamically or statically.
Attributes:¶
uri
: Specifies the URI (Uniform Resource Identifier) of the external file to include. The URI can be a static path or dynamically generated using variables from the configuration.
Example 1: Static Includes¶
In this example, two static XML files, 1_select_subset.xml
and 2_obfuscate.xml
, are included in the setup:
1 2 3 4 |
|
- The
<include>
element here simply loads the contents of1_select_subset.xml
and2_obfuscate.xml
into the configuration. This is useful when you have reusable configurations stored in separate files.
Example 2: Dynamic Includes with Variables¶
This example demonstrates a more advanced use case where includes are dynamically determined based on data from an external CSV file.
Configuration:¶
1 2 3 4 5 |
|
CSV File (data/control.ent.csv):¶
1 2 3 |
|
- Explanation:
- The configuration dynamically loads the value of the
model
field fromdata/control.ent.csv
, replacing{model}
with the value found in each row. In this case,include_generate.xml
will be included for both entries in the CSV file. - The
generate
block reads rows from the CSV, and for each row, it includes the corresponding file specified in themodel
column (include_generate.xml
).
This dynamic inclusion is useful when different setups or configurations need to be included based on external data sources, making the setup highly flexible and adaptable to various scenarios.
Example 3: Using Multiple Includes with Dynamic Targets¶
1 2 3 4 5 6 7 8 |
|
- In this setup:
- Each row from
data/control.ent.csv
is processed, and the correspondingmodel
file is included based on the value of{model}
. - This allows you to modularize the data generation process, pulling in different configurations for different rows, making the setup flexible based on the contents of the CSV file.
Best Practices for Using <include>
:¶
-
Modularization: Use
<include>
to split large configurations into smaller, manageable files. This helps in organizing complex setups and reusing common configurations across different models. -
Dynamic Includes: Combine
<include>
with dynamic variables to conditionally load external configurations based on input data (e.g., from a CSV, database, or API). -
Error Handling: Ensure the file paths or URIs in the
uri
attribute are correct, as missing or incorrectly specified files can cause failures in the configuration load process. -
Documentation and Naming: Keep your include files well-documented and use meaningful names to ensure that the intent behind each include is clear and maintainable.
<memstore>
¶
The <memstore>
element defines in-memory storage for use within the DATAMIMIC configuration model. It is particularly useful for temporarily storing data between different data generation tasks, without requiring access to an external database or file system. The memstore
serves as a temporary repository for generated data, allowing you to share data across multiple tasks or reuse the same dataset within a single setup.
Attributes:¶
id
: Specifies the unique identifier for thememstore
instance. This identifier is used to reference the in-memory storage in other parts of the configuration.
Example 1: Basic In-Memory Storage¶
In this example, a simple memstore
is created and referenced during data generation:
1 2 3 4 5 6 7 8 9 10 |
|
- Explanation:
- The
<memstore id="mem"/>
defines an in-memory storage instance with the ID "mem". - The
generate
block generates 15 records of product data and stores them in the memory store. This data can be referenced later in the same configuration.
Example 2: Reusing Data from memstore
Across Multiple Tasks¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
- Explanation:
- The first
generate
block creates a set of product data and stores it in thememstore
. - The second
generate
block retrieves the product data from thememstore
and uses it to generate sales records, adding new keys such asquantity
.
Example 3: Dynamic Data Processing with memstore
¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
- Explanation:
- The first task generates customer data and stores it in the
memstore
. - The second task retrieves the customer data from memory to generate corresponding orders, dynamically linking the orders to customers stored in memory.
Example 4: Combining memstore
with Conditions¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
- Explanation:
- This setup generates employees and stores the data in memory.
- When generating performance reviews, a conditional check is applied to the
performance_rating
, and a promotion recommendation is generated based on the rating.
Example 5: Temporary Storage for Intermediary Data Processing¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
- Explanation:
- This setup shows how the
memstore
can be used to store intermediary data (e.g., products) that is reused in subsequent generation tasks (e.g., orders and order summaries). - The
temp_memstore
is used to temporarily store and reference data across multiple generation tasks without writing to external storage between steps.
Best Practices for Using <memstore>
:¶
-
Temporary Data Storage: Use
memstore
to hold temporary data that needs to be reused across different tasks or steps in the configuration. -
Efficient Data Processing: Leverage
memstore
for in-memory processing when you need to reuse datasets without writing to a file or database, improving processing speed in data pipelines. -
Organized Data Flow: Define multiple
memstore
instances when working with different datasets to keep data organized and to avoid data mixing between unrelated tasks. -
Dynamic Data Handling: Combine
memstore
with dynamic variables, conditions, and other elements to handle complex data flows and scenarios, such as conditional data generation or hierarchical structures.
Here’s an extended version of the <execute>
element description with added syntax and use case examples:
<execute>
¶
The <execute>
element is used to run external scripts or commands within the DATAMIMIC Configuration Model. This is particularly useful when you need to set up databases, run SQL scripts, or execute external scripts (e.g., Python scripts) before or during the data generation process.
Attributes:¶
uri
: Specifies the URI or path of the script file to execute. This can be a SQL script, a Python script, or another external script type.target
: Specifies the target on which to execute the script. This is typically a database ID (e.g.,dbPostgres
) for executing SQL scripts or may be omitted for other types of external scripts.
Example 1: Executing SQL Scripts on a Database¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
In this example:
- The <execute>
element is used to run an external SQL script (backup.sql
) against a PostgreSQL database.
- The script is executed before generating data, ensuring that the database is backed up before proceeding with data generation.
Example 2: Executing Multiple Scripts (SQL and Python)¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
In this example: - Two scripts are executed—one SQL script to set up the database and one Python script to handle additional custom logic. - After running the scripts, data is fetched from the database and used for generating output records.
Example 3: Using <execute>
with a Complex DB Mapping Scenario¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
In this case:
- The <execute>
element loads a global script (lib_glob.scr.py
) into the context to provide additional logic or utilities needed for the generation process.
- The 1_prepare.xml
script sets up the database schema and tables.
Example 4: Executing a Python Script for Custom Logic¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
In this example:
- A SQL script (prepare_database.sql
) sets up the database.
- A Python script (my_custom_logic.py
) is then executed, potentially introducing custom logic or preprocessing.
- Data is generated from the database and exported as a CSV.
Best Practices for Using <execute>
¶
- Use for External Setup or Logic: The
<execute>
element is ideal for preparing databases, loading external scripts, or introducing custom logic that needs to run before or during data generation. - Order of Execution: Ensure that scripts are executed in the correct order, especially when dependencies exist between them (e.g., preparing a database schema before generating data).
- Target Selection: For SQL scripts, always specify the correct database
target
where the script will be executed. - Custom Scripting: Leverage Python or other scripting languages to enhance the functionality and logic of the data generation process by including external scripts.
<data-warehouse>
¶
The <data-warehouse>
element defines data warehouse configurations within the DATAMIMIC Configuration Model.
Attributes:¶
id
: Specifies the unique identifier for the data warehouse and the id of the configure system environment.
Example:¶
1 |
|
<kafka-exporter> and <kafka-importer>
¶
The <kafka-exporter>
element defines Kafka producer connections within the DATAMIMIC Configuration Model.
The <kafka-importer>
element defines Kafka consumer connections within the DATAMIMIC Configuration Model.
Common Attributes:¶
id
: Specifies the unique identifier for the Kafka producer.bootstrap_servers
: Specifies the Kafka bootstrap servers (cluster nodes) required for connecting to the Kafka cluster.topic
: Specifies the topic name where messages will be produced.format
: Specifies the format of the messages to be produced.security_protocol
: Defines the security protocol to be used for communication (e.g., PLAINTEXT, SSL, SASL_SSL).system
: Defines the system type of the Kafka producer.environment
: Defines the environment of the Kafka producer.schema
: Specifies the schema used for serializing messages.registry.url
: Specifies the URL of the schema registry.partition
: Specifies the partition number to which messages should be sent.allow.auto.create.topics
: (True/False) - Specifies whether to allow automatic creation of topics if they do not exist.request.timeout.ms
: Specifies the maximum time to wait for a request to complete.client.id
: Specifies a name for this client.send.buffer.bytes
: The size of the TCP send buffer (SO_SNDBUF) to use when sending data.receive.buffer.bytes
: The size of the TCP receive buffer (SO_RCVBUF) to use when reading data.max.in.flight.requests.per.connection
: Requests are pipelined to kafka brokers up to this number of maximum requests per broker connection.reconnect.backoff.ms
: The amount of time in milliseconds to wait before attempting to reconnect to a given host.reconnect.backoff.max.ms
: The maximum amount of time in milliseconds to backoff/wait when reconnecting to a broker that has repeatedly failed to connect.connections.max.idle.ms
: Specifies closing idle connections after the number of milliseconds.retry.backoff.ms
: Specifies the backoff time before retrying a failed request (in milliseconds).metadata.max.age.ms
: Specifies the period of time in milliseconds after which we force a refresh of metadata.metrics.num.samples
: Specifies the number of samples maintained to compute metrics.metrics.sample.window.ms
: Specifies the maximum age in milliseconds of samples used to compute metrics.api.version
: Specify which Kafka API version to use. If not defined, the client will attempt to infer the broker version by probing various APIs. Different versions enable different functionality. (e.g. 0.10.2)api.version.auto.timeout.ms
: Specify number of milliseconds to throw a timeout exception from the constructor when checking the broker api version. Only applies if api_version is not defined.ssl.key.password
: Specifies the password for the SSL key.ssl.truststore.location
: Specifies the location of the SSL truststore.ssl.truststore.password
: Specifies the password for the SSL truststore.ssl.truststore.type
: Specifies the type of the SSL truststore.ssl.truststore.certificate
: Specifies the certificates for the SSL truststore.ssl.protocol
: Specifies the SSL protocol (e.g., TLSv1.2, TLSv1.3).ssl.keystore.location
: Specifies the location of the SSL keystore.ssl.keystore.type
: Specifies the type of the SSL keystore.ssl.keystore.key
: Specifies the key for the SSL keystore.ssl.keystore.password
: Specifies the password for the SSL keystore.ssl.cipher.suites
: Specifies the list of ciphers for ssl connections (e.g., DHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA:ECDHE-ECDSA-AES128-GCM-SHA256).sasl.mechanism
: Specifies the SASL mechanism used for authentication (e.g., PLAIN, SCRAM-SHA-256).sasl.jaas.config
: Specifies the JAAS configuration for SASL.sasl.kerberos.service.name
: Specifies the Kerberos service name for SASL mechanism handshake.
<kafka-exporter>
Attributes:¶
encoding
: Specifies the encoding of the messages.acks
: Specifies the number of acknowledgments the producer requires the leader to have received before considering a request complete. (e.g., 0, 1, all).compression.type
: Specifies the compression type for all data generated by the producer (e.g., gzip, snappy, lz4, None).retries
: Specifies the number of retry attempts for failed sends (default is 0).batch.size
: Specifies the batch size in bytes.linger.ms
: Specifies the time to wait before sending a batch.buffer.memory
: Specifies the total bytes of memory the producer should use to buffer records waiting to be sent to the server. .max.request.size
: Specifies the maximum size of a request.max.block.ms
: Specifies the maximum time to block when sending a message.
<kafka-importer>
Attributes:¶
pageSize
: Specifies the page size for fetching messages.decoding
: Specifies the decoding of the messages.enable.auto.commit
: (True/False) - If True , the consumer’s offset will be periodically committed in the background.auto.offset.reset
: Specifies the offset reset policy (e.g., earliest, latest).group.id
: Specifies the consumer group ID.heartbeat.interval.ms
: Specifies the interval between heartbeats to the Kafka broker (in milliseconds).auto.commit.interval.ms
: Specifies the number of milliseconds between automatic offset commits.check.crcs
: (True/False) - Specifies whether to check the CRCs of records consumed.fetch.max.bytes
: Specifies the maximum bytes fetched in a single request.fetch.max.wait.ms
: Specifies the maximum time to wait for fetching records (in milliseconds).fetch.min.bytes
: Specifies the minimum bytes to fetch in a single request.max.partition.fetch.bytes
: Specifies the maximum bytes fetched per partition in a single request.max.poll.records
: Specifies the maximum number of records returned in a single poll.max.poll.interval.ms
: Specifies the maximum interval between polls before the consumer is considered dead (in milliseconds).exclude.internal.topics
: (True/False) - Whether records from internal topics (such as offsets) should be exposed to the consumer. If set to True the only way to receive records from an internal topic is subscribing to it.session.timeout.ms
: The timeout used to detect failures when using Kafka’s group management facilities. The consumer sends periodic heartbeats to indicate its liveness to the broker. If no heartbeats are received by the broker before the expiration of this session timeout, then the broker will remove this consumer from the group and initiate a rebalance. Note that the value must be in the allowable range as configured in the broker configuration by group.min.session.timeout.ms and group.max.session.timeout.ms.consumer.timeout.ms
: The number of milliseconds to block during message iteration before raising StopIteration
Example:¶
1 2 3 4 5 6 7 8 9 |
|
<object-storage>
¶
The <object-storage>
element defines object store configurations within the DATAMIMIC Configuration Model.
Attributes:¶
id
: Specifies the unique identifier for the object store and the id of the configure system environment.
Example:¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|