Skip to content

Configuration/Base Model

The Configuration/Base Model is a fundamental component of your DATAMIMIC project, serving as the foundation for setting up and configuring connected systems, including specifying system-wide settings, environment details, and included external files.

Example Configuration Model

Consider the following example of a DATAMIMIC Configuration Model:

1
2
3
4
5
6
7
8
9
<setup multiprocessing="True">
    <database id="sourceDB" system="source_oracle" />
    <database id="targetDB" system="target_postgres" />
    <mongodb id="target_mongodb" />

    <!-- The following model creates sample records in the MongoDB -->
    <include uri="1_select_subset.xml" />
    <include uri="2_obfuscate.xml" />
</setup>

In this example:

  • multiprocessing="True" indicates the use of multiprocessing for concurrent data generation.
  • <database> elements are used to define database configurations for source and target systems, including "source_oracle" and "target_postgres."
  • <mongodb> is used to define the MongoDB configuration.
  • <include> elements reference external XML files (1_select_subset.xml and 2_obfuscate.xml) for defining data generation and processing tasks.

This Configuration Model sets the stage for your DATAMIMIC project, allowing you to configure and connect with different systems, specify environment details, and include external configurations as needed.

It's important to tailor the 'Configuration/Base Model' to your specific project requirements, adapting it to the systems and databases you are working with.

<setup>

The <setup> element is the root of the Configuration/Base Model, defining the overall setup for the data generation process.

Attributes:

  • multiprocessing: Indicates whether multiprocessing is enabled. Values can be "True" or "False". Default is "False".
  • defaultSeparator: Defines the default separator while reading data source (like csv). Common values are "|", ";", or ",". Default is "|".
  • defaultDataset: Specifies the default dataset (e.g., "DE", "US"). Default is "US".
  • defaultLocale: Specifies the default locale (e.g., "de", "en"). Default is "en".
  • numProcess: Defines the number of processes for multiprocessing. Default is "None", which mean using all possible CPU cores.
  • defaultLineSeparator: Defines the default line separator (e.g., "\r\n", "\r", "\n"). Default is "\n".
  • defaultSourceScripted: Indicates whether the value in any is scripted (e.g., "{1 + 2}" will be evaluated as "3"). Default is "False".
  • reportLogging: Indicates whether to log timer of generating and exporting process. If you have huge number of generate node, set this value to "False" may reduce processing time. Default is "True".
  • defaultVariablePrefix: Configurable attribute that defines the prefix for variable substitution in dynamic strings (default is __).
  • defaultVariableSuffix: Configurable attribute that defines the suffix for variable substitution in dynamic strings (default is __).

Children:

  • <database>: Defines relational database connections.
  • <mongodb>: Defines MongoDB connections.
  • <include>: Includes external files or configurations.
  • <memstore>: Defines in-memory storage, useful for temporary data storage among generation tasks.
  • <execute>: Executes external scripts or commands.
  • <echo>: Outputs text or variables to the log, useful for debugging read more in data-definition.
  • <variable>: Defines variables used in data generation read more in data-definition.
  • <data-warehouse>: Defines data warehouse configurations.
  • <kafka-exporter>: Defines Kafka producer connections.
  • <kafka-importer>: Defines Kafka consumer connections.
  • <object-storage>: Defines object store configurations.

<database>

The <database> element is used to define database connections within the DATAMIMIC Configuration Model.

Attributes:

  • id: Specifies the unique identifier for the database connection and the id of the configure system environment.
  • system: Specifies the name of your database environment configuration on platform. If this attribute is not defined, the system will use id as the system name.

Example:

1
<database id="sourceDB" />

<mongodb>

The <mongodb> element defines MongoDB connections within the DATAMIMIC Configuration Model.

Attributes:

  • id: Specifies the unique identifier for the MongoDB connection and the id of the configure system environment.
  • system: Specifies the name of your MongoDB environment configuration on platform. If this attribute is not defined, the system will use id as the system name.

Example:

Setup mongodb database

1
<mongodb id="mongodb" system="mongodb"/>

Insert data to mongodb

1
2
3
4
<generate name="mongo_func_test" target="mongodb" count="100">
    <key name="user_id" generator="IncrementGenerator"/>
    <key name="user_name" values="'Bob', 'Frank', 'Phil'"/>
</generate>

Load data from mongodb with selector

1
<generate name="load_mongodb" source="mongodb" selector="find: 'mongo_func_test', filter: {'user_name': 'Bob'}}"/>

Important: If you want to load data from mongodb and export it as JSON, you must select data without '_id' (because objectId is not regular type data of JSON)

1
2
3
<generate name="json_mongodb" source="mongodb" 
          selector="find:'mongo_func_test', filter:{'user_name':'Bob'}, projection:{'_id':0}" 
          target="JSON"/>

Update mongodb data

1
2
3
<generate name="mongo_update" type="mongo_func_test" source="mongodb" target="mongodb.update">
    <key name="addition" values="'Addition 1', 'Addition 2', 'Addition 3'"/>
</generate>

update mongodb with upsert true

1
2
3
4
5
6
7
<generate name="mongo_upsert" source="mongodb"
          selector="find: 'mongo_func_test', filter: {'user_name': 'Mary'}"
          target="mongodb.upsert">
    <key name="addition" values="'addition_value1', 'addition_value2', 'addition_value3'"/>
    <key name="second_addition" values="'value1', 'value2', 'value3'"/>
    <key name="other_addition" values="'other_value1', 'other_value2', 'other_value3'"/>
</generate>

clear mongo_func_test collection

1
<generate name="delete" source="mongodb" selector="find: 'mongo_func_test', filter: {}" target="mongodb.delete"/>

Here's a reworked and more detailed version of the <include> element documentation, incorporating dynamic include examples and explanations:


<include>

The <include> element is used to include DATAMIMIC Model files or configurations in the DATAMIMIC configuration model. This allows for modularizing large setups by referencing external configurations dynamically or statically.

Attributes:

  • uri: Specifies the URI (Uniform Resource Identifier) of the external file to include. The URI can be a static path or dynamically generated using variables from the configuration.

Example 1: Static Includes

In this example, two static XML files, 1_select_subset.xml and 2_obfuscate.xml, are included in the setup:

1
2
3
4
<setup>
    <include uri="1_select_subset.xml"/>
    <include uri="2_obfuscate.xml"/>
</setup>
  • The <include> element here simply loads the contents of 1_select_subset.xml and 2_obfuscate.xml into the configuration. This is useful when you have reusable configurations stored in separate files.

Example 2: Dynamic Includes with Variables

This example demonstrates a more advanced use case where includes are dynamically determined based on data from an external CSV file.

Configuration:

1
2
3
4
5
<setup>
    <generate name="ctrl" source="data/control.ent.csv">
        <include uri="{model}"/>
    </generate>
</setup>

CSV File (data/control.ent.csv):

1
2
3
id|count|target|model
1|10|output1|include_generate.xml
2|20|output2|include_generate.xml
  • Explanation:
  • The configuration dynamically loads the value of the model field from data/control.ent.csv, replacing {model} with the value found in each row. In this case, include_generate.xml will be included for both entries in the CSV file.
  • The generate block reads rows from the CSV, and for each row, it includes the corresponding file specified in the model column (include_generate.xml).

This dynamic inclusion is useful when different setups or configurations need to be included based on external data sources, making the setup highly flexible and adaptable to various scenarios.

Example 3: Using Multiple Includes with Dynamic Targets

1
2
3
4
5
6
7
8
<setup>
    <generate name="dynamic_include" source="data/control.ent.csv">
        <key name="id" script="control.id"/>
        <key name="count" script="control.count"/>
        <key name="target" script="control.target"/>
        <include uri="{model}"/>
    </generate>
</setup>
  • In this setup:
  • Each row from data/control.ent.csv is processed, and the corresponding model file is included based on the value of {model}.
  • This allows you to modularize the data generation process, pulling in different configurations for different rows, making the setup flexible based on the contents of the CSV file.

Best Practices for Using <include>:

  1. Modularization: Use <include> to split large configurations into smaller, manageable files. This helps in organizing complex setups and reusing common configurations across different models.

  2. Dynamic Includes: Combine <include> with dynamic variables to conditionally load external configurations based on input data (e.g., from a CSV, database, or API).

  3. Error Handling: Ensure the file paths or URIs in the uri attribute are correct, as missing or incorrectly specified files can cause failures in the configuration load process.

  4. Documentation and Naming: Keep your include files well-documented and use meaningful names to ensure that the intent behind each include is clear and maintainable.


<memstore>

The <memstore> element defines in-memory storage for use within the DATAMIMIC configuration model. It is particularly useful for temporarily storing data between different data generation tasks, without requiring access to an external database or file system. The memstore serves as a temporary repository for generated data, allowing you to share data across multiple tasks or reuse the same dataset within a single setup.

Attributes:

  • id: Specifies the unique identifier for the memstore instance. This identifier is used to reference the in-memory storage in other parts of the configuration.

Example 1: Basic In-Memory Storage

In this example, a simple memstore is created and referenced during data generation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
<setup>
    <!-- Define an in-memory store with ID "mem" -->
    <memstore id="mem"/>

    <!-- Generate data and store it in the memstore -->
    <generate name="product_list" count="15" target="mem">
        <key name="id" generator="IncrementGenerator"/>
        <key name="name" values="'Product A', 'Product B', 'Product C'"/>
    </generate>
</setup>
  • Explanation:
  • The <memstore id="mem"/> defines an in-memory storage instance with the ID "mem".
  • The generate block generates 15 records of product data and stores them in the memory store. This data can be referenced later in the same configuration.

Example 2: Reusing Data from memstore Across Multiple Tasks

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
<setup>
    <!-- Define an in-memory store -->
    <memstore id="mem"/>

    <!-- First task: Generate product data and store it in memstore -->
    <generate name="product_list" count="15" target="mem">
        <key name="id" generator="IncrementGenerator"/>
        <key name="name" values="'Product A', 'Product B', 'Product C'"/>
    </generate>

    <!-- Second task: Use the data from memstore to generate sales data -->
    <generate name="sales_data" count="30" type="product_list" source="mem" target="CSV">
        <key name="order_id" generator="IncrementGenerator"/>
        <key name="product_id" script="id"/>
        <key name="product_name" script="name"/>
        <key name="quantity" generator="IntegerGenerator(min=1,max=10)"/>
    </generate>
</setup>
  • Explanation:
  • The first generate block creates a set of product data and stores it in the memstore.
  • The second generate block retrieves the product data from the memstore and uses it to generate sales records, adding new keys such as quantity.

Example 3: Dynamic Data Processing with memstore

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
<setup>
    <!-- Define in-memory storage -->
    <memstore id="mem"/>

    <!-- First: Generate a list of customers and store them in memory -->
    <generate name="customer_list" count="20" target="mem">
        <key name="customer_id" generator="IncrementGenerator"/>
        <key name="customer_name" values="'Alice', 'Bob', 'Charlie'"/>
    </generate>

    <!-- Second: Generate orders based on the customers in memstore -->
    <generate name="order_list" count="50" type="customer_list" source="mem" target="CSV">
        <key name="order_id" generator="IncrementGenerator"/>
        <key name="customer_id" script="customer_id"/>
        <key name="customer_name" script="customer_name"/>
        <key name="order_total" generator="FloatGenerator(min=10, max=500)"/>
    </generate>
</setup>
  • Explanation:
  • The first task generates customer data and stores it in the memstore.
  • The second task retrieves the customer data from memory to generate corresponding orders, dynamically linking the orders to customers stored in memory.

Example 4: Combining memstore with Conditions

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
<setup>
    <!-- Define in-memory storage -->
    <memstore id="mem"/>

    <!-- Generate employees and store them in memory -->
    <generate name="employee_list" count="10" target="mem">
        <key name="employee_id" generator="IncrementGenerator"/>
        <key name="employee_name" values="'John', 'Jane', 'Doe'"/>
    </generate>

    <!-- Generate performance reviews based on employee data in memstore -->
    <generate name="performance_reviews" count="10" type="employee_list" source="mem" target="CSV">
        <key name="review_id" generator="IncrementGenerator"/>
        <key name="employee_id" script="employee_id"/>
        <key name="employee_name" script="employee_name"/>
        <key name="performance_rating" generator="IntegerGenerator(min=1, max=5)"/>

        <!-- Conditionally add a promotion recommendation if the performance rating is 5 -->
        <condition>
            <if condition="performance_rating == 5">
                <key name="promotion_recommended" constant="Yes"/>
            </if>
            <else>
                <key name="promotion_recommended" constant="No"/>
            </else>
        </condition>
    </generate>
</setup>
  • Explanation:
  • This setup generates employees and stores the data in memory.
  • When generating performance reviews, a conditional check is applied to the performance_rating, and a promotion recommendation is generated based on the rating.

Example 5: Temporary Storage for Intermediary Data Processing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<setup>
    <!-- Define in-memory storage -->
    <memstore id="temp_memstore"/>

    <!-- Generate a dataset of products and store them in memory -->
    <generate name="product_list" count="10" target="temp_memstore">
        <key name="product_id" generator="IncrementGenerator"/>
        <key name="product_name" values="'Laptop', 'Mouse', 'Keyboard', 'Monitor'"/>
    </generate>

    <!-- Generate orders and link them to products in the memstore -->
    <generate name="order_list" count="20" type="product_list" source="temp_memstore" target="CSV">
        <key name="order_id" generator="IncrementGenerator"/>
        <key name="product_id" script="product_id"/>
        <key name="product_name" script="product_name"/>
        <key name="quantity" generator="IntegerGenerator"/>
    </generate>

    <!-- Perform a second processing task using memstore for temporary storage -->
    <generate name="order_summaries" count="5" type="product_list" source="temp_memstore" target="ConsoleExporter">
        <key name="summary_id" generator="IncrementGenerator"/>
        <key name="product_summary" script="product_name + ' summary'"/>
    </generate>
</setup>
  • Explanation:
  • This setup shows how the memstore can be used to store intermediary data (e.g., products) that is reused in subsequent generation tasks (e.g., orders and order summaries).
  • The temp_memstore is used to temporarily store and reference data across multiple generation tasks without writing to external storage between steps.

Best Practices for Using <memstore>:

  1. Temporary Data Storage: Use memstore to hold temporary data that needs to be reused across different tasks or steps in the configuration.

  2. Efficient Data Processing: Leverage memstore for in-memory processing when you need to reuse datasets without writing to a file or database, improving processing speed in data pipelines.

  3. Organized Data Flow: Define multiple memstore instances when working with different datasets to keep data organized and to avoid data mixing between unrelated tasks.

  4. Dynamic Data Handling: Combine memstore with dynamic variables, conditions, and other elements to handle complex data flows and scenarios, such as conditional data generation or hierarchical structures.

Here’s an extended version of the <execute> element description with added syntax and use case examples:


<execute>

The <execute> element is used to run external scripts or commands within the DATAMIMIC Configuration Model. This is particularly useful when you need to set up databases, run SQL scripts, or execute external scripts (e.g., Python scripts) before or during the data generation process.

Attributes:

  • uri: Specifies the URI or path of the script file to execute. This can be a SQL script, a Python script, or another external script type.
  • target: Specifies the target on which to execute the script. This is typically a database ID (e.g., dbPostgres) for executing SQL scripts or may be omitted for other types of external scripts.

Example 1: Executing SQL Scripts on a Database

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
<setup multiprocessing="True">
    <!-- Define the database -->
    <database id="dbPostgres" system="postgres"/>

    <!-- Execute SQL script to back up the database -->
    <execute uri="script/backup.sql" target="dbPostgres"/>

    <!-- Define a variable using a dynamic table name from the database -->
    <variable name="table_name" constant="public.db_postgres_types5"/>

    <!-- Generate data based on a SELECT query from the database -->
    <generate name="generate_selector"
              source="dbPostgres"
              selector="SELECT * FROM __table_name__" target="ConsoleExporter">
    </generate>

    <!-- Generate additional records with a variable selector from the database -->
    <generate name="variable_selector" count="20" target="ConsoleExporter">
        <variable name="query" source="dbPostgres" selector="SELECT id, text FROM __table_name__" />
        <key name="id" script="query.id"/>
        <key name="name" script="query.text"/>
    </generate>
</setup>

In this example: - The <execute> element is used to run an external SQL script (backup.sql) against a PostgreSQL database. - The script is executed before generating data, ensuring that the database is backed up before proceeding with data generation.

Example 2: Executing Multiple Scripts (SQL and Python)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
<setup multiprocessing="True">
    <!-- Define the database -->
    <database id="dbPostgres" system="postgres"/>

    <!-- Execute SQL script to set up the database -->
    <execute uri="script/setup_test_variable_source_with_name_only.sql" target="dbPostgres"/>

    <!-- Execute a Python script for custom logic -->
    <execute uri="script/my_python.src.py"/>

    <!-- Define a variable that sources data from the database -->
    <variable name="db_postgres_test_variable_source_with_name_only" source="dbPostgres" distribution="ordered"/>

    <!-- Generate records based on data fetched from the database -->
    <generate name="user" count="10" target="ConsoleExporter">
        <key name="id" script="db_postgres_test_variable_source_with_name_only.id"/>
        <key name="name" script="db_postgres_test_variable_source_with_name_only.text"/>
        <key name="number" script="db_postgres_test_variable_source_with_name_only.number"/>
    </generate>
</setup>

In this example: - Two scripts are executed—one SQL script to set up the database and one Python script to handle additional custom logic. - After running the scripts, data is fetched from the database and used for generating output records.

Example 3: Using <execute> with a Complex DB Mapping Scenario

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
<setup>
    <!-- DB Mapping Demo -->
    <include uri="conf/base.properties"/>
    <memstore id="mem"/>

    <!-- Define the database -->
    <database id="mapping" environment="environment"/>

    <!-- Execute a setup script to prepare the schema and tables -->
    <include uri="1_prepare.xml"/>

    <!-- Load global scripts into the context -->
    <execute uri="script/lib_glob.scr.py"/>

    <!-- Include a model that uses multiprocessing with custom generators -->
    <include uri="2_mapping.xml"/>
</setup>

In this case: - The <execute> element loads a global script (lib_glob.scr.py) into the context to provide additional logic or utilities needed for the generation process. - The 1_prepare.xml script sets up the database schema and tables.

Example 4: Executing a Python Script for Custom Logic

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
<setup>
    <!-- Define the database -->
    <database id="sourceDB" system="postgres"/>

    <!-- Run a SQL script to prepare the database -->
    <execute uri="scripts/prepare_database.sql" target="sourceDB"/>

    <!-- Execute a custom Python script -->
    <execute uri="scripts/my_custom_logic.py"/>

    <!-- Generate data based on the prepared database -->
    <generate name="user_data" source="sourceDB" target="CSV">
        <key name="user_id" generator="IncrementGenerator"/>
        <key name="user_name" script="random.choice(['Alice', 'Bob', 'Charlie'])"/>
    </generate>
</setup>

In this example: - A SQL script (prepare_database.sql) sets up the database. - A Python script (my_custom_logic.py) is then executed, potentially introducing custom logic or preprocessing. - Data is generated from the database and exported as a CSV.

Best Practices for Using <execute>

  1. Use for External Setup or Logic: The <execute> element is ideal for preparing databases, loading external scripts, or introducing custom logic that needs to run before or during data generation.
  2. Order of Execution: Ensure that scripts are executed in the correct order, especially when dependencies exist between them (e.g., preparing a database schema before generating data).
  3. Target Selection: For SQL scripts, always specify the correct database target where the script will be executed.
  4. Custom Scripting: Leverage Python or other scripting languages to enhance the functionality and logic of the data generation process by including external scripts.

<data-warehouse>

The <data-warehouse> element defines data warehouse configurations within the DATAMIMIC Configuration Model.

Attributes:

  • id: Specifies the unique identifier for the data warehouse and the id of the configure system environment.

Example:

1
<data-warehouse id="warehouse1" />

<kafka-exporter> and <kafka-importer>

The <kafka-exporter> element defines Kafka producer connections within the DATAMIMIC Configuration Model. The <kafka-importer> element defines Kafka consumer connections within the DATAMIMIC Configuration Model.

Common Attributes:

  • id: Specifies the unique identifier for the Kafka producer.
  • bootstrap_servers: Specifies the Kafka bootstrap servers (cluster nodes) required for connecting to the Kafka cluster.
  • topic: Specifies the topic name where messages will be produced.
  • format: Specifies the format of the messages to be produced.
  • security_protocol: Defines the security protocol to be used for communication (e.g., PLAINTEXT, SSL, SASL_SSL).
  • system: Defines the system type of the Kafka producer.
  • environment: Defines the environment of the Kafka producer.
  • schema: Specifies the schema used for serializing messages.
  • registry.url: Specifies the URL of the schema registry.
  • partition: Specifies the partition number to which messages should be sent.
  • allow.auto.create.topics: (True/False) - Specifies whether to allow automatic creation of topics if they do not exist.
  • request.timeout.ms: Specifies the maximum time to wait for a request to complete.
  • client.id: Specifies a name for this client.
  • send.buffer.bytes: The size of the TCP send buffer (SO_SNDBUF) to use when sending data.
  • receive.buffer.bytes: The size of the TCP receive buffer (SO_RCVBUF) to use when reading data.
  • max.in.flight.requests.per.connection: Requests are pipelined to kafka brokers up to this number of maximum requests per broker connection.
  • reconnect.backoff.ms: The amount of time in milliseconds to wait before attempting to reconnect to a given host.
  • reconnect.backoff.max.ms: The maximum amount of time in milliseconds to backoff/wait when reconnecting to a broker that has repeatedly failed to connect.
  • connections.max.idle.ms : Specifies closing idle connections after the number of milliseconds.
  • retry.backoff.ms: Specifies the backoff time before retrying a failed request (in milliseconds).
  • metadata.max.age.ms: Specifies the period of time in milliseconds after which we force a refresh of metadata.
  • metrics.num.samples: Specifies the number of samples maintained to compute metrics.
  • metrics.sample.window.ms: Specifies the maximum age in milliseconds of samples used to compute metrics.
  • api.version: Specify which Kafka API version to use. If not defined, the client will attempt to infer the broker version by probing various APIs. Different versions enable different functionality. (e.g. 0.10.2)
  • api.version.auto.timeout.ms: Specify number of milliseconds to throw a timeout exception from the constructor when checking the broker api version. Only applies if api_version is not defined.
  • ssl.key.password: Specifies the password for the SSL key.
  • ssl.truststore.location: Specifies the location of the SSL truststore.
  • ssl.truststore.password: Specifies the password for the SSL truststore.
  • ssl.truststore.type: Specifies the type of the SSL truststore.
  • ssl.truststore.certificate: Specifies the certificates for the SSL truststore.
  • ssl.protocol: Specifies the SSL protocol (e.g., TLSv1.2, TLSv1.3).
  • ssl.keystore.location: Specifies the location of the SSL keystore.
  • ssl.keystore.type: Specifies the type of the SSL keystore.
  • ssl.keystore.key: Specifies the key for the SSL keystore.
  • ssl.keystore.password: Specifies the password for the SSL keystore.
  • ssl.cipher.suites: Specifies the list of ciphers for ssl connections (e.g., DHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA:ECDHE-ECDSA-AES128-GCM-SHA256).
  • sasl.mechanism: Specifies the SASL mechanism used for authentication (e.g., PLAIN, SCRAM-SHA-256).
  • sasl.jaas.config: Specifies the JAAS configuration for SASL.
  • sasl.kerberos.service.name: Specifies the Kerberos service name for SASL mechanism handshake.

<kafka-exporter> Attributes:

  • encoding: Specifies the encoding of the messages.
  • acks: Specifies the number of acknowledgments the producer requires the leader to have received before considering a request complete. (e.g., 0, 1, all).
  • compression.type: Specifies the compression type for all data generated by the producer (e.g., gzip, snappy, lz4, None).
  • retries: Specifies the number of retry attempts for failed sends (default is 0).
  • batch.size: Specifies the batch size in bytes.
  • linger.ms: Specifies the time to wait before sending a batch.
  • buffer.memory: Specifies the total bytes of memory the producer should use to buffer records waiting to be sent to the server. .
  • max.request.size: Specifies the maximum size of a request.
  • max.block.ms: Specifies the maximum time to block when sending a message.

<kafka-importer> Attributes:

  • pageSize: Specifies the page size for fetching messages.
  • decoding: Specifies the decoding of the messages.
  • enable.auto.commit: (True/False) - If True , the consumer’s offset will be periodically committed in the background.
  • auto.offset.reset: Specifies the offset reset policy (e.g., earliest, latest).
  • group.id: Specifies the consumer group ID.
  • heartbeat.interval.ms: Specifies the interval between heartbeats to the Kafka broker (in milliseconds).
  • auto.commit.interval.ms: Specifies the number of milliseconds between automatic offset commits.
  • check.crcs: (True/False) - Specifies whether to check the CRCs of records consumed.
  • fetch.max.bytes: Specifies the maximum bytes fetched in a single request.
  • fetch.max.wait.ms: Specifies the maximum time to wait for fetching records (in milliseconds).
  • fetch.min.bytes: Specifies the minimum bytes to fetch in a single request.
  • max.partition.fetch.bytes: Specifies the maximum bytes fetched per partition in a single request.
  • max.poll.records: Specifies the maximum number of records returned in a single poll.
  • max.poll.interval.ms: Specifies the maximum interval between polls before the consumer is considered dead (in milliseconds).
  • exclude.internal.topics: (True/False) - Whether records from internal topics (such as offsets) should be exposed to the consumer. If set to True the only way to receive records from an internal topic is subscribing to it.
  • session.timeout.ms: The timeout used to detect failures when using Kafka’s group management facilities. The consumer sends periodic heartbeats to indicate its liveness to the broker. If no heartbeats are received by the broker before the expiration of this session timeout, then the broker will remove this consumer from the group and initiate a rebalance. Note that the value must be in the allowable range as configured in the broker configuration by group.min.session.timeout.ms and group.max.session.timeout.ms.
  • consumer.timeout.ms: The number of milliseconds to block during message iteration before raising StopIteration

Example:

1
2
3
4
5
6
7
8
9
<kafka-exporter id="kafkaSsl" system="kafkaSsl" environment="environment"
                client.id="kafka-client-datamimic" acks="1" compression.type="gzip"
                retries="1" batch.size="16384" linger.ms="0" buffer.memory="33554432"
                request.timeout.ms="30000" receive.buffer.bytes="32768" send.buffer.bytes="131072"/>

<kafka-importer id="kafka_importer" system="kafkaSsl" environment="environment" 
                client.id="kafka-client-datamimic" enable.auto.commit="False" auto.offset.reset="earliest"
                group.id="datamimic" decoding="UTF-8" request.timeout.ms="30000" 
                send.buffer.bytes="131072" receive.buffer.bytes="32768"/>

<object-storage>

The <object-storage> element defines object store configurations within the DATAMIMIC Configuration Model.

Attributes:

  • id: Specifies the unique identifier for the object store and the id of the configure system environment.

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<object-storage id="azure" />

<!-- Write the data as json to the azure blob storage into container test-datasource and path static_dir/static_file.json -->
<generate name="external_source_1" exportUri="/static_dir/static_file.json" container="test-datasource"
              storageId="azure"
              target="ConsoleExporter, JSON" count="3">
    <key name="id" generator="IncrementGenerator"/>
    <key name="name" generator="GivenNameGenerator"/>
    <key name="email" generator="EmailAddressGenerator"/>
</generate>

<!-- Write the data as json, csv, txt and xml to the azure blob storage into container test-datasource use the default path what is the uuid of the task -->
<generate name="external_source_2"
              container="test-datasource"
              storageId="azure"
              target="ConsoleExporter,
              JSON, CSV, TXT, XML" count="3">
    <key name="id" generator="IncrementGenerator"/>
    <key name="name" generator="GivenNameGenerator"/>
    <key name="email" generator="EmailAddressGenerator"/>
</generate>

<object-storage id="aws" />

<!-- Read data book.template.xml from the aws s3 bucket datamimic-test-01 and write it to the console -->
<generate name="external_source_xml" bucket="datamimic-test-01" sourceUri="test-datasource/book.template.xml"
              source="aws" target="ConsoleExporter"/>