Data Definition Model - Core Elements¶

Data Definition Models are fundamental to DATAMIMIC's test data generation capabilities. This document covers the essential elements - if you're new to DATAMIMIC, start here. For advanced features, see Advanced Data Definition Elements.

Overview¶

Data Definition Models specify how test data should be generated, transformed, or obfuscated. The core elements allow you to:

Define data generation tasks
Specify key fields and their values
Create and use variables
Generate structured data sets

Basic Elements¶

¶

The <setup> element is the root element for all data generation tasks. It contains one or more <generate> elements that define specific data generation operations. Learn more of its use in Configuration Models.

<setup>
    <generate name="users" count="100">
        <!-- Generation details goes here -->
    </generate>
</setup>

`<generate>`¶

The <generate> element is the core of Data Definition Models. It defines a data generation task and includes attributes like name, count, and target. This element is used to create structured data based on the specified configurations.

Attributes¶

name: Specifies the name of the generation task.
count: Specifies the number of records to generate.
source: Specifies the source of the data (e.g., data/active.ent.csv, mongo).
target: Specifies the target output (e.g., CSV, sqliteDB).
type: Specifies the type of data to generate.
cyclic: Enables or disables cyclic generation. Default is False.
selector: Specifies a database query for the generation.
separator: Specifies a separator for the generated data. Default is |.
sourceScripted: Enables or disables scripted source evaluation in the source file (e.g., example.ent.csv, example.json). Default is False.
pageSize: Specifies the page size for data generation.
storageId: Specifies the ID of object storage, defined by the <object-storage> element.
sourceUri: Specifies the URI of the datasource on object storage (e.g., datasource/employees.csv).
exportUri: Specifies the URI for exporting generated data on object storage (e.g., export/product.csv).
container: Specifies the container name for Azure Blob Storage.
bucket: Specifies the bucket name for AWS S3.
distribution: Specifies the distribution of data source iteration (e.g., random, ordered). Default is random.
converter: Specifies a converter to transform value.
variablePrefix: Configurable attribute that defines the prefix for variable substitution in dynamic strings (default is __).
variableSuffix: Configurable attribute that defines the suffix for variable substitution in dynamic strings (default is __).
numProcess: Defines the number of processes for multiprocessing, can be propagated from parent element <setup>. Default is 1.
mpPlatform: Define multiprocessing platform to be executed. Accepted values are multiprocessing and ray. Default value is multiprocessing.

Children¶

<key>: Specifies key fields within the data generation task.
<variable>: Defines variables used in data generation.
<reference>: Defines references to other generated data.
<nestedKey>: Specifies nested key fields and their generation methods.
<list>: Defines lists of data items.
<condition>: Conditional element to include data based on certain conditions.
<array>: Defines arrays of data items.
<echo>: Outputs text or variables for logging or debugging purposes.

Example 1: Using Object Storage for Data Generation¶

<setup>
    <!-- Define object-storage with ID referring to the environment -->
    <object-storage id="aws"/>
    <!-- Write file to the object-storage -->
    <generate name="external_write" bucket="datamimic-01" storageId="aws" exportUri="/datamimic_exporting_result/" target="JSON, CSV, TXT, XML" count="100">
        <key name="id" generator="IncrementGenerator"/>
        <key name="name" type="string"/>
    </generate>
    <!-- Read file from object-storage -->
    <generate name="external_read" bucket="datamimic-01" sourceUri="datamimic_exporting_result/external_write.json" source="aws" />
</setup>

Example 2: Using `selector` with a Database¶

<generate name="CUSTOMER" source="mongodb" selector="find: 'CUSTOMER', filter: {'age': {'$lt': 30}}" >
    <key name="id" generator="IncrementGenerator"/>
    <key name="name" type="string"/>
</generate>

In this example:

The selector is used to query the MongoDB database to find all customers under 30 years old.
The data is output to the ConsoleExporter.

Example 3: Generating Data with MongoDB and Aggregation¶

<setup >
    <memstore id="mem"/>
    <mongodb id="mongodb"/>

    <!-- Clear collections before generating new data -->
    <generate name="delete_users" source="mongodb" selector="find: 'more_users', filter: {}" target="mongodb.delete"/>
    <generate name="delete_orders" source="mongodb" selector="find: 'more_orders', filter: {}" target="mongodb.delete"/>
    <generate name="delete_products" source="mongodb" selector="find: 'more_products', filter: {}" target="mongodb.delete"/>

    <!-- Generate orders, users, and products collections -->
    <generate name="more_orders" source="script/orders.json" target="mongodb"/>
    <generate name="more_users" source="script/users.json" target="mongodb"/>
    <generate name="more_products" source="script/products.json" target="mongodb"/>

    <!-- Perform an aggregation query to summarize user orders and spending -->
    <generate name="more_summary" count="20" >
        <variable name="result" source="mongodb"
                  selector='aggregate: "more_users",
                            pipeline: [
                              {
                                "$lookup": {
                                  "from": "more_orders",
                                  "localField": "user_id",
                                  "foreignField": "user_id",
                                  "as": "userOrders"
                                }
                              },
                              {
                                "$unwind": "$userOrders"
                              },
                              {
                                "$lookup": {
                                  "from": "more_products",
                                  "localField": "userOrders.order_item",
                                  "foreignField": "product_name",
                                  "as": "orderProducts"
                                }
                              },
                              {
                                "$unwind": "$orderProducts"
                              },
                              {
                                "$group": {
                                  "_id": "$user_id",
                                  "user_name": { "$first": "$user_name" },
                                  "order_items": { "$push": "$userOrders.order_item" },
                                  "quantities": { "$first": "$userOrders.quantity" },
                                  "total_spending": {
                                    "$sum": {
                                      "$multiply": ["$userOrders.quantity", "$orderProducts.price"]
                                    }
                                  }
                                }
                              }
                            ]'/>
        <nestedKey name="users_orders" script="result"/>
    </generate>

    <!-- Clear collections after generation -->
    <generate name="delete_users" source="mongodb" selector="find: 'more_users', filter: {}" target="mongodb.delete"/>
    <generate name="delete_orders" source="mongodb" selector="find: 'more_orders', filter: {}" target="mongodb.delete"/>
    <generate name="delete_products" source="mongodb" selector="find: 'more_products', filter: {}" target="mongodb.delete"/>
</setup>

Example 4: Generating Data with Kafka¶

<setup >
    <kafka-exporter id="kafkaLocal" environment="environment"/>
    <kafka-importer id="kafka_importer" system="kafkaLocal" enable.auto.commit="True" auto.offset.reset="earliest" group.id="datamimic" decoding="UTF-8" environment="environment"/>

    <!-- Reset Kafka topic by consuming all messages -->
    <generate name="reset" source="kafka_importer" type="kafka" count="100" target=""/>

    <!-- Generate data to export to Kafka and Console -->
    <generate name="exported_data" count="10" target="ConsoleExporter, kafkaLocal">
        <variable name="person" entity="Person"/>
        <key name="name" script="person.name"/>
        <key name="email" script="person.email"/>
    </generate>

    <!-- Import data from Kafka -->
    <generate name="imported_data" source="kafka_importer" type="kafka" count="20"  distribution="ordered"/>
</setup>

Example 5: Using Data from a CSV File¶

<setup defaultSeparator="|">
    <generate name="product1" source="data/products.ent.csv" separator=","  distribution="ordered"/>
    <generate name="product2" source="data/products_2.ent.csv"  distribution="ordered"/>
</setup>

In this example:

Two generate tasks are created that source data from CSV files and output it to the ConsoleExporter.

Example 6: Using `cyclic` with Data from Memory Store¶

<setup >
    <memstore id="mem"/>
    <generate name="product" count="15" target="mem">
        <key name="id" generator="IncrementGenerator"/>
        <key name="name" values="'Alice', 'Bob', 'Cameron'"/>
    </generate>

    <!-- Generate 30 non-cyclic and cyclic products from memory -->
    <generate name="non-cyclic-product" type="product" count="30" cyclic="False" source="mem" target="" distribution="ordered"/>
    <generate name="cyclic-product" type="product" count="30" cyclic="True" source="mem" target="" distribution="ordered"/>
    <generate name="big-cyclic-product" type="product" count="100" cyclic="True" source="mem" target="" distribution="ordered"/>
</setup>

Example 7: Using 'sourceScripted' with JSON template¶

<setup>
    <generate name="json_data" source="script/data.json" sourceScripted="True" target="">
        <variable name="random_age" generator="IntegerGenerator(min=18, max=65)"/>
        <variable name="street_name" generator="StreetNameGenerator"/>
        <variable name="address_number" generator="IntegerGenerator"/>
    </generate>
</setup>

[
    {
        "id": 1,
        "name": "Alice",
        "age": "{random_age}",
        "address": "__address_number__, __street_name__ St"
    },
    {
        "id": 2,
        "name": "Bob",
        "age": "{random_age}",
        "address": "__address_number__, __street_name__ St"
    },
    {
        "id": 3,
        "name": "Cameron",
        "age": "{random_age}",
        "address": "__address_number__, __street_name__ St"
    }
]

Result:

[
    {
        "id": 1,
        "name": "Alice",
        "age": 23,
        "address": "801538, Walnut Street St"
    },
    {
        "id": 2,
        "name": "Bob",
        "age": 51,
        "address": "680286, View Street St"
    },
    {
        "id": 3,
        "name": "Cameron",
        "age": 29,
        "address": "711086, Forest Street St"
    }
]

In this example:

The sourceScripted="True" attribute is used to evaluate the JSON template with embedded variables.
The JSON template contains placeholders for variables like random_age, street_name, and address_number.
If whole JSON field value is a variable, it should be enclosed in curly braces {} (e.g., "age": "{random_age}"). Returned value can be a string, integer, or any other type.
If a variable is embedded within a string, it should be enclosed in double underscores __ (e.g., "address": "__address_number__, __street_name__ St"). Returned value will be a string. You can also customize the prefix and suffix for variable substitution using variablePrefix, variableSuffix, defaultVariablePrefix, and defaultVariableSuffix attributes. For example:

<setup defaultVariablePrefix="-%" defaultVariableSuffix="%-">
    <generate name="json_data" source="script/data.json" sourceScripted="True" target="">
        <variable name="random_age" generator="IntegerGenerator(min=18, max=65)"/>
    </generate>
</setup>
<setup>
    <generate name="json_data" source="script/data.json" sourceScripted="True" target="" variablePrefix="-%" variableSuffix="%-">
        <variable name="random_age" generator="IntegerGenerator(min=18, max=65)"/>
    </generate>
</setup>

Example 8: Using multiprocessing platform `ray`¶

<setup>
    <generate name="json_data" count="1000000" mpPlatform="ray" target="">
        <key name="random_age" generator="IntegerGenerator(min=18, max=65)"/>
        <key name="street_name" generator="StreetNameGenerator"/>
        <key name="address_number" generator="IntegerGenerator"/>
    </generate>
</setup>

In this example: - Generate tasks will be executed using ray platform instead of default python multiprocessing.

The <generate> element defines a data generation task. At its most basic, it requires:

name: Identifies the generation task
count: Specifies how many records to generate
target: (Optional) Specifies the output format (e.g., CSV, JSON)

Basic Example¶

<setup>
    <generate name="simple_users" count="10" target="CSV">
        <key name="id" generator="IncrementGenerator"/>
        <key name="name" type="string"/>
        <key name="age" type="int"/>
    </generate>
</setup>

Essential Attributes¶

name: Task identifier
count: Number of records to generate
target: Output format (e.g., CSV, JSON, ConsoleExporter)
source: (Optional) Input data source

`<key>`¶

The <key> element defines key fields within a data generation task and specifies their generation methods. These fields are crucial for creating unique identifiers or structured elements within the generated data. The <key> element allows for dynamic, constant, or conditional data generation and provides several attributes to customize its behavior.

Attributes¶

name: Specifies the name of the key. This is mandatory and will be used as the field name in the generated data.
type: Defines the data type of the key (e.g., string, int, bool). This is optional when using script or generator.
source: Specifies the data source for the key (e.g., a database, a file).
separator: Specifies a separator for csv source.
values: Provides a list of static values for the key to choose from.
script: Defines a script for dynamically generating the key's value.
generator: Specifies a generator to automatically create values (e.g., RandomNumberGenerator, IncrementGenerator).
constant: Defines a constant value for the key.
condition: Specifies a condition to determine whether the key will be generated.
converter: Specifies a converter to transform the value (e.g., date conversion, format changes).
pattern: Defines a regex pattern to validate the value of the key.
inDateFormat / outDateFormat: Specifies input and output date formats for converting date values.
defaultValue: Provides a default value if the key’s value is null or not generated.
nullQuota: Defines the probability that the key will be assigned a null value. Default is 0 (never null).
database: Specifies the database used for generating values (e.g., SequenceTableGenerator).
string: Attribute to generate complex strings by embedding variables within the string using customizable delimiters. (read more in variable section)
variablePrefix: Configurable attribute that defines the prefix for variable substitution in dynamic strings (default is __).
variableSuffix: Configurable attribute that defines the suffix for variable substitution in dynamic strings (default is __).

Example 1: Generating Constant and Scripted Keys¶

<setup>
    <generate name="static_and_scripted_keys" count="5" >
        <key name="static_key" constant="fixed_value"/>
        <key name="dynamic_key" script="random.randint(1, 100)"/>
    </generate>
</setup>

In this example:

static_key is assigned a constant value of "fixed_value" for every record.
dynamic_key generates a random integer between 1 and 100 for each record using a script.

Example 2: Handling `nullQuota` for Nullable Fields¶

<setup>
    <generate name="nullable_keys" count="10">
        <key name="key_always_null" type="string" nullQuota="1"/> <!-- 100% null values -->
        <key name="key_never_null" type="string" nullQuota="0"/> <!-- 0% null values -->
        <key name="key_sometimes_null" type="string" nullQuota="0.5"/> <!-- 50% null values -->
    </generate>
</setup>

In this example:

key_always_null will always have a null value (nullQuota="1").
key_never_null will never have a null value (nullQuota="0").
key_sometimes_null will have a null value 50% of the time (nullQuota="0.5").

Example 3: Using `defaultValue` for Fallback Values¶

<setup>
    <generate name="default_values" count="5">
        <key name="key_with_empty_string" script="" defaultValue="default_value"/> <!-- Fallback to default_value -->
        <key name="key_with_none" script="None" defaultValue="default_value"/> <!-- Fallback to default_value -->
        <key name="key_with_condition" script="" defaultValue="default_value" condition="False"/> <!-- Condition False, no generation -->
    </generate>
</setup>

Here:

The first two keys fall back to their defaultValue when the script generates an empty or None value.
The third key doesn’t generate any value since its condition is False.

Example 4: Conditional Key Generation¶

<setup>
    <generate name="conditional_keys" count="10">
        <key name="conditional_key" script="random.randint(1, 100)" condition="random.randint(1, 100) > 50"/>
        <key name="constant_key" constant="fixed_value" condition="True"/>
    </generate>
</setup>

In this example:

conditional_key is generated only when a random number greater than 50 is produced by the condition script.
constant_key is always generated since its condition="True".

Example 5: Using `pattern` to Validate Keys¶

<setup>
    <generate name="pattern_matching" count="10">
        <key name="email" script="'[email protected]'" pattern="^[\w\.-]+@[\w\.-]+\.\w+$"/>
        <key name="phone_number" script="'123-456-7890'" pattern="^\d{3}-\d{3}-\d{4}$"/>
    </generate>
</setup>

In this example:

The email key’s value must match the regex pattern for a valid email format.
The phone_number key’s value must match the regex pattern for a valid phone number format (123-456-7890).

Example 6: Date Conversion Using `inDateFormat` and `outDateFormat`¶

<setup>
    <generate name="date_format_conversion" count="10">
        <key name="date_of_birth" script="'2023-10-12'" inDateFormat="%Y-%m-%d" outDateFormat="%d-%m-%Y"/>
    </generate>
</setup>

In this example:

The date_of_birth key uses the input date format (inDateFormat="%Y-%m-%d") to parse the date and converts it to the specified output format (outDateFormat="%d-%m-%Y").

Example 7: Key Generation from a `SequenceTableGenerator`¶

<setup>
    <database id="sourceDB" system="postgres"/>
    <generate name="sequence_key_generation" count="10" >
        <key name="user_id" database="sourceDB" generator="SequenceTableGenerator"/>
    </generate>
</setup>

Here:

The user_id key is generated using a SequenceTableGenerator from a PostgreSQL database.
This generator ensures that unique, sequential values are pulled from the database.

Best Practices for Using `<key>`¶

Leverage script for Dynamic Values: Use script to generate complex and dynamic values, such as random numbers, dates, or values based on calculations.
Use nullQuota for Realistic Data: Use nullQuota to simulate real-world scenarios where some keys may have null values.
Fallback with defaultValue: Use defaultValue to ensure that your keys always have a fallback value if a script fails or produces None.
Pattern Matching for Validation: Use the pattern attribute to enforce specific formatting rules, such as email addresses or phone numbers.
Control Key Generation with condition: Use the condition attribute to dynamically determine whether a key should be generated, allowing for more control in complex data generation scenarios.

`<variable>`¶

The <variable> element defines variables used in data generation tasks. Variables can be sourced from databases, datasets, or dynamically generated using scripts. They introduce flexibility in creating dynamic test data by controlling how the data is retrieved or iterated.

Attributes¶

name: Specifies the name of the variable.
type: Defines the data type of the variable (optional).
source: Specifies the data source for the variable (e.g., a database or a file).
selector: Defines a query to retrieve data for the variable from a database (executed once).
iterationSelector: Executes a query on each iteration to retrieve dynamic data for the variable.
separator: Specifies a separator for the variable (e.g., for CSV sources).
cyclic: Enables or disables cyclic iteration of the data source.
entity: Defines the entity for generating data (e.g., a predefined model or object).
script: Specifies a script for dynamically generating the variable's value.
weightColumn: Specifies a column to weight data selection (typically used in CSV or database sources).
sourceScripted: Determines if the source is scripted.
generator: Defines a generator for the variable (e.g., RandomNumberGenerator, IncrementGenerator).
dataset: Specifies the dataset for the variable (usually a file path).
locale: Defines the locale used when generating data.
inDateFormat / outDateFormat: Specifies date format conversion for input and output.
converter: Defines a converter for transforming the variable's value.
constant: Sets a fixed constant value for the variable.
values: Provides a list of values for the variable to choose from.
defaultValue: Sets a default value when no data is available.
pattern: Defines a regex pattern for validating the variable's content.
distribution: Controls how data is distributed when selecting from a source (random, ordered).
database: Specifies the database used for generating data.
string: Attribute to generate complex strings by embedding variables within the string using customizable delimiters.
variablePrefix: Configurable attribute that defines the prefix for variable substitution in dynamic strings (default is __).
variableSuffix: Configurable attribute that defines the suffix for variable substitution in dynamic strings (default is __).

Example 1: Using `generator` for Incrementing Values¶

<setup>
    <generate name="sequential_ids" count="10" >
        <variable name="id" generator="IncrementGenerator"/>
        <key name="generated_id" script="id"/>
    </generate>
</setup>

In this example:

The id variable uses the IncrementGenerator, which generates sequential numbers.
The generated ID is then assigned to the generated_id key for each record.

Example 2: Sourcing Data from a CSV File with `separator`¶

<setup>
    <generate name="person_data" count="5" >
        <variable name="person" source="data/people.csv" separator="," distribution="ordered"/>
        <key name="person_id" script="person.id"/>
        <key name="person_name" script="person.name"/>
        <key name="person_age" script="person.age"/>
    </generate>
</setup>

In this example:

The person variable is sourced from a CSV file, with fields separated by a comma.
The distribution="ordered" ensures that records are processed in the order they appear in the file.

Example 3: Defining a `constant` Variable¶

<setup>
    <generate name="constant_value_example" count="3" >
        <variable name="country" constant="Germany"/>
        <key name="user_country" script="country"/>
    </generate>
</setup>

In this case:

The country variable is defined as a constant with the value "Germany".
This value is applied to every record generated in the user_country key.

Example 4: Generating Dynamic Variables with `script`¶

<setup>
    <generate name="dynamic_variables" count="5" >
        <variable name="random_number" script="random.randint(1, 100)"/>
        <variable name="full_name" script="fake.name()"/>
        <key name="random_number_value" script="random_number"/>
        <key name="full_name_value" script="full_name"/>
    </generate>
</setup>

In this example:

The random_number variable generates a random integer between 1 and 100 using a script.
The full_name variable uses the fake library to generate random names.
These dynamically generated values are then printed for each record.

Example 5: Using `cyclic` Variables with a CSV Source¶

<setup>
    <generate name="cyclic_people" count="8" >
        <variable name="person" source="data/people.csv" cyclic="True" separator=","/>
        <key name="person_id" script="person.id"/>
        <key name="person_name" script="person.name"/>
    </generate>
</setup>

In this example:

The cyclic="True" attribute ensures that once all records from the CSV file are used, the data starts from the beginning again.

Example 6: Using `distribution` to Randomize Data Selection¶

<setup>
    <generate name="random_people" count="10" >
        <variable name="person" source="data/people.csv" separator="," distribution="random"/>
        <key name="person_id" script="person.id"/>
        <key name="person_name" script="person.name"/>
    </generate>
</setup>

Here: - The distribution="random" attribute ensures that the records are selected randomly from the source CSV file.

Example 7: Iterating with `iterationSelector`¶

<setup>
    <generate name="iterate_selector" count="20" >
        <key name="iteration_count" generator="IncrementGenerator"/>
        <variable name="user" source="dbPostgres"
                  iterationSelector="SELECT id, name FROM users WHERE id = __iteration_count__"/>
        <key name="user_id" script="user[0].id"/>
        <key name="user_name" script="user[0].name"/>
    </generate>
</setup>

In this example:

The iterationSelector query retrieves data from a PostgreSQL database for each iteration using the iteration_count value, dynamically fetching user information.

Example 8: Defining Weighted Variables with `weightColumn`¶

<setup>
    <generate name="weighted_people" count="10" >
        <variable name="people" source="data/people_weighted.csv" weightColumn="weight" separator=","/>
        <key name="person_id" script="people.id"/>
        <key name="person_name" script="people.name"/>
    </generate>
</setup>

Here:

The weightColumn="weight" controls how frequently each row is selected. Rows with higher weight values are more likely to be chosen.

Example 9: Combining Variables with Nested Keys¶

<setup>
    <generate name="customer_info" count="10" >
        <variable name="customer" source="data/customers.csv" cyclic="True"/>
        <variable name="notification" source="data/notifications.csv" cyclic="True"/>
        <key name="customer_id" script="customer.id"/>
        <key name="customer_name" script="customer.name"/>
        <nestedKey name="notifications" type="list" count="2">
            <key name="notification_type" script="notification.type"/>
            <key name="notification_message" script="notification.message"/>
        </nestedKey>
    </generate>
</setup>

In this case:

The customer and notification variables are both sourced from CSV files.
The nestedKey element generates two notifications for each customer, showcasing how variables can be combined with nested structures.

Example 10: Working with Entities and Locale-Specific Data¶

<setup>
    <generate name="localized_data" count="5" >
        <variable name="person" entity="Person" locale="de_DE"/>
        <key name="person_name" script="person.full_name"/>
        <key name="person_address" script="person.address"/>
    </generate>
</setup>

In this example:

The person variable is generated using the Person entity, with data localized to de_DE (Germany).
This can be used to generate locale-specific data like names, addresses, etc.

Example 11: Using `string` Attribute for dynamic and complex strings¶

<setup defaultVariablePreffix="%%" defaultVariableSuffix="%%">
    <generate name="query_generation" count="1">
        <variable name="collection" constant="'users'" />
        <key name="query" string="find: %%collection%%, filter: {'status': 'active'}" />
    </generate>
</setup>

In this example:

The string attribute allows dynamic insertion of the variable collection into the query.
The custom %% prefix and suffix replace the default __.

Example 12: Default `variablePrefix` and `variableSuffix`¶

<setup>
    <generate name="query_generation" count="1">
        <variable name="collection" constant="'users'" />
        <key name="query" string="find: __collection__, filter: {'status': 'active'}" />
    </generate>
</setup>

In this case: - The default __ delimiters are used for variable substitution.

Key Benefits of the `string` Attribute¶

Simplicity: Embedding variables directly within the string eliminates the need for manual string concatenation or escaping.
Readability: Dynamic strings are easier to read and maintain.
Flexibility: The variablePrefix and variableSuffix attributes allow customization of the delimiters used, providing more flexibility when working with different syntaxes or conventions.

Best Practices for Using `<variable>`¶

Dynamic Data Generation: Use scripts in variables to create dynamic data like random numbers, names, and addresses using libraries like random and fake.
Cyclic vs Non-Cyclic: Use cyclic variables when you want data to repeat once all values are used, while non-cyclic variables are exhausted after one pass.
Weighting and Randomization: Use weightColumn to skew data generation toward certain records and distribution="random" to randomize data selection.
Combining with Nested Keys: Use variables in combination with nestedKey to generate structured, hierarchical data.

`<operate>`¶

The <operate> element is used to perform operations on data before generating the output.

Attributes¶

source: Specifies CSV file path where operations and models are defined.
operation_prefix: Specifies the prefix for operation names in the CSV file.(optional default is op)
template_not_found_action: Specifies the action to take when a template is not found (e.g., warn, error, ignore). Default is warn.
operation_not_matched_action: Specifies the action to take when an operation is not matched (e.g., warn, error, ignore). Default is warn.

Example 1: Basic Operation with CSV Source¶

<setup>
    <operate source="operate/test1.3.opctl.csv" operation_prefix="XmlOp_"/>
</setup>

id|template|XmlOp_1|XmlOp_2|XmlOp_3
user1|templates/user1.template.xml|delete(/person/id)|set(/person/name, "John")|
user2|templates/user2.template.json|set($.age, 30)|delete($.address)|

In this example:

The operation_prefix is set to "XmlOp_", so the operations in the CSV file should start with this prefix.
The CSV file specifies operations for two users, one in XML and one in JSON.

Example 2: Using Operate with Logging and Error Handling¶

<setup>
  <operate
    source="operate/test1.3.opctl.csv"
    operation_prefix="op"
    template_not_found_action="warn"
    operation_not_matched_action="warn"
  />
</setup>

In this example:

The source attribute points to the CSV file with operations.
operation_prefix is set to "op".
Actions for missing templates or unmatched operations are set to "warn".
Results will be exported locally.

Features¶

Extensive Logging: All operations, warnings, and errors are logged for traceability.
Logical Invalidity Detection: The system detects and reports invalid operation sequences (e.g., modifying data that was already deleted).
Flexible Error Handling: Control how missing templates or unmatched operations are handled via attributes.
Supports Both XML and JSON: Use XPATH for XML and JSONPATH for JSON templates.

Best Practices¶

Ensure operation order is logical (avoid modifying nodes after deletion).
Use clear and unique id values for each artifact.
Review logs for warnings about invalid or skipped operations.

`<ml-train>`¶

The <ml-train> element is used to train machine learning models with input data. These trained models can then be used as sources in <generate> elements to enrich original data. The <ml-train> element is a sub element of <setup>

Attributes¶

name: Specifies the name of model after trained. This is mandatory and will be used to reference the model in other elements.
source: Specifies the source of the data (e.g., data/active.ent.csv, mongo).
type: Specifies the type of data to generate.
mode: Specifies the training mode. Currently only have 'default' and 'persist'. 'default' will remove the model after all task finished. 'persist' will keep the model after all task finished.
maxTrainingTime: Specifies the maximum time allowed for model training in minutes (e.g. 1, 5, 10)
separator: Specifies the separator used in the source data file (e.g., ',' for CSV files).

Example 1: Basic Model Training¶

<setup>
    <ml-train name="customer_csv_gen"
            source="data/customer.ent.csv"
            maxTrainingTime="1"
            separator=","/>

    <generate name="csv_customer" count="10000" pageSize="1000" source="customer_csv_gen" target="CSV">
        <key name="id" generator="IncrementGenerator"/>
    </generate>
</setup>

In this example: - The model named "customer_csv_gen" is trained using data from "data/customers.csv" - We didn't specific "mode" so it will be default and "customer_csv_gen" model will be removed after all task finish. - The CSV file uses comma as separator - "generate" will use trained "customer_csv_gen" model as source to create new data

Example 2: Training with persist mode¶

<setup numProcess="2">

    <ml-train name="customer_csv_gen"
              source="data/customer.ent.csv"
              mode="persist"
              maxTrainingTime="1"/>

    <!-- Generate synthetic CUSTOMER records using the ML generator -->
    <generate name="csv_customer" count="10000" pageSize="1000" source="customer_csv_gen" target="CSV">
        <key name="id" generator="IncrementGenerator"/>
    </generate>
</setup>

In this example: - We specify "mode" is "persist" so it will keep even after all task finish. - Later we can use it without training.

<setup numProcess="2">
    <generate name="csv_customer" count="10000" pageSize="1000" source="customer_csv_gen" target="CSV">
        <key name="id" generator="IncrementGenerator"/>
    </generate>
</setup>

Complete Basic Example¶

Here's a complete example combining the core elements:

<setup>
    <generate name="user_data" count="100" target="CSV">
        <!-- Define variables -->
        <variable name="person" entity="Person"/>

        <!-- Define keys -->
        <key name="id" generator="IncrementGenerator"/>
        <key name="first_name" script="person.given_name"/>
        <key name="last_name" script="person.family_name"/>
        <key name="age" type="int" generator="IntegerGenerator(min=18, max=80)"/>
        <key name="status" constant="active"/>
    </generate>
</setup>

This will generate 100 user records with consistent, structured data including IDs, names, ages, and a status field.

Next Steps¶

Once you're comfortable with these core elements, explore the Advanced Data Definition Elements for more complex features like:

Nested data structures
Conditional generation
Complex data patterns
Arrays and lists
Advanced variable usage

Data Definition Model - Core Elements¶

Overview¶

Basic Elements¶

¶

<generate>¶

Attributes¶

Children¶

Example 1: Using Object Storage for Data Generation¶

Example 2: Using selector with a Database¶

Example 3: Generating Data with MongoDB and Aggregation¶

Example 4: Generating Data with Kafka¶

Example 5: Using Data from a CSV File¶

Example 6: Using cyclic with Data from Memory Store¶

Example 7: Using 'sourceScripted' with JSON template¶

Example 8: Using multiprocessing platform ray¶

Basic Example¶

Essential Attributes¶

<key>¶

Attributes¶

Example 1: Generating Constant and Scripted Keys¶

Example 2: Handling nullQuota for Nullable Fields¶

Example 3: Using defaultValue for Fallback Values¶

Example 4: Conditional Key Generation¶

Example 5: Using pattern to Validate Keys¶

Example 6: Date Conversion Using inDateFormat and outDateFormat¶

Example 7: Key Generation from a SequenceTableGenerator¶

Best Practices for Using <key>¶

<variable>¶

Attributes¶

Example 1: Using generator for Incrementing Values¶

Example 2: Sourcing Data from a CSV File with separator¶

Example 3: Defining a constant Variable¶

Example 4: Generating Dynamic Variables with script¶

Example 5: Using cyclic Variables with a CSV Source¶

Example 6: Using distribution to Randomize Data Selection¶

Example 7: Iterating with iterationSelector¶

Example 8: Defining Weighted Variables with weightColumn¶

Example 9: Combining Variables with Nested Keys¶

Example 10: Working with Entities and Locale-Specific Data¶

Example 11: Using string Attribute for dynamic and complex strings¶

Example 12: Default variablePrefix and variableSuffix¶

Key Benefits of the string Attribute¶

Best Practices for Using <variable>¶

<operate>¶

Attributes¶

Example 1: Basic Operation with CSV Source¶

Example 2: Using Operate with Logging and Error Handling¶

Features¶

Best Practices¶

<ml-train>¶

Attributes¶

Example 1: Basic Model Training¶

Example 2: Training with persist mode¶

Complete Basic Example¶

Next Steps¶

`<generate>`¶

Example 2: Using `selector` with a Database¶

Example 6: Using `cyclic` with Data from Memory Store¶

Example 8: Using multiprocessing platform `ray`¶

`<key>`¶

Example 2: Handling `nullQuota` for Nullable Fields¶

Example 3: Using `defaultValue` for Fallback Values¶

Example 5: Using `pattern` to Validate Keys¶

Example 6: Date Conversion Using `inDateFormat` and `outDateFormat`¶

Example 7: Key Generation from a `SequenceTableGenerator`¶

Best Practices for Using `<key>`¶

`<variable>`¶

Example 1: Using `generator` for Incrementing Values¶

Example 2: Sourcing Data from a CSV File with `separator`¶

Example 3: Defining a `constant` Variable¶

Example 4: Generating Dynamic Variables with `script`¶

Example 5: Using `cyclic` Variables with a CSV Source¶

Example 6: Using `distribution` to Randomize Data Selection¶

Example 7: Iterating with `iterationSelector`¶

Example 8: Defining Weighted Variables with `weightColumn`¶

Example 11: Using `string` Attribute for dynamic and complex strings¶

Example 12: Default `variablePrefix` and `variableSuffix`¶

Key Benefits of the `string` Attribute¶

Best Practices for Using `<variable>`¶

`<operate>`¶

`<ml-train>`¶