Data Definition Model - Core Elements¶

Data Definition Models are fundamental to DATAMIMIC's test data generation capabilities. This document covers the essential elements - if you're new to DATAMIMIC, start here. For advanced features, see Advanced Data Definition Elements.

Overview¶

Data Definition Models specify how test data should be generated, transformed, or obfuscated. The core elements allow you to:

Define data generation tasks
Specify key fields and their values
Create and use variables
Generate structured data sets

Expression Syntax: `{expr}` vs `{{ expr }}`¶

DATAMIMIC supports two forms of expression syntax with different caching behavior:

Syntax Comparison¶

Syntax	Behavior	Caching
`{expr}`	CACHED - Evaluated once per record iteration, then cached	Faster for repeated references
`{{ expr }}`	DYNAMIC - Evaluated every time it's referenced	Fresh value each time

When to Use Each Form¶

Use {expr} (cached) when:

The expression result should be consistent within a single record
You reference the same expression multiple times
Performance matters for complex expressions

Use {{ expr }} (dynamic) when:

You need fresh values each time (e.g., timestamps with precision)
Random values should be unique per usage
The expression has side effects

Examples¶

Timestamp Behavior¶

<!-- Cached: All timestamp fields get the SAME value within a record -->
<generate name="audit_log" count="100">
    <key name="created_at" script="{now()}"/>
    <key name="updated_at" script="{now()}"/>  <!-- Same as created_at -->
    <key name="processed_at" script="{now()}"/>  <!-- Same as created_at -->
</generate>

<!-- Dynamic: Each timestamp field gets a FRESH value -->
<generate name="event_stream" count="100">
    <key name="event_time" script="{{now()}}"/>
    <key name="log_time" script="{{now()}}"/>  <!-- Different from event_time -->
    <key name="process_time" script="{{now()}}"/>  <!-- Different from log_time -->
</generate>

Random Value Behavior¶

<!-- Cached: Same random value when referenced multiple times -->
<generate name="data" count="100">
    <key name="value1" script="{random.random()}"/>
    <key name="value2" script="{random.random()}"/>  <!-- Same expression = same cached value -->
</generate>

<!-- Dynamic: Fresh random value each time -->
<generate name="data" count="100">
    <key name="value1" script="{{random.random()}}"/>
    <key name="value2" script="{{random.random()}}"/>  <!-- Different value -->
</generate>

Recommendation Table¶

Use Case	Recommended	Reason
Static/deterministic expressions	`{expr}`	Better performance
Time-sensitive (timestamps)	`{{ expr }}`	If precision matters between fields
Random values needing uniqueness	`{{ expr }}`	Fresh value each evaluation
Complex expressions referenced multiple times	`{expr}`	Consistency + performance
Expressions with side effects	`{{ expr }}`	Ensure side effects execute

Note on `targetEntity`¶

For targetEntity in single-file exporters, {expr} and {{ expr }} behave identically because the expression cache is reset for each record and the value is evaluated once per record. See Dynamic targetEntity for details.

Setup-Time vs Runtime Attributes¶

Some attributes are evaluated at setup time (before generation starts), while others support runtime evaluation (during each record iteration). This distinction affects which expression syntax you can use:

Attribute	Evaluation Time	`{expr}`	`{{ expr }}`
`source`	Setup	Allowed	Not allowed
`target`	Setup	Allowed	Not allowed
`uri` (execute)	Setup	Allowed	Not allowed
`sourceUri`	Setup	Allowed	Not allowed
`exportUri`	Setup	Allowed	Not allowed
`targetEntity`	Runtime	Allowed	Allowed
`count`	Runtime	Allowed	Allowed
`selector`	Runtime	Allowed	N/A

Setup-time attributes require a resolved value before generation begins. Using {{expr}} (DYNAMIC) for these attributes raises error I870.

Example: Dynamic Source Path¶

<!-- Define the path in a variable -->
<variable name="data_path" script="f'data/{version}/customers.csv'"/>

<!-- Use {expr} for setup-time evaluation -->
<generate name="customers" source="{data_path}" count="100">
  <key name="id" generator="IncrementGenerator"/>
</generate>

Example: Dynamic Target¶

<!-- Define target based on environment -->
<variable name="output_target" script="'JSONSingle' if single_file else 'JSON'"/>

<!-- Use {expr} for target -->
<generate name="data" count="100" target="{output_target}">
  <key name="id" generator="IncrementGenerator"/>
</generate>

Warning

Inline interpolation (e.g., source="data/{version}/file.csv") is not supported. The entire attribute value must be wrapped in {expr}:

<!-- Correct: pre-build path in variable -->
<variable name="path" script="f'data/{version}/file.csv'"/>
<generate source="{path}" ...>

<!-- Incorrect: inline interpolation not supported -->
<generate source="data/{version}/file.csv" ...>

Basic Elements¶

¶

The <setup> element is the root element for all data generation tasks. It contains one or more <generate> elements that define specific data generation operations. Learn more of its use in Configuration Models.

<setup>
    <generate name="users" count="100">
        <!-- Generation details goes here -->
    </generate>
</setup>

`<generate>`¶

The <generate> element is the core of Data Definition Models. It defines a data generation task and includes attributes like name, count, and target. This element is used to create structured data based on the specified configurations.

Note

For nested <generate> blocks, variable lookup is scope-sensitive. If a variable is declared inside the current nested generate, prefer this.variableName instead of relying on an unqualified name. See Variable Scoping in Nested Generates.

Attributes¶

name: Specifies the name of the generation task.
count: Specifies the number of records to generate.
source: Specifies the source of the data (e.g., data/active.ent.csv, mongo).
target: Specifies the target output (e.g., CSV, sqliteDB).
type: Specifies the type of data to generate.
cyclic: Enables or disables cyclic generation. Default is False.
selector: Specifies a database query for the generation.
For top-level <generate>, selector behavior belongs to the datasource/loader read path, not to the variable setup-cache contract.
separator: Specifies a separator for the generated data. Default is |.
sourceScripted: Enables or disables scripted source evaluation in the source file (e.g., example.ent.csv, example.json). Default is False.
pageSize: Specifies the page size for data generation.
For top-level selector reads, pageSize affects generate/source paging and downstream batching as implemented by the loader path.
It is not a universal selector semantic for all selector forms in the DSL.
storageId: Specifies the object-storage client ID, defined by the <object-storage> element. Applies to file exporters (CSV/JSON/XML/Template) and is not the same as targetClient.
sourceUri: Specifies the URI of the datasource on object storage (e.g., datasource/employees.csv).
exportUri: Specifies only the path/prefix for exporting generated data (e.g., export/). The filename is derived from name or targetEntity, and the extension is determined by the exporter.
container: Specifies the container name for Azure Blob Storage.
bucket: Specifies the bucket name for AWS S3.
distribution: Specifies the distribution of data source iteration (e.g., random, ordered). Default is random.
converter: Specifies a converter to transform value.
variablePrefix: Configurable attribute that defines the prefix for variable substitution in dynamic strings (default is __).
variableSuffix: Configurable attribute that defines the suffix for variable substitution in dynamic strings (default is __).
numProcess: Defines the number of processes for multiprocessing, can be propagated from parent element <setup>. Default is 1.
mpPlatform: Define multiprocessing platform to be executed. Accepted values are multiprocessing and ray. Default value is multiprocessing.

Overview (Target vs Storage vs Source)¶

Goal	Attribute(s)	Notes
Choose exporter type	`target`	e.g., `CSV`, `JSON`, `XML`, `Template`, `mongodb`
Override DB/Kafka/Warehouse client	`targetClient`	Client-based exporters only; not object storage
Select object-storage client for file exports	`storageId`	References `<object-storage id="...">`
Set export path/prefix	`exportUri`	Path only; filename from `name`/`targetEntity`, extension from exporter
Read from object storage	`source` + `sourceUri`	`source` = client id; `sourceUri` = object key
Set bucket/container	`bucket` / `container`	Optional; defaults from client config

Children¶

<key>: Specifies key fields within the data generation task.
<variable>: Defines variables used in data generation.
<reference>: Defines references to other generated data.
<nestedKey>: Specifies nested key fields and their generation methods.
<list>: Defines lists of data items.
<condition>: Conditional element to include data based on certain conditions.
<array>: Defines arrays of data items.
<echo>: Outputs text or variables for logging or debugging purposes.

Example 1: Using Object Storage for Data Generation¶

<setup>
    <!-- Define object-storage with ID referring to the environment -->
    <object-storage id="aws"/>
    <!-- Write file to the object-storage (exportUri is path/prefix only) -->
    <generate name="external_write" bucket="datamimic-01" storageId="aws" exportUri="/datamimic_exporting_result/" target="JSON, CSV, TXT, XML" count="100">
        <key name="id" generator="IncrementGenerator"/>
        <key name="name" type="string"/>
    </generate>
    <!-- Read file from object-storage -->
    <generate name="external_read" bucket="datamimic-01" sourceUri="datamimic_exporting_result/external_write.json" source="aws" />
</setup>

Object storage reads (source + sourceUri)¶

<object-storage id="s3" />

<generate name="read_orders"
          source="s3"
          sourceUri="abc/test/myfile.csv"
          bucket="my-bucket"
          target="CSV"/>

Example 2: Using `selector` with a Database¶

<generate name="CUSTOMER" source="mongodb" selector="find: 'CUSTOMER', filter: {'age': {'$lt': 30}}" >
    <key name="id" generator="IncrementGenerator"/>
    <key name="name" type="string"/>
</generate>

In this example:

The selector is used to query the MongoDB database to find all customers under 30 years old.
The data is output to the ConsoleExporter.

Selector contract notes¶

For top-level <generate> statements:

Form	Execution timing	Cache scope	`pageSize` effect	Notes
`selector="SQL or find: ..."`	Generate read path	No setup-cache contract	Loader-owned paging/export batching	Do not infer variable-style cache-once behavior
`selector="aggregate: ..."`	Generate read path	No setup-cache contract	Backend-specific; do not assume `find`-style paging	Mongo `aggregate` is its own selector kind

Important notes:

Root-level <generate iterationSelector> is not supported.
Parent/global placeholders may be used in selector text when the generate contract allows them. Child-scope placeholders are not visible to a parent selector.
If you need setup-cached reference data semantics, use a database-backed <variable selector="..."> and not a top-level generate selector.

Example 3: Generating Data with MongoDB and Aggregation¶

<setup >
    <memstore id="mem"/>
    <mongodb id="mongodb"/>

    <!-- Clear collections before generating new data -->
    <generate name="delete_users" source="mongodb" selector="find: 'more_users', filter: {}" target="mongodb.delete"/>
    <generate name="delete_orders" source="mongodb" selector="find: 'more_orders', filter: {}" target="mongodb.delete"/>
    <generate name="delete_products" source="mongodb" selector="find: 'more_products', filter: {}" target="mongodb.delete"/>

    <!-- Generate orders, users, and products collections -->
    <generate name="more_orders" source="script/orders.json" target="mongodb"/>
    <generate name="more_users" source="script/users.json" target="mongodb"/>
    <generate name="more_products" source="script/products.json" target="mongodb"/>

    <!-- Perform an aggregation query to summarize user orders and spending -->
    <generate name="more_summary" count="20" >
        <variable name="result" source="mongodb"
                  selector='aggregate: "more_users",
                            pipeline: [
                              {
                                "$lookup": {
                                  "from": "more_orders",
                                  "localField": "user_id",
                                  "foreignField": "user_id",
                                  "as": "userOrders"
                                }
                              },
                              {
                                "$unwind": "$userOrders"
                              },
                              {
                                "$lookup": {
                                  "from": "more_products",
                                  "localField": "userOrders.order_item",
                                  "foreignField": "product_name",
                                  "as": "orderProducts"
                                }
                              },
                              {
                                "$unwind": "$orderProducts"
                              },
                              {
                                "$group": {
                                  "_id": "$user_id",
                                  "user_name": { "$first": "$user_name" },
                                  "order_items": { "$push": "$userOrders.order_item" },
                                  "quantities": { "$first": "$userOrders.quantity" },
                                  "total_spending": {
                                    "$sum": {
                                      "$multiply": ["$userOrders.quantity", "$orderProducts.price"]
                                    }
                                  }
                                }
                              }
                            ]'/>
        <nestedKey name="users_orders" script="result"/>
    </generate>

    <!-- Clear collections after generation -->
    <generate name="delete_users" source="mongodb" selector="find: 'more_users', filter: {}" target="mongodb.delete"/>
    <generate name="delete_orders" source="mongodb" selector="find: 'more_orders', filter: {}" target="mongodb.delete"/>
    <generate name="delete_products" source="mongodb" selector="find: 'more_products', filter: {}" target="mongodb.delete"/>
</setup>

Example 4: Generating Data with Kafka¶

<setup >
    <kafka-exporter id="kafkaLocal" environment="environment"/>
    <kafka-importer id="kafka_importer" system="kafkaLocal" enable.auto.commit="True" auto.offset.reset="earliest" group.id="datamimic" decoding="UTF-8" environment="environment"/>

    <!-- Reset Kafka topic by consuming all messages -->
    <generate name="reset" source="kafka_importer" type="kafka" count="100" target=""/>

    <!-- Generate data to export to Kafka and Console -->
    <generate name="exported_data" count="10" target="ConsoleExporter, kafkaLocal">
        <variable name="person" entity="Person"/>
        <key name="name" script="person.name"/>
        <key name="email" script="person.email"/>
    </generate>

    <!-- Import data from Kafka -->
    <generate name="imported_data" source="kafka_importer" type="kafka" count="20"  distribution="ordered"/>
</setup>

Example 5: Using Data from a CSV File¶

<setup defaultSeparator="|">
    <generate name="product1" source="data/products.ent.csv" separator=","  distribution="ordered"/>
    <generate name="product2" source="data/products_2.ent.csv"  distribution="ordered"/>
</setup>

In this example:

Two generate tasks are created that source data from CSV files and output it to the ConsoleExporter.

Example 6: Using `cyclic` with Data from Memory Store¶

<setup >
    <memstore id="mem"/>
    <generate name="product" count="15" target="mem">
        <key name="id" generator="IncrementGenerator"/>
        <key name="name" values="'Alice', 'Bob', 'Cameron'"/>
    </generate>

    <!-- Generate 30 non-cyclic and cyclic products from memory -->
    <generate name="non-cyclic-product" type="product" count="30" cyclic="False" source="mem" target="" distribution="ordered"/>
    <generate name="cyclic-product" type="product" count="30" cyclic="True" source="mem" target="" distribution="ordered"/>
    <generate name="big-cyclic-product" type="product" count="100" cyclic="True" source="mem" target="" distribution="ordered"/>
</setup>

Example 7: Using 'sourceScripted' with JSON template¶

<setup>
    <generate name="json_data" source="script/data.json" sourceScripted="True" target="">
        <variable name="random_age" generator="IntegerGenerator(min=18, max=65)"/>
        <variable name="street_name" generator="StreetNameGenerator"/>
        <variable name="address_number" generator="IntegerGenerator"/>
    </generate>
</setup>

[
    {
        "id": 1,
        "name": "Alice",
        "age": "{random_age}",
        "address": "__address_number__, __street_name__ St"
    },
    {
        "id": 2,
        "name": "Bob",
        "age": "{random_age}",
        "address": "__address_number__, __street_name__ St"
    },
    {
        "id": 3,
        "name": "Cameron",
        "age": "{random_age}",
        "address": "__address_number__, __street_name__ St"
    }
]

Result:

[
    {
        "id": 1,
        "name": "Alice",
        "age": 23,
        "address": "801538, Walnut Street St"
    },
    {
        "id": 2,
        "name": "Bob",
        "age": 51,
        "address": "680286, View Street St"
    },
    {
        "id": 3,
        "name": "Cameron",
        "age": 29,
        "address": "711086, Forest Street St"
    }
]

In this example:

The sourceScripted="True" attribute is used to evaluate the JSON template with embedded variables.
The JSON template contains placeholders for variables like random_age, street_name, and address_number.
If whole JSON field value is a variable, it should be enclosed in curly braces {} (e.g., "age": "{random_age}"). Returned value can be a string, integer, or any other type.
If a variable is embedded within a string, it should be enclosed in double underscores __ (e.g., "address": "__address_number__, __street_name__ St"). Returned value will be a string. You can also customize the prefix and suffix for variable substitution using variablePrefix, variableSuffix, defaultVariablePrefix, and defaultVariableSuffix attributes. For example:

<setup defaultVariablePrefix="-%" defaultVariableSuffix="%-">
    <generate name="json_data" source="script/data.json" sourceScripted="True" target="">
        <variable name="random_age" generator="IntegerGenerator(min=18, max=65)"/>
    </generate>
</setup>
<setup>
    <generate name="json_data" source="script/data.json" sourceScripted="True" target="" variablePrefix="-%" variableSuffix="%-">
        <variable name="random_age" generator="IntegerGenerator(min=18, max=65)"/>
    </generate>
</setup>

Example 8: Using multiprocessing platform `ray`¶

<setup>
    <generate name="json_data" count="1000000" mpPlatform="ray" target="">
        <key name="random_age" generator="IntegerGenerator(min=18, max=65)"/>
        <key name="street_name" generator="StreetNameGenerator"/>
        <key name="address_number" generator="IntegerGenerator"/>
    </generate>
</setup>

In this example: - Generate tasks will be executed using ray platform instead of default python multiprocessing.

The <generate> element defines a data generation task. At its most basic, it requires:

name: Identifies the generation task
count: Specifies how many records to generate
target: (Optional) Specifies the output format (e.g., CSV, JSON)

Basic Example¶

<setup>
    <generate name="simple_users" count="10" target="CSV">
        <key name="id" generator="IncrementGenerator"/>
        <key name="name" type="string"/>
        <key name="age" type="int"/>
    </generate>
</setup>

Essential Attributes¶

name: Task identifier
count: Number of records to generate
target: Output format (e.g., CSV, JSON, ConsoleExporter)
source: (Optional) Input data source

`<key>`¶

The <key> element defines key fields within a data generation task and specifies their generation methods. These fields are crucial for creating unique identifiers or structured elements within the generated data. The <key> element allows for dynamic, constant, or conditional data generation and provides several attributes to customize its behavior.

Attributes¶

name: Specifies the name of the key. This is mandatory and will be used as the field name in the generated data.
type: Defines the data type of the key (e.g., string, int, bool). This is optional when using script or generator.
source: Specifies the data source for the key (e.g., a database, a file).
separator: Specifies a separator for csv source.
values: Provides a list of static values for the key to choose from.
script: Defines a script for dynamically generating the key's value.
generator: Specifies a generator to automatically create values (e.g., RandomNumberGenerator, IncrementGenerator).
constant: Defines a constant value for the key.
condition: Specifies a condition to determine whether the key will be generated.
converter: Specifies a converter to transform the value (e.g., date conversion, format changes).
pattern: Defines a regex pattern to validate the value of the key.
inDateFormat / outDateFormat: Specifies input and output date formats for converting date values. After outDateFormat is applied, downstream expressions see a string, not a raw datetime.
defaultValue: Provides a default value if the key’s value is null or not generated.
nullQuota: Defines the probability that the key will be assigned a null value. Default is 0 (never null).
database: Specifies the database used for generating values (e.g., SequenceTableGenerator).
string: Attribute to generate complex strings by embedding variables within the string using customizable delimiters. (read more in variable section)
variablePrefix: Configurable attribute that defines the prefix for variable substitution in dynamic strings (default is __).
variableSuffix: Configurable attribute that defines the suffix for variable substitution in dynamic strings (default is __).

Example 1: Generating Constant and Scripted Keys¶

<setup>
    <generate name="static_and_scripted_keys" count="5" >
        <key name="static_key" constant="fixed_value"/>
        <key name="dynamic_key" script="random.randint(1, 100)"/>
    </generate>
</setup>

In this example:

static_key is assigned a constant value of "fixed_value" for every record.
dynamic_key generates a random integer between 1 and 100 for each record using a script.

Example 2: Handling `nullQuota` for Nullable Fields¶

<setup>
    <generate name="nullable_keys" count="10">
        <key name="key_always_null" type="string" nullQuota="1"/> <!-- 100% null values -->
        <key name="key_never_null" type="string" nullQuota="0"/> <!-- 0% null values -->
        <key name="key_sometimes_null" type="string" nullQuota="0.5"/> <!-- 50% null values -->
    </generate>
</setup>

In this example:

key_always_null will always have a null value (nullQuota="1").
key_never_null will never have a null value (nullQuota="0").
key_sometimes_null will have a null value 50% of the time (nullQuota="0.5").

Example 3: Using `defaultValue` for Fallback Values¶

<setup>
    <generate name="default_values" count="5">
        <key name="key_with_empty_string" script="" defaultValue="default_value"/> <!-- Fallback to default_value -->
        <key name="key_with_none" script="None" defaultValue="default_value"/> <!-- Fallback to default_value -->
        <key name="key_with_condition" script="" defaultValue="default_value" condition="False"/> <!-- Condition False, no generation -->
    </generate>
</setup>

Here:

The first two keys fall back to their defaultValue when the script generates an empty or None value.
The third key doesn’t generate any value since its condition is False.

Example 4: Conditional Key Generation¶

<setup>
    <generate name="conditional_keys" count="10">
        <key name="conditional_key" script="random.randint(1, 100)" condition="random.randint(1, 100) > 50"/>
        <key name="constant_key" constant="fixed_value" condition="True"/>
    </generate>
</setup>

In this example:

conditional_key is generated only when a random number greater than 50 is produced by the condition script.
constant_key is always generated since its condition="True".

Example 5: Using `pattern` to Validate Keys¶

<setup>
    <generate name="pattern_matching" count="10">
        <key name="email" script="'[email protected]'" pattern="^[\w\.-]+@[\w\.-]+\.\w+$"/>
        <key name="phone_number" script="'123-456-7890'" pattern="^\d{3}-\d{3}-\d{4}$"/>
    </generate>
</setup>

In this example:

The email key’s value must match the regex pattern for a valid email format.
The phone_number key’s value must match the regex pattern for a valid phone number format (123-456-7890).

Example 6: Date Conversion Using `inDateFormat` and `outDateFormat`¶

<setup>
    <generate name="date_format_conversion" count="10">
        <key name="date_of_birth" script="'2023-10-12'" inDateFormat="%Y-%m-%d" outDateFormat="%d-%m-%Y"/>
    </generate>
</setup>

In this example:

The date_of_birth key uses the input date format (inDateFormat="%Y-%m-%d") to parse the date and converts it to the specified output format (outDateFormat="%d-%m-%Y").

Example 7: Keep A Raw Datetime For Arithmetic¶

<setup>
    <generate name="invoice_dates" count="1">
        <variable name="issue_at_raw" generator="DateTimeGenerator(value='2024-01-15 10:30:00')"/>
        <key name="issue_date" script="issue_at_raw" outDateFormat="%Y-%m-%d"/>
        <key name="due_date" script="issue_at_raw.add_days(30)" outDateFormat="%Y-%m-%d"/>
    </generate>
</setup>

In this example:

issue_at_raw stays a raw datetime for downstream arithmetic.
issue_date is formatted for output.
due_date uses the raw datetime value and formats only at the end.

Example 8: Key Generation from a `SequenceTableGenerator`¶

<setup>
    <database id="sourceDB" system="postgres"/>
    <generate name="sequence_key_generation" count="10" >
        <key name="user_id" database="sourceDB" generator="SequenceTableGenerator"/>
    </generate>
</setup>

Here:

The user_id key is generated using a SequenceTableGenerator from a PostgreSQL database.
This generator ensures that unique, sequential values are pulled from the database.

Best Practices for Using `<key>`¶

Leverage script for Dynamic Values: Use script to generate complex and dynamic values, such as random numbers, dates, or values based on calculations.
Use nullQuota for Realistic Data: Use nullQuota to simulate real-world scenarios where some keys may have null values.
Fallback with defaultValue: Use defaultValue to ensure that your keys always have a fallback value if a script fails or produces None.
Pattern Matching for Validation: Use the pattern attribute to enforce specific formatting rules, such as email addresses or phone numbers.
Control Key Generation with condition: Use the condition attribute to dynamically determine whether a key should be generated, allowing for more control in complex data generation scenarios.
Keep Raw Datetimes Separate From Formatted Output: If you need datetime arithmetic later, keep a raw variable or unformatted value and format only on the final key.

`<variable>`¶

The <variable> element defines variables used in data generation tasks. Variables can be sourced from databases, datasets, or dynamically generated using scripts. They introduce flexibility in creating dynamic test data by controlling how the data is retrieved or iterated. New in this release is the storage attribute for explicit control while keeping full backward compatibility.

Attributes¶

name: Specifies the name of the variable.
type: Defines the data type of the variable (optional). For DB/file sources, use the table/collection/entity name.
source: Specifies the data source for the variable (e.g., a database or a file path).
selector: Defines a database query for the variable. By default, it executes once and exposes the result through the variable's storage/value behavior.
iterationSelector: Executes a database query on each iteration to retrieve dynamic data for the variable.
paged: Optional DB-selector behavior switch. When paged="true", the selector follows the current parent <generate> page window instead of the default setup-cached once-per-run behavior.
storage: Controls how the variable stores/serves data. Options:
value – single static value (default for generators/constants; first row if a query)
data – complete data list in memory (random access)
iterator – cursor/iterator over rows; respects cyclic
separator: Specifies a separator for the variable (e.g., for CSV sources).
cyclic: Enables or disables cyclic iteration of the data source (relevant for storage="iterator").
entity: Defines the entity for generating data (e.g., a predefined model or object).
script: Specifies a script for dynamically generating the variable's value.
weightColumn: Specifies a column to weight data selection (typically used in CSV or database sources).
sourceScripted: Enables per-row template evaluation for file-backed sources (CSV/JSON, weighted sources).
generator: Defines a generator for the variable (e.g., RandomNumberGenerator, IncrementGenerator).
dataset: Specifies the dataset for the variable (usually a file path).
locale: Defines the locale used when generating data.
inDateFormat / outDateFormat: Specifies date format conversion for input and output. After outDateFormat is applied, downstream expressions see a string value, not a raw datetime.
converter: Defines a converter for transforming the variable's value.
constant: Sets a fixed constant value for the variable.
values: Provides a list of values for the variable to choose from.
defaultValue: Sets a default value when no data is available.
pattern: Defines a regex pattern for validating the variable's content.
distribution: Controls how data is distributed when selecting from a source (random, ordered).
database: Specifies the database used for generating data.
string: Attribute to generate complex strings by embedding variables within the string using customizable delimiters (see examples on <key>).
variablePrefix / variableSuffix: Configurable attributes that define the prefix/suffix for variable substitution in dynamic strings (default is __). Can be set globally on <setup> via defaultVariablePrefix / defaultVariableSuffix and overridden per element.

Storage Modes¶

Use storage for explicit, predictable behavior:

value: single static value. Ideal for configuration values, selector scalar queries, constants, generators.
data: full list materialized in memory. Useful for analytics, random access, or joining in scripts. Be mindful of size.
iterator: efficient row-by-row iteration. Honors cyclic. Best for large tables/files.

Selector and Paging Contract¶

Selector behavior on <variable> is intentionally split into distinct contracts:

Form	Execution timing	Cache scope	`pageSize` effect	Notes
`selector="..."`	Once by default	Run-local cached result	No generic source-paging contract	Result exposure still depends on storage/value behavior
`selector="..." paged="true"`	Once per parent page window	Page-scoped cache	Follows the current parent generate page window	Use when the variable must track generate paging
`iterationSelector="..."`	Per iteration	No setup-cache contract	No generic `pageSize` contract	Dynamic per-iteration lookup

Important notes:

selector and iterationSelector always drive the query when present; type and sourceEntity do not override them.
Parent and global placeholders can be resolved in selector text. Child-scope placeholders are not visible to a parent selector.
Mongo aggregate: is a selector-kind exception. Do not assume it shares the same paging semantics as SQL or Mongo find: selectors.

Legacy Automatic Behavior (Backward Compatible)¶

If storage is omitted, DATAMIMIC applies the legacy rules:

Selector-based variables → behave as static single value.
Table/collection variables → cycle over rows (iterator semantics).
Generator/constant variables → single value.

Context Levels¶

Root-level variables (declared directly under <setup>): loaded once, shared across the run, stable in multiprocessing.
Nested variables (declared inside <generate>): created per generation scope; can exhaust when cyclic="False".

Multiprocessing Notes¶

storage="data": each worker receives the same snapshot list.
storage="iterator": each worker advances its own cursor. For globally partitioned traversal, partition upstream (e.g., by ID ranges).

Example 1: Using `generator` for Incrementing Values¶

<setup>
    <generate name="sequential_ids" count="10" >
        <variable name="id" generator="IncrementGenerator"/>
        <key name="generated_id" script="id"/>
    </generate>
</setup>

In this example:

The id variable uses the IncrementGenerator, which generates sequential numbers.
The generated ID is then assigned to the generated_id key for each record.

Example 2: Sourcing Data from a CSV File with `separator`¶

<setup>
    <generate name="person_data" count="5" >
        <variable name="person" source="data/people.csv" separator="," distribution="ordered"/>
        <key name="person_id" script="person.id"/>
        <key name="person_name" script="person.name"/>
        <key name="person_age" script="person.age"/>
    </generate>
</setup>

In this example:

The person variable is sourced from a CSV file, with fields separated by a comma.
The distribution="ordered" ensures that records are processed in the order they appear in the file.

Example 3: Defining a `constant` Variable¶

<setup>
    <generate name="constant_value_example" count="3" >
        <variable name="country" constant="Germany"/>
        <key name="user_country" script="country"/>
    </generate>
</setup>

In this case:

The country variable is defined as a constant with the value "Germany".
This value is applied to every record generated in the user_country key.

Example 4: Generating Dynamic Variables with `script`¶

<setup>
    <generate name="dynamic_variables" count="5" >
        <variable name="random_number" script="random.randint(1, 100)"/>
        <variable name="full_name" script="fake.name()"/>
        <key name="random_number_value" script="random_number"/>
        <key name="full_name_value" script="full_name"/>
    </generate>
</setup>

In this example:

The random_number variable generates a random integer between 1 and 100 using a script.
The full_name variable uses the fake library to generate random names.
These dynamically generated values are then printed for each record.

Example 5: Using `cyclic` Variables with a CSV Source¶

<setup>
    <generate name="cyclic_people" count="8" >
        <variable name="person" source="data/people.csv" cyclic="True" separator=","/>
        <key name="person_id" script="person.id"/>
        <key name="person_name" script="person.name"/>
    </generate>
</setup>

In this example:

The cyclic="True" attribute ensures that once all records from the CSV file are used, the data starts from the beginning again.

Example 6: Using `distribution` to Randomize Data Selection¶

<setup>
    <generate name="random_people" count="10" >
        <variable name="person" source="data/people.csv" separator="," distribution="random"/>
        <key name="person_id" script="person.id"/>
        <key name="person_name" script="person.name"/>
    </generate>
</setup>

Here:

The distribution="random" attribute ensures that the records are selected randomly from the source CSV file.

Example 7: Iterating with `iterationSelector`¶

<setup>
    <generate name="iterate_selector" count="20" >
        <key name="iteration_count" generator="IncrementGenerator"/>
        <variable name="user" source="dbPostgres"
                  iterationSelector="SELECT id, name FROM users WHERE id = __iteration_count__"/>
        <key name="user_id" script="user[0].id"/>
        <key name="user_name" script="user[0].name"/>
    </generate>
</setup>

In this example:

The iterationSelector query retrieves data from a PostgreSQL database for each iteration using the iteration_count value, dynamically fetching user information.

Example 8: Using `paged="true"` with a database selector¶

<setup>
    <generate name="paged_customers" count="1000" pageSize="200">
        <variable name="customer_page"
                  source="dbPostgres"
                  selector="SELECT id, name FROM public.customers"
                  paged="True"/>
        <key name="customer_id" script="customer_page.id"/>
        <key name="customer_name" script="customer_page.name"/>
    </generate>
</setup>

In this example:

The selector no longer behaves like a setup-cached once-per-run lookup.
It reloads once for each parent generate page window and reuses that page-local result inside the page.

Example 9: Preserve A Raw Datetime Variable For Arithmetic¶

<setup>
    <generate name="subscription_dates" count="1">
        <variable name="created_at" generator="DateTimeGenerator(value='2024-01-31 10:30:00')"/>
        <key name="renewal_date" script="created_at.add_months(1).end_of_month()" outDateFormat="%Y-%m-%d"/>
    </generate>
</setup>

In this example:

created_at remains a raw datetime variable.
The key performs arithmetic first and formatting second.

Example 10: Defining Weighted Variables with `weightColumn`¶

<setup>
    <generate name="weighted_people" count="10" >
        <variable name="people" source="data/people_weighted.csv" weightColumn="weight" separator=","/>
        <key name="person_id" script="people.id"/>
        <key name="person_name" script="people.name"/>
    </generate>
</setup>

Here:

The weightColumn="weight" controls how frequently each row is selected. Rows with higher weight values are more likely to be chosen.

Example 10: Combining Variables with Nested Keys¶

<setup>
    <generate name="customer_info" count="10" >
        <variable name="customer" source="data/customers.csv" cyclic="True"/>
        <variable name="notification" source="data/notifications.csv" cyclic="True"/>
        <key name="customer_id" script="customer.id"/>
        <key name="customer_name" script="customer.name"/>
        <nestedKey name="notifications" type="list" count="2">
            <key name="notification_type" script="notification.type"/>
            <key name="notification_message" script="notification.message"/>
        </nestedKey>
    </generate>
</setup>

In this case:

The customer and notification variables are both sourced from CSV files.
The nestedKey element generates two notifications for each customer, showcasing how variables can be combined with nested structures.

Example 11: Working with Entities and Locale-Specific Data¶

<setup>
    <generate name="localized_data" count="5" >
        <variable name="person" entity="Person" locale="de_DE"/>
        <key name="person_name" script="person.full_name"/>
        <key name="person_address" script="person.address"/>
    </generate>
</setup>

In this example:

The person variable is generated using the Person entity, with data localized to de_DE (Germany).
This can be used to generate locale-specific data like names, addresses, etc.

Example 12: Using `string` Attribute for Dynamic and Complex Strings¶

<setup defaultVariablePrefix="%%" defaultVariableSuffix="%%">
    <generate name="query_generation" count="1">
        <variable name="collection" constant="'users'" />
        <key name="query" string="find: %%collection%%, filter: {'status': 'active'}" />
    </generate>
</setup>

In this example:

The string attribute allows dynamic insertion of the variable collection into the query.
The custom %% prefix and suffix replace the default __.

Example 13: Default `variablePrefix` and `variableSuffix`¶

<setup>
    <generate name="query_generation" count="1">
        <variable name="collection" constant="'users'" />
        <key name="query" string="find: __collection__, filter: {'status': 'active'}" />
    </generate>
</setup>

In this case:

The default __ delimiters are used for variable substitution.

Example 14: Explicit Iterator Storage with Cycling Control¶

<setup>
    <generate name="sales" count="50">
        <variable name="products" source="db" type="product_table" storage="iterator" cyclic="True"/>
        <key name="product_id" script="products.id"/>
    </generate>
</setup>

Example 15: Storing the Complete Dataset in Memory¶

<setup>
    <variable name="all_users" source="db" type="user_table" storage="data"/>
    <generate name="analytics" count="10">
        <key name="total_users" script="len(all_users)"/>
        <key name="random_user" script="all_users[random(0, len(all_users)-1)]"/>
    </generate>
</setup>

Example 16: Forcing Single-Value Behavior¶

<setup>
    <generate name="test" count="5">
        <variable name="first_user" source="db" type="user_table" storage="value"/>
        <key name="template_user" script="first_user.name"/>
    </generate>
</setup>

Best Practices for Using `<variable>`¶

Dynamic Data Generation: Use scripts in variables to create dynamic data like random numbers, names, and addresses using libraries like random and fake.
Cyclic vs Non-Cyclic: Use cyclic variables when you want data to repeat once all values are used, while non-cyclic variables are exhausted after one pass.
Weighting and Randomization: Use weightColumn to skew data generation toward certain records and distribution="random" to randomize data selection.
Combining with Nested Keys: Use variables in combination with nestedKey to generate structured, hierarchical data.
Pick the right storage: Use value for scalars/config, iterator for large sources, data for small datasets you need to index.
Prefer explicit type for DB sources: Avoid relying on variable names to infer tables/collections.
Mind multiprocessing: data is shared as a snapshot; iterators advance per worker.

Storage Mode Summary¶

Variable Pattern	Storage Mode	Behavior	Use Case
`selector="..."`	auto → value	Single value	Config, max values, constants
`selector="..." paged="true"`	page-aware	Reloads per parent page window	Page-local DB lookups
`type="table_name"`	auto → iterator	Cycles through data	Entity relationships
`storage="value"`	Explicit	Single value	Force static behavior
`storage="data"`	Explicit	Complete list	Calculations, random access
`storage="iterator"`	Explicit	Cycles/exhausts by `cyclic`	Controlled iteration

`<ml-train>`¶

The <ml-train> element is used to train machine learning models with input data. These trained models can then be used as sources in <generate> elements to enrich original data. The <ml-train> element is a sub element of <setup>

Attributes¶

name: Specifies the name of model after trained. This is mandatory and will be used to reference the model in other elements.
source: Specifies the source of the data (e.g., data/active.ent.csv, mongo).
type: Specifies the type of data to generate.
mode: Specifies the training mode. Currently only have 'default' and 'persist'. 'default' will remove the model after all task finished. 'persist' will keep the model after all task finished.
maxTrainingTime: Specifies the maximum time allowed for model training in minutes (e.g. 1, 5, 10)
separator: Specifies the separator used in the source data file (e.g., ',' for CSV files).

Example 1: Basic Model Training¶

<setup>
    <ml-train name="customer_csv_gen"
            source="data/customer.ent.csv"
            maxTrainingTime="1"
            separator=","/>

    <generate name="csv_customer" count="10000" pageSize="1000" source="customer_csv_gen" target="CSV">
        <key name="id" generator="IncrementGenerator"/>
    </generate>
</setup>

In this example: - The model named "customer_csv_gen" is trained using data from "data/customers.csv" - We didn't specific "mode" so it will be default and "customer_csv_gen" model will be removed after all task finish. - The CSV file uses comma as separator - "generate" will use trained "customer_csv_gen" model as source to create new data

Example 2: Training with persist mode¶

<setup numProcess="2">

    <ml-train name="customer_csv_gen"
              source="data/customer.ent.csv"
              mode="persist"
              maxTrainingTime="1"/>

    <!-- Generate synthetic CUSTOMER records using the ML generator -->
    <generate name="csv_customer" count="10000" pageSize="1000" source="customer_csv_gen" target="CSV">
        <key name="id" generator="IncrementGenerator"/>
    </generate>
</setup>

In this example: - We specify "mode" is "persist" so it will keep even after all task finish. - Later we can use it without training.

<setup numProcess="2">
    <generate name="csv_customer" count="10000" pageSize="1000" source="customer_csv_gen" target="CSV">
        <key name="id" generator="IncrementGenerator"/>
    </generate>
</setup>

For the Database View -> ML Generator View lifecycle and project-level reuse flow (including source="ml://..." usage), see ML Generator from Database Metadata and ML Generator View.

Complete Basic Example¶

Here's a complete example combining the core elements:

<setup>
    <generate name="user_data" count="100" target="CSV">
        <!-- Define variables -->
        <variable name="person" entity="Person"/>

        <!-- Define keys -->
        <key name="id" generator="IncrementGenerator"/>
        <key name="first_name" script="person.given_name"/>
        <key name="last_name" script="person.family_name"/>
        <key name="age" type="int" generator="IntegerGenerator(min=18, max=80)"/>
        <key name="status" constant="active"/>
    </generate>
</setup>

This will generate 100 user records with consistent, structured data including IDs, names, ages, and a status field.

Next Steps¶

Once you're comfortable with these core elements, explore the Advanced Data Definition Elements for more complex features like:

Nested data structures
Conditional generation
Complex data patterns
Arrays and lists
Advanced variable usage

Data Definition Model - Core Elements¶

Overview¶

Expression Syntax: {expr} vs {{ expr }}¶

Syntax Comparison¶

When to Use Each Form¶

Examples¶

Timestamp Behavior¶

Random Value Behavior¶

Recommendation Table¶

Note on targetEntity¶

Setup-Time vs Runtime Attributes¶

Example: Dynamic Source Path¶

Example: Dynamic Target¶

Basic Elements¶

¶

<generate>¶

Attributes¶

Overview (Target vs Storage vs Source)¶

Children¶

Example 1: Using Object Storage for Data Generation¶

Object storage reads (source + sourceUri)¶

Example 2: Using selector with a Database¶

Selector contract notes¶

Example 3: Generating Data with MongoDB and Aggregation¶

Example 4: Generating Data with Kafka¶

Example 5: Using Data from a CSV File¶

Example 6: Using cyclic with Data from Memory Store¶

Example 7: Using 'sourceScripted' with JSON template¶

Example 8: Using multiprocessing platform ray¶

Basic Example¶

Essential Attributes¶

<key>¶

Attributes¶

Example 1: Generating Constant and Scripted Keys¶

Example 2: Handling nullQuota for Nullable Fields¶

Example 3: Using defaultValue for Fallback Values¶

Example 4: Conditional Key Generation¶

Example 5: Using pattern to Validate Keys¶

Example 6: Date Conversion Using inDateFormat and outDateFormat¶

Example 7: Keep A Raw Datetime For Arithmetic¶

Example 8: Key Generation from a SequenceTableGenerator¶

Best Practices for Using <key>¶

<variable>¶

Attributes¶

Storage Modes¶

Selector and Paging Contract¶

Legacy Automatic Behavior (Backward Compatible)¶

Context Levels¶

Multiprocessing Notes¶

Example 1: Using generator for Incrementing Values¶

Example 2: Sourcing Data from a CSV File with separator¶

Example 3: Defining a constant Variable¶

Example 4: Generating Dynamic Variables with script¶

Example 5: Using cyclic Variables with a CSV Source¶

Example 6: Using distribution to Randomize Data Selection¶

Example 7: Iterating with iterationSelector¶

Example 8: Using paged="true" with a database selector¶

Example 9: Preserve A Raw Datetime Variable For Arithmetic¶

Example 10: Defining Weighted Variables with weightColumn¶

Example 10: Combining Variables with Nested Keys¶

Example 11: Working with Entities and Locale-Specific Data¶

Example 12: Using string Attribute for Dynamic and Complex Strings¶

Example 13: Default variablePrefix and variableSuffix¶

Example 14: Explicit Iterator Storage with Cycling Control¶

Example 15: Storing the Complete Dataset in Memory¶

Example 16: Forcing Single-Value Behavior¶

Best Practices for Using <variable>¶

Storage Mode Summary¶

<ml-train>¶

Attributes¶

Example 1: Basic Model Training¶

Example 2: Training with persist mode¶

Complete Basic Example¶

Next Steps¶

Expression Syntax: `{expr}` vs `{{ expr }}`¶

Note on `targetEntity`¶

`<generate>`¶

Example 2: Using `selector` with a Database¶

Example 6: Using `cyclic` with Data from Memory Store¶

Example 8: Using multiprocessing platform `ray`¶

`<key>`¶

Example 2: Handling `nullQuota` for Nullable Fields¶

Example 3: Using `defaultValue` for Fallback Values¶

Example 5: Using `pattern` to Validate Keys¶

Example 6: Date Conversion Using `inDateFormat` and `outDateFormat`¶

Example 8: Key Generation from a `SequenceTableGenerator`¶

Best Practices for Using `<key>`¶

`<variable>`¶

Example 1: Using `generator` for Incrementing Values¶

Example 2: Sourcing Data from a CSV File with `separator`¶

Example 3: Defining a `constant` Variable¶

Example 4: Generating Dynamic Variables with `script`¶

Example 5: Using `cyclic` Variables with a CSV Source¶

Example 6: Using `distribution` to Randomize Data Selection¶

Example 7: Iterating with `iterationSelector`¶

Example 8: Using `paged="true"` with a database selector¶

Example 10: Defining Weighted Variables with `weightColumn`¶

Example 12: Using `string` Attribute for Dynamic and Complex Strings¶

Example 13: Default `variablePrefix` and `variableSuffix`¶

Best Practices for Using `<variable>`¶

`<ml-train>`¶