Data Definition Model - Core Elements
Data Definition Models are fundamental to DATAMIMIC's test data generation capabilities. This document covers the essential elements - if you're new to DATAMIMIC, start here. For advanced features, see Advanced Data Definition Elements.
Overview
Data Definition Models specify how test data should be generated, transformed, or obfuscated. The core elements allow you to:
- Define data generation tasks
- Specify key fields and their values
- Create and use variables
- Generate structured data sets
Basic Elements
The <setup> element is the root element for all data generation tasks. It contains one or more <generate> elements that define specific data generation operations. Learn more of its use in Configuration Models.
| <setup>
<generate name="users" count="100">
<!-- Generation details goes here -->
</generate>
</setup>
|
<generate>
The <generate> element is the core of Data Definition Models. It defines a data generation task and includes attributes like name, count, and target. This element is used to create structured data based on the specified configurations.
Attributes
- name: Specifies the name of the generation task.
- count: Specifies the number of records to generate.
- source: Specifies the source of the data (e.g.,
data/active.ent.csv, mongo).
- target: Specifies the target output (e.g.,
CSV, sqliteDB).
- type: Specifies the type of data to generate.
- cyclic: Enables or disables cyclic generation. Default is
False.
- selector: Specifies a database query for the generation.
- separator: Specifies a separator for the generated data. Default is
|.
- sourceScripted: Enables or disables scripted source evaluation in the source file (e.g.,
example.ent.csv, example.json). Default is False.
- pageSize: Specifies the page size for data generation.
- storageId: Specifies the ID of object storage, defined by the
<object-storage> element.
- sourceUri: Specifies the URI of the datasource on object storage (e.g.,
datasource/employees.csv).
- exportUri: Specifies the URI for exporting generated data on object storage (e.g.,
export/product.csv).
- container: Specifies the container name for Azure Blob Storage.
- bucket: Specifies the bucket name for AWS S3.
- distribution: Specifies the distribution of data source iteration (e.g.,
random, ordered). Default is random.
- converter: Specifies a converter to transform value.
- variablePrefix: Configurable attribute that defines the prefix for variable substitution in dynamic strings (default is __).
- variableSuffix: Configurable attribute that defines the suffix for variable substitution in dynamic strings (default is __).
- numProcess: Defines the number of processes for multiprocessing, can be propagated from parent element
<setup>. Default is 1.
- mpPlatform: Define multiprocessing platform to be executed. Accepted values are
multiprocessing and ray. Default value is multiprocessing.
Children
<key>: Specifies key fields within the data generation task.
<variable>: Defines variables used in data generation.
<reference>: Defines references to other generated data.
<nestedKey>: Specifies nested key fields and their generation methods.
<list>: Defines lists of data items.
<condition>: Conditional element to include data based on certain conditions.
<array>: Defines arrays of data items.
<echo>: Outputs text or variables for logging or debugging purposes.
Example 1: Using Object Storage for Data Generation
| <setup>
<!-- Define object-storage with ID referring to the environment -->
<object-storage id="aws"/>
<!-- Write file to the object-storage -->
<generate name="external_write" bucket="datamimic-01" storageId="aws" exportUri="/datamimic_exporting_result/" target="JSON, CSV, TXT, XML" count="100">
<key name="id" generator="IncrementGenerator"/>
<key name="name" type="string"/>
</generate>
<!-- Read file from object-storage -->
<generate name="external_read" bucket="datamimic-01" sourceUri="datamimic_exporting_result/external_write.json" source="aws" />
</setup>
|
Example 2: Using selector with a Database
| <generate name="CUSTOMER" source="mongodb" selector="find: 'CUSTOMER', filter: {'age': {'$lt': 30}}" >
<key name="id" generator="IncrementGenerator"/>
<key name="name" type="string"/>
</generate>
|
In this example:
- The
selector is used to query the MongoDB database to find all customers under 30 years old.
- The data is output to the
ConsoleExporter.
Example 3: Generating Data with MongoDB and Aggregation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63 | <setup >
<memstore id="mem"/>
<mongodb id="mongodb"/>
<!-- Clear collections before generating new data -->
<generate name="delete_users" source="mongodb" selector="find: 'more_users', filter: {}" target="mongodb.delete"/>
<generate name="delete_orders" source="mongodb" selector="find: 'more_orders', filter: {}" target="mongodb.delete"/>
<generate name="delete_products" source="mongodb" selector="find: 'more_products', filter: {}" target="mongodb.delete"/>
<!-- Generate orders, users, and products collections -->
<generate name="more_orders" source="script/orders.json" target="mongodb"/>
<generate name="more_users" source="script/users.json" target="mongodb"/>
<generate name="more_products" source="script/products.json" target="mongodb"/>
<!-- Perform an aggregation query to summarize user orders and spending -->
<generate name="more_summary" count="20" >
<variable name="result" source="mongodb"
selector='aggregate: "more_users",
pipeline: [
{
"$lookup": {
"from": "more_orders",
"localField": "user_id",
"foreignField": "user_id",
"as": "userOrders"
}
},
{
"$unwind": "$userOrders"
},
{
"$lookup": {
"from": "more_products",
"localField": "userOrders.order_item",
"foreignField": "product_name",
"as": "orderProducts"
}
},
{
"$unwind": "$orderProducts"
},
{
"$group": {
"_id": "$user_id",
"user_name": { "$first": "$user_name" },
"order_items": { "$push": "$userOrders.order_item" },
"quantities": { "$first": "$userOrders.quantity" },
"total_spending": {
"$sum": {
"$multiply": ["$userOrders.quantity", "$orderProducts.price"]
}
}
}
}
]'/>
<nestedKey name="users_orders" script="result"/>
</generate>
<!-- Clear collections after generation -->
<generate name="delete_users" source="mongodb" selector="find: 'more_users', filter: {}" target="mongodb.delete"/>
<generate name="delete_orders" source="mongodb" selector="find: 'more_orders', filter: {}" target="mongodb.delete"/>
<generate name="delete_products" source="mongodb" selector="find: 'more_products', filter: {}" target="mongodb.delete"/>
</setup>
|
Example 4: Generating Data with Kafka
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 | <setup >
<kafka-exporter id="kafkaLocal" environment="environment"/>
<kafka-importer id="kafka_importer" system="kafkaLocal" enable.auto.commit="True" auto.offset.reset="earliest" group.id="datamimic" decoding="UTF-8" environment="environment"/>
<!-- Reset Kafka topic by consuming all messages -->
<generate name="reset" source="kafka_importer" type="kafka" count="100" target=""/>
<!-- Generate data to export to Kafka and Console -->
<generate name="exported_data" count="10" target="ConsoleExporter, kafkaLocal">
<variable name="person" entity="Person"/>
<key name="name" script="person.name"/>
<key name="email" script="person.email"/>
</generate>
<!-- Import data from Kafka -->
<generate name="imported_data" source="kafka_importer" type="kafka" count="20" distribution="ordered"/>
</setup>
|
Example 5: Using Data from a CSV File
| <setup defaultSeparator="|">
<generate name="product1" source="data/products.ent.csv" separator="," distribution="ordered"/>
<generate name="product2" source="data/products_2.ent.csv" distribution="ordered"/>
</setup>
|
In this example:
- Two
generate tasks are created that source data from CSV files and output it to the ConsoleExporter.
Example 6: Using cyclic with Data from Memory Store
1
2
3
4
5
6
7
8
9
10
11
12 | <setup >
<memstore id="mem"/>
<generate name="product" count="15" target="mem">
<key name="id" generator="IncrementGenerator"/>
<key name="name" values="'Alice', 'Bob', 'Cameron'"/>
</generate>
<!-- Generate 30 non-cyclic and cyclic products from memory -->
<generate name="non-cyclic-product" type="product" count="30" cyclic="False" source="mem" target="" distribution="ordered"/>
<generate name="cyclic-product" type="product" count="30" cyclic="True" source="mem" target="" distribution="ordered"/>
<generate name="big-cyclic-product" type="product" count="100" cyclic="True" source="mem" target="" distribution="ordered"/>
</setup>
|
Example 7: Using 'sourceScripted' with JSON template
| <setup>
<generate name="json_data" source="script/data.json" sourceScripted="True" target="">
<variable name="random_age" generator="IntegerGenerator(min=18, max=65)"/>
<variable name="street_name" generator="StreetNameGenerator"/>
<variable name="address_number" generator="IntegerGenerator"/>
</generate>
</setup>
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 | [
{
"id": 1,
"name": "Alice",
"age": "{random_age}",
"address": "__address_number__, __street_name__ St"
},
{
"id": 2,
"name": "Bob",
"age": "{random_age}",
"address": "__address_number__, __street_name__ St"
},
{
"id": 3,
"name": "Cameron",
"age": "{random_age}",
"address": "__address_number__, __street_name__ St"
}
]
|
Result:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 | [
{
"id": 1,
"name": "Alice",
"age": 23,
"address": "801538, Walnut Street St"
},
{
"id": 2,
"name": "Bob",
"age": 51,
"address": "680286, View Street St"
},
{
"id": 3,
"name": "Cameron",
"age": 29,
"address": "711086, Forest Street St"
}
]
|
In this example:
- The
sourceScripted="True" attribute is used to evaluate the JSON template with embedded variables.
- The JSON template contains placeholders for variables like
random_age, street_name, and address_number.
- If whole JSON field value is a variable, it should be enclosed in curly braces
{} (e.g., "age": "{random_age}"). Returned value can be a string, integer, or any other type.
- If a variable is embedded within a string, it should be enclosed in double underscores
__ (e.g., "address": "__address_number__, __street_name__ St"). Returned value will be a string. You can also customize the prefix and suffix for variable substitution using variablePrefix, variableSuffix, defaultVariablePrefix, and defaultVariableSuffix attributes. For example:
| <setup defaultVariablePrefix="-%" defaultVariableSuffix="%-">
<generate name="json_data" source="script/data.json" sourceScripted="True" target="">
<variable name="random_age" generator="IntegerGenerator(min=18, max=65)"/>
</generate>
</setup>
<setup>
<generate name="json_data" source="script/data.json" sourceScripted="True" target="" variablePrefix="-%" variableSuffix="%-">
<variable name="random_age" generator="IntegerGenerator(min=18, max=65)"/>
</generate>
</setup>
|
| <setup>
<generate name="json_data" count="1000000" mpPlatform="ray" target="">
<key name="random_age" generator="IntegerGenerator(min=18, max=65)"/>
<key name="street_name" generator="StreetNameGenerator"/>
<key name="address_number" generator="IntegerGenerator"/>
</generate>
</setup>
|
In this example:
- Generate tasks will be executed using ray platform instead of default python multiprocessing.
The <generate> element defines a data generation task. At its most basic, it requires:
- name: Identifies the generation task
- count: Specifies how many records to generate
- target: (Optional) Specifies the output format (e.g., CSV, JSON)
Basic Example
| <setup>
<generate name="simple_users" count="10" target="CSV">
<key name="id" generator="IncrementGenerator"/>
<key name="name" type="string"/>
<key name="age" type="int"/>
</generate>
</setup>
|
Essential Attributes
- name: Task identifier
- count: Number of records to generate
- target: Output format (e.g.,
CSV, JSON, ConsoleExporter)
- source: (Optional) Input data source
<key>
The <key> element defines key fields within a data generation task and specifies their generation methods. These fields are crucial for creating unique identifiers or structured elements within the generated data. The <key> element allows for dynamic, constant, or conditional data generation and provides several attributes to customize its behavior.
Attributes
- name: Specifies the name of the key. This is mandatory and will be used as the field name in the generated data.
- type: Defines the data type of the key (e.g.,
string, int, bool). This is optional when using script or generator.
- source: Specifies the data source for the key (e.g., a database, a file).
- separator: Specifies a separator for csv source.
- values: Provides a list of static values for the key to choose from.
- script: Defines a script for dynamically generating the key's value.
- generator: Specifies a generator to automatically create values (e.g.,
RandomNumberGenerator, IncrementGenerator).
- constant: Defines a constant value for the key.
- condition: Specifies a condition to determine whether the key will be generated.
- converter: Specifies a converter to transform the value (e.g., date conversion, format changes).
- pattern: Defines a regex pattern to validate the value of the key.
- inDateFormat / outDateFormat: Specifies input and output date formats for converting date values.
- defaultValue: Provides a default value if the keyβs value is null or not generated.
- nullQuota: Defines the probability that the key will be assigned a null value. Default is
0 (never null).
- database: Specifies the database used for generating values (e.g.,
SequenceTableGenerator).
- string: Attribute to generate complex strings by embedding variables within the string using customizable delimiters. (read more in variable section)
- variablePrefix: Configurable attribute that defines the prefix for variable substitution in dynamic strings (default is
__).
- variableSuffix: Configurable attribute that defines the suffix for variable substitution in dynamic strings (default is
__).
Example 1: Generating Constant and Scripted Keys
| <setup>
<generate name="static_and_scripted_keys" count="5" >
<key name="static_key" constant="fixed_value"/>
<key name="dynamic_key" script="random.randint(1, 100)"/>
</generate>
</setup>
|
In this example:
static_key is assigned a constant value of "fixed_value" for every record.
dynamic_key generates a random integer between 1 and 100 for each record using a script.
Example 2: Handling nullQuota for Nullable Fields
| <setup>
<generate name="nullable_keys" count="10">
<key name="key_always_null" type="string" nullQuota="1"/> <!-- 100% null values -->
<key name="key_never_null" type="string" nullQuota="0"/> <!-- 0% null values -->
<key name="key_sometimes_null" type="string" nullQuota="0.5"/> <!-- 50% null values -->
</generate>
</setup>
|
In this example:
key_always_null will always have a null value (nullQuota="1").
key_never_null will never have a null value (nullQuota="0").
key_sometimes_null will have a null value 50% of the time (nullQuota="0.5").
Example 3: Using defaultValue for Fallback Values
| <setup>
<generate name="default_values" count="5">
<key name="key_with_empty_string" script="" defaultValue="default_value"/> <!-- Fallback to default_value -->
<key name="key_with_none" script="None" defaultValue="default_value"/> <!-- Fallback to default_value -->
<key name="key_with_condition" script="" defaultValue="default_value" condition="False"/> <!-- Condition False, no generation -->
</generate>
</setup>
|
Here:
- The first two keys fall back to their
defaultValue when the script generates an empty or None value.
- The third key doesnβt generate any value since its
condition is False.
Example 4: Conditional Key Generation
| <setup>
<generate name="conditional_keys" count="10">
<key name="conditional_key" script="random.randint(1, 100)" condition="random.randint(1, 100) > 50"/>
<key name="constant_key" constant="fixed_value" condition="True"/>
</generate>
</setup>
|
In this example:
conditional_key is generated only when a random number greater than 50 is produced by the condition script.
constant_key is always generated since its condition="True".
Example 5: Using pattern to Validate Keys
| <setup>
<generate name="pattern_matching" count="10">
<key name="email" script="'[email protected]'" pattern="^[\w\.-]+@[\w\.-]+\.\w+$"/>
<key name="phone_number" script="'123-456-7890'" pattern="^\d{3}-\d{3}-\d{4}$"/>
</generate>
</setup>
|
In this example:
- The
email keyβs value must match the regex pattern for a valid email format.
- The
phone_number keyβs value must match the regex pattern for a valid phone number format (123-456-7890).
| <setup>
<generate name="date_format_conversion" count="10">
<key name="date_of_birth" script="'2023-10-12'" inDateFormat="%Y-%m-%d" outDateFormat="%d-%m-%Y"/>
</generate>
</setup>
|
In this example:
- The
date_of_birth key uses the input date format (inDateFormat="%Y-%m-%d") to parse the date and converts it to the specified output format (outDateFormat="%d-%m-%Y").
Example 7: Key Generation from a SequenceTableGenerator
| <setup>
<database id="sourceDB" system="postgres"/>
<generate name="sequence_key_generation" count="10" >
<key name="user_id" database="sourceDB" generator="SequenceTableGenerator"/>
</generate>
</setup>
|
Here:
- The
user_id key is generated using a SequenceTableGenerator from a PostgreSQL database.
- This generator ensures that unique, sequential values are pulled from the database.
Best Practices for Using <key>
- Leverage
script for Dynamic Values: Use script to generate complex and dynamic values, such as random numbers, dates, or values based on calculations.
- Use
nullQuota for Realistic Data: Use nullQuota to simulate real-world scenarios where some keys may have null values.
- Fallback with
defaultValue: Use defaultValue to ensure that your keys always have a fallback value if a script fails or produces None.
- Pattern Matching for Validation: Use the
pattern attribute to enforce specific formatting rules, such as email addresses or phone numbers.
- Control Key Generation with
condition: Use the condition attribute to dynamically determine whether a key should be generated, allowing for more control in complex data generation scenarios.
<variable>
The <variable> element defines variables used in data generation tasks. Variables can be sourced from databases, datasets, or dynamically generated using scripts. They introduce flexibility in creating dynamic test data by controlling how the data is retrieved or iterated. New in this release is the storage attribute for explicit control while keeping full backward compatibility.
Attributes
- name: Specifies the name of the variable.
- type: Defines the data type of the variable (optional). For DB/file sources, use the table/collection/entity name.
- source: Specifies the data source for the variable (e.g., a database or a file path).
- selector: Defines a query to retrieve data for the variable from a database (executed once).
- iterationSelector: Executes a query on each iteration to retrieve dynamic data for the variable.
- storage: Controls how the variable stores/serves data. Options:
value β single static value (default for generators/constants; first row if a query)
data β complete data list in memory (random access)
iterator β cursor/iterator over rows; respects cyclic
- separator: Specifies a separator for the variable (e.g., for CSV sources).
- cyclic: Enables or disables cyclic iteration of the data source (relevant for
storage="iterator").
- entity: Defines the entity for generating data (e.g., a predefined model or object).
- script: Specifies a script for dynamically generating the variable's value.
- weightColumn: Specifies a column to weight data selection (typically used in CSV or database sources).
- sourceScripted: Enables per-row template evaluation for file-backed sources (CSV/JSON, weighted sources).
- generator: Defines a generator for the variable (e.g.,
RandomNumberGenerator, IncrementGenerator).
- dataset: Specifies the dataset for the variable (usually a file path).
- locale: Defines the locale used when generating data.
- inDateFormat / outDateFormat: Specifies date format conversion for input and output.
- converter: Defines a converter for transforming the variable's value.
- constant: Sets a fixed constant value for the variable.
- values: Provides a list of values for the variable to choose from.
- defaultValue: Sets a default value when no data is available.
- pattern: Defines a regex pattern for validating the variable's content.
- distribution: Controls how data is distributed when selecting from a source (
random, ordered).
- database: Specifies the database used for generating data.
- string: Attribute to generate complex strings by embedding variables within the string using customizable delimiters (see examples on
<key>).
- variablePrefix / variableSuffix: Configurable attributes that define the prefix/suffix for variable substitution in dynamic strings (default is
__). Can be set globally on <setup> via defaultVariablePrefix / defaultVariableSuffix and overridden per element.
Storage Modes
Use storage for explicit, predictable behavior:
value: single static value. Ideal for configuration values, selector scalar queries, constants, generators.
data: full list materialized in memory. Useful for analytics, random access, or joining in scripts. Be mindful of size.
iterator: efficient row-by-row iteration. Honors cyclic. Best for large tables/files.
Legacy Automatic Behavior (Backward Compatible)
If storage is omitted, DataMimIC applies the legacy rules:
- Selector-based variables β behave as static single value.
- Table/collection variables β cycle over rows (iterator semantics).
- Generator/constant variables β single value.
Context Levels
- Root-level variables (declared directly under
<setup>): loaded once, shared across the run, stable in multiprocessing.
- Nested variables (declared inside
<generate>): created per generation scope; can exhaust when cyclic="False".
Multiprocessing Notes
storage="data": each worker receives the same snapshot list.
storage="iterator": each worker advances its own cursor. For globally partitioned traversal, partition upstream (e.g., by ID ranges).
Example 1: Using generator for Incrementing Values
| <setup>
<generate name="sequential_ids" count="10" >
<variable name="id" generator="IncrementGenerator"/>
<key name="generated_id" script="id"/>
</generate>
</setup>
|
In this example:
- The
id variable uses the IncrementGenerator, which generates sequential numbers.
- The generated ID is then assigned to the
generated_id key for each record.
Example 2: Sourcing Data from a CSV File with separator
| <setup>
<generate name="person_data" count="5" >
<variable name="person" source="data/people.csv" separator="," distribution="ordered"/>
<key name="person_id" script="person.id"/>
<key name="person_name" script="person.name"/>
<key name="person_age" script="person.age"/>
</generate>
</setup>
|
In this example:
- The
person variable is sourced from a CSV file, with fields separated by a comma.
- The
distribution="ordered" ensures that records are processed in the order they appear in the file.
Example 3: Defining a constant Variable
| <setup>
<generate name="constant_value_example" count="3" >
<variable name="country" constant="Germany"/>
<key name="user_country" script="country"/>
</generate>
</setup>
|
In this case:
- The
country variable is defined as a constant with the value "Germany".
- This value is applied to every record generated in the
user_country key.
Example 4: Generating Dynamic Variables with script
| <setup>
<generate name="dynamic_variables" count="5" >
<variable name="random_number" script="random.randint(1, 100)"/>
<variable name="full_name" script="fake.name()"/>
<key name="random_number_value" script="random_number"/>
<key name="full_name_value" script="full_name"/>
</generate>
</setup>
|
In this example:
- The
random_number variable generates a random integer between 1 and 100 using a script.
- The
full_name variable uses the fake library to generate random names.
- These dynamically generated values are then printed for each record.
Example 5: Using cyclic Variables with a CSV Source
| <setup>
<generate name="cyclic_people" count="8" >
<variable name="person" source="data/people.csv" cyclic="True" separator=","/>
<key name="person_id" script="person.id"/>
<key name="person_name" script="person.name"/>
</generate>
</setup>
|
In this example:
- The
cyclic="True" attribute ensures that once all records from the CSV file are used, the data starts from the beginning again.
Example 6: Using distribution to Randomize Data Selection
| <setup>
<generate name="random_people" count="10" >
<variable name="person" source="data/people.csv" separator="," distribution="random"/>
<key name="person_id" script="person.id"/>
<key name="person_name" script="person.name"/>
</generate>
</setup>
|
Here:
- The
distribution="random" attribute ensures that the records are selected randomly from the source CSV file.
Example 7: Iterating with iterationSelector
| <setup>
<generate name="iterate_selector" count="20" >
<key name="iteration_count" generator="IncrementGenerator"/>
<variable name="user" source="dbPostgres"
iterationSelector="SELECT id, name FROM users WHERE id = __iteration_count__"/>
<key name="user_id" script="user[0].id"/>
<key name="user_name" script="user[0].name"/>
</generate>
</setup>
|
In this example:
- The
iterationSelector query retrieves data from a PostgreSQL database for each iteration using the iteration_count value, dynamically fetching user information.
Example 8: Defining Weighted Variables with weightColumn
| <setup>
<generate name="weighted_people" count="10" >
<variable name="people" source="data/people_weighted.csv" weightColumn="weight" separator=","/>
<key name="person_id" script="people.id"/>
<key name="person_name" script="people.name"/>
</generate>
</setup>
|
Here:
- The
weightColumn="weight" controls how frequently each row is selected. Rows with higher weight values are more likely to be chosen.
Example 9: Combining Variables with Nested Keys
1
2
3
4
5
6
7
8
9
10
11
12 | <setup>
<generate name="customer_info" count="10" >
<variable name="customer" source="data/customers.csv" cyclic="True"/>
<variable name="notification" source="data/notifications.csv" cyclic="True"/>
<key name="customer_id" script="customer.id"/>
<key name="customer_name" script="customer.name"/>
<nestedKey name="notifications" type="list" count="2">
<key name="notification_type" script="notification.type"/>
<key name="notification_message" script="notification.message"/>
</nestedKey>
</generate>
</setup>
|
In this case:
- The
customer and notification variables are both sourced from CSV files.
- The
nestedKey element generates two notifications for each customer, showcasing how variables can be combined with nested structures.
Example 10: Working with Entities and Locale-Specific Data
| <setup>
<generate name="localized_data" count="5" >
<variable name="person" entity="Person" locale="de_DE"/>
<key name="person_name" script="person.full_name"/>
<key name="person_address" script="person.address"/>
</generate>
</setup>
|
In this example:
- The
person variable is generated using the Person entity, with data localized to de_DE (Germany).
- This can be used to generate locale-specific data like names, addresses, etc.
Example 11: Using string Attribute for Dynamic and Complex Strings
| <setup defaultVariablePrefix="%%" defaultVariableSuffix="%%">
<generate name="query_generation" count="1">
<variable name="collection" constant="'users'" />
<key name="query" string="find: %%collection%%, filter: {'status': 'active'}" />
</generate>
</setup>
|
In this example:
- The
string attribute allows dynamic insertion of the variable collection into the query.
- The custom
%% prefix and suffix replace the default __.
Example 12: Default variablePrefix and variableSuffix
| <setup>
<generate name="query_generation" count="1">
<variable name="collection" constant="'users'" />
<key name="query" string="find: __collection__, filter: {'status': 'active'}" />
</generate>
</setup>
|
In this case:
- The default
__ delimiters are used for variable substitution.
Example 13: Explicit Iterator Storage with Cycling Control
| <setup>
<generate name="sales" count="50">
<variable name="products" source="db" type="product_table" storage="iterator" cyclic="True"/>
<key name="product_id" script="products.id"/>
</generate>
</setup>
|
Example 14: Storing the Complete Dataset in Memory
| <setup>
<variable name="all_users" source="db" type="user_table" storage="data"/>
<generate name="analytics" count="10">
<key name="total_users" script="len(all_users)"/>
<key name="random_user" script="all_users[random(0, len(all_users)-1)]"/>
</generate>
</setup>
|
Example 15: Forcing Single-Value Behavior
| <setup>
<generate name="test" count="5">
<variable name="first_user" source="db" type="user_table" storage="value"/>
<key name="template_user" script="first_user.name"/>
</generate>
</setup>
|
Best Practices for Using <variable>
- Dynamic Data Generation: Use scripts in variables to create dynamic data like random numbers, names, and addresses using libraries like
random and fake.
- Cyclic vs Non-Cyclic: Use
cyclic variables when you want data to repeat once all values are used, while non-cyclic variables are exhausted after one pass.
- Weighting and Randomization: Use
weightColumn to skew data generation toward certain records and distribution="random" to randomize data selection.
- Combining with Nested Keys: Use variables in combination with
nestedKey to generate structured, hierarchical data.
- Pick the right
storage: Use value for scalars/config, iterator for large sources, data for small datasets you need to index.
- Prefer explicit
type for DB sources: Avoid relying on variable names to infer tables/collections.
- Mind multiprocessing:
data is shared as a snapshot; iterators advance per worker.
Storage Mode Summary
| Variable Pattern |
Storage Mode |
Behavior |
Use Case |
selector="..." |
auto β value |
Single value |
Config, max values, constants |
type="table_name" |
auto β iterator |
Cycles through data |
Entity relationships |
storage="value" |
Explicit |
Single value |
Force static behavior |
storage="data" |
Explicit |
Complete list |
Calculations, random access |
storage="iterator" |
Explicit |
Cycles/exhausts by cyclic |
Controlled iteration |
<ml-train>
The <ml-train> element is used to train machine learning models with input data. These trained models can then be used as sources in <generate> elements to enrich original data.
The <ml-train> element is a sub element of <setup>
Attributes
- name: Specifies the name of model after trained. This is mandatory and will be used to reference the model in other elements.
- source: Specifies the source of the data (e.g., data/active.ent.csv, mongo).
- type: Specifies the type of data to generate.
- mode: Specifies the training mode. Currently only have 'default' and 'persist'. 'default' will remove the model after all task finished. 'persist' will keep the model after all task finished.
- maxTrainingTime: Specifies the maximum time allowed for model training in minutes (e.g. 1, 5, 10)
- separator: Specifies the separator used in the source data file (e.g., ',' for CSV files).
Example 1: Basic Model Training
| <setup>
<ml-train name="customer_csv_gen"
source="data/customer.ent.csv"
maxTrainingTime="1"
separator=","/>
<generate name="csv_customer" count="10000" pageSize="1000" source="customer_csv_gen" target="CSV">
<key name="id" generator="IncrementGenerator"/>
</generate>
</setup>
|
In this example:
- The model named "customer_csv_gen" is trained using data from "data/customers.csv"
- We didn't specific "mode" so it will be default and "customer_csv_gen" model will be removed after all task finish.
- The CSV file uses comma as separator
- "generate" will use trained "customer_csv_gen" model as source to create new data
Example 2: Training with persist mode
1
2
3
4
5
6
7
8
9
10
11
12 | <setup numProcess="2">
<ml-train name="customer_csv_gen"
source="data/customer.ent.csv"
mode="persist"
maxTrainingTime="1"/>
<!-- Generate synthetic CUSTOMER records using the ML generator -->
<generate name="csv_customer" count="10000" pageSize="1000" source="customer_csv_gen" target="CSV">
<key name="id" generator="IncrementGenerator"/>
</generate>
</setup>
|
In this example:
- We specify "mode" is "persist" so it will keep even after all task finish.
- Later we can use it without training.
| <setup numProcess="2">
<generate name="csv_customer" count="10000" pageSize="1000" source="customer_csv_gen" target="CSV">
<key name="id" generator="IncrementGenerator"/>
</generate>
</setup>
|
Complete Basic Example
Here's a complete example combining the core elements:
1
2
3
4
5
6
7
8
9
10
11
12
13 | <setup>
<generate name="user_data" count="100" target="CSV">
<!-- Define variables -->
<variable name="person" entity="Person"/>
<!-- Define keys -->
<key name="id" generator="IncrementGenerator"/>
<key name="first_name" script="person.given_name"/>
<key name="last_name" script="person.family_name"/>
<key name="age" type="int" generator="IntegerGenerator(min=18, max=80)"/>
<key name="status" constant="active"/>
</generate>
</setup>
|
This will generate 100 user records with consistent, structured data including IDs, names, ages, and a status field.
Next Steps
Once you're comfortable with these core elements, explore the Advanced Data Definition Elements for more complex features like:
- Nested data structures
- Conditional generation
- Complex data patterns
- Arrays and lists
- Advanced variable usage