Data Definition Model - Core Elements¶
Data Definition Models are fundamental to DATAMIMIC's test data generation capabilities. This document covers the essential elements - if you're new to DATAMIMIC, start here. For advanced features, see Advanced Data Definition Elements.
Overview¶
Data Definition Models specify how test data should be generated, transformed, or obfuscated. The core elements allow you to:
- Define data generation tasks
- Specify key fields and their values
- Create and use variables
- Generate structured data sets
Basic Elements¶
¶
The <setup>
element is the root element for all data generation tasks. It contains one or more <generate>
elements that define specific data generation operations. Learn more of its use in Configuration Models.
1 2 3 4 5 |
|
<generate>
¶
The <generate>
element is the core of Data Definition Models. It defines a data generation task and includes attributes like name
, count
, and target
. This element is used to create structured data based on the specified configurations.
Attributes:¶
- name: Specifies the name of the generation task.
- count: Specifies the number of records to generate.
- source: Specifies the source of the data (e.g.,
data/active.ent.csv
,mongo
). - target: Specifies the target output (e.g.,
CSV
,sqliteDB
). - type: Specifies the type of data to generate.
- cyclic: Enables or disables cyclic generation. Default is
False
. - selector: Specifies a database query for the generation.
- separator: Specifies a separator for the generated data. Default is
|
. - sourceScripted: Enables or disables scripted source evaluation in the source file (e.g.,
example.ent.csv
,example.json
). Default isFalse
. - pageSize: Specifies the page size for data generation.
- storageId: Specifies the ID of object storage, defined by the
<object-storage>
element. - sourceUri: Specifies the URI of the datasource on object storage (e.g.,
datasource/employees.csv
). - exportUri: Specifies the URI for exporting generated data on object storage (e.g.,
export/product.csv
). - container: Specifies the container name for Azure Blob Storage.
- bucket: Specifies the bucket name for AWS S3.
- multiprocessing: Enables or disables multiprocessing for data generation. Default is
False
. - distribution: Specifies the distribution of data source iteration (e.g.,
random
,ordered
). Default israndom
. - converter: Specifies a converter to transform value.
- variablePrefix: Configurable attribute that defines the prefix for variable substitution in dynamic strings (default is __).
- variableSuffix: Configurable attribute that defines the suffix for variable substitution in dynamic strings (default is __).
Children:¶
<key>
: Specifies key fields within the data generation task.<variable>
: Defines variables used in data generation.<reference>
: Defines references to other generated data.<nestedKey>
: Specifies nested key fields and their generation methods.<list>
: Defines lists of data items.<condition>
: Conditional element to include data based on certain conditions.<array>
: Defines arrays of data items.<echo>
: Outputs text or variables for logging or debugging purposes.
Example 1: Using Object Storage for Data Generation¶
1 2 3 4 5 6 7 8 9 10 11 |
|
Example 2: Using selector
with a Database¶
1 2 3 4 |
|
In this example:
- The
selector
is used to query the MongoDB database to find all customers under 30 years old. - The data is output to the
ConsoleExporter
.
Example 3: Generating Data with MongoDB and Aggregation¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
Example 4: Generating Data with Kafka¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Example 5: Using Data from a CSV File¶
1 2 3 4 |
|
In this example:
- Two
generate
tasks are created that source data from CSV files and output it to theConsoleExporter
.
Example 6: Using cyclic
with Data from Memory Store¶
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Example 7: Using 'sourceScripted' with JSON template¶
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Result:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
In this example:
- The
sourceScripted="True"
attribute is used to evaluate the JSON template with embedded variables. - The JSON template contains placeholders for variables like
random_age
,street_name
, andaddress_number
. - If whole JSON field value is a variable, it should be enclosed in curly braces
{}
(e.g.,"age": "{random_age}"
). Returned value can be a string, integer, or any other type. - If a variable is embedded within a string, it should be enclosed in double underscores
__
(e.g.,"address": "__address_number__, __street_name__ St"
). Returned value will be a string. You can also customize the prefix and suffix for variable substitution usingvariablePrefix
,variableSuffix
,defaultVariablePrefix
, anddefaultVariableSuffix
attributes. For example: 1 2 3 4 5 6 7 8 9 10
<setup defaultVariablePrefix="-%" defaultVariableSuffix="%-"> <generate name="json_data" source="script/data.json" sourceScripted="True" target=""> <variable name="random_age" generator="IntegerGenerator(min=18, max=65)"/> </generate> </setup> <setup> <generate name="json_data" source="script/data.json" sourceScripted="True" target="" variablePrefix="-%" variableSuffix="%-"> <variable name="random_age" generator="IntegerGenerator(min=18, max=65)"/> </generate> </setup>
The <generate>
element defines a data generation task. At its most basic, it requires:
- name: Identifies the generation task
- count: Specifies how many records to generate
- target: (Optional) Specifies the output format (e.g., CSV, JSON)
Basic Example¶
1 2 3 4 5 6 7 |
|
Essential Attributes¶
- name: Task identifier
- count: Number of records to generate
- target: Output format (e.g.,
CSV
,JSON
,ConsoleExporter
) - source: (Optional) Input data source
<key>
¶
The <key>
element defines key fields within a data generation task and specifies their generation methods. These fields are crucial for creating unique identifiers or structured elements within the generated data. The <key>
element allows for dynamic, constant, or conditional data generation and provides several attributes to customize its behavior.
Attributes:¶
- name: Specifies the name of the key. This is mandatory and will be used as the field name in the generated data.
- type: Defines the data type of the key (e.g.,
string
,int
,bool
). This is optional when usingscript
orgenerator
. - source: Specifies the data source for the key (e.g., a database, a file).
- separator: Specifies a separator for csv source.
- values: Provides a list of static values for the key to choose from.
- script: Defines a script for dynamically generating the key's value.
- generator: Specifies a generator to automatically create values (e.g.,
RandomNumberGenerator
,IncrementGenerator
). - constant: Defines a constant value for the key.
- condition: Specifies a condition to determine whether the key will be generated.
- converter: Specifies a converter to transform the value (e.g., date conversion, format changes).
- pattern: Defines a regex pattern to validate the value of the key.
- inDateFormat / outDateFormat: Specifies input and output date formats for converting date values.
- defaultValue: Provides a default value if the key’s value is null or not generated.
- nullQuota: Defines the probability that the key will be assigned a null value. Default is
0
(never null). - database: Specifies the database used for generating values (e.g.,
SequenceTableGenerator
). - string: Attribute to generate complex strings by embedding variables within the string using customizable delimiters. (read more in variable section)
- variablePrefix: Configurable attribute that defines the prefix for variable substitution in dynamic strings (default is
__
). - variableSuffix: Configurable attribute that defines the suffix for variable substitution in dynamic strings (default is
__
).
Example 1: Generating Constant and Scripted Keys¶
1 2 3 4 5 6 |
|
In this example:
static_key
is assigned a constant value of"fixed_value"
for every record.dynamic_key
generates a random integer between 1 and 100 for each record using a script.
Example 2: Handling nullQuota
for Nullable Fields¶
1 2 3 4 5 6 7 |
|
In this example:
key_always_null
will always have a null value (nullQuota="1"
).key_never_null
will never have a null value (nullQuota="0"
).key_sometimes_null
will have a null value 50% of the time (nullQuota="0.5"
).
Example 3: Using defaultValue
for Fallback Values¶
1 2 3 4 5 6 7 |
|
Here:
- The first two keys fall back to their
defaultValue
when the script generates an empty orNone
value. - The third key doesn’t generate any value since its
condition
isFalse
.
Example 4: Conditional Key Generation¶
1 2 3 4 5 6 |
|
In this example:
conditional_key
is generated only when a random number greater than 50 is produced by thecondition
script.constant_key
is always generated since itscondition="True"
.
Example 5: Using pattern
to Validate Keys¶
1 2 3 4 5 6 |
|
In this example:
- The
email
key’s value must match the regex pattern for a valid email format. - The
phone_number
key’s value must match the regex pattern for a valid phone number format (123-456-7890
).
Example 6: Date Conversion Using inDateFormat
and outDateFormat
¶
1 2 3 4 5 |
|
In this example:
- The
date_of_birth
key uses the input date format (inDateFormat="%Y-%m-%d"
) to parse the date and converts it to the specified output format (outDateFormat="%d-%m-%Y"
).
Example 7: Key Generation from a SequenceTableGenerator
¶
1 2 3 4 5 6 |
|
Here:
- The
user_id
key is generated using aSequenceTableGenerator
from a PostgreSQL database. - This generator ensures that unique, sequential values are pulled from the database.
Best Practices for Using <key>
¶
- Leverage
script
for Dynamic Values: Usescript
to generate complex and dynamic values, such as random numbers, dates, or values based on calculations. - Use
nullQuota
for Realistic Data: UsenullQuota
to simulate real-world scenarios where some keys may have null values. - Fallback with
defaultValue
: UsedefaultValue
to ensure that your keys always have a fallback value if a script fails or producesNone
. - Pattern Matching for Validation: Use the
pattern
attribute to enforce specific formatting rules, such as email addresses or phone numbers. - Control Key Generation with
condition
: Use thecondition
attribute to dynamically determine whether a key should be generated, allowing for more control in complex data generation scenarios.
<variable>
¶
The <variable>
element defines variables used in data generation tasks. Variables can be sourced from databases, datasets, or dynamically generated using scripts. They introduce flexibility in creating dynamic test data by controlling how the data is retrieved or iterated.
Attributes:¶
- name: Specifies the name of the variable.
- type: Defines the data type of the variable (optional).
- source: Specifies the data source for the variable (e.g., a database or a file).
- selector: Defines a query to retrieve data for the variable from a database (executed once).
- iterationSelector: Executes a query on each iteration to retrieve dynamic data for the variable.
- separator: Specifies a separator for the variable (e.g., for CSV sources).
- cyclic: Enables or disables cyclic iteration of the data source.
- entity: Defines the entity for generating data (e.g., a predefined model or object).
- script: Specifies a script for dynamically generating the variable's value.
- weightColumn: Specifies a column to weight data selection (typically used in CSV or database sources).
- sourceScripted: Determines if the source is scripted.
- generator: Defines a generator for the variable (e.g.,
RandomNumberGenerator
,IncrementGenerator
). - dataset: Specifies the dataset for the variable (usually a file path).
- locale: Defines the locale used when generating data.
- inDateFormat / outDateFormat: Specifies date format conversion for input and output.
- converter: Defines a converter for transforming the variable's value.
- constant: Sets a fixed constant value for the variable.
- values: Provides a list of values for the variable to choose from.
- defaultValue: Sets a default value when no data is available.
- pattern: Defines a regex pattern for validating the variable's content.
- distribution: Controls how data is distributed when selecting from a source (
random
,ordered
). - database: Specifies the database used for generating data.
- string: Attribute to generate complex strings by embedding variables within the string using customizable delimiters.
- variablePrefix: Configurable attribute that defines the prefix for variable substitution in dynamic strings (default is
__
). - variableSuffix: Configurable attribute that defines the suffix for variable substitution in dynamic strings (default is
__
).
Example 1: Using generator
for Incrementing Values¶
1 2 3 4 5 6 |
|
In this example:
- The
id
variable uses theIncrementGenerator
, which generates sequential numbers. - The generated ID is then assigned to the
generated_id
key for each record.
Example 2: Sourcing Data from a CSV File with separator
¶
1 2 3 4 5 6 7 8 |
|
In this example:
- The
person
variable is sourced from a CSV file, with fields separated by a comma. - The
distribution="ordered"
ensures that records are processed in the order they appear in the file.
Example 3: Defining a constant
Variable¶
1 2 3 4 5 6 |
|
In this case:
- The
country
variable is defined as a constant with the value"Germany"
. - This value is applied to every record generated in the
user_country
key.
Example 4: Generating Dynamic Variables with script
¶
1 2 3 4 5 6 7 8 |
|
In this example:
- The
random_number
variable generates a random integer between 1 and 100 using a script. - The
full_name
variable uses thefake
library to generate random names. - These dynamically generated values are then printed for each record.
Example 5: Using cyclic
Variables with a CSV Source¶
1 2 3 4 5 6 7 |
|
In this example:
- The
cyclic="True"
attribute ensures that once all records from the CSV file are used, the data starts from the beginning again.
Example 6: Using distribution
to Randomize Data Selection¶
1 2 3 4 5 6 7 |
|
Here:
- The distribution="random"
attribute ensures that the records are selected randomly from the source CSV file.
Example 7: Iterating with iterationSelector
¶
1 2 3 4 5 6 7 8 9 |
|
In this example:
- The
iterationSelector
query retrieves data from a PostgreSQL database for each iteration using theiteration_count
value, dynamically fetching user information.
Example 8: Defining Weighted Variables with weightColumn
¶
1 2 3 4 5 6 7 |
|
Here:
- The
weightColumn="weight"
controls how frequently each row is selected. Rows with higher weight values are more likely to be chosen.
Example 9: Combining Variables with Nested Keys¶
1 2 3 4 5 6 7 8 9 10 11 12 |
|
In this case:
- The
customer
andnotification
variables are both sourced from CSV files. - The
nestedKey
element generates two notifications for each customer, showcasing how variables can be combined with nested structures.
Example 10: Working with Entities and Locale-Specific Data¶
1 2 3 4 5 6 7 |
|
In this example:
- The
person
variable is generated using thePerson
entity, with data localized tode_DE
(Germany). - This can be used to generate locale-specific data like names, addresses, etc.
Example 11: Using string
Attribute for dynamic and complex strings¶
1 2 3 4 5 6 |
|
In this example:
- The
string
attribute allows dynamic insertion of the variablecollection
into the query. - The custom
%%
prefix and suffix replace the default__
.
Example 12: Default variablePrefix
and variableSuffix
¶
1 2 3 4 5 6 |
|
In this case:
- The default __
delimiters are used for variable substitution.
Key Benefits of the string
Attribute¶
- Simplicity: Embedding variables directly within the string eliminates the need for manual string concatenation or escaping.
- Readability: Dynamic strings are easier to read and maintain.
- Flexibility: The
variablePrefix
andvariableSuffix
attributes allow customization of the delimiters used, providing more flexibility when working with different syntaxes or conventions.
Best Practices for Using <variable>
¶
- Dynamic Data Generation: Use scripts in variables to create dynamic data like random numbers, names, and addresses using libraries like
random
andfake
. - Cyclic vs Non-Cyclic: Use
cyclic
variables when you want data to repeat once all values are used, whilenon-cyclic
variables are exhausted after one pass. - Weighting and Randomization: Use
weightColumn
to skew data generation toward certain records anddistribution="random"
to randomize data selection. - Combining with Nested Keys: Use variables in combination with
nestedKey
to generate structured, hierarchical data.
Complete Basic Example¶
Here's a complete example combining the core elements:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
This will generate 100 user records with consistent, structured data including IDs, names, ages, and a status field.
Next Steps¶
Once you're comfortable with these core elements, explore the Advanced Data Definition Elements for more complex features like:
- Nested data structures
- Conditional generation
- Complex data patterns
- Arrays and lists
- Advanced variable usage