DATAMIMIC

DATAMIMIC — model-driven, deterministic-first test data.
Open-source library (CE) and on-prem enterprise platform (EE) sharing the same DSL.

Documentation: https://docs.datamimic.io

Community Project: https://github.com/rapiddweller/datamimic

Enterprise Edition: https://datamimic.io

DATAMIMIC generates test data for banks, insurers, and public-sector systems. It combines rule-based generators for regulated data shapes (IBAN, BIC), template exporters for SWIFT, PACS, EDIFACT, and HL7 message formats, and deterministic seed-to-hash contracts so the same input produces byte-identical output across environments. The Enterprise Edition adds auto-regressive ML generators for cases where statistical fidelity matters.

Two editions, shared DSL — different runtime¶

CE and EE share most of the DSL surface, so most models run on both. They diverge on runtime engineering, integration coverage, and operational depth.

	CE — Community Edition	EE — Enterprise Edition
What it is	Python library and `datamimic` CLI	On-prem platform: Web UI, REST API, scheduler, workers (which encapsulate the EE Core engine), optional LSP
Processing core	Python	Python and Rust fast path, materially more optimised
Datasource scan	Skip/offset (`OFFSET / LIMIT`) SQL pagination executed by the database	Keyset + manifest scan synchronised across worker processes
Logging & error handling	Standard	Structured logs, advanced error catalog and recovery
Streaming / messaging	—	Apache Kafka source and target
ML	—	Auto-regressive ML (TabularARGN, Apache 2.0)
Triggered by	Local CLI / Python scripts	UI or CI via REST, dispatched as worker tasks
Deployment	`pip install datamimic-ce`	Docker, Podman, Kubernetes, OpenShift, or Helm chart
Governance	—	User/group management, audit trail on tasks and schedules, structured per-task logs, deterministic re-runs
CE-richer surface	MCP support, richer demographic profiles	—

Pages tagged (CE+EE) apply to both editions. Pages tagged (CE) are CE-specific. Pages tagged (EE) require the Enterprise Edition.

Key features¶

Composable Models (CE+EE): Compose descriptors via <include>. Reusable fragments declare expected inputs with root-level <param>; call sites pass per-instance values via <property> overlays — one fragment serves many business variants without copy-paste. See release notes for 3.3.0.
Multi-Format Output (CE+EE): One model targets SQL databases, JSON, XML, CSV, and EDIFACT in the same run. Kafka source/target is EE-only.
JSON and XML Handling (CE+EE): Generate and transform deeply nested structures with full control over hierarchy and types.
Data Anonymisation and Pseudonymisation (CE+EE): Field-level masking configured in the model. The same configuration reproduces the same masked output across runs.
MCP Support (CE): Model Context Protocol integration available in the Community Edition for LLM-assisted workflows.
Demographic Profiles (CE): CE ships richer demographic-profile coverage out of the box.
Automatic PII Detection (EE): When generating a model from database metadata, fields likely to contain PII are auto-marked with #SENSITIVE so they are masked by default. Authors review and adjust before export.
Privacy Alignment (CE+EE): Aligned with GDPR Art. 25 (privacy by design); supports BCBS 239 and DORA traceability requirements.
ML-Based Generation, Auto-Regressive (EE): Auto-regressive ML generators (TabularARGN, Apache 2.0). Versioned and quality-graded; outputs are statistically consistent (not byte-identical — only seeded rule-based outputs are).
Performance (EE): Python + Rust fast-path processing core. Used in production to anonymise streaming payment transactions at multi-million-record scale. CE runs on a Python-only core and is materially less optimised.
Advanced DataSource Scanning (EE): EE reads source databases via keyset pagination with a worker-synchronised manifest, avoiding the latency and consistency cost of OFFSET / LIMIT queries at scale. CE uses standard skip/offset pagination executed by the database.
Advanced Logging and Error Handling (EE): Structured per-task logs, an explicit error catalog, and richer recovery semantics.
Per-Run Audit Contracts (EE): Every task is logged with task ID, inputs, status, and structured per-task logs. With rngSeed set on <setup>, re-running the same model produces deterministic, identical output across machines and releases.
Reproducible Test Data in CI/CD: CE — invoke datamimic from any CI runner. EE — trigger generation via REST API; workers execute the EE Core engine and return artifacts on completion.
In-Editor Authoring Support, LSP v2 (EE, opt-in): Context-aware completion in script= and condition= attributes, including visible variables, project script symbols, .ent.csv columns, and fragment-property inputs at <include> call sites. See Experimental LSP.
Web UI and Reference Documentation (EE): Project, schedule, and task management in the browser; full reference for the model, generators, and API.
Generator Library (CE+EE): In-built generators covering domain, demographic, and protocol-specific data. Extendable with custom Python code.
Integration Surface: CE — Python API and datamimic CLI. EE — REST API, Helm chart, database connectors for PostgreSQL, Oracle, MongoDB, streaming via Apache Kafka, and template exporters for industry message formats (SWIFT, PACS, HL7, EDIFACT).

Requirements¶

CE: Python 3.11
EE: Modern web browser - DATAMIMIC UI is available as SaaS or can be deployed on-premise (Docker, Podman, Kubernetes, OpenShift, Helm)

Example UI test data generation (EE)¶

Click Clone in the tile of 'Basic Script' to create your first project from Demo Store.

Generate it¶

Start your first DATAMIMIC task by hitting the button 'GENERATE'.
The status window informs you about the processing progress.

Check it¶

Click 'Previews' to see a preview of the data being created.
Or click 'Logs' to get detailed insights about the task, its processing speed, throughput's and more.
Or navigate to 'Tasks' to get an overview of all task executions and its status of your project.

Switch between the main views (Editor, Schedules, Tasks, Settings) from the project bar.

Download it¶

Navigate to 'Tasks'.
Click the 'Artifact' Icon and select the generated file(s) on the Artifact View for download.

Example DATAMIMIC model¶

In DATAMIMIC, every project begins with an XML-based main model. These models can be auto-generated from connected databases, JSON, XML, or other file types. In this example, we'll start with a basic script example to illustrate how DATAMIMIC's model works.

<setup>
    <generate name="datamimic_user_list" count="100" target="CSV,JSON,XML">
        <variable name="person" entity="Person(min_age=18, max_age=90, female_quota=0.5)"/>
        <key name="id" generator="IncrementGenerator"/>
        <key name="first_name" script="person.given_name"/>
        <key name="last_name" script="person.family_name"/>
        <key name="gender" script="person.gender"/>
        <key name="birthDate" script="person.birthdate" converter="DateFormat('%d.%m.%Y')"/>
        <key name="email" script="person.family_name + '@' + person.given_name + '.de'"/>
        <key name="ce_user" values="True, False"/>
        <key name="ee_user" values="True, False"/>
        <key name="datamimic_lover" constant="DEFINITELY"/>
    </generate>
</setup>

The above is a DATAMIMIC model for generating a dataset named "datamimic_user_list." This dataset will contain 100 records, and its target formats include CSV, JSON, and XML. Let's break down the key components of this script:

<setup>: This node encloses each DATAMIMIC model and can be used for more advanced configurations.
<generate>: This node defines the dataset to be generated, specifying its name, record count, and target formats.
<variable>: Here, we define a variable named "person" and associate it with the "Person" entity. This entity is configured to generate data for individuals with age between 18 and 90, with a 50% female quota.
<key>: Multiple key elements represent the attributes of each record. For example:
- "id" is generated using the IncrementGenerator.
- "first_name" is generated using the value given_name from the Person object stored in the <variable\>.
- "last_name" is generated using the value family_name from the Person object stored in the <variable\>.
- "gender" is generated using the value gender from the Person object stored in the <variable\>.
- "birthDate" is generated using the value birthdate from the Person object stored in the <variable\> and formatted with the "DateFormat" converter.
- "email" is generated using a script that combines "family_name" and "given_name" to create an email address.
- "ce_user" and "ee_user" are generated with predefined values "True" and "False" that are randomly distributed.
- "datamimic_lover" is set as a constant value "DEFINITELY".

This minimal example shows how a DATAMIMIC model defines data generation. Real projects typically split logic across multiple files and compose them with <include> — see the Tutorial for a worked composition example. Re-running the same seeded model produces byte-identical CSV, JSON, and XML output.

For a more complete example including further <nodes\> and features such as database connectivity, complex JSON modelling and obfuscation scenarios, see the Tutorial - User Guide.

Further Support¶

For new customers

See how DATAMIMIC fits your test-data and compliance workflow. Demos are run with our architects — bring a real schema or use case.

Book a Demo

For existing customers

For advanced features or production questions, open a support ticket or book a dedicated session with our team.

More

Visit www.datamimic.io for services, use cases, and Enterprise Edition details.

Contact the DATAMIMIC Development Team