DATAMIMIC
DATAMIMIC β model-driven, deterministic-first test data.
Open-source library (CE) and on-prem enterprise platform (EE) sharing the same DSL.
Documentation: https://docs.datamimic.io
Community Project: https://github.com/rapiddweller/datamimic
Enterprise Edition: https://datamimic.io
DATAMIMIC generates test data for banks, insurers, and public-sector systems. It combines rule-based generators for regulated data shapes (IBAN, BIC), template exporters for SWIFT, PACS, EDIFACT, and HL7 message formats, and deterministic seed-to-hash contracts so the same input produces byte-identical output across environments. The Enterprise Edition adds auto-regressive ML generators for cases where statistical fidelity matters.
Two editions, shared DSL β different runtime¶
CE and EE share most of the DSL surface, so most models run on both. They diverge on runtime engineering, integration coverage, and operational depth.
| CE β Community Edition | EE β Enterprise Edition | |
|---|---|---|
| What it is | Python library and datamimic CLI |
On-prem platform: Web UI, REST API, scheduler, workers (which encapsulate the EE Core engine), optional LSP |
| Processing core | Python | Python and Rust fast path, materially more optimised |
| Datasource scan | Skip/offset (OFFSET / LIMIT) SQL pagination executed by the database |
Keyset + manifest scan synchronised across worker processes |
| Logging & error handling | Standard | Structured logs, advanced error catalog and recovery |
| Streaming / messaging | β | Apache Kafka source and target |
| ML | β | Auto-regressive ML (TabularARGN, Apache 2.0) |
| Triggered by | Local CLI / Python scripts | UI or CI via REST, dispatched as worker tasks |
| Deployment | pip install datamimic-ce |
Docker, Podman, Kubernetes, OpenShift, or Helm chart |
| Governance | β | User/group management, audit trail on tasks and schedules, structured per-task logs, deterministic re-runs |
| CE-richer surface | MCP support, richer demographic profiles | β |
Pages tagged (CE+EE) apply to both editions. Pages tagged (CE) are CE-specific. Pages tagged (EE) require the Enterprise Edition.
Key features¶
- Composable Models (CE+EE): Compose descriptors via
<include>. Reusable fragments declare expected inputs with root-level<param>; call sites pass per-instance values via<property>overlays β one fragment serves many business variants without copy-paste. See release notes for 3.3.0. - Multi-Format Output (CE+EE): One model targets SQL databases, JSON, XML, CSV, and EDIFACT in the same run. Kafka source/target is EE-only.
- JSON and XML Handling (CE+EE): Generate and transform deeply nested structures with full control over hierarchy and types.
- Data Anonymisation and Pseudonymisation (CE+EE): Field-level masking configured in the model. The same configuration reproduces the same masked output across runs.
- MCP Support (CE): Model Context Protocol integration available in the Community Edition for LLM-assisted workflows.
- Demographic Profiles (CE): CE ships richer demographic-profile coverage out of the box.
- Automatic PII Detection (EE): When generating a model from database metadata, fields likely to contain PII are auto-marked with
#SENSITIVEso they are masked by default. Authors review and adjust before export. - Privacy Alignment (CE+EE): Aligned with GDPR Art. 25 (privacy by design); supports BCBS 239 and DORA traceability requirements.
- ML-Based Generation, Auto-Regressive (EE): Auto-regressive ML generators (TabularARGN, Apache 2.0). Versioned and quality-graded; outputs are statistically consistent (not byte-identical β only seeded rule-based outputs are).
- Performance (EE): Python + Rust fast-path processing core. Used in production to anonymise streaming payment transactions at multi-million-record scale. CE runs on a Python-only core and is materially less optimised.
- Advanced DataSource Scanning (EE): EE reads source databases via keyset pagination with a worker-synchronised manifest, avoiding the latency and consistency cost of
OFFSET / LIMITqueries at scale. CE uses standard skip/offset pagination executed by the database. - Advanced Logging and Error Handling (EE): Structured per-task logs, an explicit error catalog, and richer recovery semantics.
- Per-Run Audit Contracts (EE): Every task is logged with task ID, inputs, status, and structured per-task logs. With
rngSeedset on<setup>, re-running the same model produces deterministic, identical output across machines and releases. - Reproducible Test Data in CI/CD: CE β invoke
datamimicfrom any CI runner. EE β trigger generation via REST API; workers execute the EE Core engine and return artifacts on completion. - In-Editor Authoring Support, LSP v2 (EE, opt-in): Context-aware completion in
script=andcondition=attributes, including visible variables, project script symbols,.ent.csvcolumns, and fragment-property inputs at<include>call sites. See Experimental LSP. - Web UI and Reference Documentation (EE): Project, schedule, and task management in the browser; full reference for the model, generators, and API.
- Generator Library (CE+EE): In-built generators covering domain, demographic, and protocol-specific data. Extendable with custom Python code.
- Integration Surface: CE β Python API and
datamimicCLI. EE β REST API, Helm chart, database connectors for PostgreSQL, Oracle, MongoDB, streaming via Apache Kafka, and template exporters for industry message formats (SWIFT, PACS, HL7, EDIFACT).
Requirements¶
- CE: Python 3.11
- EE: Modern web browser - DATAMIMIC UI is available as SaaS or can be deployed on-premise (Docker, Podman, Kubernetes, OpenShift, Helm)
Example UI test data generation (EE)¶
Login and check the demo store¶
- Click Clone in the tile of 'Basic Script' to create your first project from Demo Store.
Generate it¶
- Start your first DATAMIMIC task by hitting the button 'GENERATE'.
- The status window informs you about the processing progress.
Check it¶
- Click 'Previews' to see a preview of the data being created.
- Or click 'Logs' to get detailed insights about the task, its processing speed, throughput's and more.
- Or navigate to 'Tasks' to get an overview of all task executions and its status of your project.
Switch between the main views (Editor, Schedules, Tasks, Settings) from the project bar.
Download it¶
- Navigate to 'Tasks'.
- Click the 'Artifact' Icon and select the generated file(s) on the Artifact View for download.
Example DATAMIMIC model¶
In DATAMIMIC, every project begins with an XML-based main model. These models can be auto-generated from connected databases, JSON, XML, or other file types. In this example, we'll start with a basic script example to illustrate how DATAMIMIC's model works.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
The above is a DATAMIMIC model for generating a dataset named "datamimic_user_list." This dataset will contain 100 records, and its target formats include CSV, JSON, and XML. Let's break down the key components of this script:
-
<setup>: This node encloses each DATAMIMIC model and can be used for more advanced configurations. -
<generate>: This node defines the dataset to be generated, specifying its name, record count, and target formats. -
<variable>: Here, we define a variable named "person" and associate it with the "Person" entity. This entity is configured to generate data for individuals with age between 18 and 90, with a 50% female quota. -
<key>: Multiple key elements represent the attributes of each record. For example:- "id" is generated using the IncrementGenerator.
- "first_name" is generated using the value
given_namefrom the Person object stored in the<variable\>. - "last_name" is generated using the value
family_namefrom the Person object stored in the<variable\>. - "gender" is generated using the value
genderfrom the Person object stored in the<variable\>. - "birthDate" is generated using the value
birthdatefrom the Person object stored in the<variable\>and formatted with the "DateFormat" converter. - "email" is generated using a script that combines "family_name" and "given_name" to create an email address.
- "ce_user" and "ee_user" are generated with predefined values "True" and "False" that are randomly distributed.
- "datamimic_lover" is set as a constant value "DEFINITELY".
This minimal example shows how a DATAMIMIC model defines data generation. Real projects typically split logic across multiple files and compose them with <include> β see the Tutorial for a worked composition example. Re-running the same seeded model produces byte-identical CSV, JSON, and XML output.
For a more complete example including further <nodes\> and features such as database connectivity, complex JSON modelling and obfuscation scenarios, see the Tutorial - User Guide.
Further Support¶
For new customers
See how DATAMIMIC fits your test-data and compliance workflow. Demos are run with our architects β bring a real schema or use case.
For existing customers
For advanced features or production questions, open a support ticket or book a dedicated session with our team.
More
Visit www.datamimic.io for services, use cases, and Enterprise Edition details.