Features¶

DATAMIMIC features¶

DATAMIMIC combines model-driven generation and deterministic seed-to-hash contracts in both editions. The Enterprise Edition adds auto-regressive ML, a Rust fast path, Kafka integration, advanced datasource scanning, structured logging, and richer error handling on top of the shared DSL surface. The capabilities below describe what each edition does and where they diverge.

Features are tagged by edition:

(CE) — Community Edition, the Python library and datamimic CLI. Includes some features not present in EE (MCP support, richer demographic profiles).
(EE) — Enterprise Edition, the on-prem platform (Web UI, REST API, scheduler, workers that encapsulate the EE Core engine, optional LSP). Adds Rust fast path, Kafka, advanced datasource scanning, structured logging.
(CE+EE) — DSL-level capability shared by both editions.

Model-Driven Approach (CE+EE)¶

DATAMIMIC's model-driven approach centralises test data definition:

Single source of definition: Data models are blueprints for generation. Schema, generator, and constraint changes are made once at the model level and propagate to every output target.
Abstraction over data shapes: Authors work with high-level entities and keys rather than format-specific structures, so the same model can drive SQL, JSON, XML, CSV, and EDIFACT output.
Adaptable: Changes in data formats, new data sources, or modified relationships are made inside the model. Downstream pipelines stay untouched.
Consistent and verifiable: Generation follows declared rules and constraints. Re-running a seeded model produces the same output across machines and releases.

Composable Models (CE+EE)¶

DATAMIMIC models are designed for composition, not as monolithic files:

<include> — split logic across files (configuration .properties, sub-model XML fragments) and assemble them in a main descriptor.
<param> declarations — reusable fragments declare expected inputs at the root level, making shared model components explicit.
<property> overlays under <include> — call sites pass per-instance values (constant or scripted) so the same fragment serves multiple business variants without copy-paste.

This is the common pattern in enterprise projects. See release notes for 3.3.0 for the include-property and fragment-param features.

Multi-Format Output (CE+EE)¶

One model targets SQL databases, JSON, XML, CSV, and EDIFACT in the same run via multiple <generate target="..."> declarations. Kafka source and target are EE-only.

JSON and XML Handling (CE+EE)¶

Generate and transform deeply nested JSON and XML structures with full control over hierarchy, types, and arrays.

Nested Structure Generation: Define JSON and XML objects with nested structures — objects or elements containing other objects/elements, arrays, and mixed data types in a single structure.
Hierarchical Data Modelling: Represent complex relationships between data entities, useful for NoSQL databases and XML schemas.
Deep Nesting: Build deeply nested JSON objects and XML elements to mirror real-world interconnected data scenarios.
Customisation: Customise every aspect of the structure, from layout to how each value is generated and formatted.

Data Anonymisation and Pseudonymisation (CE+EE)¶

Field-level masking configured in the model. The same seed and configuration reproduce the same masked output across runs and environments.

Automatic PII Detection (EE)¶

When generating a DATAMIMIC model from database metadata, the platform automatically detects fields likely to contain Personally Identifiable Information (PII). These fields are marked with #SENSITIVE in the generated model so they are masked by default. Authors review and adjust before generating or exporting synthetic data.

Privacy Alignment (CE+EE)¶

Aligned with GDPR Art. 25 (privacy by design). Supports BCBS 239 lineage and DORA traceability requirements through per-run task IDs, model versions, and content hashes.

ML-Based Generation, Auto-Regressive (EE)¶

Auto-regressive ML generators based on TabularARGN (Apache 2.0). Models are versioned and quality-graded. ML outputs are statistically consistent rather than byte-identical — only seeded rule-based outputs are byte-identical. ML generation is an Enterprise Edition capability; the Community Edition covers rule-based generation only.

Per-Run Audit Contracts (EE)¶

Every worker task is logged with task ID, inputs, status, and structured per-task logs. With rngSeed set on <setup>, re-running the same model produces deterministic, identical output across machines and releases. This is the evidence layer for compliance and reproducibility reviews.

MCP Support (CE)¶

The Community Edition ships an integration with the Model Context Protocol (MCP) for LLM-assisted workflows over the DSL. EE does not currently include this surface.

Demographic Profiles (CE)¶

CE ships richer demographic-profile coverage out of the box. The profile catalog is shared at the DSL level; CE happens to include more pre-built profiles than EE today.

Performance (EE)¶

The Enterprise Edition runs a Python + Rust fast-path processing core and is used in production to anonymise streaming payment transactions at multi-million-record scale. The Community Edition runs on a Python-only core and is materially less optimised — appropriate for individual projects, scripts, and CI use, but not designed for the same throughput targets.

Advanced DataSource Scanning (EE)¶

When reading from source databases, EE uses keyset pagination with a worker-synchronised manifest instead of OFFSET / LIMIT queries. This keeps multi-worker scans consistent, avoids the latency and consistency cost of skip-and-offset at scale, and works under load. CE uses standard skip/offset pagination executed by the database — fine for small datasets, but it does not scale the same way.

Advanced Logging and Error Handling (EE)¶

Structured per-task logs, an explicit error catalog, and richer recovery semantics. CE provides standard Python logging.

In-Editor Authoring Support — LSP v2 (EE, opt-in)¶

Context-aware completion in script= and condition= attributes, including visible variables, project script symbols, .ent.csv columns, multi-hop entity members, and fragment-property inputs at <include> call sites. Enabled per project under Project Settings; see Experimental LSP.

Web UI (EE)¶

Project, schedule, and task management in the browser, including database view, demo store, previews, and per-task logs and artifacts.

Integration Surface¶

(CE) Python API and datamimic CLI for local generation and scripting.
(EE) REST API for CI/CD triggers (workers execute the EE Core engine and return artifacts); Helm chart for deployment; database connectors for PostgreSQL, Oracle, MongoDB; streaming via Apache Kafka; template exporters for industry message formats including SWIFT, PACS, HL7, and EDIFACT.

Scales from Single Schema to Enterprise Platform¶

(CE) Suitable for individual projects, scripts, and CI use of the Python library.
(EE) Platform-wide test data management across teams; deployable on Docker, Podman, Kubernetes, OpenShift, or via Helm chart.

Generator Library (CE+EE)¶

In-built generators for domain, demographic, and protocol-specific data. Extendable with custom Python code.