VALIDATION API¶
DATAMIMIC ships a Python validation API that lets you check whether a value or a structured record matches realistic, dataset-aware patterns for a given country. Validation is organized in three layers:
- Token-level — validate a single string (a name, a postal code, an IBAN, …).
- Entity-level — validate a structured record (
person,address,company,contact) against its domain schema, including cross-field rules. - Cross-entity coherence — check whether several already-validated entities agree with each other (e.g. a person's address country matches their phone prefix).
All validators are dataset-aware: the country code drives both the dictionaries that back exact matches (e.g. known German family names) and the structural rules (e.g. German postal-code format).
Token-Level Validation¶
Use the registry to obtain a validator for a domain (person, address,
contact, company, finance) and call validate() with the raw string.
1 2 3 4 5 6 7 8 9 10 11 12 | |
get_validator(name, dataset="DE") accepts ValidatorName or its string value
(e.g. "person"). Unknown names raise UnknownValidatorError.
Result shape¶
ValidationResult is a frozen dataclass with the following fields:
| Field | Type | Meaning |
|---|---|---|
valid |
bool |
Whether the input passes the validator |
confidence |
float (0..1) |
How strong the match is |
match_kind |
MatchKind |
EXACT_DATASET, STRUCTURAL_PATTERN, HEURISTIC, NONE |
code |
ValidationCode |
Machine-readable reason (e.g. DATASET_MATCH) |
validator |
ValidatorRef |
(name, operation, dataset) triple |
message |
str \| None |
Human-readable explanation |
normalized_value |
str \| None |
Canonicalized input (trimmed, NFKC, …) |
Listing available validators¶
1 2 3 4 5 | |
Entity-Level Validation¶
Entity validators accept either a plain dict[str, str] or a typed Pydantic
input model and return an EntityValidationResult containing per-field results
plus structural issues that span multiple fields.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
The available entity validators are:
| Validator | Domain entity | Module |
|---|---|---|
PersonEntityValidator |
person | datamimic_ee.domains.validation.entities.person_entity_validator |
AddressEntityValidator |
address | datamimic_ee.domains.validation.entities.address_entity_validator |
CompanyEntityValidator |
company | datamimic_ee.domains.validation.entities.company_entity_validator |
ContactEntityValidator |
contact | datamimic_ee.domains.validation.entities.contact_entity_validator |
All four constructors take the same dataset: str = "DE" argument and expose a
validate(entity) method returning EntityValidationResult.
Result shape¶
EntityValidationResult (frozen dataclass):
| Field | Type | Meaning |
|---|---|---|
entity_type |
str |
"person", "address", "company", "contact" |
valid |
bool |
Overall validity (per-field validity + cross-field rules) |
confidence |
float |
Aggregated confidence across field results (0 if not valid) |
field_results |
tuple[FieldValidationResult, …] |
Per-field outcome — same shape as token-level ValidationResult minus the validator ref |
data |
Mapping[str, str] |
Normalized input data (after schema-level normalizers) |
structural_issues |
tuple[StructuralIssue, …] |
Cross-field problems (e.g. POSTAL_CODE_CITY_MISMATCH) |
StructuralIssue.code is a StructuralIssueCode enum:
MISSING_REQUIRED_FIELD, INVALID_FULL_NAME_STRUCTURE, UNKNOWN_HONORIFIC,
POSTAL_CODE_CITY_MISMATCH, MISSING_COMPANY_LEGAL_FORM, NO_CONTACT_FIELDS,
NO_FINANCE_FIELDS.
Cross-Entity Coherence¶
CoherenceValidator takes a mapping of entity_type → EntityValidationResult
and reports whether they corroborate or contradict each other. It only runs
rules that span at least two entity types.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
CoherenceResult.signals is a tuple of CoherenceSignal records; each signal
carries a status (CONSISTENT, INCONSISTENT, INDETERMINATE), a confidence
score, the entities_involved, and an optional human-readable detail.
If any input entity is itself invalid, the corresponding rules emit
INDETERMINATE signals rather than evaluating against unreliable data.
Warming Caches¶
For long-running services that want to pay validator-initialization cost up
front, call warm_validation_resources(). It instantiates every registered
validator (and pre-resolves entity rule bundles) so the first user request does
not incur the cold-start cost.
1 2 3 4 5 6 7 8 9 | |
Omit datasets= to warm all registered validators with the default DE
dataset.
Public Surface¶
The stable import surface is re-exported from
datamimic_ee.domains.validation:
1 2 3 4 5 6 7 8 9 10 11 | |
Registry helpers live in
datamimic_ee.domains.validation.validation_registry
(get_validator, list_validator_names, register_validator,
warm_validation_resources, UnknownValidatorError).
The cross-entity coherence layer is exposed from
datamimic_ee.domains.domain_core.coherence
(CoherenceValidator, CoherenceResult, CoherenceSignal,
CoherenceStatus, CoherenceRuleCode).