Skip to content

VALIDATION API

DATAMIMIC ships a Python validation API that lets you check whether a value or a structured record matches realistic, dataset-aware patterns for a given country. Validation is organized in three layers:

  • Token-level — validate a single string (a name, a postal code, an IBAN, …).
  • Entity-level — validate a structured record (person, address, company, contact) against its domain schema, including cross-field rules.
  • Cross-entity coherence — check whether several already-validated entities agree with each other (e.g. a person's address country matches their phone prefix).

All validators are dataset-aware: the country code drives both the dictionaries that back exact matches (e.g. known German family names) and the structural rules (e.g. German postal-code format).

Token-Level Validation

Use the registry to obtain a validator for a domain (person, address, contact, company, finance) and call validate() with the raw string.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from datamimic_ee.domains.validation import ValidatorName
from datamimic_ee.domains.validation.validation_registry import get_validator

person_validator = get_validator(ValidatorName.PERSON, dataset="DE")
result = person_validator.validate("Müller")

result.valid              # True
result.confidence         # 0.7
result.match_kind         # MatchKind.EXACT_DATASET
result.code               # ValidationCode.DATASET_MATCH
result.normalized_value   # "Müller"
result.validator_id       # "person.name.DE"

get_validator(name, dataset="DE") accepts ValidatorName or its string value (e.g. "person"). Unknown names raise UnknownValidatorError.

Result shape

ValidationResult is a frozen dataclass with the following fields:

Field Type Meaning
valid bool Whether the input passes the validator
confidence float (0..1) How strong the match is
match_kind MatchKind EXACT_DATASET, STRUCTURAL_PATTERN, HEURISTIC, NONE
code ValidationCode Machine-readable reason (e.g. DATASET_MATCH)
validator ValidatorRef (name, operation, dataset) triple
message str \| None Human-readable explanation
normalized_value str \| None Canonicalized input (trimmed, NFKC, …)

Listing available validators

1
2
3
4
5
from datamimic_ee.domains.validation.validation_registry import list_validator_names

list_validator_names()
# [ValidatorName.ADDRESS, ValidatorName.COMPANY, ValidatorName.CONTACT,
#  ValidatorName.FINANCE, ValidatorName.PERSON]

Entity-Level Validation

Entity validators accept either a plain dict[str, str] or a typed Pydantic input model and return an EntityValidationResult containing per-field results plus structural issues that span multiple fields.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from datamimic_ee.domains.validation.entities.person_entity_validator import (
    PersonEntityValidator,
)

validator = PersonEntityValidator(dataset="DE")
result = validator.validate({
    "given_name": "Anna",
    "family_name": "Müller",
    "honorific": "Dr.",
})

result.valid                       # True
result.confidence                  # 0.72
for fr in result.field_results:
    print(fr.field_name, fr.valid, fr.confidence, fr.code.value)
result.structural_issues           # ()  — empty when no structural problems

The available entity validators are:

Validator Domain entity Module
PersonEntityValidator person datamimic_ee.domains.validation.entities.person_entity_validator
AddressEntityValidator address datamimic_ee.domains.validation.entities.address_entity_validator
CompanyEntityValidator company datamimic_ee.domains.validation.entities.company_entity_validator
ContactEntityValidator contact datamimic_ee.domains.validation.entities.contact_entity_validator

All four constructors take the same dataset: str = "DE" argument and expose a validate(entity) method returning EntityValidationResult.

Result shape

EntityValidationResult (frozen dataclass):

Field Type Meaning
entity_type str "person", "address", "company", "contact"
valid bool Overall validity (per-field validity + cross-field rules)
confidence float Aggregated confidence across field results (0 if not valid)
field_results tuple[FieldValidationResult, …] Per-field outcome — same shape as token-level ValidationResult minus the validator ref
data Mapping[str, str] Normalized input data (after schema-level normalizers)
structural_issues tuple[StructuralIssue, …] Cross-field problems (e.g. POSTAL_CODE_CITY_MISMATCH)

StructuralIssue.code is a StructuralIssueCode enum: MISSING_REQUIRED_FIELD, INVALID_FULL_NAME_STRUCTURE, UNKNOWN_HONORIFIC, POSTAL_CODE_CITY_MISMATCH, MISSING_COMPANY_LEGAL_FORM, NO_CONTACT_FIELDS, NO_FINANCE_FIELDS.

Cross-Entity Coherence

CoherenceValidator takes a mapping of entity_type → EntityValidationResult and reports whether they corroborate or contradict each other. It only runs rules that span at least two entity types.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from datamimic_ee.domains.domain_core.coherence import CoherenceValidator
from datamimic_ee.domains.validation.entities.person_entity_validator import (
    PersonEntityValidator,
)
from datamimic_ee.domains.validation.entities.address_entity_validator import (
    AddressEntityValidator,
)

person_result = PersonEntityValidator("DE").validate(person_payload)
address_result = AddressEntityValidator("DE").validate(address_payload)

coherence = CoherenceValidator(dataset="DE")
report = coherence.validate({
    "person": person_result,
    "address": address_result,
})

report.overall_consistent   # True
report.confidence           # 0.85
for signal in report.signals:
    print(signal.code.value, signal.status.value, signal.detail)

CoherenceResult.signals is a tuple of CoherenceSignal records; each signal carries a status (CONSISTENT, INCONSISTENT, INDETERMINATE), a confidence score, the entities_involved, and an optional human-readable detail.

If any input entity is itself invalid, the corresponding rules emit INDETERMINATE signals rather than evaluating against unreliable data.

Warming Caches

For long-running services that want to pay validator-initialization cost up front, call warm_validation_resources(). It instantiates every registered validator (and pre-resolves entity rule bundles) so the first user request does not incur the cold-start cost.

1
2
3
4
5
6
7
8
9
from datamimic_ee.domains.validation import ValidatorName
from datamimic_ee.domains.validation.validation_registry import (
    warm_validation_resources,
)

warm_validation_resources(
    datasets={ValidatorName.PERSON: "DE", ValidatorName.ADDRESS: "DE"},
    include_rule_bundles=True,
)

Omit datasets= to warm all registered validators with the default DE dataset.

Public Surface

The stable import surface is re-exported from datamimic_ee.domains.validation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from datamimic_ee.domains.validation import (
    DomainSchema,
    EntityValidationResult,
    FieldDescriptor,
    FieldValidationResult,
    MatchKind,
    NameClassification,
    NameTokenKind,
    ValidationResult,
    ValidatorName,
)

Registry helpers live in datamimic_ee.domains.validation.validation_registry (get_validator, list_validator_names, register_validator, warm_validation_resources, UnknownValidatorError).

The cross-entity coherence layer is exposed from datamimic_ee.domains.domain_core.coherence (CoherenceValidator, CoherenceResult, CoherenceSignal, CoherenceStatus, CoherenceRuleCode).