AI & legacy data

The data-provenance layer your AI program skipped

IronParse · field notes

Walk into any mature AI governance program today and you'll find provenance treated as table stakes. Where did this dataset come from? Who touched it? What transformations did it pass through before a model trained on it? Lineage tooling, data catalogs, and model cards all exist to answer those questions, and regulators increasingly expect the answers to be on file. It's a real and welcome shift. But there is one boundary where the entire apparatus goes dark, and it happens to be the boundary where your most consequential data lives: the legacy hop. The moment data leaves a forty-year-old COBOL system on its way to a feature store or a training set, provenance stops being tracked and starts being assumed.

Lineage is not fidelity

The distinction worth getting precise about is the one almost no program draws cleanly. Lineage tells you where data came from and what route it took. Fidelity tells you whether the data survived that route intact. They are not the same property, and proving the first says nothing about the second. You can have flawless lineage — a fully documented path from mainframe to model — wrapped around data that was silently mangled in transit. Lineage records the journey; fidelity records whether the cargo arrived whole.

Every AI governance program tracks where its data came from. Almost none can prove it arrived intact.

This gap is invisible precisely because lineage tooling is so good now. A pipeline that faithfully logs every step it ran will happily log the step that misread a packed-decimal field or flattened a twelve-element array — and report success. The provenance graph is complete and the data underneath it is wrong. Everyone instruments the first property and quietly takes the second on faith.

Why model-risk frameworks should demand a fidelity attestation

Model-risk management already insists on documented data sources for anything feeding a production model. The natural and overdue extension is to require, for any source data that crossed a migration, an explicit fidelity attestation — an independent assertion that the migrated records match their legacy originals, field for field. Not "the pipeline ran." Not "the schema validated." A concrete, checkable statement that nothing was dropped, truncated, or silently retyped on the way out of the legacy system.

The reason this belongs in the framework rather than in a runbook is leverage. A corrupted field on a mainframe is one bad record. The same defect feeding a model becomes a bad feature, a shifted decision boundary, and an output — a price, a reserve, an eligibility call — that no auditor can trace back to a conversion script that ran two quarters ago. Catching it at the boundary is cheap; catching it after the model has been making decisions on it is not.

The parity receipt as a drop-in provenance artifact

This is exactly the artifact the parity receipt is built to produce. When legacy records are converted, an independent check asserts that every field survived — the structure parses, the field count matches the source, every picture clause decodes to a concrete type, a decode/re-encode round-trip is byte-identical, and the emitted schema accepts the record. Pass all of it and you can sign a receipt that says, in machine-readable terms, this migrated data is faithful to the original.

What makes it a clean fit for an AI program is its shape. It's signed, so it's tamper-evident and attributable. It's reproducible, so anyone can re-run the check and get the same result rather than trusting a one-time report. It's complete, covering every field rather than a sampled subset. And it's generated in-perimeter — the records never have to leave your environment to be verified; only the receipt does. That last property matters more than it looks: it means you can attach a provenance-and-fidelity artifact to the legacy-to-AI hop without ever exposing regulated source data to a third party. The receipt is the only thing that travels.

Slot it in and the boundary stops being a blind spot. The same governance process that already tracks lineage gains a fidelity record for the one hop it couldn't see, and your risk team, your auditors, and your model-governance committee all get something concrete to point at instead of an assumption.

Prove what you fed it

The fidelity question doesn't go away when you ignore it; it just moves downstream, where it's expensive and untraceable. The organizations moving fastest into AI aren't the ones who skipped this step — they're the ones who can prove what they fed the model, which is exactly what lets them move without flinching. For more on how the same gap plays out at the migration layer, see feeding AI the mainframe.

Close the gap at the legacy boundary

IronParse generates a signed, reproducible fidelity receipt for migrated source data — the provenance layer your AI program skipped.

Request a pilot → See a live receipt

← All insights