On Messy Data: Auxil Field Notes

The pitch is reassuring. Your data is not as bad as you think; the cleanup is mostly a presentation problem; AI can do the cleanup itself; the AI project can start now rather than after the year of remediation everyone was dreading. The exec hearing it nods. The data team does not.

The split being sold is structure versus quality. Most of the work, the pitch says, is structure, the part the speaker happens to sell. It is a tidy split. It collapses on contact with a real customer table. Anyone who has opened one knows this in the first ten minutes.

Where the distinction breaks

A field where someone typed asdfjk; in 2014 to bypass a required input has no structure problem. The cell is a well-formed string in a column of the correct name. The content is junk. No amount of remapping recovers what was never there.

A column of addresses from 2012 has no structure problem. Schema documented, values well-formed, foreign keys resolve. The people have moved. The data is structurally pristine and operationally wrong; the fix is information the data does not contain.

Two systems disagreeing on whether a customer cancelled on June 1st or June 15th have no structure problem. Each side is internally consistent; they contradict. Reconciling needs a human who knows which system is the source of truth, or an executive decision to make one so.

“Structure” frames every cell as the right answer in the wrong outfit. Real data is partly wrong, partly missing, partly stale, partly contradicted, partly meaningless. A translator does not help.

What AI does fix, and how much

Real tools do real work. LlamaSheets uses vision-capable models to parse spreadsheets whose layout defeats ordinary code: merged cells, multi-level headers, colour-coded meaning, three tables on one worksheet. It produces clean tabular output, and it is hard to do without. Call it 10 to 15 percent of the problem: ingestion, getting a messy artifact into a form you can reason about. Once the form is clean, every quality problem above is still in front of you. A beautifully reconstructed schema over ten thousand rows of junk is junk, neatly arranged.

The broader toolkit

Cleanup is not one tool. It is a stack, where each layer addresses a different kind of bad.

Layer	Examples	What it does
Layout normalisation	LlamaSheets, custom parsers	Gets non-tabular sources into tabular form
Interactive wrangling	OpenRefine, Trifacta	Lets a human cluster typos and bulk-apply transforms across millions of rows
Entity resolution	RecordLinkage, splink	Catches fuzzy duplicates by distance metrics
Quality observability	Great Expectations, Soda, Monte Carlo	Asserts against pipelines; catches regressions when input shifts
Reference data	Address validation APIs, business directories	Brings external truth in to correct stale fields
Logs and lineage	Custom audit tables, OpenLineage	Records who changed what, when, why, with which authority

A team with Great Expectations and no entity resolution catches pipeline regressions and ships duplicates. A team with entity resolution and no reference data merges duplicates into a single rotten record. Describing the stack is easy. Operating it takes engineers who know which tool to reach for when badness shows up, in a deployment that lets them reach for it without rebuilding the pipeline.

Why hand-rolled heuristics give out

Many teams start with a script. Pad short zip codes. Normalise phone numbers. Title-case names. Six months later the script has four hundred rules, rule 400 breaks rule 40, the team is afraid to edit it, and someone is maintaining a wiki page of which rules to comment out for which import.

This is not a discipline failure but a structural one. People invent new ways to break inputs faster than rules can be written to catch them. Rules are also blind to context: special characters are invalid in one product line and valid serial numbers in another. Heuristics work for small, stable inputs; they do not scale to the long tail of how humans corrupt data over decades.

Human in the loop, in practice

“Human in the loop” is a slogan. The thing it names, when it is built, is an exception queue. High-confidence records flow through; low-confidence, contradictory, or anomalous ones divert to a queue. A reviewer opens an item, sees the record, the suggested action, and the reason it was flagged; accepts, rejects, or overrides. The queue is the loop.

Behind it are four unglamorous requirements:

A confidence threshold calibrated against the cost of being wrong either way.
A queue that does not lose items when the worker restarts.
A review UI showing enough context to decide without a research project per item.
A path back into the pipeline so the decision sticks.

Each is a real piece of code someone has to maintain.

The audit trail

For any consequential change, the audit trail is not optional. Every change is a ledger entry: timestamp, original value, new value, the agent (which model, what confidence; or which human, what authorisation), the action taken, the reason. Without this ledger, you cannot reverse a mistake, you cannot explain a decision to a regulator, and you cannot improve the automated path because you do not know which decisions humans overrode. The trail is the only thing between a cleanup project and an irreversible rewrite of the company’s history.

Compliance gives the easy argument for it. SOC 2, and most frameworks like it, will ask who changed what and on what authority, and the ledger is how you answer. But the compliance case understates it. If you are serious about the data, you need the audit trail whether or not anyone audits you, because a cleanup that mutates millions of records without a way to ask “why is this value what it is” is not a cleanup, it is a second corruption you cannot trace. The certificate is a side effect of doing the thing you needed to do anyway.

What no one can fix

Some data is gone. A column corrupted in a 2018 outage with no recoverable backup is gone. AI can impute a plausible value, but plausible is not true, and treating an imputed number as real is its own quality failure. Stale data with no external source of truth is gone. Two systems that contradict with no metadata explaining how can be reconciled only by executive decree, which is a business decision dressed as a data decision.

Some problems look fixable and are not. A field whose meaning lived in the heads of people who have left is not always recoverable, even with the data intact. The schema says “status code,” the codes run 1 through 7, three are documented, the others are anyone’s guess. A model can infer that code 6 correlates with cancellations, but inference is not knowledge, and acting on inference at scale is a way to ship confident wrongness.

Accepting that some data cannot be saved is part of doing this work seriously. The alternative is a cleanup project that never ends.

If you are starting now

Prevention does not eliminate any of the categories above. New systems age, new business requirements misshape old schemas, new operators find new ways to corrupt input. What a new system can do is refuse to manufacture the easy versions of the problem, which is most of them.

Validate at every boundary

Garbage inputs are partly a UI problem. Required fields force values, so the value will be asdfjk; when the operator does not have one; use constrained widgets, and make “not yet known” a real state instead of punishing operators into faking it. But the UI is only the first gate, and it is the one most easily bypassed: imports, integrations, and API writes never see it. Garbage is a validation problem too, and validation belongs at every boundary the data can enter through, not only the screen a human types into. The same constraint that greys out an invalid option in the form has to reject the same value when it arrives in a nightly batch, or the form was theatre.

Timestamp, version, and curate

Temporal decay is fought with timestamps. Every field that can go stale gets an as_of. Distinguish “true when collected” from “true now,” and schedule revalidation for the fields that matter. Decay is not only a property of cells. A whole document decays, and once you keep more than one version of it, you have to answer what counts as the same document across edits. Versioning and document identity are the same question asked at the row and at the file: which writes supersede which, and which thing they are writing to. Get identity wrong and you keep two truths under one name, or split one truth across two names, and no timestamp saves you.

This is where the newest version of the old problem lives. Retrieval-augmented generation and the knowledge bases behind it have created a class of corpus that barely existed before: the uncurated, gigantic dumping ground pointed at a model as context. A decade of SharePoint, a shared drive no one has pruned since 2015, every draft and superseded policy and contradictory memo sitting at equal weight, now retrieved and quoted with the confidence of a fresh answer. The data was always bad; what changed is that a model will now read all of it and synthesise a fluent reply that averages the stale against the current and the draft against the approved. Staleness, entropy, and the absence of curation were survivable when a human searched and judged. They are not survivable when retrieval is automatic and the reader cannot tell a 2015 draft from this morning’s signed version.

Curation is the missing discipline, and part of it is making decay computable. Treat facts and documents as reactive, not inert. A derived fact, a summary, an embedding, a roll-up, depends on inputs, and those dependencies form a DAG. When an upstream input changes, everything downstream of it is stale and needs recompute, sometimes by rerunning code, sometimes by re-asking a model, sometimes by routing to a human to re-judge. A knowledge base that does not know its own dependency graph cannot know what went stale when the source changed, so it serves the old answer forever. The fix is to model the graph and recompute along it, the way a spreadsheet recalculates the cells that depended on the one you edited.

Give every fact one owner

Contradiction is a source-of-truth problem. Each fact has one system that owns it; others hold derived views, fed by events or replication, not by parallel manual entry. The moment two systems both accept writes to the same fact, contradiction is on the schedule.

Make loss recoverable

Destructive loss is fought with backups, point-in-time recovery, and append-only audit tables. Soft delete where it makes sense; hard delete only when the legal answer requires it.

Document meaning at the schema

Undocumented codes are fought by writing the documentation at the schema. Status codes are enums; each value carries a comment; the comment is the contract. New code, new comment, in the same migration.

Make invalid states unrepresentable

Strong typing prevents whole categories of bad data by making them unrepresentable. The maxim “make invalid states unrepresentable” applies wherever the domain is well enough understood to encode its rules (see It’s OCaml’s Time for the language case). A status that depends on another field should be a discriminated union, not two nullables that can disagree. A user is Unverified or Verified { at: Date }, not one record with a nullable verified_at the application reinterprets. Bad combinations never get written, never enter the database, never become someone’s afternoon.

Pushed to its limit, this stops being a modelling convenience and becomes a proof. Dependent types, the kind in Idris or Agda, let the type carry arbitrary facts about the value: a list paired with its length, a date that cannot be the 31st of February, a transition that only typechecks if the prior state allowed it. Type to the Nth degree and “invalid states unrepresentable” stops being a slogan and becomes literal. The compiler will not build a program that could construct the bad value, so the bad value never reaches the database to be cleaned up later. Most production code never goes this far, and should not, because the proof obligation grows faster than the payoff. But the direction is the point: every invariant you move into the type is a class of dirty data the system can no longer produce, checked once at compile time instead of forever at runtime.

The qualifier matters: the domain has to be understood. The reach of the type system is bounded by how well you know the thing you are typing, and typing hard is the opposite of free when you know it poorly. Premature typing in an area still being explored, including most LLM-produced shapes during ontology iteration, calcifies a guess and feeds the producer back its own categories (this is the schema-capture trap discussed in The Batting Stance). A precise type for the wrong model is worse than a loose one, because it is harder to dislodge and more confidently wrong. Where the shape is uncertain, keep the representation loose (JSON, opaque payloads, narrow types only at the application boundary) and tighten when downstream code earns each case. Unstructured representations have a place: wherever the structure is not yet known.

Identify, attribute, observe

Identity is a stable internal ID, never an email or phone number. Provenance is recorded per field: manual, import, API, model inference (with model and version); inferred values flagged. Observability runs from day one, because finding a regression in week one costs much less than finding it in year three. The audit trail above goes in before the first record matters.

Collect less

Do not over-collect. Every optional field gathers more garbage than the required ones, and every field is something someone is expected to maintain. The notes column will accumulate phone numbers, addresses, operator gossip, and three flag-like enum values someone wished were columns. Better fewer columns that hold something than many that hold whatever happened to land there.

None of this prevents the slow erosion of meaning that comes with the years. It buys a system that decays at the rate of the business, not the rate of operator improvisation. That is the difference between cleanup as ongoing maintenance and cleanup as a multi-year program.

How it actually goes

The pitch frames cleanup as smaller than it looks because AI can shoulder most of it. The work goes the other way; the problem is larger and stranger than the diagnosis suggested. AI carries one layer, usually ingestion. The rest stays hard, slow, and tied to context the model cannot recover. There is no universal translator for data, because the problem is not translation.

What works is unromantic. Pick the right tool at each layer. Notice early which records are unrecoverable, and decide what the business does about them on purpose. The model is not going to figure it out, and the slide deck that said it would was written by someone who has not opened the file.

On Messy Data