Validating Supplier CSVs Against FSMA 204 KDE Schemas: Stopping Silent Type Coercion at Ingestion

The exact production failure this page resolves: a supplier CSV loads without raising a single error, yet the Key Data Elements it writes to your traceability ledger are already corrupt. A 14-digit traceability lot code that began with a zero — 00123456789012 — lands in the database as the integer 123456789012, two characters shorter and no longer matching the code the supplier printed on the pallet. A bulk shipment weight exported as 1.5E+04 is committed as the float 15000.0, and a case count of NA becomes the float NaN. None of these transformations throws. They happen inside the CSV reader, before any FSMA 204 (21 CFR Part 1, Subpart S) validation rule has a chance to run, which is why they survive all the way to a recall investigation where the lot chain fails to reconstruct.

This is the flat-file half of the schema gate defined by the parent Schema Validation Rules: every normalized record — CSV, EDI, or REST payload — must pass one strict KDE contract before it reaches the ledger. But a schema check only works on the values it is handed, and by the time a naive pandas.read_csv() hands them over, the damage is done. The fix has to move earlier, into the read itself.

Root Cause: Numeric Inference Runs Before Validation

The defect is not a bug in pandas; it is pandas doing exactly what it is designed to do. pandas.read_csv() and Python’s native csv-plus-inference paths run dtype inference on every column at parse time. A column that looks numeric is coerced to int64 or float64, and three FSMA-relevant fields consistently look numeric to the inference engine even though they are not:

Leading-zero identifiers. Traceability lot codes, Global Location Numbers (GLNs), and GTINs are digit strings, not quantities. Inference reads 00123456789012 as an integer and discards the leading zeros, permanently. The value can never round-trip back to the printed code.
Scientific notation in bulk weights. Suppliers exporting from spreadsheets emit large quantities as 1.5E+04. Parsed as a float, that is 15000.0 — but floats cannot represent every decimal exactly, so a downstream sum of case weights drifts, and the exact quantity KDE required for a shipping Critical Tracking Event is now an approximation.
Null-token coercion. By default pandas maps NA, N/A, NULL, and empty strings to NaN, a float. A single NaN forces the entire column to float64, so even the clean lot codes beside it are re-typed and their zeros stripped.

Under Subpart S, a traceability lot code (§ 1.1320) and the quantity and unit KDEs attached to a CTE (§ 1.1340) must be captured, maintained, and transmitted exactly as recorded. A silently truncated lot code is not a formatting nuisance — it is a KDE that no longer identifies the lot it names, the precise failure the rule exists to prevent. The engineering rule that follows: preserve every field as a string through the read, then let an explicit schema decide what each value is allowed to become.

Figure — CSV ingestion validation pipeline:

The KDE Contract This Gate Enforces

Before the code, the fields the validator is responsible for and the exact regulatory source each one answers to:

KDE column	Expected type	Validation rule	Regulatory Source
`traceability_lot_code`	string (digits, zeros preserved)	Non-empty; exact supplier string, never numerically coerced	21 CFR 1.1320
`critical_tracking_event_date`	ISO 8601 string with offset	Parseable, timezone-aware; naive timestamps rejected	21 CFR 1.1330 / 1.1340
`quantity`	fixed-point decimal string	Positive `Decimal`; scientific notation rejected	21 CFR 1.1340
`unit_of_measure`	string	Present; controlled vocabulary	21 CFR 1.1340
`location_id` (GLN)	13-digit string	`^\d{13}$` after strip	21 CFR 1.1320 / 1.1315

Minimal Reproducible Example

The failure is easiest to see against the default read path, with a two-row CSV that carries a leading-zero lot code, a scientific-notation weight, and a NA token — the three inputs that trigger coercion:

import io
import pandas as pd

raw = io.StringIO(
    "traceability_lot_code,quantity,unit_of_measure\n"
    "00123456789012,1.5E+04,LB\n"
    "00987654321098,NA,LB\n"
)

df = pd.read_csv(raw)  # the naive default: dtype inference runs on every column

print(repr(df.loc[0, "traceability_lot_code"]))  # -> 123456789012   (int64, zeros gone)
print(repr(df.loc[0, "quantity"]))               # -> 15000.0        (float, precision fiction)
print(repr(df.loc[1, "quantity"]))               # -> nan            (NA became a float)
print(df["traceability_lot_code"].dtype)         # -> int64

Every line runs; nothing raises. Yet the first lot code has lost its leading zeros and can no longer match the printed pallet label, the weight is now a lossy float, and the NA in row two has forced the quantity column to float64. A schema check applied after this point validates corrupted values and stamps them as compliant — the worst possible outcome, because the ledger now looks clean.

Fix: String-Preserving Read, Then an Explicit KDE Schema

The fix is two layers. First, a diagnostic read that forces every column to str and disables null-token coercion, so the reader preserves exactly what the supplier sent. Second, a pydantic v2 model that is the only place a value is allowed to change type — parsing dates, converting quantities to Decimal, and validating GLNs — with every rejection carrying an explicit message. This mirrors the enforcement contract used across Supplier Data Ingestion & Sync Automation: parse permissively, validate strictly, quarantine loudly.

import logging
import pandas as pd

logger = logging.getLogger("fsma204_validator")
logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")


def diagnostic_read(filepath: str) -> pd.DataFrame:
    """Read a supplier CSV with strict string preservation to prevent silent coercion."""
    try:
        df = pd.read_csv(
            filepath,
            dtype=str,               # every column stays a string; no numeric inference
            encoding="utf-8-sig",    # strips a UTF-8 BOM that would corrupt the first header
            low_memory=False,
            keep_default_na=False,   # 'NA'/'NULL'/'' stay as strings, never become NaN floats
        )

        # Log raw headers first: a BOM or hidden whitespace here explains a "missing" column.
        logger.info("Raw headers detected: %s", list(df.columns))

        # Normalize headers deterministically: strip BOM remnants, collapse spaces, lowercase.
        df.columns = [
            c.strip().replace("", "").replace(" ", "_").lower()
            for c in df.columns
        ]

        # Structural gate: required KDEs must be present before any value-level check runs.
        required_kdes = {
            "critical_tracking_event_date",
            "lot_code",
            "traceability_lot_code",
            "quantity",
            "location_id",
        }
        missing = required_kdes - set(df.columns)
        if missing:
            raise ValueError(f"Missing mandatory KDE columns: {missing}")

        logger.info("Diagnostic read successful. Shape: %s", df.shape)
        return df
    except pd.errors.ParserError as e:
        logger.error("CSV parsing failed at file level: %s", e)
        raise

With raw strings safely in memory, the pydantic v2 model becomes the single authority on type:

from pydantic import BaseModel, field_validator, ValidationError
from datetime import datetime
from decimal import Decimal, InvalidOperation
import re


class FSMA204KDESchema(BaseModel):
    critical_tracking_event_date: str
    lot_code: str
    traceability_lot_code: str
    product_description: str
    quantity: str
    unit_of_measure: str
    location_id: str  # GLN or equivalent facility identifier

    @field_validator("critical_tracking_event_date", mode="before")
    @classmethod
    def normalize_event_date(cls, v: str) -> str:
        if not v or not v.strip():
            raise ValueError("critical_tracking_event_date cannot be empty")
        v = v.strip()
        try:
            dt = datetime.fromisoformat(v.replace("Z", "+00:00"))
        except ValueError as e:
            raise ValueError(
                f"Invalid date format. Expected ISO 8601 with timezone. Got: {v}"
            ) from e
        # FSMA CTE timestamps must be unambiguous across facilities in different zones.
        if dt.tzinfo is None:
            raise ValueError(
                "Timestamp must include explicit timezone offset (e.g., +00:00 or Z)"
            )
        return dt.isoformat()

    @field_validator("quantity", mode="before")
    @classmethod
    def normalize_quantity(cls, v: str) -> str:
        if not v or not v.strip():
            raise ValueError("quantity cannot be empty")
        v = v.strip().replace(",", "")

        # Reject scientific notation outright: 1.5E+04 loses precision the instant it
        # is parsed as float(), and the quantity KDE must stay exact.
        if re.search(r"[eE]", v):
            raise ValueError(
                "Scientific notation is prohibited for KDE quantity fields. Use decimal format."
            )

        try:
            dec_val = Decimal(v)  # Decimal, not float — no binary rounding drift
        except InvalidOperation as e:
            raise ValueError(f"Quantity must be a valid decimal number. Got: {v}") from e
        if dec_val <= 0:
            raise ValueError("Quantity must be greater than zero.")

        # Emit fixed-point notation. Do NOT call Decimal.normalize(): it re-introduces
        # scientific notation for trailing-zero integers, e.g. Decimal("100") -> 1E+2.
        return f"{dec_val:f}"

    @field_validator("location_id", "traceability_lot_code", mode="before")
    @classmethod
    def strip_identifier(cls, v: str) -> str:
        # Identifiers arrive as strings and stay strings; only surrounding noise is removed.
        return v.strip()

    @field_validator("location_id")
    @classmethod
    def validate_gln(cls, v: str) -> str:
        # A GLN is 13 digits; leading zeros are significant and must have survived the read.
        if not re.match(r"^\d{13}$", v):
            raise ValueError("location_id must be a valid 13-digit GLN (digits only).")
        return v

Because the read forced strings, 00123456789012 reaches the validator intact and passes; the same value under the naive path arrived as 123456789012 and would fail the 13-digit GLN check — the coercion is caught either way, but only the string-preserving read keeps the correct value.

For a 50,000-row manifest, wrap the model in an orchestrator that isolates failures per row and trips a circuit breaker before a systemically broken export floods the quarantine. This fail-forward pattern — one bad row never aborts the batch — is the same one the parent gate applies across every ingestion vector:

from typing import Any


class KDEValidationCircuitBreaker:
    def __init__(self, max_error_rate: float = 0.05, max_total_errors: int = 100) -> None:
        self.max_error_rate = max_error_rate
        self.max_total_errors = max_total_errors
        self.errors: list[dict[str, Any]] = []
        self.processed_count = 0
        self.failed_count = 0

    def validate_batch(
        self, df: pd.DataFrame
    ) -> tuple[pd.DataFrame, list[dict[str, Any]]]:
        valid_rows: list[dict[str, Any]] = []

        for idx, row in df.iterrows():
            self.processed_count += 1
            try:
                validated = FSMA204KDESchema(**row.to_dict())
                valid_rows.append(validated.model_dump())
            except ValidationError as e:
                self.failed_count += 1
                self.errors.append({
                    "row_index": int(idx),
                    "errors": [err["msg"] for err in e.errors()],
                    "raw_data": row.to_dict(),  # raw strings, so the reviewer sees the real value
                })

                # Circuit breaker 1: absolute error ceiling.
                if self.failed_count >= self.max_total_errors:
                    logger.critical(
                        "Circuit breaker triggered: max total errors (%d) exceeded.",
                        self.max_total_errors,
                    )
                    break

                # Circuit breaker 2: relative error rate — catches a wholesale bad export early.
                current_rate = self.failed_count / self.processed_count
                if current_rate > self.max_error_rate:
                    logger.critical(
                        "Circuit breaker triggered: error rate %.2f%% exceeds %.2f%% threshold.",
                        current_rate * 100, self.max_error_rate * 100,
                    )
                    break

        logger.info(
            "Batch complete. Processed: %d | Valid: %d | Failed: %d",
            self.processed_count, len(valid_rows), self.failed_count,
        )
        return pd.DataFrame(valid_rows), self.errors

Verification Steps

Confirm the fix against the same inputs that broke the naive path. First, prove the read preserved the string:

import io

fixed = io.StringIO(
    "critical_tracking_event_date,lot_code,traceability_lot_code,"
    "product_description,quantity,unit_of_measure,location_id\n"
    "2024-05-12T14:30:00Z,LOT-A,00123456789012,Romaine,1500,LB,0086000000012\n"
    "2024-05-12T14:30:00,LOT-B,00987654321098,Romaine,1.5E+04,LB,0086000000012\n"
)
df = pd.read_csv(fixed, dtype=str, keep_default_na=False)

# The string survived the read — leading zeros intact, no int64 coercion.
assert df.loc[0, "traceability_lot_code"] == "00123456789012"
assert df["traceability_lot_code"].dtype == object

breaker = KDEValidationCircuitBreaker(max_error_rate=0.60)
valid, errors = breaker.validate_batch(df)

# Row 0 is compliant; row 1 fails on BOTH a naive timestamp and scientific-notation quantity.
assert len(valid) == 1
assert valid.iloc[0]["quantity"] == "1500"           # fixed-point, not 1500.0 float
assert errors[0]["row_index"] == 1
assert any("timezone" in m for m in errors[0]["errors"])
assert any("Scientific notation" in m for m in errors[0]["errors"])
print("Verified:", len(valid), "committed,", len(errors), "quarantined")

The assertions establish three things an auditor cares about: the lot code kept its leading zeros through ingestion, the accepted quantity is a fixed-point string rather than a float, and the rejected row was quarantined with both specific reasons attached — not silently dropped. In production, tail the log stream for the per-row errors payload and route it to your error-handling workflows so suppliers receive an actionable report rather than a generic bounce.

Delimiter and quoting drift. A supplier switches from comma to semicolon delimiters, or embeds an unescaped comma inside product_description, shifting every column one position right. dtype=str preserves the shift silently — validate column count per row and reconcile against the header before trusting positions. This belongs in continuous data-quality monitoring, not just the ingestion gate.
Mixed date formats within one file. Some rows arrive ISO 8601, others as 05/12/2024 or Excel serial numbers like 45424. The validator rejects the non-ISO rows correctly, but a high reject rate signals an upstream export template problem — pair the circuit breaker with alerting so a format regression pages a data steward instead of quietly quarantining a whole batch.
GLN present but check-digit invalid. ^\d{13}$ proves shape, not correctness. A transposed digit passes the regex but points at no real facility. Add a GS1 mod-10 check-digit validation, and keep facility identifiers segregated per the Security Boundaries for Trace Data so location data does not co-mingle with raw KDE stores.

Frequently Asked Questions

Why not just cast the lot code column back to a zero-padded string after reading?

Because the information is already gone. Once pandas has inferred 00123456789012 as int64, the leading zeros are not stored anywhere — re-padding to a fixed width only works if every code is the same length, which supplier data never guarantees. A 12-digit code and a 14-digit code that both lost leading zeros cannot be distinguished after the fact. The only reliable fix is to never let inference run: read with dtype=str so the original string is preserved from the start.

What does keep_default_na=False actually change?

By default pandas treats a list of tokens — including NA, N/A, NULL, NaN, and the empty string — as missing values and converts them to the float NaN. A single NaN forces its entire column to float64, which then strips leading zeros from every other value in that column. Setting keep_default_na=False keeps those tokens as literal strings, so your validator sees "NA" and can raise an explicit “cannot be empty” error instead of a silent NaN that a later float() swallows.

Why reject scientific notation instead of parsing it into a Decimal?

You could parse 1.5E+04 into Decimal("15000") safely, but accepting it hides an upstream problem. Scientific notation in a quantity field almost always means the value passed through a spreadsheet cell formatted as a number, which is the same pipeline that silently rounds 100000000000 or drops precision on long values. Rejecting the format forces the supplier to export quantities as plain decimals, closing the door on the class of precision bugs rather than papering over one instance. If a specific trusted supplier genuinely cannot change their export, whitelist them explicitly rather than loosening the global rule.

Should validation run inside the CSV reader or as a separate step?

Separate, always. The reader’s only job is to preserve bytes as strings and confirm the required columns exist. Type parsing, format rules, and business logic belong in the pydantic model, which is testable in isolation and shared identically by the CSV path, the EDI path from the CSV/EDI Parser Setup, and any REST payload. Mixing the two means a parser change can silently alter compliance behavior, and a validation change forces a re-read. Keep the boundary sharp: parse permissively, validate strictly.

What error rate should trip the circuit breaker for a new supplier?

Tighter than for an established one. A supplier whose CSV export has not been validated warrants a max_error_rate around 0.02, so a systemic problem — an entire column shifted by a delimiter change, every timestamp missing its offset — halts ingestion fast and pages a data steward before it fills the quarantine. Established partners with a clean history can run at 0.05. Always alert on a trip: a silent halt is itself a compliance gap, because it means KDEs stopped reaching the ledger without anyone noticing.

Schema Validation Rules — the parent gate whose pydantic v2 contract this CSV path feeds.
CSV/EDI Parser Setup — the routing layer that normalizes flat files and EDI into the same canonical KDE record.
KDE Field Mapping Guide — the full catalog of KDEs and the GLN/GTIN identifiers these columns map onto.
Data Quality Monitoring — continuous checks for delimiter drift and format regressions beyond the ingestion gate.
API Polling Strategies — batch boundaries for records arriving over pollable endpoints rather than file drops.

Up: Schema Validation Rules — this CSV validator is the flat-file ingestion vector of the parent schema gate.

For the authoritative regulatory text, reference the FDA Food Traceability Final Rule.

Related content