Async Batch Processing for FSMA 204 Supplier Data Ingestion

Q: Why bound concurrency with a semaphore instead of running all requests concurrently?

An unbounded set of asyncio tasks opens one outbound connection per record, which trips supplier rate limits and earns HTTP 429s or an IP ban that halts ingestion. An asyncio.Semaphore caps in-flight workers at a fixed limit tuned to the slowest supplier tier, and because a slow ledger commit holds its slot longer, the cap doubles as automatic backpressure. You get throughput without exceeding what upstream endpoints tolerate.

Q: How does the processor prevent duplicate ledger entries on a retry?

Every record derives a composite idempotency key from lot_code, facility_id, and shipping_timestamp before persistence. The ledger commit upserts on that key, so an at-least-once delivery from the queue, a network retry, or a duplicated supplier submission all resolve to a single entry, keeping the traceability record count truthful.

FSMA 204 (21 CFR Part 1, Subpart S) mandates deterministic capture, mapping, and retention of Key Data Elements (KDEs) across every Critical Tracking Event (CTE). In a multi-tier supply chain, supplier records do not arrive politely. They come as high-volume, asynchronous payloads with heterogeneous schemas, intermittent connectivity, and aggressive rate limits — dozens of harvest logs, advance ship notices, and transformation events landing in the same window. A synchronous ingestion path degrades fast under that load, introducing latency bottlenecks, credential exhaustion, and partial-state failures that directly fracture lot code continuity. Async batch processing is the fix: it decouples data acquisition from schema validation and ledger persistence, so a slow supplier endpoint or a single malformed file never stalls the whole compliance pipeline. This page defines the batch component that sits inside the broader Supplier Data Ingestion architecture — the concurrency model, the KDE data contract it enforces, a runnable Python engine, and the quarantine strategy that keeps every rejected record accounted for.

The Problem: Throughput Without Losing Traceability Events

The engineering trap is treating ingestion as a request-per-record loop. Under real supplier load, that model collapses in three ways. First, an unbounded thread or task pool opens as many outbound connections as there are records, tripping supplier rate limits and earning HTTP 429s or an IP ban that halts ingestion entirely. Second, holding an entire fetch in memory to validate it record-by-record produces unpredictable memory footprints during traffic spikes. Third — and most damaging for compliance — a synchronous path tends to fail as an all-or-nothing unit: one bad timestamp aborts the transaction, and a whole window of otherwise-valid CTEs never reaches the traceability ledger.

The regulatory stakes make that unacceptable. The FDA can require sortable, electronic traceability records within 24 hours of a request during an outbreak investigation. If ingestion lag or a partial-state failure means a lot chain cannot be reconstructed in time, the facility faces an expanded recall scope and regulatory enforcement. The batch processor’s job is therefore not merely to move data quickly. It is to absorb unpredictable supplier behavior while guaranteeing that every valid KDE lands on the ledger exactly once, and that every invalid one is isolated with enough provenance to reconcile it by hand.

The design that satisfies both goals aggregates incoming payloads into configurable batch windows — typically 500 to 2,000 records — then partitions each window into fixed sub-batches before applying strict validation. Concurrency is explicitly bounded with an asyncio.Semaphore so the number of in-flight workers never exceeds what the slowest supplier tier tolerates. Because each window is treated as an atomic unit with a partial-commit contract, a validation failure on a subset of records quarantines only that subset, commits the valid remainder, and schedules reconciliation — the primary stream never blocks.

Pipeline Architecture and Concurrency Control

The batch processor runs a fan-out/fan-in topology. A batch window is read from the source, split into sub-batches of a fixed size (100 records is a good default for memory predictability), and each sub-batch is handed to a worker that acquires a semaphore slot before doing any I/O. Validated records fan back in to a single idempotent commit against the traceability ledger; malformed records fan out to a dead-letter queue. Backpressure is implicit: because the semaphore caps concurrent workers, a slow ledger commit naturally throttles how fast new sub-batches acquire slots, so the pipeline self-regulates instead of overwhelming downstream storage.

A batch window is partitioned into fixed sub-batches that fan out through a semaphore-capped worker pool; validated KDEs fan in to an idempotent ledger commit while malformed records divert to a dead-letter queue, and a dashed backpressure path lets a slow commit throttle new slot acquisition.

When supplier APIs enforce aggressive rate limits or return transient 5xx responses, the acquisition stage must adapt without dropping traceability events. This intersects directly with the resilient fetch discipline described in API Polling Strategies, where adaptive intervals, jittered exponential backoff, and token-bucket throttling prevent credential lockouts while preserving ingestion continuity. The batch processor consumes what the poller fetches, so the two components share the same backoff philosophy: a momentary network fault is a signal to wait, never a reason to skip a record. Where suppliers lack modern REST endpoints entirely, records enter this stage only after a CSV/EDI Parser Setup has normalized legacy flat files and EDI transactions into the same canonical payload shape the validator expects.

Data Contract: KDE Fields Every Batch Record Must Carry

Before any record in a sub-batch is committed, it must satisfy a strict field contract. The processor validates exactly the KDEs that FSMA 204 requires to establish unbroken traceability at a CTE, plus the composite key it needs to enforce idempotency. Any field not on this list is preserved as raw metadata for audit but is never permitted to satisfy a mandatory KDE slot. The Regulatory Source column cites the Subpart S provision that makes each field load-bearing.

Field	Type	Validation rule	Regulatory Source
`lot_code`	`str`	Non-null, 1–64 chars; supplier-assigned, immutable for the life of the lot	21 CFR 1.1320 (Subpart S)
`product_gtin`	`str`	12–14 digit GS1 GTIN; rejected on format mismatch	21 CFR 1.1340(a)
`facility_id`	`str`	Non-null, 3–32 chars; resolves to a registered location identifier	21 CFR 1.1330 / 1.1340
`harvest_date`	`datetime`	ISO 8601; the originating CTE event instant	21 CFR 1.1325 (growing/harvesting)
`shipping_timestamp`	`datetime`	Timezone-aware; never in the future	21 CFR 1.1340 (shipping CTE)
`quantity`	`float`	Strictly greater than zero	21 CFR 1.1340(a)

Two rules eliminate most batch defects. First, shipping_timestamp is validated as timezone-aware at the boundary — a naive datetime is rejected rather than silently coerced, because a wrong offset shifts the CTE onto the wrong day and corrupts recall scoping. Second, the composite of lot_code, facility_id, and shipping_timestamp becomes the idempotency key, so a network retry or a duplicated supplier submission resolves to a single ledger entry instead of inflating the traceability record count. For the exact field-to-field transformation rules from legacy supplier schemas into this contract, engineering teams follow the KDE Field Mapping Guide; for the deeper enforcement of type coercion and business-rule constraints, see the Schema Validation Rules.

Production-Grade Async Batch Processor in Python

The engine below is async, bounded, and idempotent. It uses pydantic v2 for KDE validation, tenacity for bounded exponential-backoff retries on the supplier fetch, an asyncio.Semaphore to cap concurrency, structured JSON-friendly logging as the audit trail, and a partial-commit contract that persists the valid subset of every window while routing malformed records to a dead-letter queue. Running it twice over the same window produces the same ledger state — duplicate submissions collapse on the composite idempotency key rather than double-writing.

import asyncio
import hashlib
import logging
from datetime import datetime, timezone
from typing import Any, Optional

import httpx
from pydantic import BaseModel, Field, ValidationError, field_validator
from tenacity import (
    retry,
    retry_if_exception_type,
    stop_after_attempt,
    wait_exponential_jitter,
)

# Structured, audit-ready logging: every batch decision is recorded.
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
    datefmt="%Y-%m-%dT%H:%M:%S%z",
)
logger = logging.getLogger("fsma204.async_batch")


class TransientFetchError(Exception):
    """Retryable upstream fault (5xx / network) — safe to back off and retry."""


class KDEPayload(BaseModel):
    """Strict schema mapping for FSMA 204 Key Data Elements."""

    lot_code: str = Field(..., min_length=1, max_length=64)
    product_gtin: str = Field(..., pattern=r"^\d{12,14}$")
    harvest_date: datetime
    facility_id: str = Field(..., min_length=3, max_length=32)
    shipping_timestamp: datetime
    quantity: float = Field(..., gt=0)

    @field_validator("shipping_timestamp")
    @classmethod
    def validate_shipping_ts(cls, v: datetime) -> datetime:
        # A naive timestamp makes the CTE date ambiguous; reject at the boundary.
        if v.tzinfo is None or v.tzinfo.utcoffset(v) is None:
            raise ValueError("shipping_timestamp must be timezone-aware")
        if v > datetime.now(timezone.utc):
            raise ValueError("shipping_timestamp cannot be in the future")
        return v

    def idempotency_key(self) -> str:
        # Composite of the fields that uniquely identify a shipping CTE, so a
        # retry or duplicate submission resolves to one ledger entry.
        composite = f"{self.lot_code}|{self.facility_id}|{self.shipping_timestamp.isoformat()}"
        return hashlib.sha256(composite.encode("utf-8")).hexdigest()


class DeadLetterQueue:
    """Quarantine for non-compliant payloads, preserved with the raw input."""

    def __init__(self) -> None:
        self.records: list[dict[str, Any]] = []

    def push(self, record: dict[str, Any], error: str) -> None:
        self.records.append(
            {
                "quarantined_at": datetime.now(timezone.utc).isoformat(),
                "raw_payload": record,
                "validation_error": error,
            }
        )
        logger.warning("Payload quarantined to DLQ | error=%s", error)


@retry(
    retry=retry_if_exception_type(TransientFetchError),
    stop=stop_after_attempt(4),
    wait=wait_exponential_jitter(initial=1, max=30),
    reraise=True,
)
async def fetch_supplier_batch(
    client: httpx.AsyncClient,
    endpoint: str,
    batch_size: int = 1000,
) -> list[dict[str, Any]]:
    """Fetch one supplier window; transient 5xx faults are retried with backoff."""
    try:
        response = await client.get(endpoint, params={"limit": batch_size})
        response.raise_for_status()
    except httpx.HTTPStatusError as exc:
        if exc.response.status_code >= 500:
            # Transient server-side fault: raise the retryable type so tenacity
            # backs off with jitter instead of hammering a struggling endpoint.
            raise TransientFetchError(str(exc)) from exc
        logger.error("Non-retryable HTTP error | %s", exc)
        raise
    except (httpx.ConnectError, httpx.ReadTimeout) as exc:
        raise TransientFetchError(str(exc)) from exc
    return response.json().get("records", [])


async def validate_and_persist_batch(
    records: list[dict[str, Any]],
    ledger_client: Any,
    dlq: DeadLetterQueue,
) -> int:
    """Validate KDEs, commit the valid subset idempotently, quarantine failures."""
    valid: list[dict[str, Any]] = []
    for idx, record in enumerate(records):
        try:
            payload = KDEPayload(**record)
        except ValidationError as exc:
            # Partial-commit contract: one bad record never aborts the batch.
            dlq.push(record, str(exc))
            logger.error("Sub-batch index %d failed KDE validation | %s", idx, exc)
            continue
        row = payload.model_dump(mode="json")
        row["idempotency_key"] = payload.idempotency_key()
        valid.append(row)

    if valid:
        # commit_batch upserts on idempotency_key, so retries collapse safely.
        await ledger_client.commit_batch(valid)
        logger.info("Committed %d valid KDE records to traceability ledger", len(valid))
    return len(valid)


async def run_async_ingestion_pipeline(
    supplier_endpoint: str,
    ledger_client: Any,
    concurrency_limit: int = 10,
    batch_window_size: int = 500,
    sub_batch_size: int = 100,
) -> dict[str, int]:
    """Orchestrate async batch processing with semaphore-bounded concurrency."""
    semaphore = asyncio.Semaphore(concurrency_limit)
    dlq = DeadLetterQueue()

    async with httpx.AsyncClient(timeout=30.0) as client:
        raw_records = await fetch_supplier_batch(
            client, supplier_endpoint, batch_window_size
        )

        # Partition into sub-batches for predictable memory under traffic spikes.
        sub_batches = [
            raw_records[i : i + sub_batch_size]
            for i in range(0, len(raw_records), sub_batch_size)
        ]

        async def process_sub_batch(batch: list[dict[str, Any]]) -> int:
            async with semaphore:  # backpressure: never exceed the concurrency cap
                return await validate_and_persist_batch(batch, ledger_client, dlq)

        tasks = [process_sub_batch(batch) for batch in sub_batches]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    committed = sum(r for r in results if isinstance(r, int))
    failures = [r for r in results if isinstance(r, Exception)]
    for exc in failures:
        logger.error("Sub-batch raised and was skipped this cycle | %s", exc)

    logger.info(
        "Pipeline complete | committed=%d | quarantined=%d | sub_batch_errors=%d",
        committed, len(dlq.records), len(failures),
    )
    return {
        "committed": committed,
        "quarantined": len(dlq.records),
        "sub_batch_errors": len(failures),
    }


if __name__ == "__main__":
    class MockLedger:
        async def commit_batch(self, records: list[dict[str, Any]]) -> bool:
            await asyncio.sleep(0.1)  # simulate an idempotent upsert
            return True

    asyncio.run(
        run_async_ingestion_pipeline(
            supplier_endpoint="https://supplier-api.example.com/v1/shipments",
            ledger_client=MockLedger(),
            concurrency_limit=8,
            batch_window_size=1000,
        )
    )

The implementation enforces timezone-aware timestamps at the boundary, derives a composite idempotency key before persistence, and never lets a single malformed record abort a sub-batch. The tenacity decorator retries only the retryable TransientFetchError — a 4xx client error is raised immediately because retrying a malformed request wastes quota and delays the whole window. This is the same backoff discipline the core Python async batch sync for supplier APIs walkthrough builds on, extended here with compliance-specific validation gates and partial-batch persistence.

Error Handling and Quarantine Strategy

Async batch processing does not eliminate data failures; it isolates them so the primary stream keeps flowing. The processor fails closed: any record that cannot be proven compliant is quarantined rather than dropped, and the raw input is preserved alongside a structured error. Three failure classes route to the dead-letter queue:

Schema-invalid payloads — a record that fails pydantic validation (a naive shipping_timestamp, a malformed GTIN, a non-positive quantity) is pushed to the DLQ with the exact ValidationError text. It is never committed on a guess, because a malformed field could mask a record that belongs on the ledger.
Future-dated or logically impossible events — a shipping_timestamp past datetime.now(timezone.utc) is rejected as a data-entry or clock-skew defect. Committing it would place a CTE outside the plausible traceability timeline and distort a downstream recall query.
Exhausted fetch retries — a supplier window that keeps returning 5xx exhausts the tenacity backoff budget and raises; that window is logged and deferred to the next scheduled cycle rather than being partially committed. Because the cursor is only advanced by the upstream poller after a clean fetch, no records are lost by deferring.

Quarantined records do not sit idle. Compliance teams run scheduled reconciliation jobs that reprocess DLQ entries after suppliers correct the source data, and those records re-enter the next batch on their original idempotency key so they never double-commit. This reconciliation path is shared with the broader Error Handling Workflows that own retries, dead-letter routing, and operator alerting across the ingestion layer. Persistent malformation rates surface into Data Quality Monitoring, where a rising DLQ rate for one supplier flags a schema drift that needs a mapping fix rather than a per-record retry.

Integration With the Parent Architecture

This batch processor is the execution engine for the volume-scaling layer of the parent Supplier Data Ingestion pipeline. The dependency runs in one direction. Upstream, the poller and the parser deliver normalized payloads and advance their cursors; this stage consumes those payloads, validates them against the KDE contract, and commits the valid subset to the traceability ledger. Downstream, the FSMA 204 Architecture & KDE Compliance Mapping program turns those committed KDEs into an exportable, query-ready lot graph, and its retention component ages them out under the two-year mandate. If this stage silently drops a record, the export layer cannot reconstruct the lot chain — which is precisely why the processor quarantines instead of discarding.

Idempotency is what makes the boundary safe. Because the ledger commit upserts on the composite of lot_code, facility_id, and shipping_timestamp, an at-least-once delivery guarantee from the queue never inflates the record count. Every batch decision — commit hash, validation outcome, retry count, DLQ push — is emitted as structured, immutable telemetry, which feeds the same audit evidence trail the FDA expects and passes through the access controls defined in the Security Boundaries for Trace Data. By coupling schema validation with structured logging at the ingestion boundary, the pipeline transforms raw supplier payloads into legally defensible traceability records under the Final Rule Requirements for Additional Traceability Records for Certain Foods.

Operational Notes

Runtime and dependencies. Python 3.10+, httpx>=0.27, pydantic>=2.6, tenacity>=8.2. The asyncio, hashlib, and logging modules are standard library. Pin versions in a lockfile so validation and idempotency behavior is byte-for-byte reproducible during an audit.
Configuration variables. Expose concurrency_limit, batch_window_size, and sub_batch_size as environment variables, never literals. Tune concurrency_limit to the slowest supplier tier — start low (8–10) and raise it only while watching for 429 responses. sub_batch_size bounds per-worker memory; 100 records is a safe default.
Backpressure and clocks. The semaphore is the only throttle you need under normal load; do not add a second unbounded task pool around it. Ensure the host clock is NTP-synchronized, because the future-timestamp check and every CTE comparison use datetime.now(timezone.utc).
Monitoring. Track three metrics with alert thresholds that trip before the 24-hour SLA is at risk: batch commit latency, DLQ accumulation rate, and validation-failure distribution by supplier. A spike in any one tells you whether to adjust the semaphore limit, the retry budget, or an upstream parser mapping.
Dry runs. Ship a mode that validates and logs the intended commits without writing to the ledger, and capture that output as inspection evidence before enabling destructive writes in production.

Frequently Asked Questions

Why bound concurrency with a semaphore instead of just running all requests concurrently?

An unbounded set of asyncio tasks opens one outbound connection per record, which trips supplier rate limits and earns HTTP 429s or an IP ban that halts ingestion entirely. The asyncio.Semaphore caps in-flight workers at a fixed limit tuned to the slowest supplier tier, and because a slow ledger commit holds its slot longer, the cap doubles as automatic backpressure. You get high throughput without ever exceeding what the upstream endpoints tolerate.

What happens to the valid records in a batch when one record fails validation?

They are committed. The processor uses a partial-commit contract: each record is validated independently, the valid subset is upserted to the traceability ledger, and only the malformed records are routed to the dead-letter queue with their raw payload and error. One bad timestamp never aborts the window, so downstream recall-readiness workflows do not stall on an upstream data anomaly.

How does the processor prevent duplicate ledger entries on a retry?

Every record derives a composite idempotency key from lot_code, facility_id, and shipping_timestamp before persistence. The ledger commit upserts on that key, so an at-least-once delivery from the queue, a network retry, or a duplicated supplier submission all resolve to a single entry. This is what keeps the traceability record count truthful under an at-least-once delivery guarantee.

Which fetch failures are retried, and which are not?

Only transient faults — 5xx responses, connection errors, and read timeouts — are wrapped in TransientFetchError and retried by tenacity with exponential backoff and jitter. A 4xx client error is raised immediately, because retrying a malformed or unauthorized request wastes supplier quota and delays the whole window. When the backoff budget is exhausted, the window is deferred to the next cycle rather than partially committed.

Why reject a timezone-naive shipping timestamp instead of assuming UTC?

Because guessing an offset can shift a shipping CTE onto the wrong calendar day, which corrupts recall scoping and one-up/one-back reconstruction. The field_validator rejects any naive shipping_timestamp at the boundary and quarantines the record, forcing the supplier to send an explicit offset. Under 21 CFR 1.1340 the event timestamp is load-bearing, so an ambiguous one is treated as invalid rather than silently coerced.

Conclusion

Async batch processing is the reliability backbone of an FSMA 204 ingestion pipeline. By aggregating supplier payloads into bounded windows, capping concurrency with a semaphore, retrying only transient faults through tenacity, enforcing the KDE contract with pydantic, committing idempotently on a composite key, and quarantining every ambiguous record instead of dropping it, the processor absorbs unpredictable supplier behavior without ever breaking lot code continuity. Treated as a deterministic compliance engine rather than a generic data mover, it delivers continuous audit readiness and a traceability ledger the FDA can reconstruct within the 24-hour window.

Python async batch sync for supplier APIs — the core orchestration walkthrough this engine extends
API Polling Strategies — stateful, rate-limit-aware fetching that feeds this batch stage
CSV/EDI Parser Setup — normalizing legacy flat files and EDI into the canonical payload shape
Schema Validation Rules — the deeper KDE enforcement this stage applies at the boundary
Error Handling Workflows — the retry, quarantine, and reconciliation path shared across ingestion

Up: Supplier Data Ingestion & Sync Automation — this batch processor is the volume-scaling execution engine of the parent ingestion pipeline.

Related content