Fixing Connection Pool Exhaustion and KDE Schema Drift in Async Supplier Syncs

High-volume asynchronous batch synchronization for supplier APIs routinely deadlocks when processing FSMA 204 Critical Tracking Events (CTEs). The failure rarely originates from raw network latency. Instead, it stems from unbounded concurrency colliding with inconsistent Key Data Element (KDE) schemas, expired pagination cursors, and silent connection pool leaks. When a supplier endpoint returns mixed ISO 8601 timestamp formats, integer-encoded lot codes, or truncated next_cursor payloads, an unbounded asyncio event loop exhausts available sockets, triggers cascading aiohttp timeouts, and drops compliance-critical records before they reach the traceability ledger. This page reproduces both failure vectors against realistic KDE payloads, then shows the corrected Async Batch Processing implementation that keeps every CTE inside the FDA’s 24-hour reconstruction window.

Root Cause Analysis

Pool exhaustion and schema drift look identical in the logs — both surface as a burst of dropped records — but they have distinct upstream triggers, and a durable fix must address both at once.

Unbounded concurrency saturates the connection pool. An asyncio.gather() fired across hundreds of lot endpoints opens a socket per coroutine. Once the number of in-flight requests exceeds the aiohttp.TCPConnector limit (default 100), every additional request queues on connection acquisition. Under sustained load the OS file descriptor table fills, aiohttp raises asyncio.TimeoutError on acquisition rather than on the request itself, and the coroutines that were mid-flight are cancelled — discarding whatever CTEs they had already fetched but not yet persisted.
Connection leaks starve the pool over time. A response body that is never fully read, or a ClientSession created per request instead of reused, holds a connection open until GC reclaims it. The pool slowly drains until the very first burst of traffic exhausts it. This is why a sync that ran fine in staging deadlocks in production after an hour.
Schema drift rejects valid-but-non-canonical records. Suppliers frequently mutate KDE payloads during Supplier Data Ingestion rollouts without updating their API documentation. A traceability_lot_code arrives as an integer instead of a string, an event_timestamp switches from a Z-suffixed UTC string to an epoch integer, or a quantity comes back null. A strict parser rejects the whole batch, and if the rejection path simply drops the batch, the one-up/one-back chain required by FSMA 204 develops a hole.

The compounding failure is the dangerous one: when the pool is saturated, the supplier gateway often returns a truncated JSON array during the throttled window. That partial payload then also fails schema validation, so a single incident registers as both a timeout spike and a pydantic.ValidationError spike. Unless you instrument for both signals independently, you cannot tell which lever to pull. The same non-canonical records that survive a saturated window are exactly the ones the Schema Validation Rules at the ingestion boundary will reject.

Minimal Reproducible Example

The snippet below reproduces both vectors. It fans out over many suppliers with an unbounded gather() and no connector ceiling, and it validates with a naive parser that assumes a single canonical timestamp format. Under load it exhausts the pool; on the first drifted payload it raises ValidationError and loses the batch.

import asyncio
from datetime import datetime

import aiohttp
from pydantic import BaseModel

# A realistic CTE payload the sync is trying to retrieve for each lot.
DRIFTED_PAGE = {
    "cte_records": [
        {
            "traceability_lot_code": "STRAWB-2026-0412-A",
            "event_timestamp": "2026-04-12T14:03:00Z",   # Z-suffixed UTC string
            "location_id": "GLN:0614141000005",
            "product_description": "Fresh strawberries, clamshell",
            "quantity": 240.0,
        },
        {
            "traceability_lot_code": 5583019,             # drift: integer, not string
            "event_timestamp": 1744466580,                # drift: epoch integer
            "location_id": "GLN:0614141000012",
            "product_description": "Fresh strawberries, clamshell",
            "quantity": 240.0,
        },
    ],
    "next_cursor": None,
}


class NaiveKDE(BaseModel):
    traceability_lot_code: str
    event_timestamp: datetime   # only parses canonical ISO strings
    location_id: str
    product_description: str
    quantity: float


async def fetch(session: aiohttp.ClientSession, url: str) -> list[NaiveKDE]:
    async with session.get(url) as resp:
        data = await resp.json()
        # No timestamp normalization: the epoch-int record raises ValidationError,
        # and pydantic aborts the whole list, dropping the valid record with it.
        return [NaiveKDE(**r) for r in data["cte_records"]]


async def main(supplier_urls: list[str]) -> None:
    async with aiohttp.ClientSession() as session:   # default 100-connection pool
        # Unbounded fan-out: 500 suppliers -> 500 concurrent sockets ->
        # pool acquisition times out and coroutines are cancelled mid-fetch.
        await asyncio.gather(*(fetch(session, u) for u in supplier_urls))

Run this over a few hundred suppliers and two things happen: acquisition timeouts cancel in-flight fetches (pool exhaustion), and the epoch-integer timestamp aborts an entire page of otherwise-valid CTEs (schema drift). Both leave audit gaps.

Fix Implementation

The fix pairs an explicit TCPConnector with an asyncio.Semaphore to bound concurrency, and a pydantic v2 model with a field_validator that normalizes drifted values before they enter the pipeline. Records that still fail validation are routed to a quarantine buffer rather than dropped. The KDE contract the validator enforces is fixed by the FDA data dictionary:

KDE field	Type	Constraint	Regulatory Source
`traceability_lot_code`	`str`	3–64 chars, coerced from int	21 CFR 1.1320 (Subpart S)
`event_timestamp`	`datetime`	UTC, normalized from string or epoch	21 CFR 1.1340 / 1.1345 (Subpart S)
`location_id`	`str`	GLN, non-empty	21 CFR 1.1330 (Subpart S)
`product_description`	`str`	non-empty	21 CFR 1.1320 (Subpart S)
`quantity`	`float`	`>= 0`	21 CFR 1.1340 (Subpart S)

The sequence below shows the bounded, cursor-paginated flow: a slot is acquired before each page fetch and released before validation, so validation work never holds a socket open.

The slot is released before validation, so a socket is never held while pydantic runs; each cursor page either extends the ledger on a clean batch or diverts non-canonical payloads to quarantine, and the loop continues until next_cursor is empty.

import asyncio
import logging
from datetime import datetime, timezone
from typing import Optional

from aiohttp import ClientSession, ClientTimeout, TCPConnector
from pydantic import BaseModel, Field, TypeAdapter, ValidationError, field_validator

logger = logging.getLogger("fsma204.sync.production")


class FSMA204KDE(BaseModel):
    """Strict schema for Critical Tracking Event Key Data Elements."""
    traceability_lot_code: str = Field(..., min_length=3, max_length=64)
    event_timestamp: datetime
    location_id: str = Field(..., min_length=1)
    product_description: str = Field(..., min_length=1)
    quantity: float = Field(..., ge=0)
    reference_document: Optional[str] = None

    @field_validator("traceability_lot_code", mode="before")
    @classmethod
    def coerce_lot_code(cls, v: str | int) -> str:
        # Suppliers sometimes emit numeric lot codes; coerce to the canonical string.
        return str(v) if isinstance(v, int) else v

    @field_validator("event_timestamp", mode="before")
    @classmethod
    def normalize_timestamp(cls, v: str | int | datetime) -> datetime:
        """Handle mixed ISO 8601 formats and epoch integers deterministically."""
        if isinstance(v, int):
            return datetime.fromtimestamp(v, tz=timezone.utc)   # epoch seconds -> UTC
        if isinstance(v, str):
            # Normalize a trailing "Z" so fromisoformat parses the offset.
            return datetime.fromisoformat(v.replace("Z", "+00:00"))
        return v


# Pre-compile the adapter once for high-throughput batch validation.
KDE_ADAPTER = TypeAdapter(list[FSMA204KDE])


async def bounded_supplier_sync(
    session: ClientSession,
    url: str,
    semaphore: asyncio.Semaphore,
) -> list[FSMA204KDE]:
    """Fetch, validate, and paginate one supplier's CTEs with strict bounds."""
    valid_records: list[FSMA204KDE] = []
    cursor: Optional[str] = None

    while True:
        params = {"limit": 100, "cursor": cursor} if cursor else {"limit": 100}
        async with semaphore:                      # bound in-flight requests
            try:
                async with session.get(url, params=params) as resp:
                    resp.raise_for_status()
                    raw_data = await resp.json()   # fully drain body -> release socket
            except asyncio.TimeoutError:
                logger.warning("Acquisition/timeout on %s, backing off", url)
                await asyncio.sleep(2)
                continue

        # Strict validation with deterministic quarantine routing.
        try:
            batch = KDE_ADAPTER.validate_python(raw_data.get("cte_records", []))
            valid_records.extend(batch)
        except ValidationError as exc:
            logger.error("KDE schema drift on %s: %s", url, exc)
            # Never drop silently: preserve the raw payload for reconciliation.
            await _route_to_quarantine(url, raw_data.get("cte_records", []), exc)

        cursor = raw_data.get("next_cursor")
        if not cursor:
            break

    return valid_records


async def run_all(supplier_urls: list[str], concurrency_limit: int = 15) -> None:
    # One connector ceiling + one shared semaphore = a hard cap on sockets,
    # even when hundreds of suppliers are synced in the same event loop.
    connector = TCPConnector(limit=concurrency_limit)
    semaphore = asyncio.Semaphore(concurrency_limit)
    timeout = ClientTimeout(total=30, connect=5)
    async with ClientSession(connector=connector, timeout=timeout) as session:
        await asyncio.gather(
            *(bounded_supplier_sync(session, u, semaphore) for u in supplier_urls)
        )


async def _route_to_quarantine(url: str, records: list[dict], exc: ValidationError) -> None:
    """Deterministic fallback for non-compliant payloads (idempotent write)."""
    logger.warning("Quarantining %d non-canonical records from %s", len(records), url)
    # Persist {url, records, exc.errors(), timestamp} to a dead-letter store
    # (S3 bucket, Redis stream, or a quarantine table) for manual reconciliation.

Three decisions in this code are compliance-relevant. First, the TCPConnector(limit=...) and the Semaphore are keyed to the same value, so the socket ceiling and the concurrency ceiling can never disagree — the pool cannot be over-committed. Second, await resp.json() fully drains the body inside the async with, guaranteeing the connection returns to the pool instead of leaking. Third, the field_validator normalizes drift before rejection, so only genuinely broken records reach quarantine, and even those retain their raw payload for audit.

For suppliers whose endpoints degrade unpredictably, wrap bounded_supplier_sync in a lightweight circuit breaker so one unresponsive tenant cannot stall the whole pipeline:

import time
from typing import Any, Awaitable, Callable


class AsyncCircuitBreaker:
    def __init__(self, failure_threshold: int = 5, recovery_timeout: float = 30.0):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.last_failure_time = 0.0
        self.state = "CLOSED"

    async def call(self, func: Callable[..., Awaitable[Any]], *args: Any) -> Any:
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "HALF_OPEN"      # allow a single probe
            else:
                raise RuntimeError("Circuit OPEN: supplier endpoint unavailable")
        try:
            result = await func(*args)
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
                self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
            raise

When the circuit opens, route pending CTEs to a local persistence layer (SQLite WAL or a Redis stream) and schedule a retry window, guaranteeing at-least-once delivery even during prolonged supplier outages.

Verification Steps

Confirm the fix along three axes before you trust it in production:

Pool is bounded. Attach an aiohttp.TraceConfig, increment a counter in on_connection_create_end and decrement it in on_connection_released, and assert the peak never exceeds the connector limit under a 500-supplier load test. If the counter climbs past the ceiling, a code path is bypassing the semaphore.
No leaked connections. After a full sync run, log connector._conns (open connection count) — it must return to zero once every supplier completes. A non-zero idle count after quiescence means a body was left undrained.
Drift is normalized, not dropped. Feed the DRIFTED_PAGE fixture through KDE_ADAPTER.validate_python; the integer lot code and epoch timestamp must both coerce, yielding two valid FSMA204KDE objects and zero quarantine writes. Then feed a genuinely broken record (negative quantity) and assert exactly one quarantine write with the raw payload retained.
Ledger completeness. Run SELECT COUNT(*) on ingested CTEs and reconcile it against the supplier’s page manifest (sum(len(page.cte_records))). The counts must match, or the delta must equal the quarantine count — never an unexplained gap.

Once the primary deadlock is resolved, three adjacent failure modes deserve a check:

Expired cursor mid-pagination. A long sync can outlive the supplier’s cursor TTL, so get(cursor=...) returns 410 Gone on page 40. Catch that status, restart pagination from the last persisted checkpoint rather than from the beginning, and dedupe on traceability_lot_code so replayed pages do not double-write. The retry mechanics live in Error Handling Workflows.
Silent throttling on a shared gateway. A single Semaphore protects one endpoint, but a gateway fronting many suppliers enforces an aggregate quota. Key a separate semaphore and breaker per supplier tenant, and pair this with the API Polling Strategies cadence so parallel tenants do not collectively breach the limit.
Timezone-naive timestamps that pass validation. A supplier that emits 2026-04-12T14:03:00 with no offset parses into a naive datetime that silently corrupts the chain-of-custody ordering. Reject naive timestamps explicitly, or assume the supplier’s documented local zone and convert to UTC before persistence.

FSMA 204 requires records be maintained in a sortable, machine-readable format with precise timestamps and lot-level granularity, so a deadlock that orphans a CTE directly breaks chain-of-custody verification. The architecture above guarantees three things: KDE typing is enforced against the FDA data dictionary before ingestion; pool exhaustion, schema drift, and breaker transitions are captured in structured logs that satisfy the requirement for verifiable, timestamped system activity; and non-canonical payloads are quarantined with full raw retention rather than discarded. Before replaying any quarantined records, reconcile their fields against the KDE Field Mapping Guide, and confirm the quarantine store honors the tenancy limits described in Security Boundaries for Trace Data. For authoritative KDE requirements, consult the FDA’s final rule on food traceability.

Frequently Asked Questions

Why does my sync deadlock in production but pass in staging?

Staging rarely reproduces the concurrency and duration that leak connections. A body left undrained, or a ClientSession created per request, holds a socket open until GC reclaims it, so the pool drains slowly and only the first sustained traffic burst exhausts it. Reuse one ClientSession, fully await resp.json() inside the async with, and assert connector._conns returns to zero after a run.

Should the TCPConnector limit and the Semaphore value be the same?

Yes. Keying TCPConnector(limit=n) and asyncio.Semaphore(n) to the same value means the socket ceiling and the concurrency ceiling can never disagree. If the semaphore is larger than the connector limit, excess coroutines block on connection acquisition and can time out; if it is smaller, you leave sockets idle. Fifteen is a safe starting point for most ERP and WMS endpoints — raise it only after a load test shows the pool never saturates.

Should I retry a payload that failed KDE schema validation?

No. A payload that fails validation is not a transport failure, so retrying produces the same invalid record. Normalize known drift (integer lot codes, epoch timestamps) in a field_validator first; anything that still fails is structurally broken and is routed to quarantine with the raw payload, the pydantic error paths, and a timestamp, preserving the audit trail for reconciliation.

Which 21 CFR Part 1 subpart makes async sync reliability a compliance issue?

Subpart S. Because § 1.1340 (shipping) and § 1.1345 (receiving) require complete, timestamped KDEs, and the rule mandates producing traceability records within 24 hours of an FDA request, any CTE dropped by a pool deadlock or a mishandled schema-drift exception is a compliance gap, not just an ingestion bug.

Async Batch Processing — the bounded worker pool this sync feeds and the retry/quarantine patterns it shares.
API Polling Strategies — the cursor-based delta polling cadence that keeps per-tenant concurrency within quota.
Schema Validation Rules — the KDE contract that drifted and truncated payloads fail at the ingestion boundary.
Error Handling Workflows — jittered retries, dead-letter routing, and expired-cursor recovery across the pipeline.
KDE Field Mapping Guide — map quarantined fields back to canonical KDEs before replay.

Up: Async Batch Processing

Related content