How we generate Python device classes from datasheets (and the failure modes we still see)

A hardware engineer at our first customer, Corvita Biomedical, dropped a 47-page datasheet PDF for a neonatal flow sensor into our portal at 2 AM. By the time their team logged in the next morning, the device was streaming validated readings into their staging fleet. Three production-blocking issues had been caught and patched by the runtime before the driver was promoted out of sandbox.

I want to walk through how that pipeline works, because the only way you can evaluate whether AI-generated integration code belongs anywhere near a medical device is by looking at the eval harness, not the demo. The interesting work at SCADABLE is not the LLM call. It is everything that wraps it.

The shape of the problem

A typical hardware team integrating a new sensor spends one to three weeks per device. Most of that is mechanical: read the datasheet, find the register map, write a Modbus or I2C or BLE GATT client, decode byte order, write tests against a bench unit, ship to the gateway. Multiply by the 30 to 200 device types in a real fleet and you understand why most plants run on three-year-old firmware.

The LLM-shaped temptation is obvious. Datasheets are structured (sort of). Drivers are repetitive (sort of). Surely a model can read the PDF and emit the class. And it can, until you put it on a device that has a 12-bit signed two's-complement temperature register packed into the high nibble of a 16-bit holding register, the AI helpfully scales it as if it were unsigned, and your incubator alarm triggers at the wrong setpoint.

Stage 1: parsing the datasheet

The first surprise when you start ingesting datasheets at volume is how varied they are. Out of the first ~400 we processed:

About 35% have clean, machine-readable register tables embedded as PDF text.
About 40% have register tables rendered as raster images (scanned vendor docs, or PDFs exported from old CAD tools).
About 15% have register layouts split across pages with footnotes that meaningfully change the encoding.
About 10% have no register layout at all; you derive the protocol from prose like "the device responds to function code 03 with a single 16-bit word representing flow in tenths of a liter per minute."

So the first stage is not "feed the PDF to an LLM." It is a preprocessing pipeline:

class DatasheetIngestPipeline:
    def __init__(self):
        self.text_extractor = PdfPlumberExtractor()
        self.table_extractor = CamelotTableExtractor(flavor="lattice")
        self.ocr_fallback = TesseractOcrExtractor(lang="eng")
        self.layout_classifier = LayoutClassifier()  # in-house
 
    def ingest(self, pdf_path: str) -> NormalizedDatasheet:
        raw = self.text_extractor.extract(pdf_path)
        layout = self.layout_classifier.classify(raw)
 
        if layout.needs_ocr:
            raw = self.ocr_fallback.extract(pdf_path)
 
        tables = self.table_extractor.extract(pdf_path)
        if not tables and layout.likely_has_register_map:
            tables = self._extract_tables_from_ocr(raw, layout)
 
        return NormalizedDatasheet(
            text=raw.text,
            tables=tables,
            page_layout=layout,
            source_pdf_hash=sha256_of(pdf_path),
        )

pdfplumber handles the easy 35%. Camelot with flavor="lattice" handles register tables with visible borders (flavor="stream" is the fallback for borderless tables, with more false positives). When the layout classifier flags a page as scanned raster, we route through Tesseract with a custom dictionary trained on register names, function codes, and unit strings. We hash the source PDF so that when a vendor reissues a datasheet, the driver rebuilds automatically.

Output of this stage is a NormalizedDatasheet: text blocks, recovered tables, layout metadata. No LLM has run yet.

Stage 2: structured extraction before code generation

A common AI-firmware demo skips straight from PDF to code. That is where most of the embarrassing failures come from. The model has to do two things at once (understand the device, write Python), and it bungles one roughly half the time.

So we split it. The first LLM pass produces a structured DeviceSpec JSON object. No code, just shape:

# Pseudo-prompt structure
system: |
  You convert sensor datasheets into a strict DeviceSpec JSON object.
  Output MUST validate against the DeviceSpec JSON Schema (provided).
  If a field cannot be determined from the datasheet, set it to null
  and add an entry to "uncertainties" explaining what is missing.
  Never guess byte order. Never guess polling intervals.
 
few_shot:
  - input: <Modbus flow sensor datasheet excerpt>
    output: <validated DeviceSpec JSON>
  - input: <I2C temperature sensor datasheet excerpt>
    output: <validated DeviceSpec JSON>
  - input: <BLE pulse oximeter GATT profile excerpt>
    output: <validated DeviceSpec JSON>
 
user: |
  <NormalizedDatasheet payload, with table cells and text blocks>
 
response_format:
  type: json_schema
  schema: <DeviceSpec schema, ~140 fields>

We use OpenAI's function calling with a strict JSON Schema for response_format, so the model literally cannot return malformed JSON. The schema enforces things like byte_order: "big" | "little" | null, register_type: "holding" | "input" | "coil" | "discrete", signed: boolean, scale_factor: number | null. Every nullable field has a sibling uncertainties array, and the prompt says (in capitals, multiple times) that the model must use null and flag uncertainty rather than guessing.

Splitting the pass makes uncertainty a first-class signal. A DeviceSpec with three uncertainty entries goes to a human-in-the-loop queue. One with zero goes straight to code generation.

Stage 3: code generation against a typed runtime

Once DeviceSpec is validated, the second LLM pass writes Python. The prompt is narrow: given this validated spec, produce a class that extends ModbusDevice, I2CDevice, or BLEDevice from the SCADABLE runtime base classes. The runtime gives us leverage here, because most of the dangerous primitives (byte unpacking, scaling, retries, polling cadence) are already implemented and typed.

from scadable.runtime import ModbusDevice, RegisterMap, register, Reading
from scadable.runtime.types import UInt16BE, Int16BE, Scaled
 
class CorvitaFlowSensorV2(ModbusDevice):
    """Corvita Biomedical neonatal flow sensor, generated 2026-05-01.
 
    Source: corvita_flow_v2_rev_C.pdf (sha256: a3f1c2...)
    DeviceSpec uncertainties: 0
    """
 
    unit_id = 17
    poll_interval_seconds = 1.0  # from datasheet "recommended polling": 1 Hz
 
    registers = RegisterMap(
        flow_rate_lpm=register(
            address=0x0010,
            kind="holding",
            decode=Scaled(Int16BE, factor=0.1),
            unit="L/min",
            valid_range=(0.0, 25.0),
        ),
        temperature_c=register(
            address=0x0012,
            kind="holding",
            decode=Scaled(Int16BE, factor=0.01),
            unit="degC",
            valid_range=(-10.0, 60.0),
        ),
        status_flags=register(
            address=0x0020,
            kind="holding",
            decode=UInt16BE,
            bitfield={
                "occlusion": 0,
                "low_battery": 1,
                "calibration_due": 4,
            },
        ),
    )
 
    async def read(self) -> Reading:
        flow = await self.registers.flow_rate_lpm.read()
        temp = await self.registers.temperature_c.read()
        flags = await self.registers.status_flags.read()
        return Reading(
            flow_lpm=flow,
            temperature_c=temp,
            occlusion=flags["occlusion"],
            low_battery=flags["low_battery"],
        )

The runtime base class enforces the contract. RegisterMap is typed; decoder primitives (Int16BE, UInt16BE, Scaled, BitField) are unit-tested in isolation and cannot be confused at runtime. valid_range is mandatory and rejects readings outside the band, which catches scaling errors. The AI's job is reduced to filling in addresses, types, and ranges from the spec, not implementing the protocol.

Stage 4: the verification layer

This is where the vaporware test happens. Generated code goes through four gates before it touches a real device:

Syntactic and type validation. The class compiles with py_compile, then runs through mypy --strict against the runtime stubs. Any unresolved type triggers regeneration with the type error fed back into the prompt.
Schema cross-check. Every field in the generated class is matched against the DeviceSpec. If the spec says flow_rate_lpm is at 0x0010 with scale 0.1 and the code emits 0x0011 or 1.0, the gate fails. The LLM is not the source of truth; the validated spec is. This catches the case where the model "improves" on its own spec mid-generation.
Behavioral simulation. We run the driver against an in-process emulator that replays known register fixtures. For Modbus we use a fork of pymodbus's test server with fixture registers preloaded; for BLE we use a fake GATT profile. If the datasheet says register 0x0010 returns 0x00FA and the value should be 25.0 L/min, that is the assertion. Any off-by-byte-order, off-by-scale, or off-by-sign trips here.
Regression against the validated driver library. We maintain a corpus of ~180 hand-validated drivers. New drivers structurally similar to existing ones (same chipset, same protocol family) must pass the existing driver's regression suite, with field renames mapped through the spec. This catches the case where the AI pattern-matches to the wrong reference driver and silently inherits a bug from a different device.

Only after all four gates pass is the driver eligible to run on a real device.

Stage 5: production gate

The driver runs first in a sandbox: a single, isolated, instrumented gateway with one real unit of the device wired up. Every reading is mirrored to observability. We watch for:

Range violations (readings outside the valid_range declared in the class). One violation in a 30-minute window pauses the driver and pages a human.
Variance anomalies (readings whose distribution differs significantly from the validated driver corpus for similar devices).
Polling-induced device misbehavior (some devices return zeros if polled too fast; the runtime detects flatlines).

Sandbox burn-in is 24 hours minimum. After that, the driver can be promoted to a customer's staging fleet, then production, with feature flags at every stage. A driver that has never seen a production fleet looks identical from the customer's API surface to one that has run for a year, but the runtime knows the difference and the rollout policy enforces it.

The AI is the demo. The runtime is the product. Everything that makes the generated code safe lives in the verification layer, the typed runtime base classes, and the rollout policy. The model is the cheapest, most replaceable component in the stack.

The failure modes that still happen

If you have read this far you deserve the honest part. Here are the failure modes our verification layer still catches, in rough order of frequency, after roughly 1,200 driver generations:

1. Endianness errors on multi-byte registers. The most common LLM mistake by a wide margin. Big-endian and little-endian Modbus implementations look almost identical in prose. The model reads "the flow rate is reported as a 32-bit floating-point value across registers 0x0010 and 0x0011" and picks the wrong word order roughly one time in five. Behavioral simulation catches this every time, because the fixture bytes do not decode to a sensible number under the wrong endianness. The fix is automatic regeneration with the failed assertion fed back into the prompt; the second pass usually corrects it. We have seen models generate the same wrong endianness three times in a row on truly ambiguous datasheets, which is when the human-in-the-loop queue triggers.

2. Bit-field decoding with non-standard bit numbering. Some datasheets number bits left-to-right (MSB is bit 0), some right-to-left (LSB is bit 0). The model defaults to LSB-first because that is the modal convention in its training data, and it gets bit positions wrong about 8% of the time on MSB-first devices. The schema cross-check catches obvious cases (bit position out of range); subtle cases (bit 3 vs bit 4) only show up in behavioral simulation.

3. Coil registers confused with discrete inputs. In Modbus, coils (function code 01/05) are read-write, discrete inputs (function code 02) are read-only. Both are single bits. About 4% of generated drivers conflate them, producing a write attempt against a read-only address that the device rejects silently. The fix was extending the spec schema to require both the function code and the register kind, with cross-validation between them.

4. Polling intervals derived from the wrong field. Datasheets routinely list a "maximum sample rate" alongside a "recommended polling interval," and these are not the same number. The maximum sample rate is what the device's ADC can do internally; the recommended polling interval is what the bus and MCU can sustain without dropping responses. The model picks the faster number about 15% of the time, which causes the device to brown out under sustained polling. The runtime catches this in sandbox burn-in via flatline detection, but it wastes a full burn-in cycle. We now extract both fields explicitly into the spec and enforce that the runtime uses the conservative one.

5. Multi-mode devices. Some sensors are configurable into different physical modalities. The same hardware can act as a flow meter, a differential pressure sensor, or a temperature probe depending on a configuration register. The AI picks one mode (usually the most prominent in the datasheet) and emits a driver for that mode, missing the configuration logic entirely. This failure mode scares us most: the driver works perfectly for devices configured in the AI's chosen mode and silently produces nonsense for any other. We now run a dedicated "mode discovery" pass on the spec that enumerates configuration registers explicitly, and the generated class is required to read the configuration register at startup and refuse to run if the device is in an unsupported mode.

There are smaller residual failure modes (CRC variants on RS-485, scaling with non-decimal factors, devices with mandatory wake-up sequences), but the five above account for roughly 90% of verification gate failures.

None of these failures reach a customer fleet. They all surface in stages 4 or 5, before the driver is promoted out of sandbox. The verification layer is not optional polish on top of the AI. It is the product. Strip it away and what you have is a generator of plausible-looking firmware that will eventually be wrong about an incubator's alarm setpoint.

Why this is shippable

The reason this pipeline works for Corvita and not just for internal demos is that the runtime carries most of the safety surface. The generated class is small, typed, and declares its valid ranges; the runtime rejects readings that violate them. Everything dangerous about a Modbus client (byte unpacking, retries, timeouts, reconnection, presence signaling, observability) lives in the base class, not in code the model wrote. The blast radius of a model mistake is bounded by the runtime, not by the LLM's worst day.

That is the bet. Make the generated surface small and typed, instrument every reading, reject anything outside declared bounds, treat every promotion from sandbox to fleet as a graded rollout with telemetry. Inside that envelope, the AI can be wrong, and you find out before a customer does.

If you're evaluating whether AI-generated integration code is production-safe enough for your devices, we run 30-min architecture reviews where we walk you through exactly how this works on your specific hardware. Book at https://cal.com/rahbaral/quick-chat