The Verification Layer: How to Trust AI-Generated Integration Code in Production

The first time a hardware CTO hears "AI generates the integration code," the response is almost always the same sentence: the LLM will hallucinate and brick our fleet.

That objection is correct. Raw LLM output is not safe to deploy onto a fleet of medical devices, industrial controllers, or anything else where a wrong register address means a wrong dose, a wrong setpoint, or a wrong shutoff. We have seen GPT-class models confidently invert a Modbus byte order, write a CRC routine that passes its own unit test but fails on the actual wire, and confuse Celsius with Kelvin in the same function. None of those are theoretical. We have caught all three in our own pipeline in the last six months.

The bug is not the AI. The bug is shipping AI output without a verification layer.

This is the part of SCADABLE that matters more than the model. The AI side is interchangeable: swap Claude for GPT for a fine-tuned Llama and the value proposition does not change. The verification layer is what makes AI-generated code safe to put on a neonatal monitor or a compressor station. It is a six-stage pipeline, each stage catching a different class of failure, each producing the kind of evidence a regulated-industry audit needs to see.

Here is what is actually in that pipeline, and which compliance requirement each stage satisfies.

Stage 1: Syntactic and type validation

The cheapest stage, and the one most teams stop at. The generated driver has to parse, compile, and pass a strict type check against the SCADABLE driver SDK. We use Rust on the gateway side, which makes this stage do more work than it does in dynamically-typed pipelines: a missing field, a mismatched register width, a return type that does not match the trait signature, all fail at compile time.

What this catches: typos, schema drift between the device manual and the canonical model, AI inventing fields that do not exist, AI returning the wrong concrete type for a Driver trait method.

What this misses: anything semantically wrong. A driver that reads the wrong register but reads it correctly will pass every time.

Stage 2: Datasheet round-trip validation

Once the driver compiles, we run the generated code back through a structured extractor and compare its claims against the source datasheet. If the AI wrote register: 0x0042, scale: 0.1, unit: "celsius" for a temperature reading, we re-read the datasheet section that produced that mapping and confirm the address, the scale factor, and the unit all match.

This is where most "the AI hallucinated a register address" failures get caught. The model has a strong prior toward plausible-looking hex addresses, and a datasheet rarely contradicts an address that sounds right. The round-trip forces the address to come from a specific page, table, and row. If the table does not contain that row, the driver fails this stage.

The unit-confusion class (Celsius vs Kelvin, PSI vs bar, RPM vs Hz) shows up here too. Every numeric field must declare its unit and conversion factor, and we verify both against the source.

What this catches: wrong addresses, wrong scale factors, wrong units, wrong endianness when the datasheet specifies it.

What this misses: errors in the datasheet itself, ambiguous datasheets, behavioral logic that no datasheet specifies.

Stage 3: Synthetic device simulation

Stage three runs the generated driver against a synthetic emulator. We maintain a library of device emulators (Modbus slaves, BLE peripherals, OPC-UA servers, BACnet endpoints) that respond with known values across the device's documented operating range. The driver has to read those values back correctly, in the right units, with the right semantics.

The emulator is seeded with both nominal data (a temperature ramp from 20 to 40 degrees C, a pressure cycle, a normal heartbeat pattern) and labeled-known-good responses. If the driver reports 104 when the emulator is sending 40C, the bug is in the driver's unit conversion, not in the emulator. The pipeline rejects it.

What this catches: behavioral errors in normal operation, unit conversions that look right on paper but compound wrong, state-machine bugs where the driver works on the first poll but breaks on the second.

What this misses: hardware-specific quirks the emulator does not model (analog noise, drift, vendor firmware bugs, real-world timing).

Stage 4: Boundary and adversarial inputs

Stage four is where we try to break the driver. The emulator from stage three runs again, this time configured to send malformed packets, out-of-range values, partial responses, delayed responses, and a library of known-bad sequences captured from real-world failures. Some are random; most are seeded by failure modes we have seen in production.

A driver that crashes on a short Modbus response, or interprets a 0xFFFF "no data" sentinel as a valid temperature reading of 6553.5 degrees, fails this stage. So does a driver that locks up the gateway when the device returns its serial number with a non-ASCII byte.

This is also where security holes get caught. A driver that interprets unbounded length fields, dereferences attacker-controlled pointers, or panics on malformed TLS certs from a fake device fails here. We do not ship drivers that can be turned into a remote-code-execution vector by a hostile sensor.

What this catches: edge-case crashes, sentinel-value misinterpretation, timeout bugs, security holes in parser code.

What this misses: anything that requires real silicon to reproduce.

Stage 5: Sandboxed real-device shadow run

The most expensive stage, and the one we cannot skip. The candidate driver runs against a real device in our hardware-in-the-loop lab, for a duration calibrated to the device class (an hour for simple sensors, twenty-four hours for anything with thermal cycling or duty-cycle behavior). Output is compared against customer-supplied expected ranges and against a reference implementation when one exists.

The shadow run catches the residual class of bugs that simulators cannot model: vendor firmware that does not match its own datasheet (extremely common), analog-front-end drift, electromagnetic interference, timing jitter that exposes a race condition, behavior on the second hour that differs from the first ten seconds.

For regulated customers, the shadow run is the artifact that maps cleanly onto a V&V evidence package. We ship the raw capture, the comparison report, the pass/fail summary, and the cryptographic hash of the driver under test. That bundle is what goes into the design history file.

What this catches: hardware-specific issues, vendor firmware deviations, drift, timing-dependent bugs, anything analog.

What this misses: failure modes that only emerge across thousands of devices over months. That is stage six.

Stage 6: Production observability and circuit breakers

Stage six is not a pre-deployment check. It runs continuously, on every deployed driver, on every device in the field. The runtime monitors the statistical envelope of every driver: read latency, value distribution, error rate, reconnect rate, and a handful of driver-specific invariants the AI is required to declare at generation time (for example, "temperature should never exceed 80 C; if it does, do not propagate, raise a fault").

When a driver drifts, the runtime opens a circuit breaker, halts new rollout, and (depending on customer policy) either falls back to a previous driver version or raises an alert and waits for human acknowledgement. Every transition is recorded in the immutable audit log.

What this catches: drift in production, vendor firmware updates that change behavior post-deployment, edge cases that did not appear in the lab, fleet-wide regressions from a bad rollout.

What this misses: nothing in steady state. Stage six is the safety net under everything else.

What this maps to in regulated industries

The six-stage pipeline is not designed around a specific standard, but it lines up cleanly with three of them. We work with hardware teams in medical (Corvita, our first customer, builds neonatal monitoring devices), industrial, and aerospace, and the verification evidence we produce is structured to slot into each compliance regime.

IEC 62304 (medical device software lifecycle). The standard requires documented evidence of software unit verification, integration testing, and system testing for every release of every software item, with traceability to requirements. Stages one through four produce the unit and integration evidence. Stage five produces the system-level verification artifact. Every stage emits a signed report that ties back to the driver's git SHA and the input datasheet hash. Combined with the immutable audit log from stage six, this maps onto the IEC 62304 software-of-unknown-provenance and software-item-verification clauses without us having to manually assemble anything.

21 CFR Part 11 (FDA electronic records). Part 11 requires that records used to make regulated decisions be attributable, legible, contemporaneous, original, and accurate, with audit trails that cannot be modified. Stage six's audit log is built on Postgres tables with DELETE triggers blocked at the database level (we wrote about the immutability pattern previously) and 13-month retention by default. Every generated driver, every verification result, every deployment, every rollback is recorded with the actor, the timestamp, the before-state and after-state, and a cryptographic chain that makes silent edits detectable.

IEC 61508 (functional safety, industrial). 61508 emphasizes independent verification, especially for higher safety integrity levels, and rewards diverse implementations. We treat the AI as one implementation and run a second, independently-derived reference driver in parallel during stage five whenever the customer is targeting a SIL 2 or higher application. Disagreement between the two is a hard failure. This is more expensive than stage five alone, and we only run it for customers who need it, but it gives us a path to production for safety-critical applications without claiming the AI is its own independent channel (it is not).

ISO 13485 (medical device QMS). The pipeline produces design controls evidence: design history files pre-populated with verification records, traceability matrices, and risk-control measures. The customer's quality team does not have to assemble it after the fact.

The AI is the part anyone can reproduce. The verification layer is the part that takes years. That is the moat.

What the verification layer does not catch

We try to be honest about the residual risk. Three classes of failure survive all six stages:

Errors in the source datasheet. If the manufacturer's manual is wrong, the round-trip in stage two will agree with the wrong value. We catch some of these in stage five when the device behaves differently than the manual claims, but not all of them. Mitigation: the customer's hardware engineer signs off on the source datasheet before generation, and the audit log records that signoff.
Failure modes that need months of fleet data to surface. A drift that is too slow to notice in 24 hours, a corrosion-induced sensor change at year three, a vendor firmware update that lands on 2% of devices and changes a register's meaning. Stage six is what we have for these, and stage six is statistical rather than deterministic. We will not catch the first occurrence; we will catch the third.
Adversarial models. A model that intentionally produces subtly-wrong code is not in our threat model today. We pin the model version, sandbox the generation environment, and verify outputs against ground truth, but we are not running formal adversarial-AI defenses. For customers in classified or national-security applications, we do not yet have the evidence package they need.

We tell prospective customers this on the first call. The verification layer reduces residual risk; it does not eliminate it. A team that is not willing to operate a fleet with a non-zero defect rate should not be operating a fleet, regardless of who wrote the code.

Why this matters more than the AI

The center of gravity in an AI-native infrastructure company is not the model. It is the verification layer.

A team can wire up Claude or GPT to a code-generation pipeline in a week. We have done it. It is interesting until a real customer asks how you would defend the output in front of an FDA auditor, an EU MDR notified body, or a TÜV functional-safety assessor. At that point, "the AI checks its own work" stops the conversation.

What gets you past that conversation is what gets any safety-critical platform past it: layered verification, independent evidence, immutable records, observable runtime, circuit breakers, and a clear story about what is and is not in scope. Most of those layers are not AI-shaped. They are the same boring, expensive infrastructure work that has been the cost of operating in regulated industries for decades. The AI changes how integration code gets written; it does not change what production safety requires.

That work is multi-year. Hardware-in-the-loop labs, emulator libraries, datasheet extractors, immutable audit infrastructure, observability pipelines, and the institutional knowledge to know which checks matter for which class of device: none of that is replaceable by a bigger model. The AI side of SCADABLE will get cheaper every year. The verification side gets deeper every year. The combination is what makes AI-generated integration code safe to put on a baby monitor in a NICU, and that combination is the part that is hard to copy.

If you are evaluating AI-native infrastructure for a regulated-industry product (medical, industrial, aerospace) and need to understand the verification model before you can recommend it internally, we run 30-minute architecture reviews tailored to your compliance posture. Book at https://cal.com/rahbaral/quick-chat.