What happens when the AI gets the integration wrong

Most "AI generates your IoT driver" pitches stop at the part where the model figured it out. That part is true. The model figures out the device about 70% of the time, including a non-trivial number of devices that took us a week to integrate by hand the first time we saw them.

The other 30% is the part that actually decides whether SCADABLE works as a platform. The model gets it wrong, the verification layer catches the wrong, and then something has to happen. The version of "something" you build is the difference between "the AI handles your IoT" being a real claim and being a marketing line.

The "something" we built is an engineer who steps in. This is a post about the four ways the AI gets it wrong, why the verification layer catches three of them automatically, and what the human in the loop actually does for the fourth.

The four failure modes

After running this in production for a year, the failure modes cluster into four buckets, in roughly this order of frequency.

1. Endianness inversion

Modbus is a 1979 protocol that predates the IEEE standardization of byte order. Different vendors decided different things about whether a 32-bit value is stored as [high16, low16] or [low16, high16], and within each of those, whether each 16-bit word is [high8, low8] or [low8, high8]. There are four valid interpretations of the same byte sequence.

The AI guesses based on the datasheet's notation, which is sometimes wrong. The bug looks like this: a flow meter reports a flow rate of 4.6e-41 L/min. The number is denormalized but valid IEEE 754. A human reads "denormalized float" and immediately knows it is byte-swapped.

The verification layer catches this 95% of the time, because we sanity-check the first reading against the sane_range declared in the device class. A flow rate of 4.6e-41 is outside (0.0, 500.0) and the deploy aborts with a flagged exception. The remaining 5% are devices where the byte order is inverted but the value happens to land inside the sane range, which is rare but possible. Those need human review.

2. Function code mismatch

Modbus has four register types, each with a different function code: holding (FC03), input (FC04), coil (FC01), discrete (FC02). The datasheet often lists registers by address only, without explicitly stating the function code. The model guesses, usually correctly, but when a vendor uses an unusual address scheme, it guesses wrong.

The bug looks like the connection succeeds, the read is performed, and the device returns an exception code 02 (illegal data address). The driver retries, gets the same exception, and gives up. The verification layer catches this on first deploy, because the exception bubbles up and we mark the integration as failed.

This goes to human review, because the fix is "try the other function code," and the human can do that in 30 seconds.

3. Scaling and unit confusion

Datasheets express scaling in different ways. Sometimes "value = raw / 10". Sometimes "value = raw * 0.1". Sometimes the raw value is in centidegrees but the datasheet says "°C, divide by 100". Sometimes the unit is implied by context and never stated.

The model handles the explicit cases. The implicit cases are the failure mode. The bug looks like the temperature reads 220 instead of 22.0, or 22000 instead of 22.0. The verification layer catches the 220 case (out of sane_range), but the 22000 case sometimes lands inside an overly generous range, especially for parameters where humans were sloppy about declaring tight bounds.

These get caught by the next layer of verification: cross-checking against device behavior simulators. We run the generated driver against a synthetic device that produces known values, and we check that the driver decodes them correctly. If the simulator says the input is 1850 raw and the driver says 185.0 °C, but the datasheet says it should be 18.5 °C, we catch it.

4. PDF parsing ambiguity

This is the failure mode the verification layer cannot catch on its own, and it is what the human is for.

The datasheet has a register table where the columns are unlabeled or labeled in a non-English language we did not anticipate, or the table spans two pages with different orientations, or a footnote on page 47 changes the interpretation of the type column from page 12. The model produces a DeviceSpec that is internally consistent and passes verification against simulated data, because the simulated data was generated from the same wrong DeviceSpec.

The deploy succeeds. The first event lands. The number is wrong. The customer notices. We notice (the audit log shows the value, the customer's dashboard shows it, the value is sane-looking but does not match what the customer's calibration target is). An engineer is paged.

What the engineer does

The engineer-in-the-loop pulls up three things: the original PDF, the generated DeviceSpec, and the live event stream. They diff the DeviceSpec against the PDF for the affected register. They identify the misinterpretation (often it is a single column header that the model translated from German to English imprecisely, or a footnote indicator that did not get extracted into the IR). They edit the device class manually. They redeploy.

The engineer's edit is captured. It goes back into our training corpus as a labeled correction. The next time a similar device family comes through, the model has seen the correction, and the same failure mode does not recur in the same way.

In practice, the engineer-in-the-loop work for a single device takes 20 minutes to 2 hours, depending on how exotic the device is. It is genuinely a human in the loop. The customer's experience is "we deployed a device, it took an extra few hours instead of two minutes, the SCADABLE team handled it, the data is now flowing." They never see the diff process.

Why this is a feature, not a bug

The honest version of this post is that the AI is not done. It is going to be done someday, but right now there is a 30% gap between "model output" and "production-ready integration." We could pretend the gap does not exist, ship the broken integrations, and let the customer deal with the bad data. We could refuse to ship until the model is perfect, which would mean shipping never. Or we could put a human in the gap.

We picked the third one because the customer's experience matters more than our ideological purity about whether the AI did 100% of the work. The customer's job is "ship a connected medical device" or "ship a connected smart inverter." Their job is not to debug a misinterpreted datasheet. If the integration is wrong, our problem; if the integration is right, our problem regardless of which side of the loop solved it.

This is also the right business shape. We bill for working integrations, not for "the AI tried." That alignment forces the right behavior on us. If the AI is wrong, we eat the cost. The customer does not pay more because the device was unusual. We absorb the variability so the customer doesn't have to think about it.

The trajectory

The engineer-in-the-loop work is shrinking month over month. Our internal metric for this is "engineer minutes per integration," which has dropped from ~90 minutes per integration nine months ago to ~12 minutes per integration today. We expect it to keep dropping, because every correction the engineer makes folds back into the verification layer and the parsing pipeline.

The asymptote is not zero. There will always be devices the AI cannot handle without context, especially in regulated industries where the datasheet is intentionally imprecise to leave room for the manufacturer's clinical labeling team to refine it. For those, the human in the loop is the right primitive. The goal is not to remove the human; it is to make the human's contribution count for thousands of customers instead of one.

See it on the demo

The drag-and-drop and the verification layer are both visible in the live demo. The engineer-in-the-loop fallback is, by design, invisible: it is the part where the customer hands us the device and we hand them back the integration. If you want to talk about a specific device that has been giving your team trouble, let us talk. We are particularly interested in devices that sit in the 30%, because those are the ones that improve the system for everyone else.