Modbus TCP to MQTT, a production-grade guide (and the 7 things that break)

Most "Modbus to MQTT" tutorials top out around 60 lines of Python, a while True loop, and a screenshot of MQTT Explorer showing a happy little JSON payload. Then you ship it, and within a week the loop has wedged on a half-open TCP socket, your broker is rejecting messages because the queue is full, and a firmware update on one of the PLCs has silently shifted register 40021 from a uint16 to a signed scaled int16. Now your dashboard says the chamber temperature is -32760 and someone is calling you on a Sunday.

We run this bridge in production for Corvita Biomedical's neonatal device fleet, where a corrupted reading is not a Slack notification, it is a regulatory event. This post is what we wish someone had written before we built it. We will start with the naive version (because you should still understand it), then walk through the seven failure modes that nobody documents.

The naive Modbus to MQTT bridge

Here is the version of the script that lives in 80% of the GitHub repos when you search modbus to mqtt python. It uses pymodbus and paho-mqtt, polls a holding register every second, and republishes it as JSON.

import json, time
from pymodbus.client import ModbusTcpClient
import paho.mqtt.client as mqtt
 
modbus = ModbusTcpClient("10.20.30.40", port=502)
broker = mqtt.Client(client_id="line-1-bridge")
broker.connect("mqtt.internal", 1883, keepalive=60)
 
while True:
    rr = modbus.read_holding_registers(address=40020, count=2, slave=1)
    payload = {"ts": time.time(), "raw": rr.registers}
    broker.publish("plant/line-1/oven/temp", json.dumps(payload), qos=1)
    time.sleep(1.0)

This works. On a laptop, on a clean network, with one cooperative slave, it works for hours. The first time we deployed something close to this in 2023, it ran for nine days before everything I am about to describe happened, in roughly that order. If you are looking for a node-red modbus mqtt flow you would hit the same wall, the platform is different but the failure modes are identical.

What actually breaks in production

The Modbus spec is from 1979. Your MQTT broker thinks it is 2026. The bridge between them is where every assumption goes to die.

There are seven things that break, and they break in roughly this order of frequency. We will go through each, with the symptom, the root cause, and the production fix.

1. Reconnect logic, because the TCP socket dies silently

The most common failure: your bridge stops publishing, but no exception is raised. pymodbus returns an empty response, or the socket.recv() blocks forever, because some switch in the middle decided to drop the connection table without sending a RST. You will see this on managed industrial switches, on cellular routers, and most viciously on Windows-based "data concentrator" PCs that put the NIC to sleep.

The fix is a watchdog plus aggressive socket-level keepalives. Do not trust application-level keepalives alone. On Linux, set TCP_KEEPIDLE to something like 15 seconds and TCP_KEEPINTVL to 5. In Python:

sock = modbus.socket
sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 15)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 5)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3)

Then wrap every read in a deadline. If a read takes longer than, say, 800 ms when normal latency is 30 ms, close the socket and reconnect. Track consecutive failures, back off exponentially (capped around 30 seconds), and emit a bridge.reconnect MQTT event so your observability stack actually knows. The error you will eventually catch is pymodbus.exceptions.ConnectionException: Modbus Error: [Connection] Failed to connect, but only after you stop trusting the library to surface it.

2. Register caching, because polling 200 registers at 10 Hz is not free

The naive Modbus to MQTT bridge reads everything on every cycle. Most plants have a fast-changing tag list (a dozen analog inputs, maybe a couple of state machines) and a long tail of slow-changing config registers, setpoints, firmware versions, serial numbers, that some integrator checked into the poll list five years ago.

Stop. The PLC's CPU is doing real work. Every Modbus request you send is RAM-mapped polling on its scan cycle, and on small PLCs (Click, Micro820, Modicon M221) you can absolutely measure scan-time degradation under load. Group registers by change frequency. Poll fast-changing tags at your real cycle, poll setpoints every 30 seconds, and read identity registers once at startup. Cache aggressively, only publish on change (with a heartbeat republish every 60 seconds so consumers know the bridge is alive).

We had one deployment where median read latency for a single register dropped from roughly 200 ms to 40 ms after we coalesced reads into a few read_holding_registers(address, count=20) calls instead of 20 single-register reads. Modbus TCP has per-request overhead, batch your reads.

3. Type coercion, because Modbus registers do not have types

A holding register is 16 bits. That is the only thing the spec promises. Your float, your int32, your bitfield, your packed string, all of these are conventions layered on top by the device vendor, documented in a PDF, and frequently wrong.

The two reliable failure modes here are byte order and word order. A 32-bit float in two consecutive registers can be (in increasing order of pain): big-endian-big-endian, big-endian-little-endian (Modicon "swap"), little-endian-big-endian, or little-endian-little-endian. There are PLCs that ship in one mode and have a config bit to flip to another. Your bridge has to know.

Build a tag map, do not infer:

TAG_MAP = {
    "oven_temp_c":     {"addr": 40020, "type": "float32", "byteorder": "ABCD", "scale": 1.0},
    "blower_rpm":      {"addr": 40024, "type": "uint16",  "scale": 1.0},
    "fault_bits":      {"addr": 40030, "type": "bitfield16"},
    "device_serial":   {"addr": 40100, "type": "string", "length": 16},
}

Then decode against that map and emit typed JSON downstream. Never publish raw register arrays to MQTT, your consumer should not be parsing two's complement on the receiving end. If you do, the next person to debug it will be doing it at 2am.

4. Slave timeouts, because one slow PLC blocks the rest

If your modbus mqtt gateway talks to multiple slaves over a single TCP session (Modbus TCP supports this via the unit ID field), one slow or dead slave will starve all the others. The default pymodbus timeout is generous, often three seconds, and when a slave goes dark you spend three full seconds per cycle waiting on a ghost.

Two patterns work in production. First, run one connection per slave when latency budget allows, and isolate them with separate workers (asyncio tasks or threads). Second, when you must multiplex, set a tight per-request timeout (we use 500 ms for LAN, 1500 ms for cellular) and a per-slave circuit breaker: after N consecutive timeouts, mark the slave as unreachable, publish a state event to MQTT, and only retry it on a slow background timer. The other slaves keep flowing.

The error to grep for is ModbusIOException: Modbus Error: [Input/Output] No Response received from the remote slave. If your logs are full of those for one unit ID, that is exactly the slave you should be circuit-breaking.

5. TLS at both layers, because Modbus is unencrypted by default

Modbus TCP is plaintext. There is a Modbus/TCP Security spec from 2018 (TCP port 802, with TLS) and almost nobody ships it. So in practice your Modbus link is unencrypted, and you cannot fix it at the protocol layer. You fix it at the network layer: the bridge runs on the same trusted segment as the PLC, behind a managed VLAN or an industrial firewall, and the only thing that leaves that segment is the MQTT side.

The MQTT side is where TLS earns its keep. Use mTLS with a real certificate per gateway, not a shared password. Pin the broker CA. Set tls_version explicitly to TLS 1.3, do not let it negotiate down. The handshake error you do not want to debug at 11pm is paho.mqtt: SSL: CERTIFICATE_VERIFY_FAILED, which usually means the gateway clock has drifted. Run chrony or systemd-timesyncd, this is not optional on edge hardware.

For Corvita's deployment we issue short-lived gateway certs (24-hour lifetime) and rotate them automatically via EST enrollment, but even a yearly manual rotation is dramatically better than a static PSK. The threat model here is not exotic, it is "someone plugs a laptop into the plant network".

6. Backpressure, because the broker is not always available

Your MQTT broker will go down. The internet uplink will flap. The cell modem will reboot at 3am because Verizon decided. If your bridge does not handle backpressure, what happens is: the publish call buffers in paho-mqtt's in-memory queue, the queue grows, RAM fills, and on a small gateway (a $200 industrial PC with 1 GB of RAM) the OOM killer takes the whole bridge down. You lose the buffer and the live data simultaneously.

The right answer is a bounded persistent queue on disk. We use a small SQLite ring buffer with a configurable size (typically 100 MB or about 24 hours of data, whichever is smaller) and a writer that pulls from the queue when the broker is connected. When the queue is full, we drop oldest, not newest, and emit a counter metric so you can alert on data loss. The flow looks like:

modbus_poll -> decode -> sqlite_queue -> mqtt_publisher
                                   |
                                   +-> drop_oldest if full

QoS 1 with clean_session=False and a stable client ID gets you the broker-side replay, but you still need the local buffer for the disconnected case. Do not skip this. The first thing your customer will ask after an outage is "did we lose data", and the right answer is "no, here is the catch-up window".

7. Schema drift, because firmware updates change register maps

This is the one that scares us the most, because it is silent. A vendor pushes a firmware update to a PLC. Register 40050, which used to be a raw temperature in tenths of a degree Celsius, is now in centidegrees, or it is now signed where it used to be unsigned, or the engineering units have shifted from PSI to bar. Your bridge keeps publishing. Your dashboard keeps drawing lines. The lines are wrong.

There is no protocol-level defense against this. The only defenses are operational. We do three things:

Pin firmware versions in the tag map. Read the firmware version register at connect time, and refuse to start (or downgrade to read-only) if it does not match the version the tag map was written against.
Range-check every published value. A neonatal incubator chamber temperature outside [20, 50] degrees Celsius is a bug, not data. Drop the publish, raise a bridge.range_violation event with the offending tag and value.
Diff the tag map under version control. Tag maps are config, and config belongs in git. Every change is a PR with a sign-off, and the bridge reads the map by SHA, so a deployment is auditable.

This is the failure mode that cannot be fixed by better code. It is fixed by treating the register map as a contract, with a version, an owner, and a review process.

Putting it together

The production-grade Modbus TCP to MQTT bridge ends up being roughly 1,500 to 3,000 lines, depending on the language. The shape is consistent: a connection manager with reconnect and circuit breakers, a poller that batches reads and respects per-slave isolation, a decoder that maps raw registers to typed values via a versioned tag map, a range/sanity checker, a bounded persistent queue, and an MQTT client with mTLS and proper QoS handling. Around all of that, structured logs, Prometheus metrics, and at least three event topics for operational visibility (bridge.connected, bridge.reconnect, bridge.range_violation).

Most of this is not Modbus expertise, it is distributed-systems hygiene applied to a 1979 protocol. If you have built reliable network services before, none of these patterns are new to you. They are just absent from every Modbus to MQTT tutorial we have read.

Or: don't

We built SCADABLE because we kept watching hardware teams burn three to six engineering months on this exact bridge, then burn another three on OTA, audit logs, and certificate rotation. If you would rather not, you can upload your device manual to SCADABLE and we will generate the production-hardened driver, with the failure modes above already handled. Same MQTT topics on the other side. Less weekend pages.