SCADABLE

Connecting ESP32 to AWS IoT Core: why mTLS broke for us and what we learned

A field-tested walkthrough of esp32 aws iot core integration, why the Espressif hello-world stops scaling at device 50, and how to survive cert provisioning, mTLS handshake errors, and rotation without losing the fleet.


The Espressif esp-idf MQTT example is a beautiful lie. It is not a lie because the code is wrong. It is a lie because it makes you believe production is one idf.py flash away. We believed it too. We had one ESP32 happily publishing to AWS IoT Core within an afternoon. Three weeks later we had 47 devices on a bench, half of them refusing to connect, and a Slack channel full of MBEDTLS_ERR_SSL_FATAL_ALERT_MESSAGE codes nobody could decode.

This post is the writeup we wish we had found on day three. If you searched esp32 aws iot core and ended up here, you are probably about to learn the same things, in the same order, in the same amount of pain. Skip the pain.

The hello-world that lies to you

Here is the shape of the official sample. Almost every ESP32-to-AWS tutorial on the internet is a recolor of it.

esp_mqtt_client_config_t mqtt_cfg = {
    .broker.address.uri = "mqtts://a1b2c3d4e5f6g7-ats.iot.us-east-1.amazonaws.com:8883",
    .broker.verification.certificate = (const char *)aws_root_ca_pem_start,
    .credentials = {
        .authentication = {
            .certificate = (const char *)device_cert_pem_start,
            .key = (const char *)device_private_key_pem_start,
        },
    },
};
 
esp_mqtt_client_handle_t client = esp_mqtt_client_init(&mqtt_cfg);
esp_mqtt_client_register_event(client, ESP_EVENT_ANY_ID, mqtt_event_handler, NULL);
esp_mqtt_client_start(client);

Three PEM blobs are baked into the binary via EMBED_TXTFILES: AWS's Amazon Root CA 1, the device certificate, and the private key. Flash, connect, publish, done. It works on the first device. It will work on the second device too, if you do not mind both devices presenting the same identity to AWS, which you very much should mind.

The moment you ship more than one of these, the architecture breaks in five places at once:

  1. AWS IoT policies are scoped per certificate. Same cert means same clientId rules, same shadow, same everything. Two devices fight each other on connect because AWS will close one session when the other reconnects.
  2. Revoking a compromised device means revoking every device.
  3. AWS recommends a maximum cert validity of one year. Your firmware now has a hard expiration date.
  4. The private key sits in a .pem file in your repo or on a build server. If you have ever run git log -p, you know how that ends.
  5. There is no way to provision a device on the factory floor without burning the AWS keypair into the build itself, which means your contract manufacturer effectively holds your AWS root of trust.

Every problem after this point is a consequence of fixing one of those five.

mTLS, demystified, in five lines

There is a lot of mystique around mutual TLS that evaporates once you write it down plainly. The TLS handshake between an ESP32 and AWS IoT Core uses three artifacts:

  • Amazon Root CA 1. The device uses this to verify that the broker it dialed (*.iot.<region>.amazonaws.com) is really AWS. Lives in flash, ships with firmware, rotates roughly never.
  • The device certificate. A chunk of x509 signed by either AWS's CA (if you used CreateKeysAndCertificate) or your own CA (if you registered one with RegisterCACertificate). It contains the device's public key and a subject. The device sends this to AWS during the handshake.
  • The device private key. Stays on the device, ideally never leaves it. The device proves ownership of the certificate by signing a handshake transcript with this key.

That is the whole protocol from the firmware's perspective. The device says "here is my cert, and here is a signature over the handshake proving I own it." AWS validates the signature, looks up the cert in its registry, finds the attached IoT policy, and decides whether you are allowed to do the things you try to do over the resulting MQTT session. Authentication and authorization are decoupled. The cert authenticates; the policy authorizes.

Two implications matter for embedded engineers. First, the private key is the only secret. The cert is public. The root CA is public. If you lose the private key, you lose the device. If you keep the private key, everything else is recoverable. Second, AWS does not care how you provisioned the cert as long as the chain of trust resolves. Which is the door Fleet Provisioning walks through.

Fleet Provisioning by Claim, in pseudocode you can actually use

AWS IoT Fleet Provisioning has two flavors, and the docs do not do a great job of distinguishing why you would pick one over the other. The short version:

  • Provisioning by Trusted User. A human (an installer, a technician) authenticates to AWS, gets a temporary cert, hands it to the device. Good for high-touch installs. Useless for ten thousand units off a contract manufacturer's line.
  • Provisioning by Claim. Every device ships with a shared "claim certificate" that has very narrow IoT policy permissions: it can only request a real, per-device certificate from AWS, and only over a specific MQTT topic. On first boot, the device uses the claim cert to call CreateCertificateFromCsr (or CreateKeysAndCertificate), receives a unique cert, stores it, throws away the claim cert, and reconnects as itself.

Provisioning by Claim is what you want for a fleet. The flow looks like this:

// 1. Boot. NVS contains either a claim cert (factory) or a real cert (post-provisioning).
provisioning_state_t state = nvs_load_provisioning_state();
 
if (state == PROVISIONED) {
    connect_with_device_cert();
    return;
}
 
// 2. Generate a keypair on-device. The private key never leaves flash.
mbedtls_pk_context keypair;
mbedtls_pk_init(&keypair);
generate_ecdsa_p256_keypair(&keypair);
 
// 3. Build a CSR with the device serial as CN.
char csr_pem[1024];
build_csr(&keypair, device_serial(), csr_pem, sizeof(csr_pem));
 
// 4. Connect to AWS using the claim cert.
mqtt_connect_with_claim_cert();
 
// 5. Publish the CSR to $aws/certificates/create-from-csr/json.
mqtt_publish("$aws/certificates/create-from-csr/json",
             json_wrap("certificateSigningRequest", csr_pem));
 
// 6. Subscribe to /accepted, parse the returned cert + ownership token.
char device_cert[2048];
char ownership_token[512];
await_provisioning_response(device_cert, ownership_token);
 
// 7. Call RegisterThing to attach the cert to a Thing in the registry.
mqtt_publish("$aws/provisioning-templates/<TemplateName>/provision/json",
             json_wrap_two("certificateOwnershipToken", ownership_token,
                           "parameters", device_parameters_json()));
 
// 8. Persist the new cert + private key. Wipe the claim cert.
nvs_store_device_cert(device_cert);
nvs_store_private_key(&keypair);
nvs_erase_claim_cert();
 
// 9. Disconnect, reconnect as the real device.
mqtt_disconnect();
connect_with_device_cert();

The two non-obvious parts are step 2 and step 7. Generating the keypair on-device matters because it means the private key has literally never existed anywhere else. Even AWS does not see it; AWS only sees the CSR, which contains the public key. And the RegisterThing call is what binds the cert to a Thing in the AWS registry with a name template like corvita-pump-{serialNumber}, so your cloud side has a stable identity to attach shadows and policies to.

The private key is the only secret. The cert is public. If you keep the private key, everything else is recoverable.

The claim cert's IoT policy must be ruthless: only iot:Connect with a specific clientId pattern, only iot:Publish to the two $aws/certificates/... and $aws/provisioning-templates/... topics, and iot:Subscribe to the /accepted and /rejected responses. If a claim cert leaks (and assume it will), the worst an attacker can do is enroll fake devices, which your provisioning template hooks can rate-limit and audit.

The five mTLS handshake failures you will actually see

Once provisioning is in place, you will spend an unreasonable fraction of your remaining time staring at handshake errors. Here are the ones that ate our weeks, in order of frequency.

1. MBEDTLS_ERR_X509_CERT_VERIFY_FAILED on first boot, only on some devices. This one is almost always the system clock. mbedTLS validates notBefore and notAfter on the server cert, and if the device's clock is at the Unix epoch (1970) because it has not yet talked to NTP, every cert in the world looks "not yet valid." The fix is to gate MQTT startup on a successful SNTP sync, or use esp_sntp_set_time_sync_notification_cb and only start the MQTT client after the callback fires. We learned this when half the devices on the bench connected and the other half did not, and the only difference was which ones had finished SNTP first.

2. MBEDTLS_ERR_SSL_FATAL_ALERT_MESSAGE with alert code 42 (bad_certificate) or 43 (unsupported_certificate). AWS sent you a TLS alert and closed the connection. Ninety percent of the time this is a wrong root CA. AWS rotated to ATS endpoints (a1b2c3-ats.iot.us-east-1.amazonaws.com) years ago, and the corresponding root is Amazon Root CA 1, not Verisign or Starfield. If you are using a *-ats endpoint with a non-ATS root, the chain does not validate. Confirm with openssl s_client -connect <your-endpoint>:8883 -showcerts from a laptop and compare the issuer chain to whatever you embedded.

3. MBEDTLS_ERR_SSL_FATAL_ALERT_MESSAGE with alert code 46 (certificate_unknown) or 48 (unknown_ca). AWS does not recognize your device cert. Either the cert was never registered (Fleet Provisioning silently failed and you fell back to a stale cert), or the cert is in INACTIVE state in the IoT registry, or the cert is registered but in a different AWS account or region. Check the cert's fingerprint against the registry before suspecting anything else: openssl x509 -in device.pem -noout -fingerprint -sha256.

4. MBEDTLS_ERR_SSL_TIMEOUT mid-handshake, especially on cellular or weak Wi-Fi. mbedTLS's default handshake timeout is shorter than you want for high-latency links. Bump CONFIG_MBEDTLS_SSL_HANDSHAKE_TIMEOUT_MS (or set it programmatically) to something like 30 seconds. Related: if you see the handshake succeed but then immediately get disconnected with no error, your MQTT keep-alive is probably colliding with a NAT timeout. Set keep-alive to 60 seconds, not the 120-second default.

5. ESP_ERR_NO_MEM during the handshake. The TLS record buffers eat heap. AWS IoT Core's cert chain plus the AWS-side ServerHello can push the handshake to use more RAM than the default 16 KB inbound buffer allows. If you are running the BLE stack, Wi-Fi, and a fat application all on a vanilla ESP32 (not S3, not WROVER), you can starve mbedTLS. Either bump CONFIG_MBEDTLS_SSL_IN_CONTENT_LEN to handle the worst case or shed memory elsewhere. We hit this on a WROOM-32 with BLE provisioning enabled and ate three days before noticing.

The pattern across all of these: the error message is local (mbedTLS) but the cause is distributed. The fix is almost always on the cloud side or in the bring-up sequence, not in the TLS code itself.

Cert rotation: the part nobody wants to write

You provisioned a fleet. Every cert expires in a year. What is your plan for month 11?

The honest answer most teams arrive at: a background task that checks notAfter on the active device cert daily, and if it is within some window (we used 30 days), kicks off a renewal. Renewal looks identical to initial Fleet Provisioning except the device authenticates with its current cert instead of the claim cert, and the new cert replaces the old one in NVS atomically. The IoT policy for the device cert needs to allow iot:CreateCertificateFromCsr and the RegisterThing topic for itself, which is a small but real expansion of trust.

The thing that bites you is the failure case. What if rotation fails because the device is offline for the entire 30-day window? What if the new cert is issued but the write to NVS is interrupted by a power loss? You need three things:

  • A two-slot NVS layout for cert + key, with an active-slot marker that you flip atomically only after the new cert handshakes successfully against AWS at least once. Same A/B pattern as OTA.
  • A grace period after expiration where the device keeps trying with the old cert. AWS will reject it, but the device should fall back to the claim cert (if still present) or surface a hard provisioning error rather than silently bricking.
  • A watchdog timer on the rotation flow itself. If a rotation has been "in progress" for more than 24 hours, abort it, mark the attempt failed, and try again on the next interval. Half-rotated state is worse than no rotation.

We landed on rotating well before expiration to give the system slack. In our case the sweet spot for renewal is at the 75 percent mark of the cert's validity. Mid-day local time when devices are most likely awake, with jitter so you do not hammer your provisioning endpoint at the same instant for the entire fleet.

Where this leaves you

If you are building all of the above (Fleet Provisioning by Claim, two-slot NVS cert storage, rotation with watchdog and grace, debug tooling for the five handshake failure modes), you are looking at four to six weeks of firmware work that is entirely about identity infrastructure and not the product you set out to build. We did it. Twice.

This is the gap SCADABLE fills. We treat cert provisioning, rotation (24-hour cert lifetime with automatic re-enrollment at hour 18 over EST, RFC 7030), and gateway-side state recovery as platform concerns so firmware teams can spend their cycles on what their devices actually do. If that sounds useful, we are talking to early hardware teams now.

Either way: write the rotation watchdog. Future you, debugging an expired-cert outage at 2am, will be grateful.