Debugging hardware is hard

If you want a very technical illustration of why, as they say, hardware is hard, follow along as I debug a communications problem between Pickup’s two main chips.

Day 1

I was working on the bus interface between Pickup’s MCU (the STM32) and its WiFi radio (an ESP32) only to find that the STM32 UART appears to be… pretty terrible, actually. It was getting all sorts of errors I’ve never seen on another MCU before. Even when I dropped the bus clock down to almost nothing, the STM32 corrupted bits on its own receive path.

Bizarre—it’s like the STM32 was occasional sampling a bit wrong, or messing up its timing. Never seen anything like it, and it’s not a complicated interface (you do one in Year 3 of undergrad).

Day 2

With a fresh mind, I went through everything: Clocks are accurate. The electrical interface is pristine. The ESP32 itself has zero problems, never missing a single bit.

But the STM32 threw random errors constantly. On a DMA transfer, so it wasn’t interrupt timing or anything related to that. Best I could tell, the chip itself had a glitchy receiver design and that’s just how ST rolls.

Pickup’s interface between these two chips was originally designed with the (usually correct) assumption that bit errors are so rare you can just reset the entire board. This would be a once-per-year-at-most kind of event. If it was happening this often, I’d need to redesign the entire interface to make it capable of handling bus errors, and build a test system that specifically verifies the interface is reliable.

So at this point I needed to confirm that it was systemic. I fired up another three boards (the setup process was working well, at least). One got hung up pretty quick, but didn’t do it again. Okay, a bit less doom and gloom, but still frustrating. Maybe there was a hardware glitch or noise issue that wasn’t showing up on my scope.

I set the boards to cycle the Wi-Fi about every 10 seconds to be extra noisy, and left them to run overnight.

At this point I was starting to redesign the interface in my head. My strategy would to be:

  • Support retrying a packet on a frame/CRC error/timeout on the STM32 side.
  • Add error status to the acknowledge frame coming from the ESP32 side.
  • Add packet IDs to prevent the ESP32 from running a duplicate command on a retry (we are likely to miss the acknowledge but the actual command always goes through)
  • Start the interface at a lower speed in the bootloader, and then add a command to bump the speed up in normal operation in the main firmware. The idea is that the bootloader gets extra timing margin so it is less likely to have a problem where it is hardest to fix.
  • Add error counters to the bus interface driver and log those for diagnostics.
  • Set up some stress tests to run 24/7.

I still had no idea what the actual problem was. The ESP32 side was fine; I barely needed to touch it. The STM32 reported a framing error most of the time, but I didn’t see it on the bus. The signals run right underneath the ESP32, pretty close to the antenna, so maybe they were picking up some WiFi interference during a transmission and my scopes (I have a Saleae and an oscilloscope plugged in) were just missing it? The signal was just so clean, and a glitch on the power rail should show up on the bus too. It was so weird.

The transmitter pulls a lot of power and ramps up really fast, so it’s a common cause of power glitches with the ESP32, but again, I wasn’t actually seeing a glitch there on a 100MHz scope. Whatever it was, it was beyond my direct observation.

Day 3

My three test boards continued to run overnight, with a full WiFi connect/disconnect cycle, along with hitting the serial bus for status information.

Was the issue related to the board I was developing on, or to having the debugger plugged in? I ran the test set on my bench power supply simulating a battery connection—no debugger is present.

I put error checks and retries all over the place, and watched it fail 100 times in a row, with the same error every time. It received an extra byte of all 0s before the header, 98 times, with frame errors the other 2.

And it was only the UART receiver. The transmit side was perfect. The ESP32 will fail to respond if it has a malformed packet, but it has responded correctly every time.

So if it was a glitch in the STM32, it was weirdly localized. Everything else on the chip seemed to be fine. It’s not like the CPU would hang up, I didn’t see the RAM get corrupted, timers and clocks all kept working.

It could still be that the UART really didn’t like having the debugger connected, though that wasn’t documented and it would make the debugger somewhat useless. I kept going down the rabbit hole…

I updated the error handler to reset the UART, and it seemed to recover just fine. So something was glitching the UART itself and it got stuck. Maybe some error flag didn’t get cleared? Doing a full reset of the UART logic in an error case made sense anyway. The practice of “just reset your computer” works on many scales.

The thrilling conclusion

I found the root cause, though I still can’t explain why it does what it does.

There are a couple of clocks on the chip. One is the 32KHz crystal, which by itself can only connect to a handful of things. Another is the MSI, which is a built-in RC oscillator with a couple frequency settings. Not as accurate as a crystal, but cheap and low-power.

The MSI can be automatically calibrated against the 32KHz crystal, and I had it turned on because I want the UART reference clock to be as accurate as possible to match the ESP32. UART requires the two sides to have fairly close clocks since it doesn’t use an explicit clock signal.

If I have the auto-calibration enabled, the UART glitches appear. If I turn it off, no glitches. That’s weird, because I wanted the autocal specifically for UART reliability. But it seems like something in that process causes a glitch that throws the receiver off.

I didn’t see this when I was bringing up the bootloader and doing the original comms check on the new boards because the bootloader doesn’t enable the crystal at all. So, the bootloader never glitched.

I couldn’t find anything in the docs or on the Internet about why this might happen with the autocal, and there’s nothing that details exactly how it works either. (Welcome to microcontroller programming.) But I can at least correlate to that setting, I now have more robust UART code, and even with the glitch, comms seem to be quite stable and reliable now. That was all totally bonkers.

Day 4

Ran overnight: 22,290,582 transfers, 10 retransmissions, all recovered. 99.99995513800402340325% success rate on first attempt and 100% on the second attempt. Good enough, I think!

This was a very software-meets-hardware problem, and you have to be good at both to solve it. I had the software debugger, logic analyzer, oscilloscope, and debug console on the other MCU all running at the same time. It’s been a good deep dive into the board, and I’m a lot more confident about the bus interface now.