|The worst possible answer to a customer problem is that it’s a hardware bug. Hardware bugs are expensive to fix. You not only have to replace the hardware, you may also have to replace everything you’ve got on the shelves. You can’t do this until you’ve “fixed” the problem, which might cost millions of dollars and take months. Hardware problems suck.|
This reminds me of a specific problem from long ago dealing with locking.
Locking is what programs do to avoid stepping on each other. It’s very similar to locking the bathroom door. When the bathroom is in use, the door is locked. Others wanting to use the bathroom will try the door, see that it’s locked, and try again later.
Our problem was that applications and even the kernel were crashing with what appeared to be two threads using the same resource. Thread B would have the lock, but thread A was in the same code acting as if it also held the lock. This normally is not possible. Thread A could not have gotten where it was without having the lock, and Thread B should have stopped when it hit the lock being held by A. There were no obvious errors in the code that could have allowed this.
Ultimately, we discovered the answer hinged on two assembly instructions and a new feature in the CPU called “speculative execution.”
The instructions were LL, SC (load link, store conditional). LL, SC is a neat concept that makes locking very fast. You “read” the lock in (LL) and “store” your lock value if the lock is free (store conditionally). If the lock is not free, the store does not happen and your code vectors back around to try again later.
Speculative execution is a feature that allows a CPU to execute ahead of itself. The CPU will pull in instructions that are coming up (and assume any if/then/else branches) and sort out what will probably happen in the next 32 instructions. In this way, the CPU can have all these values calculated and cued up for rapid and efficient execution.
So what was the problem? Turns out that if the CPU was in the middle of an LL/SC lock and happened to speculate over another LL/SC lock, the state of the first lock was overwritten by the second. So speculating over an unlocked lock while checking a locked one led you to continue on as if your lock was “unlocked”. The “SC” instruction would succeed when it shouldn’t have.
This is really a hardware issue. The hardware shouldn’t do this. However, replacing every CPU would be expensive and “hard.” The solution? It’s a compiler bug. The compilers were changed so that when applications and even the OS were compiled, 32 “no op” instructions were inserted after each and every LL/SC occurrence. This made sure that any speculative execution would never hit a second LL/SC combo since the CPU never went out past 32 instructions. Problem solved…. I guess.