Back when I was at SGI slaying dragons I had the good fortune to visit America Online. At the time, AOL was moving about 20% of the USA’s email traffic through SGI Challenge XL servers.
This was around the time they crossed 10 million (with an M) users. That was a lot back then – there were t-shirts printed. Facebook is approaching 1 billion (with a B) users today.
As you can imagine, AOL introduced some serious issues of scale to our products. We’d never really had anyone use a Challenge XL server to handle 100,000 mail users (much less five gymnasiums full of Challenge XL servers to handle 10 million). Having so many systems together created some interesting challenges.
First off, when a customer has two systems and you have a bug that occurs once a year, the two-system customer may never see it. If they do see it, they might chalk it up to a power glitch. Your engineers may never get enough reports to fix the problem since it’s simply not reproducible in a way you can recreate.
Not so with 200 in a room. You might see that once a year glitch every day. That’s a very different prospect. Manufacturing will never see that in quality control running systems one at a time. It can only be seen with hundreds of systems together.
In AOL’s case, we had the “Twilight Hang” (and no, there were no vampires that sparkled). A machine would simply “stop.” There was no core dump. It could not be forced down and there would be no error messages. The machine was simply frozen in a twilight state. This is the worst possible situation because engineers and support personnel cannot gather data or evidence to fix the problem. There’s no way to get a fingerprint to link the problem to other known issues.
SGI mustered a very strong team of people (including me) to go onsite with a special bus analyzer to watch one of the machines that seemed to hit the problem more than the others. I was there for three weeks. In fact, my fiancé’ actually flew out on the last of the three weeks because it was her birthday and I was not scheduled to be gone that long.
I can recall one highlight from this trip was me sitting in a room with some of the onsite SGI and AOL people having a conference call with SGI engineering and SGI field office people. During this call, the engineering manager was explaining the theory that the /dev/poll device might be getting “stuck” because of a bug with the poll-lock. Evidently, the poll-lock might get locked and never “unlocked,” which would cause the machine to hang. I had to ask, “Carl, how many poll locks does it take to hang a system?” There was dead silence. I came to find out that the other SGI field people on the phone had hit mute and were rolling on the floor laughing. (Thanks guys). The Corporate SGI people were not amused.
Anyhow, the ultimate cause of the problem was secondary cache corruption. Irix 5.3 was not detecting cache errors correctly, and when it did it would corrupt the result every other time. Ultimately, they completely disabled secondary cache correction and to this day, you Irix users will notice a tuning variable called “r4k_corruption.” You have to turn that on to allow the machine to attempt to correct those errors (even at the risk of corrupting memory). The ultimate solution for AOL was to upgrade to R10k processors that “correctly” corrected secondary cache errors every time.
I remember the days when CPUs were stuck in a rut. They were barely hitting 1Ghz. Networks were running at 1Gb and beyond and CPUs and storage just could not keep up. Clients wanted redundant, failover capable servers that could handle 600 clients, but SGI was running out of ways to do that. We couldn’t make the bus any wider (128bit computers?) and we couldn’t make the CPUs any faster. What should we do?
One answer was to network many systems together over NUMA (non-uniform memory access). This would let many systems (that would normally be a cluster) act as if they were one system. The problem with a setup like this is speed. Systems accessing remote memory are slow. We had to find a way to speed up access to memory.
SGI invented lots of cool stuff to do this.
One of the new things was the CPOP connector. This connector was made up of many fuzzy little pads. The fuzzy pads would be compressed together and allow for a much higher frequency connection than normally would be allowed with gold pins.
The problem with delicate things like this is that they are far more sensitive to installation mistakes. Each connector needed to be torqued down to the right pressure so that the signals made it across cleanly. Install them too loosely and you’re going to see connectivity errors.
So cut to one of our advanced training courses where we taught field engineers how to replace boards. The instructor explained how each HEX head screw on the CPU cards needs to be torqued down to an exact specification. This is where one of the helpful field guys, who had clearly done this before, piped up and explained that you know the boards are properly seated when you tighten the HEX screw down and hear three “clicks.”
The instructor and I looked at each other. We didn’t remember there being three clicks. We normally used torque drivers to accurately measure the torque and there were never any clicks.
After some investigation and a quick examination of the board our helpful field guy had just installed, we discovered the source of the “three clicks.” They were the sound of the very expensive backplane cracking as the HEX screw penetrated the various layers of plastic…. OUCH.
From that point on, correct torque drivers were provided to all field personnel.
The worst possible answer to a customer problem is that it’s a hardware bug. Hardware bugs are expensive to fix. You not only have to replace the hardware, you may also have to replace everything you’ve got on the shelves. You can’t do this until you’ve “fixed” the problem, which might cost millions of dollars and take months. Hardware problems suck.
This reminds me of a specific problem from long ago dealing with locking.
Locking is what programs do to avoid stepping on each other. It’s very similar to locking the bathroom door. When the bathroom is in use, the door is locked. Others wanting to use the bathroom will try the door, see that it’s locked, and try again later.
Our problem was that applications and even the kernel were crashing with what appeared to be two threads using the same resource. Thread B would have the lock, but thread A was in the same code acting as if it also held the lock. This normally is not possible. Thread A could not have gotten where it was without having the lock, and Thread B should have stopped when it hit the lock being held by A. There were no obvious errors in the code that could have allowed this.
Ultimately, we discovered the answer hinged on two assembly instructions and a new feature in the CPU called “speculative execution.”
The instructions were LL, SC (load link, store conditional). LL, SC is a neat concept that makes locking very fast. You “read” the lock in (LL) and “store” your lock value if the lock is free (store conditionally). If the lock is not free, the store does not happen and your code vectors back around to try again later.
Speculative execution is a feature that allows a CPU to execute ahead of itself. The CPU will pull in instructions that are coming up (and assume any if/then/else branches) and sort out what will probably
happen in the next 32 instructions. In this way, the CPU can have all these values calculated and cued up for rapid and efficient execution.
So what was the problem? Turns out that if the CPU was in the middle of an LL/SC lock and happened to speculate over another LL/SC lock, the state of the first lock was overwritten by the second. So speculating over an unlocked lock while checking a locked one led you to continue on as if your lock was “unlocked”. The “SC” instruction would succeed when it shouldn’t have.
This is really a hardware issue. The hardware shouldn’t do this. However, replacing every CPU would be expensive and “hard.” The solution? It’s a compiler bug. The compilers were changed so that when applications and even the OS were compiled, 32 “no op” instructions were inserted after each and every LL/SC occurrence. This made sure that any speculative execution would never hit a second LL/SC combo since the CPU never went out past 32 instructions. Problem solved…. I guess.
Working with the military can be a lot of fun. It can be exhilarating. It can also be incredibly frustrating and boring.
Recently, we had a problem with a large server system being used by a military contractor at a very sensitive base. The server was periodically losing its boot drive. The drive would spin down and stop (hanging the machine). They would reboot and the system would run for a while, then spin down again and stop. Eventually, the drive would stop working all together. It would not spin up any more.
The solution is simple, right? Just get the drive and send it to the manufacturer - they will figure out what went wrong. Sorry…no can do. These drives are super secret and they much be shredded. They cannot be returned.
Management is up in arms. "Steve" they said, "you need to take your SCSI analyzer down there and plug it in to this system and see what's going on."
I had three good reasons not to like this idea:
1. Clearly, this is a hardware problem. Software does not "fry" drives so they will not come up again. Were they thinking the drive was just demoralized from bad SCSI commands? It was broken!
2. Connecting my SCSI analyzer to a super secret system is simple enough, but the military isn't going to let me take it home again! Once it's been plugged in, it's theirs. It took me a long time to justify getting that thing and I really didn't want to give it to this military contractor
3. Whatever was blowing up their boot disks was just as likely to blow up my SCSI analyzer! That's not a very good idea either.
Not sure of the best way to approach this, I walked over to another well known smoke jumper guy. This guy fixed hardware. He fixed it the way only a former military guy could. He took it apart down to every nut and bolt, he smelled cables for ozone (indicating something burned) and he examined boards with a portable microscope. He also had one other thing that is required for hardware debugging…a 6th sense that just told him what was wrong.
Me: "Phil, I've got this system that's blowing up drives. They've replaced everything and they want me to put my analyzer on there, but I think this is hardware. You have any suggestions?" Phil: “It's a burned cap on the midplane, have them replace it.”
The SSE was quick to shoot this down when we talked "No way! We've replaced the midplane twice. It can't be the midplane".
This is when "smoke jumping" happens. Phil gets on a plane and flies down.
What does he discover? It is indeed a burned cap on the midplane. The cause? The foil shielding behind the midplane is not mounted properly and is "angled" – resulting in the foil touching a 5V pin. As soon as they powered on the machine with the new midplane, “ZAP,” another midplane is fried.
They fixed the foil, put in a new midplane, and me and my SCSI analyzer lived to fight another day.