|Back when I was at SGI slaying dragons I had the good fortune to visit America Online. At the time, AOL was moving about 20% of the USA’s email traffic through SGI Challenge XL servers.|
This was around the time they crossed 10 million (with an M) users. That was a lot back then – there were t-shirts printed. Facebook is approaching 1 billion (with a B) users today.
As you can imagine, AOL introduced some serious issues of scale to our products. We’d never really had anyone use a Challenge XL server to handle 100,000 mail users (much less five gymnasiums full of Challenge XL servers to handle 10 million). Having so many systems together created some interesting challenges.
First off, when a customer has two systems and you have a bug that occurs once a year, the two-system customer may never see it. If they do see it, they might chalk it up to a power glitch. Your engineers may never get enough reports to fix the problem since it’s simply not reproducible in a way you can recreate.
Not so with 200 in a room. You might see that once a year glitch every day. That’s a very different prospect. Manufacturing will never see that in quality control running systems one at a time. It can only be seen with hundreds of systems together.
In AOL’s case, we had the “Twilight Hang” (and no, there were no vampires that sparkled). A machine would simply “stop.” There was no core dump. It could not be forced down and there would be no error messages. The machine was simply frozen in a twilight state. This is the worst possible situation because engineers and support personnel cannot gather data or evidence to fix the problem. There’s no way to get a fingerprint to link the problem to other known issues.
SGI mustered a very strong team of people (including me) to go onsite with a special bus analyzer to watch one of the machines that seemed to hit the problem more than the others. I was there for three weeks. In fact, my fiancé’ actually flew out on the last of the three weeks because it was her birthday and I was not scheduled to be gone that long.
I can recall one highlight from this trip was me sitting in a room with some of the onsite SGI and AOL people having a conference call with SGI engineering and SGI field office people. During this call, the engineering manager was explaining the theory that the /dev/poll device might be getting “stuck” because of a bug with the poll-lock. Evidently, the poll-lock might get locked and never “unlocked,” which would cause the machine to hang. I had to ask, “Carl, how many poll locks does it take to hang a system?” There was dead silence. I came to find out that the other SGI field people on the phone had hit mute and were rolling on the floor laughing. (Thanks guys). The Corporate SGI people were not amused.
Anyhow, the ultimate cause of the problem was secondary cache corruption. Irix 5.3 was not detecting cache errors correctly, and when it did it would corrupt the result every other time. Ultimately, they completely disabled secondary cache correction and to this day, you Irix users will notice a tuning variable called “r4k_corruption.” You have to turn that on to allow the machine to attempt to correct those errors (even at the risk of corrupting memory). The ultimate solution for AOL was to upgrade to R10k processors that “correctly” corrected secondary cache errors every time.