For you old timers, you may remember a story where the USA’s largest Internet service provider went down for 19 hours. For you younger folks, you can read about it here:
I would hardly know about this story myself except that I received a panicked phone call from the SGI office that served AOL that same day. In that phone call I was asked, “Could installing an SGI server on AOL’s network bring down the entire network?”
Hmmmm… normally, I would have said “no.” Networks should be resilient. People make mistakes on networks all the time. Sometimes they put systems on that have the same IP address. Sometimes they set their subnet or broadcast addresses incorrectly. These simple errors don’t take out buildings.
Even the most egregious error I could think of - somehow looping or routing a switch back to itself - shouldn’t take out the entire network. It might hang a “dumb” switch, but AOL used expensive switches with Spanning Tree Protocol that would prevent such loops. So even if the onsite people had made the very improbable mistake of making the SGI a router and somehow sticking two of its ports onto the same switch, I could not see AOL - as in, the entire company - going offline.
I got off the phone and something started to nag at me. I remembered a case with Chrysler the year before where they had deployed some SGI workstations on their CAD network. When they turned the SGI systems on, the IBM systems would drop off the network. The upshot was that SGI systems were a lot more aggressive when sending packets and we could easily keep the IBM systems from “getting a word in edgewise.”
Could this be it? Did AOL set up a system somewhere that was handling all their DNS or something and we were forcing it off the network?
This is where politics comes in. If we shut off the SGI system and the network “magically” came back, then what? At best, AOL would have been extremely leery about letting SGI add any more servers. At worst, the headline the following day would have read, “SGI takes out entire AOL network!” Dumb luck and coincidence might have put SGI on the front page in a very unfavorable light.
Ultimately, before we could gather any traces, AOL figured it out. The complete solution is explained here: http://news.cnet.com/AOL-mystery-explained/2100-1023_3-220635.html.
As it was explained to me, a redundant router was put in place alongside the existing router to handle AOL’s network traffic. That “new” router had an empty routing table. He decided to push his routing table down to all the sub-routers on AOL’s network and essentially erased their entire distributed routing table. As I recall, admins were logging into routers and manually entering routes to allow different floors of the building to reattach so they could get to other routers and fix them until they finally got enough connectivity to get back to the main routing tables and recover everything and push it all back down.
That was a very bad day for those guys, but it was no picnic for me either. ☺