Working with the military can be a lot of fun. It can be exhilarating. It can also be incredibly frustrating and boring.
Recently, we had a problem with a large server system being used by a military contractor at a very sensitive base. The server was periodically losing its boot drive. The drive would spin down and stop (hanging the machine). They would reboot and the system would run for a while, then spin down again and stop. Eventually, the drive would stop working all together. It would not spin up any more.
The solution is simple, right? Just get the drive and send it to the manufacturer - they will figure out what went wrong. Sorry…no can do. These drives are super secret and they much be shredded. They cannot be returned.
Management is up in arms. "Steve" they said, "you need to take your SCSI analyzer down there and plug it in to this system and see what's going on."
I had three good reasons not to like this idea:
1. Clearly, this is a hardware problem. Software does not "fry" drives so they will not come up again. Were they thinking the drive was just demoralized from bad SCSI commands? It was broken!
2. Connecting my SCSI analyzer to a super secret system is simple enough, but the military isn't going to let me take it home again! Once it's been plugged in, it's theirs. It took me a long time to justify getting that thing and I really didn't want to give it to this military contractor
3. Whatever was blowing up their boot disks was just as likely to blow up my SCSI analyzer! That's not a very good idea either.
Not sure of the best way to approach this, I walked over to another well known smoke jumper guy. This guy fixed hardware. He fixed it the way only a former military guy could. He took it apart down to every nut and bolt, he smelled cables for ozone (indicating something burned) and he examined boards with a portable microscope. He also had one other thing that is required for hardware debugging…a 6th sense that just told him what was wrong.
Me: "Phil, I've got this system that's blowing up drives. They've replaced everything and they want me to put my analyzer on there, but I think this is hardware. You have any suggestions?" Phil: “It's a burned cap on the midplane, have them replace it.”
The SSE was quick to shoot this down when we talked "No way! We've replaced the midplane twice. It can't be the midplane".
This is when "smoke jumping" happens. Phil gets on a plane and flies down.
What does he discover? It is indeed a burned cap on the midplane. The cause? The foil shielding behind the midplane is not mounted properly and is "angled" – resulting in the foil touching a 5V pin. As soon as they powered on the machine with the new midplane, “ZAP,” another midplane is fried.
They fixed the foil, put in a new midplane, and me and my SCSI analyzer lived to fight another day.