We experienced a hardware malfunction yesterday that caused the TOPAZ [1] hosted journals to be offline from 4pm - 10pm PST.
James [2] sent an email late afternoon yesterday indicating site errors on the PLoS journal websites. Soon after his email, the IT team started receiving SMS alerts. I assumed that something had occurred with the Topaz framework and started looking at the appropriate log files but couldn't find anything. I spent the requisite amount of time banging my head against the "site error" wall without success and I called up Russ [3] for assistance. After a bit of digging through the server logs, he found the culprit - a drive had failed on the Mulgara [4] server. This drive is part of a RAID 5 configuration, so we didn't lose any data but we also mysteriously lost the connection from the Mulgara server to the DAS array [5] (disk storage for the Mulgara data).
We restarted the server but couldn't confirm that it was rebuilding the RAID correctly. I drove down to the colo, confirmed the drive failure and babysat the server until the platform was healthy. We'll swap out the defective drive on Wednesday during the migration to a pre-release of Topaz 0.9.
In case you missed the reference to U2's Sunday Bloody Sunday [6]....
Links:
[1] http://www.topazproject.org
[2] http://www.plos.org/cms/
[3] http://www.plos.org/about/people/itweb.html#ruman
[4] http://www.mulgara.org/
[5] http://en.wikipedia.org/wiki/Direct-attached_storage
[6] http://en.wikipedia.org/wiki/Sunday_Bloody_Sunday_(song)