status

Title	http4 hardware issue
ID	Operation #64
State	completed
Beginning date	10/20/2012 4:51 p.m.
End date	10/21/2012 10:20 a.m.
Affected servers	http4

Messages

10/20/2012 5:05 p.m.	We’re investigating.
10/20/2012 5:10 p.m.	Back again. The machine was found frozen and has to be rebooted. Probably a kernel issue, we’ll have to update it soon.
10/20/2012 10:01 p.m.	It happened again. We’re forcing the kernel upgrade immediately.
10/20/2012 10:15 p.m.	The kernel has been upgraded. We stay vigilant as these freezes may be caused by a hardware issue.
10/20/2012 11:47 p.m.	We’ve found evidence of a probable hardware issue in the logs after the second freeze. We’ll schedule a motherboard replacement.
10/21/2012 12:25 a.m.	The motherboard will be replaced at 1:00 (in 35 minutes).
10/21/2012 1:02 a.m.	The operation is starting.
10/21/2012 1:47 a.m.	The hardware is still being replaced, it takes longer than usual.
10/21/2012 3:16 a.m.	We’re stil waiting to hear from our provider. They’re clearly not doing a good job right now.
10/21/2012 4:11 a.m.	It seems like the new motherboard was not the exact same model as before, and it has incompatibility issues with the kernel. We’ll know more when the operation has completed.
10/21/2012 5:08 a.m.	Apparently, they didn’t get the new motherboard to work. The old one is being put back into the server.
10/21/2012 5:48 a.m.	And it still doesn’t work (network down, as before). They must have misconfigured something during the operation. We’re very, very sorry about all of this. We’ll keep you informed as soon as we get more details.
10/21/2012 6:05 a.m.	They’re still trying to figure this out. Here is the error message that gets printed every 3 seconds, to be specific: ixgbe 0000:04:00.4 eth0: reset adapter
10/21/2012 6:36 a.m.	Nothing new since the last message. Just to make things clear: your data is fine, it’s “just” a network issue. The machine is fully accessible by KVM.
10/21/2012 7:49 a.m.	They still didn’t make it work. A senior technician will arrive at 10:00. We cannot give an ETA right now, I’m sorry.
10/21/2012 8:03 a.m.	They’re preparing a spare server where our disks will be inserted. Hopefully that will solve the issue.
10/21/2012 10:04 a.m.	The senior technician is now investigating on this issue. They’re looking for another network card model, as it seems to be the cause of all this.
10/21/2012 10:17 a.m.	The server is up again. It will be slower than usual for several minutes.
10/21/2012 10:40 a.m.	The network card has been replaced. Why the previous one, which worked normally for 3 years, stopped working when the motherboard was replaced is still a mystery. We will have a chat with our provider next week to understand how such an extended downtime could have happened, especially on a high-end server such as http4. That’s the worse downtime we’ve experienced with a server, by far. We will obviously take actions to avoid this to happen again. All customers on http4 can ask for a full refund for October by opening a support ticket. We are very, very sorry. This is clearly not the quality of service you should expect from us.

Messages