status

Title	http6 is down
ID	Operation #168
State	completed
Beginning date	02/10/2014 10:05 a.m.
End date	02/10/2014 3:05 p.m.
Affected servers	http6

Messages

02/10/2014 10:13 a.m.	We’re investigating.
02/10/2014 10:22 a.m.	The server is up again, we’re stil analyzing what happened.
02/10/2014 10:26 a.m.	Down again.
02/10/2014 10:39 a.m.	A network attack is causing the trouble. We’re working on mitigating it and trying to figure out why our firewall didn’t work as expected.
02/10/2014 10:57 a.m.	The traffic is still perturbed.
02/10/2014 12:08 p.m.	We’re obviously still working on resolving this issue.
02/10/2014 12:32 p.m.	We think we’ve found the problem.
02/10/2014 12:39 p.m.	We’re pretty sure the problem is identified. Another unrelated problem has appeared, we’re working on it.
02/10/2014 12:53 p.m.	A quotacheck is running on the /home filesystem, it may take some time.
02/10/2014 1:13 p.m.	The quotacheck is done, but we’re still fixing an isuse.
02/10/2014 1:40 p.m.	We’re still fighting a mdadm issue. The data is safe.
02/10/2014 1:58 p.m.	We’ve disabled the RAID for now, the server is up.
02/10/2014 3:05 p.m.	The server is up, everything is working as normal. We’ll close the operation for now, but details will be posted later. The RAID is still degraded, but we’ll keep it this way until we find a way to fix it properly.
02/10/2014 7:38 p.m.	Here is what happened: yesterday, the http6 server was migrated over a new machine. Everything was working properly until this morning. At 09:45 today, the server started getting random network issues (broken connections, timeouts, weird traffic). Debugging was a bit more difficult than usual since we didn’t have a functional SSH access. The only relevant kernel log message was: net_ratelimit: 2127 callbacks suppressed Such messages indicate that the kernel avoided repeating the same messages again and again. We didn’t have any other message, though, which is unexpected. Since we couldn’t exploit those messages, we tried to discover the cause of the issue by other means. We initially thought it could be a network attack, despite our protections. The network traffic was suspicious. After much investigation, we came to the conclusion it wasn’t an attack. We pursued our debugging for a while, to no avail. We decided to downgrade the kernel version from 3.10 to 3.4. The 3.4 kernel was reporting new log messages: IPv4: Neighbour table overflow That helped us quickly notice a typo in a network configuration file: the default gateway was pointing to the server itself, rather than the real gateway (the reality was actually a bit more complex). The neighbour table was filled with non-local IP addresses. Everything worked fine until this morning, when the number of connections increased. We still don’t know why the 3.10 kernel didn’t report these log messages. We’ll have to investigate the RAID issues as well, which started appearing during the kernel downgrade. We’ll obviously take lessons from that outage. An automated debugging tool may prove useful in unusually weird situations such as this one.

Messages