Servers status

Check our services status at a glance

ID:
168
Title:
http6 is down
Status:
completed
Started date:
02/10/2014 10:05 a.m.
End date:
02/10/2014 3:05 p.m.
Involved servers:
http6

Upgrades

02/10/2014
10:13

We’re investigating.

02/10/2014
10:22

The server is up again, we’re stil analyzing what happened.

02/10/2014
10:26

Down again.

02/10/2014
10:39

A network attack is causing the trouble. We’re working on mitigating it and trying to figure out why our firewall didn’t work as expected.

02/10/2014
10:57

The traffic is still perturbed.

02/10/2014
12:08

We’re obviously still working on resolving this issue.

02/10/2014
12:32

We think we’ve found the problem.

02/10/2014
12:39

We’re pretty sure the problem is identified. Another unrelated problem has appeared, we’re working on it.

02/10/2014
12:53

A quotacheck is running on the /home filesystem, it may take some time.

02/10/2014
13:13

The quotacheck is done, but we’re still fixing an isuse.

02/10/2014
13:40

We’re still fighting a mdadm issue. The data is safe.

02/10/2014
13:58

We’ve disabled the RAID for now, the server is up.

02/10/2014
15:05

The server is up, everything is working as normal. We’ll close the operation for now, but details will be posted later. The RAID is still degraded, but we’ll keep it this way until we find a way to fix it properly.

02/10/2014
19:38

Here is what happened: yesterday, the http6 server was migrated over a new machine. Everything was working properly until this morning.

At 09:45 today, the server started getting random network issues (broken connections, timeouts, weird traffic). Debugging was a bit more difficult than usual since we didn’t have a functional SSH access.

The only relevant kernel log message was: net_ratelimit: 2127 callbacks suppressed

Such messages indicate that the kernel avoided repeating the same messages again and again. We didn’t have any other message, though, which is unexpected. Since we couldn’t exploit those messages, we tried to discover the cause of the issue by other means.

We initially thought it could be a network attack, despite our protections. The network traffic was suspicious. After much investigation, we came to the conclusion it wasn’t an attack. We pursued our debugging for a while, to no avail.

We decided to downgrade the kernel version from 3.10 to 3.4. The 3.4 kernel was reporting new log messages: IPv4: Neighbour table overflow

That helped us quickly notice a typo in a network configuration file: the default gateway was pointing to the server itself, rather than the real gateway (the reality was actually a bit more complex). The neighbour table was filled with non-local IP addresses. Everything worked fine until this morning, when the number of connections increased.

We still don’t know why the 3.10 kernel didn’t report these log messages. We’ll have to investigate the RAID issues as well, which started appearing during the kernel downgrade.

We’ll obviously take lessons from that outage. An automated debugging tool may prove useful in unusually weird situations such as this one.