Statut

Vérifiez l'état de nos services d'un seul coup d'œil

Titre http6 is down
ID Opération #168
État terminé
Date de début 10 fév. 2014 10:05
Date de fin 10 fév. 2014 15:05
Serveurs concernés
  • http6

Messages

10 fév. 2014 10:13

We’re investigating.

10 fév. 2014 10:22

The server is up again, we’re stil analyzing what happened.

10 fév. 2014 10:26

Down again.

10 fév. 2014 10:39

A network attack is causing the trouble. We’re working on mitigating it and trying to figure out why our firewall didn’t work as expected.

10 fév. 2014 10:57

The traffic is still perturbed.

10 fév. 2014 12:08

We’re obviously still working on resolving this issue.

10 fév. 2014 12:32

We think we’ve found the problem.

10 fév. 2014 12:39

We’re pretty sure the problem is identified. Another unrelated problem has appeared, we’re working on it.

10 fév. 2014 12:53

A quotacheck is running on the /home filesystem, it may take some time.

10 fév. 2014 13:13

The quotacheck is done, but we’re still fixing an isuse.

10 fév. 2014 13:40

We’re still fighting a mdadm issue. The data is safe.

10 fév. 2014 13:58

We’ve disabled the RAID for now, the server is up.

10 fév. 2014 15:05

The server is up, everything is working as normal. We’ll close the operation for now, but details will be posted later. The RAID is still degraded, but we’ll keep it this way until we find a way to fix it properly.

10 fév. 2014 19:38

Here is what happened: yesterday, the http6 server was migrated over a new machine. Everything was working properly until this morning.

At 09:45 today, the server started getting random network issues (broken connections, timeouts, weird traffic). Debugging was a bit more difficult than usual since we didn’t have a functional SSH access.

The only relevant kernel log message was: net_ratelimit: 2127 callbacks suppressed

Such messages indicate that the kernel avoided repeating the same messages again and again. We didn’t have any other message, though, which is unexpected. Since we couldn’t exploit those messages, we tried to discover the cause of the issue by other means.

We initially thought it could be a network attack, despite our protections. The network traffic was suspicious. After much investigation, we came to the conclusion it wasn’t an attack. We pursued our debugging for a while, to no avail.

We decided to downgrade the kernel version from 3.10 to 3.4. The 3.4 kernel was reporting new log messages: IPv4: Neighbour table overflow

That helped us quickly notice a typo in a network configuration file: the default gateway was pointing to the server itself, rather than the real gateway (the reality was actually a bit more complex). The neighbour table was filled with non-local IP addresses. Everything worked fine until this morning, when the number of connections increased.

We still don’t know why the 3.10 kernel didn’t report these log messages. We’ll have to investigate the RAID issues as well, which started appearing during the kernel downgrade.

We’ll obviously take lessons from that outage. An automated debugging tool may prove useful in unusually weird situations such as this one.