Opération #39

Titre http7, * down
État terminé
Date de début 25 juin 2012 16:40
Date de fin 25 juin 2012 17:50
Serveurs concernés


25 juin 2012 18:26

http7 first became unresponsive - that happens. However, it caused NFS freezes on the server that handles * as well as our main internal database. Those freezes also froze our * Web applications, which kept their PostgreSQL connections open (status "idle in transaction"). Which eventually consumed all PostgreSQL connections.

The http7 server has been restarted, but it failed to connect to our internal database to fetch its configuration (as it was down). Which explains why all sites remained down on this server for much longer than it should have.

This internal database is mirrored in real time, though, and the secondary server was ready to accept connections and serve requests. Unfortunately, this server has been overloaded very quickly, and stopped responding too. This is because its hardware hasn't been upgraded to match the primary's. This server also host, which is why it was down too.

We still have to figure out why several protections didn't work out as they should. Of course, the secondary server will also be upgraded to be able to handle the same load as the primary. http7's hardware will also be replaced, something is definitely wrong with this server.

We're really sorry for this unexpectely long downtime. All other http servers as well as other services weren't affected by this issue.