Statut

Vérifiez l'état de nos services d'un seul coup d'œil

Titre DNS issue
ID Opération #483
État terminé
Date de début 12 août 2024 09:46
Date de fin 12 août 2024 10:58
Serveurs concernés

Messages

12 août 2024 10:15

We are investigating.

12 août 2024 10:21

DNS servers are now replying normally, although changes are not applied for now. We’re still investigating.

12 août 2024 10:58

It’s fixed.

12 août 2024 17:28

We apologize to our customers whose services or businesses were impacted during this incident. Here is our post-mortem which details what happened.

Context

Every 4 years, we’re replacing our software infrastructure, based on Debian, with the latest available version. That is a long process that typically spans over one year and involves migrating every piece of software we use on their newest version. We’ve already migrated most of our infrastructure now, and new clients have been running on our new 2024 infrastructure since early July.

Our authoritative DNS servers use dnsdist (on the front) and PowerDNS (on the back). We’ve started preparing changes required to migrate to the latest version of PowerDNS, with one particular change requiring that we replace how we currently handle SOA records. It’s a relatively small change, transparent to our clients, that consists of 2 parts:

  • modifying how we create SOA records by explicitely setting the serial ourselves, rather than relying on PowerDNS’ automatically generated serial
  • recreating the SOA record for all domains that we handle

Timeline

The first part went into production in early August, with no issue. New domains started using the new SOA format while all previously existing domains still used the old format.

After one week, we started the second part, i.e. recreating SOA records for existing domains. The procedure is simple: call the function that creates the SOA record, which we’ve been using for years. We initially did it on a few test domains only, then on a few dozens domains the next day, and then on a few hundreds the day after. Again, no issue occurred.

Today at 9:40 CEST, we made the change for all existing domains (about 40.000). We followed the exact same procedure we used when we deployed on a few hundreds domains.

At 9:43 CEST, we received an alert about an internal DNS issue and immediately started investigating it. The overwhelming majority of domains still worked fine, including domains that our procedure had already modified, so we continued our investigation.

At 9:48 CEST, as the number of errors and failing domains were increasing, we decided to abort the procedure. Domains that came before re (in alphabetical order) had been modified, others were not. It was clear that the procedure had broken something, but we still couldn’t understand what exactly. Despite inspecting our database, we couldn’t find anything wrong: old SOA records were deleted, new SOA records were present and had a correct format.

At 9:55 CEST, we decided to activate our DNS disaster plan. It changes the way PowerDNS operates: rather than querying our database, it uses simple flat zone files that are automatically generated every 30 minutes. It means that DNS changes made during the last 30 minutes are (temporarily) lost.

At 9:59 CEST, as PowerDNS had issues starting with the disaster config, we prepared to restore a backup of our DNS records in parallel.

At 10:05 CEST, we fixed issues that prevented PowerDNS from starting in disaster mode. We started a few tests before deploying the change.

At 10:10 CEST, all production DNS servers were running on the disaster config, everything was working normally again, albeit in degraded mode (DNS changes made since 9:30 CEST were not visible).

At 10:56 CEST, the original issue had been found and fixed, all production DNS servers were running the normal config again, and all changes applied during or before the incident were now visible.

Impact

For all domains that come before re (in alphabetical order), our DNS servers (dns1.alwaysdata.com and dns2.alwaysdata.com) may have returned an error (REFUSED) between 9:40 CEST and 10:10 CEST. However, due to cache (on our side, thanks to dnsdist, or on recursive DNS servers’ side) and other factors (alphabetical order of your domain, TTL expiration, etc.), observable errors were rather random.

Root cause

When a DNS record is added, it’s initially in a disabled state. A procedure is then run to determine whether the record must be enabled and if others must be disabled, which happens when a user adds a A record that overrides our own, for instance. Finally, the cache on PowerDNS and dnsdist is flushed if necessary.

In normal cases, the procedure is almost instantaneous. However, in the very rare cases where we need to change hundreds or thousands of records at once (which is always a manual operation), the procedure is not executed on each record but batched and executed at the very end, for performance reasons.

This is what happened this morning. As our code iterated on domains, their old SOA record was deleted and a new one was created, in a disabled state. As the procedure that would enable the new record is only run at the end, there was a period of time during which no SOA record was enabled for domains. However, that is not supposed to cause issues as PowerDNS still uses the SOA in its cache as long as it’s not flushed, which only happens after the new record is enabled. Unfortunately due to a bug in the PowerDNS version we use, the cache could be flushed when the TTL of the record expired, which is 5 minutes by default on our servers.

Although the SOA record is rarely queried directly, PowerDNS refuses to serve a domain if it cannot find a SOA record.

Unfortunately, the reason we did not get any issue when we deployed the change on a few hundreds domains is that even a few hundreds was too small to significantly delay the batched procedure, especially as the 5 minutes TTL had to expire before PowerDNS would query the database again.

Prevention

First, that particular PowerDNS bug is fixed in the version we’re preparing to upgrade to (ironically).

Second, we will test our disaster config more regularly, as not getting errors when starting PowerDNS could have saved us almost 10 minutes.

Third, even after a procedure was deemed safe, we should as much as possible deploy by chunks rather than all at once (which is what we already do for most of our daily operations).