how to accidentally break DNS for 15 domains or why you maybe could not send mail to me

TL;DR: DNS for golov.de and other (14) domains hosted on my infra was flaky from 15th to 17th of May, which may have resulted in undelivered mail.

Yeah, I know, I haven't blogged for quite some time. Even not after I switched the engine of my blog from WordPress to Nikola. Sorry!

But this post is not about apologizing or at least not for not blogging.

Last Tuesday, mika sent me a direct message on Twitter (around 13:00) that read „problem auf deiner Seite?“ or “problem on your side/page?”. Given side and page are the same word in German, I thought he meant my (this) website, so I quickly fired up a browser, checked that the site loads (I even checked both, HTTP and HTTPS! :-)) and as everything seemed to be fine and I was at a customer I only briefly replied “?”. A couple messages later we found out that mika tried to send a screenshot (from his phone) but that got lost somewhere. A quick protocol change later (yay, Signal!) and I got the screenshot. It said "<evgeni+grml@golov.de>: Host or domain name not found. Name service error for name=golov.de type=AAAA: Host found, but no data record of requested type". Well, yeah, that looks like an useful error message. And here the journey begins.

For historical nonsense golov.de currently does not have any AAAA records, so it looked odd that Postfix tried that. Even odder was that dig MX golov.de and dig mail.golov.de worked just fine from my laptop.

Still, the message looked worrying and I decided to dig deeper. golov.de is served by three nameservers: ns.die-welt.net, ns2.die-welt and ns.inwx.de and dig was showing proper replies from ns2.die-welt.net and ns.inwx.de but not from ns.die-welt.net, which is the master. That was weird, but gave a direction to look at, and explained why my initial tests were OK. Another interesting data-point was that die-welt.net was served just fine from all three nameservers.

Let's quickly SSH into that machine and look what's happening… Yeah, but I only have my work laptop with me, which does not have my root key (and I still did not manage to setup a Yubikey/Nitrokey/whatver). Thankfully my key was allowed to access the hypervisor, yay console!

Now let's really look. golov.de is served from from the bind backend of my PowerDNS, while die-welt.net is served from the MySQL backend. That explains why one domain didn't work while the other did. The relevant zone file looked fine, but the zones.conf was empty. WTF?! That zones.conf is autogenerated by Froxlor and I had upgraded it during the weekend to get Let's Encrypt support. Oh well, seems I hit a bug, damn. A few PHP hacks later and I got my zones.conf generated properly again and all was good.

But what had really happened?

  • On Saturday (around 17:00) I upgraded to Froxlor 0.9.35.1 to get Let's Encrypt support and hit Froxlor bug 1615 without noticing as PowerDNS re-reads zones.conf only when told.

  • On Sunday PowerDNS was restarted because of upgraded packages, thus re-reading zones.conf and properly logging:

    May 15 08:10:59 shokki pdns[2210]: [bindbackend] Parsing 0 domain(s), will report when done
    
  • On Tuesday the issue hit a friend who cared and notified me

  • On Tuesday the issue was fixed (first by a quick restore from etckeeper, later by fixing the generating code):

    May 17 14:56:08 shokki pdns[24422]: [bindbackend] Parsing 15 domain(s), will report when done
    

And the lessons learned?

  • Monitor all your domains, on all your nameservers. (I didn't)
  • Have emergency access to all you servers. (I did, but it was complicated)
  • Use etckeeper, it's easier to use than backups in such cases.
  • When hitting bugs, look in the bugtracker before solving the issue yourself. (I didn't)
  • Have friends who care :-)

Comments