An unexpected outage: blocking DNS calls in an event loop

6 April, 2024
by Mark Teisman

My service once experienced an unexpected outage. The service operates a TCP server that processes remote procedure calls (RPCs) sequentially on a single thread, in an event loop. Because of the performance sensitivity in these RPC calls, we made sure that none of the RPC required blocking I/O. However, every couple of seconds, a (non-RPC) task is scheduled on the event loop that flushes data to a remote HTTP server.

This all worked fine, until one day the event loop of these processes unexpectedly would jam. This effectively caused an outage to our clients, which would no longer get a timely response to their RPC calls.

The incident correlated with a Domain Name System (DNS) outage within the organisation. Suddenly, DNS lookups using the getaddrinfo() method in Linux would time out. getaddrinfo is a function used for network address and service translation. It converts hostnames (like www.example.com) into IPv4 or IPv6 addresses. It queries remote DNS servers to do this resolution. getaddrinfo() is a blocking call, meaning that the function will not return control to the application until it has completed its attempt to resolve a given hostname. If the DNS server is unreachable, the function will wait for a response until a timeout occurs.

I was reminded of this outage as I was reading the book Asynchronous Programming in Rust, where it mentioned to avoid blocking calls such as getaddrinfo() in the main event loop thread, and instead relegated to a thread pool.

It turns out that libraries like libuv (known for powering Node.js) does operate such a threadpool. Node.js then uses this threadpool to provide asynchronous versions of DNS methods, such as dns.lookup(), dns.resolve(). These methods then do not block the main Javascript thread.

To prevent this issue from occurring in the future, we could indeed move the DNS lookup to a separate thread. Given the nature of the task, we could even move the whole asynchronous flushing of data to a separate thread (although would require us to make the implementation thread-safe).

To contain the magnitude of the problem, we also could have looked at configuring the timeouts of the DNS lookups. The DNS timeout and retry settings can be configured in /etc/resolv.conf. In this file you can add options including as timeout (the time in seconds to wait for a response from a DNS server) and attempts (the number of times the resolver will send a query to its DNS servers before giving up). An example of a valid /etc/resolv.conf configuration would be

options timeout:2 attempts:3

In our case, another mitigation was to have the service communicate with a service mesh proxy that runs on localhost, which omitted the DNS lookup altogether.