06 April, 2024
by Mark Teisman
My service once experienced an unexpected outage. The service operates a TCP server that processes remote procedure calls (RPCs) sequentially on a single thread, in an event loop. Because of the performance sensitivity in these RPC calls, we made sure that none of the RPC required blocking I/O. However, every couple of seconds, a (non-RPC) task is scheduled on the event loop that flushes data to a remote HTTP server.
This all worked fine, until one day the event loop of these processes unexpectedly would jam. This effectively caused an outage to our clients, which would no longer get a timely response to their RPC calls.
The incident correlated with a Domain Name System (DNS) outage within
the organisation. Suddenly, DNS lookups using the getaddrinfo() method
in Linux would time out. getaddrinfo is a function used for network
address and service translation. It converts hostnames (like
www.example.com) into IPv4 or IPv6 addresses. It queries remote DNS
servers to do this resolution. getaddrinfo() is a blocking call,
meaning that the function will not return control to the application
until it has completed its attempt to resolve a given hostname. If the
DNS server is unreachable, the function will wait for a response until a
timeout occurs.
I was reminded of this outage as I was reading the book Asynchronous
Programming in Rust, where it
mentioned to avoid blocking calls such as getaddrinfo() in the main
event loop thread, and instead relegated to a thread pool.
It turns out that libraries like libuv (known for powering Node.js)
does operate such a threadpool. Node.js then uses this threadpool to
provide asynchronous versions of DNS methods, such as dns.lookup(),
dns.resolve(). These methods then do not block the main Javascript
thread.
To prevent this issue from occurring in the future, we could indeed move the DNS lookup to a separate thread. Given the nature of the task, we could even move the whole asynchronous flushing of data to a separate thread (although would require us to make the implementation thread-safe).
To contain the magnitude of the problem, we also could have looked at
configuring the timeouts of the DNS lookups. The DNS timeout and retry
settings can be configured in /etc/resolv.conf. In this file you can
add options including as timeout (the time in seconds to wait for a
response from a DNS server) and attempts (the number of times the
resolver will send a query to its DNS servers before giving up). An
example of a valid /etc/resolv.conf configuration would be
options timeout:2 attempts:3
In our case, another mitigation was to have the service communicate with
a service mesh proxy that runs on localhost, which omitted the DNS
lookup altogether.