A Common Operational Problem in DNS Servers - Failure To Respond.
Internet Systems Consortium
950 Charter Street
Redwood City
CA
94063
US
marka@isc.org
The DNS is a query / response protocol. Failure to respond to
queries causes both immediate operational problems and long term
problems with protocol development.
This document will identify a number of common classes of
queries that some servers fail to respond too. This document
will also suggest procedures for TLD and other similar zone
operators to apply to reduce / eliminate the problem.
The DNS ,
is a query / response protocol. Failure to respond to
queries causes both immediate operational problems and long
term problems with protocol development.
Failure to respond to a query is indistinguishable from a
packet loss without doing a analysis of query response
patterns and results in unnecessary additional queries being
made by DNS clients and unnecessary delays being introduced
to the resolution process.
Due to the inability to distingish between packet loss and
nameservers dropping EDNS queries,
packet loss is sometimes misclassified as lack of EDNS
support which can lead to DNSSEC validation failures.
Allowing servers which fail to respond to queries to remain
in the DNS hierarchy for extended periods results in
developers being afraid to deploy new type codes. Such servers
need to be identified and corrected / replaced.
The DNS has response codes that cover almost any conceivable
query response. A nameserver should be able to respond to
any conceivable query using them.
Unless a nameserver is under attack, it should respond to
all queries directed to it as a result of following
delegations. Additionally code should not assume that there
isn't a delegation to the server even if it is not configured
to serve the zone. Broken delegation are a common occurrence
in the DNS and receiving queries for zones that you are not
configured for is not a necessarily a indication that you
are under attack.
When a nameserver is under attack it may wish to drop packet.
A common attack is to use a nameserver as a amplifier by
sending spoofed packets. This is done because response
packets are bigger than the queries and big amplification
factors are available especially if EDNS is supported.
Limiting the rate of responses is reasonable when a this
is occuring and the client should retry. This however only
works if legitimate clients are not being forced to guess
whether EDNS queries are accept or not. While there is
still a pool of servers that don't repsond to EDNS requests,
clients have no way to know if the lack of response is due
to packet loss, EDNS packets not being supported or rate
limiting due to the server being under attack. Mis-classifications
of server characteristics are unavoidable when rate limiting
is done.
There are three common query class that result in non
responses today. These are EDNS queries, queries for unknown
(unallocated) or unsupported types and filtering of TCP
queries.
Identifying servers that fail to respond to EDNS queries
can be done by first identifying that the server responds
to regular DNS queries then making a series otherwise
identical responses using EDNS, then making the original
query again. A series of EDNS queries is needed as at least
one DNS implementation responds to the first EDNS query
with FORMERR but fails to respond to subsequent queries
from the same address for a period until a regular DNS
query is made. The EDNS query should specify a UDP buffer
size of 512 bytes to avoid false classification of not
supporting EDNS due to response packet size.
If the server responds to the first and last queries but
fails to respond to most or all of the EDNS queries it
is probably faulty. The test should be repeated a number
of times to eliminate the likely hood of a false positive
due to packet loss.
Firewalls may also block larger EDNS responses but there
is no easy way to check authoritative servers to see if
the firewall is misconfigured.
Identifying servers that fail to respond to unknown or
unsupported types can be done by making a initial DNS
query for a A record, making a number of queries for
unallocated type, them making a query for a A record
again. IANA maintains a registry of allocated types.
If the server responds to the first and last queries but
fails to respond to the queries for the unallocated type
it is probably faulty. The test should be repeated a
number of times to eliminate the likely hood of a false
positive due to packet loss.
All DNS servers are supposed to respond to queries over
TCP . Firewalls that drop TCP
connection attempts rather that resetting the connect
attempt or send a ICMP/ICMPv6 administratively prohibited
message introduce excessive delays to the resolution
process.
Whether a server accepts TCP connections can be tested
by first checking that it responds to UDP queries to
confirm that it is up and operating then attempting
the same query over TCP. A additional query should
be made over UDP if the TCP connection attempt fails
to confirm that the server under test is still operating.
While the first step in remediating this problem is to get
the offending nameserver code corrected, there is a very
long tail problem with DNS servers in that it can often
take over a decade between the code being corrected and a
nameserver being upgraded with corrected code. With that
in mind it is requested that TLD, and other similar zone
operators, take steps to identify and inform their customers,
directly or indirectly through registrars, that they are
running such servers and that the customers need to correct
the problem.
TLD operators should construct a list of servers child zones
are delegated to along with a delegated zone name. This
name shall be the query name used to test the server as it
is supposed to exist.
For each server the TLD operator shall make a SOA query the
delegated zone name. This should result in the SOA record
being returned in the answer section. If the SOA record is
not return but some other response is returned this is a
indication of a bad delegation and the TLD operator should
take whatever steps it normally takes to rectify a bad
delegation. If more that one zone is delegated to the server
it should choose another zone until it finds a zone which
responds correctly or it exhausts the list of zones delegated
to the server.
If the server fails to get a response to a SOA query the
TLD operator should make a A query as some nameservers fail
to respond to SOA queries but respond to A queries. If it
gets no response to the A query another delegated zone
should be queried for as some nameservers fail to respond
to zones they are not configured for. If subsequent queries
find a responding zone all delegation to this server need
to be checked and rectified using the TLD's normal procedures.
Having identified a working <server, query name> tuple the
TLD operator should now check that the server responds to
EDNS, Unknown Query Type and TCP tests as described above.
If the TLD operator finds that server fails any of the
tests, the TLD operator shall take steps to inform the
operator of the server that they are running a fault
nameserver and that they need to take steps to correct the
matter. The TLD operator shall also record the <server,
query name> for followup testing.
If repeated attempts to inform and get the customer
to correct / replace the fault server are unsuccessful
the TLD operator shall remove all delegations to said
server from the zone.
It will also be necessary for TLD operators to repeat
the scans periodically. It is recommended that this
be performed monthly backing off to bi-annually once
the numbers of faulty servers found drops off to less
than 1 in 100000 servers tested. Follow up tests for
faulty servers still need to be performed monthly.
Some operators claim that they can't perform checks at
registration time. If a check is not performed at registration
time it needs to be performed within a week of registration
in order to detect faulty servers swiftly.
Checking of delegations by TLD operators should be nothing
new as they have been required from the very beginings of
DNS to do this . Checking for
compliance of nameserver operations should just be a extension
of such testing.
It is recommended that TLD operators setup a test web page
which performs the tests the TLD operator performs as part
of their regular audits to allow nameserver operators to
test that they have correctly fixed their servers. Such
tests should be rate limited to avoid these pages being
a denial of service vector.
Firewalls and load balancers can affect the externally
visible behaviour of a nameserver. Tests for conformance
need to be done from outside of any firewall so that the
system as a whole is tested.
Firewalls and load balancers should not drop DNS packets
that they don't understand. They should either pass through
the packets or generate a appropriate error response.
Requests for unknown query types are not attacks and
should not be treated as such.
DOMAIN NAMES - CONCEPTS AND FACILITIES
DOMAIN NAMES - IMPLEMENTATION AND SPECIFICATION
Extension Mechanisms for DNS (EDNS0)
DNS Transport over TCP - Implementation Requirements