DNS Traffic Control

Introduction

DNS Traffic Control (DTC) enables the IPAM authoritative nameservers to dynamically adjust their query responses for a particular domain name and record type, potentially returning a different address (A or AAAA) record each time depending on:

  • health checks
  • the source IP address of the query (note that this source IP usually belongs to the recursive DNS resolver rather than the end user)

    RFC7871 defines an EDNS0 Client Subnet option for recursive DNS resolvers to tell the authoritative nameserver about the originating client’s IP, but many recursive DNS resolvers do not implement this option (in part due to some privacy shortcomings).

DNS Traffic Control can be used to provide active/passive failover or active/active load distribution (not true load balancing, see below) between multiple target IP addresses for the same service.  The target IP addresses must be stable and known in advance.

The query response synthesized by DNS Traffic Control is returned instead of any normal DNS resource records which may be configured in IPAM for the queried domain name and type.  However, if DTC is not able to synthesize a response (e.g. because none of the eligible target IPs appear to be healthy), then the normal DNS resource record set is returned as a fallback.

Known Issue: DNS Traffic Control may return your fallback records (instead of synthesized responses) for about 1 second after a restart of services in which other DTC configuration changes are being applied, even if those changes have nothing to do with your LBDN.  Mitigation: choose good fallback records!

Responses for other record types, such as TXT, are completely unaffected by DNS Traffic Control.

Example

The following illustration shows a DTC Load-Balanced Domain Name (LBDN) with two DTC Pools comprising a total of three DTC Servers.

The domain name (e.g. example.illinois.edu) is associated with the LBDN; the target IP addresses are associated with the DTC Servers.

This LBDN uses the “Global Availability” load balancing method to always prefer test-pool1 over test-pool2.

Each Pool uses the “Ratio: Fixed” load balancing method to randomly select from among the available Servers in that Pool.

With this configuration, A record queries matching the LBDN will return:

  1. one A record for the IP address of either test-server1 or test-server2 (choosing randomly each time if both are healthy).

    Right now test-server2 is unhealthy, so every matching A record query will return an A record for the IP of test-server1.

  2. one A record for the IP address of test-server3, if both test-pool1 servers are down but test-server3 is healthy.
  3. the normal (non-DTC, fallback) set of A records configured in IPAM for the queried domain name, if all three servers (and therefore both pools) are down.

Using multiple Pools makes it possible to implement fairly sophisticated logic, but this is not always necessary; many real-world use cases require only a single Pool (which can be either active/passive or active/active depending on the load balancing method chosen).

For introductory purposes this example shows only DTC Servers with IPv4 addresses, but we can add more DTC Servers with IPv6 addresses to get equivalent behavior for AAAA record queries as well (using the same Pools and LBDN).  Note that DNS Traffic Control automatically ignores DTC A records while synthesizing an AAAA response, and vice versa.

About TTLs

The scalability of DNS depends upon caching DNS records in accordance with their TTL (time-to-live) values.  The effectiveness of DNS Traffic Control as a failover mechanism depends upon using very short TTL values, typically 60 seconds or less, to limit the cache lifetime of its synthesized query responses.  It also depends on client applications issuing a new DNS query before reconnecting to your service, instead of just retrying against the same remembered IP address.  Note as a consequence that DNS Traffic Control is generally much more effective at mitigating protracted outages than brief intermittent outages.

TTL Caveat

So far, we have not found the following to be a significant problem in practice, but rumors abound on the internet: if a particular client application or standards-violating ISP DNS resolver decides to cache old responses for significantly longer than the sanctioned TTL, there is nothing that we as the authoritative DNS provider can do about it – beyond suggesting that individual impacted users may want to reboot and/or switch to one of the various public DNS resolver services as a workaround.

Note that you do not need to set a short TTL on Stand-alone CNAME Records which point to a domain name handled by DNS Traffic Control.  The TTL values in this sample response appropriately reflect our expectations that the CNAME record will remain stable but the A record might change suddenly:

www.example.illinois.edu. 3600 IN CNAME example.illinois.edu.
example.illinois.edu.     60   IN A     18.217.84.3

Glossary

DTC Server: represents a single target IP address which is capable of providing your service.  Note that DTC Server objects do not necessarily correspond 1:1 to your real servers, since

  • a single DTC Server might target a Server Load Balancing (SLB) VIP which distributes incoming client requests among multiple real servers.
  • a real server with both an IPv4 address and an IPv6 address will require two separate DTC Server objects, one per address (assuming you wish to use DNS Traffic Control for both A and AAAA responses).

Health Monitor: a test performed independently by each authoritative nameserver to determine whether a DTC Server is healthy

DTC Pool: a grouping of one or more DTC Servers with a load balancing method for choosing among them

Load-Balanced Domain Name (LBDN): a slightly misnamed object which maps one or more domain name patterns to a grouping of one or more DTC Pools, with a load balancing method for choosing among them

DTC LBDN Record: a manifestation of the LBDN which automatically appears within associated DNS Zones based on the LBDN’s domain name patterns.  Technically LBDN Records are separate objects from the LBDN, but you don’t normally need to worry about this distinction.

DTC A (or AAAA) Record: record data associated with a DTC Server which is used to synthesize DTC query responses.  Usually auto-populated based on the DTC Server’s IP.

Load Balancing Methods

Global Availability: always selects the first available target from an ordered list.  Use this for active/passive failover.

Ratio:Fixed: randomly selects an available target each time, optionally “weighted” to select certain targets more often than others.  Use this for active/active load distribution.

Keep in mind that DNS response caching makes this an imprecise mechanism for controlling actual server load; if you need finely tuned load balancing, consider using network-layer SLB instead of or in addition to DTC (see below).

All Available (Pool only): instead of returning just one address record at a time, return the address records for all healthy Servers in the Pool (omitting any unhealthy ones).  This is an alternative active/active method which behaves more like traditional “round-robin DNS” (see also below), but cannot be used in DNSSEC signed zones because the number of distinct responses it might hypothetically produce is an exponential power set.

We don’t use DNSSEC signed zones as this time, but plan to do so in the future.

Topology: selects a target by evaluating the source IP address of each incoming query against a custom-configured set of rules (note that this source IP usually belongs to the recursive DNS resolver rather than the end user).  Use this e.g. if you need to provide a different answer to on-campus vs off-campus clients – or at least, to clients currently using an on-campus DNS resolver vs clients currently using an off-campus DNS resolver.

Beware of edge cases which may blur the distinction between on- and off-campus clients, such as split-tunnel VPN and AWS Enterprise VPCs.  Consider carefully what will happen in the event that such a client receives the “on-campus” answer from DNS Traffic Control but then actually connects to your service from an off-campus source IP, or vice versa.

Round Robin: rotates through the available targets in ordered sequence, returning the “next” one for each successive query (1, 2, 3, 1, 2, 3, …).  This alternative active/active method involves maintaining state on the nameserver, which will expire if no matching queries are received for a while (empirically around 60s) causing the sequence to restart deterministically from 1.

In practice this can produce extremely uneven response distributions for records which are less frequently queried, so Ratio:Fixed is usually a better choice even if you don’t need to assign a different “weight” to each target.

Configuring DNS Traffic Control

Unfortunately we are not able to offer any self-service configuration of DTC objects within IPAM at this time.  We are hopeful that future software releases from the vendor will provide enough granular permissions that we can safely open up at least partial self-service, but for now, all DTC configuration changes should be made by request to hostmgr (except fallback records, which are not themselves DTC).

The following documentation will walk you through the design process step by step; we encourage you to read through all the steps, and then submit one request with as much detail as possible.

1. Configure Non-DTC Fallback Records

First, follow the usual self-service steps to create one or more normal Stand-alone A and/or AAAA Records for each fully-qualified domain name (FQDN) which will be handled by DNS Traffic Control.  Ultimately these static records will be unused most of the time, but they will be returned as a fallback whenever DNS Traffic Control is unable to synthesize a satisfactory query response targeting a known healthy DTC Server.  Broadly speaking, this can happen for two reasons:

  1. Your servers are working, but something is wrong with DNS Traffic Control’s monitoring of your servers.  Consider in this scenario what static DNS response would generally give clients the best chance of successfully using your service.

    Note that it is common for DTC to return fallback records for up to 1 second after a behind-the-scenes service restart in which other DTC configuration changes are being applied (even if those changes have nothing to do with your LBDN).

  2. None of your servers are working.  Consider in this scenario what static DNS response would be most useful to have already cached (by recursive DNS servers and/or client applications) at the moment when your servers begin working again.

Be sure to set a custom TTL (e.g. 60 seconds) on your fallback records so that they will not remain cached for too long once DTC is again able to synthesize a response.

The two most common choices for a fallback resource record set are:

  • all of the eligible target IPs (essentially falling back to traditional “round-robin DNS“)

    Note: in this case, if you want the target IPs to have matching PTRs pointing back to the service FQDN (and assuming the target IPs do not have other PTRs already) you may opt to create a fallback Host record rather than stand-alone A/AAAA records.

  • just one of the eligible target IPs (this might be preferable in some active/passive use cases where there is a significant penalty associated with actually using the secondary server)

Once your DTC LBDN is configured, Grid Manager will display the fallback records in strikethrough font as a visual reminder that they are masked by DTC:

Known Issue: the strikethrough font logic is currently not working, and Grid Manager displays the fallback records in normal font instead.

2. Choose or Customize a Health Monitor

Next, decide how IPAM should determine the health of your DTC Servers (i.e. target IP addresses).  You can use one of the default pre-defined Health Monitors (icmp, http, or https), or email hostmgr to request a customized Health Monitor for your service.

Try to choose a health monitor that will not be blocked by firewall policy; by default, each authoritative nameserver performs its own health monitor testing independently of the others, and not all of them have University of Illinois System IP addresses.  If necessary we can work around this with consolidated health monitor settings (configured per Pool, details further down).

Best Practice

The most powerful and flexible health check is achieved by configuring a DTC Health Monitor to invoke your own HTTP-based self-test routine which runs on each target server and answers to “GET /healthcheck HTTP/1.0” or similar, returning HTTP response status code 200 (OK) if and only if the server is healthy.

Advantages of this approach:

  • Your self-test implementation can do anything you want and be as elaborate as you like, yet the interaction with DTC is very simple and easy to troubleshoot.
  • You can manually disable the self-test response (while leaving the actual service running) to gracefully “drain” a server of clients before taking it offline for maintenance.
  • You can use the same health check for both DTC and SLB if desired.

Caution: be careful not to create a denial-of-service vector for your servers by performing lengthy or resource-intensive operations upon every single HTTP request.  If such operations are needed, perform them periodically and cache the results.

Note that DTC can be configured to send HTTP health monitor requests to any port, and if necessary even to an alternate IP address (e.g. if your actual servers are non-customizable appliances).

Please provide the following details when requesting a customized Health Monitor:

  • Interval (seconds): how long to wait between the end of one health monitor cycle (either receiving a response or timing out) and sending the next request to the same DTC Server.  Default: 5s
  • Timeout (seconds): how long to wait for a response before giving up.  Default: 15s
  • Retry Up Count: number of consecutive successes for a currently unhealthy server to be marked healthy.  Default: 1
  • Retry Down Count: number of consecutive failures for a currently healthy server to be marked unhealthy.  Default: 1
  • protocol: choose one of
    • ICMP (ping): sends an ICMP or ICMPv6 echo request and expects an ICMP/ICMPv6 echo response.  No additional parameters.

    • HTTP: sends an HTTP request and optionally examines the response

      Parameter Default Remarks
      Port 80
      HTTP Request GET /

      May be a multi-line message including HTTP headers.  Recommended examples:

      GET / HTTP/1.0
      GET / HTTP/1.1
      Host: example.illinois.edu
      Connection: close

      Note that we do not generally recommend using the HTTP/0.9 request format (e.g. “GET /” without HTTP-Version) because HTTP/0.9 responses do not include a status code; see RFC 1945 for more information.

      Response Code Check any response code is valid may require a specific HTTP response status code (e.g. 200) or accept any
      Response Content Check do not check may search first 5KB of headers, body, or both using a POSIX Extended Regular Expression, and optionally perform additional validation on the matched content
    • HTTPS: sends an HTTPS request and optionally examines the response

      Parameter Default Remarks
      Port 443
      HTTP Request GET / (same as HTTP above)
      Enable Certificate Validation false boolean
      Enable SNI (Server Name Indication) false boolean (the SNI Hostname to use is configured in each DTC Server)
      Response Code Check any response code is valid (same as HTTP above)
      Response Content Check do not check (same as HTTP above)
    • TCP: opens a TCP connection to the specified port (successful when the handshake completes)

      Parameter Default
      Port no default (must be customized)
    • SIP, PDP, SNMP: talk to us if you feel that one of these might be appropriate for your service
  • internal display name

    There is a single global namespace for all DTC Health Monitor objects, so try to choose something reasonable that begins with your department or group.

  • Owned By Domain: the name(s) of one or more domain models (from Contacts Database) that should confer “ownership” rights to this object.  Self-service management of DTC objects is not possible at this time, but any user with permissions on any of the named models will be authorized to request changes through hostmgr.

You can examine Health Monitors under Data Management > DNS > Traffic Control, Toolbar: Manage Health Monitors

3. Configure DTC Servers

Please email hostmgr to request creation or modification of DTC Server objects.  Provide the following details for each:

  • the IPv4 or IPv6 address which should be returned in response to A or AAAA queries when this DTC Server is selected as the answer

  • (optional) any Health Monitors for this DTC Server which must target a different IP address than the one returned in response to queries

  • (optional) an alternate SNI Hostname to be used for HTTPS Health Monitors
  • internal display name

    This name has no functional effect on DNS behavior; it is used only for administrative purposes within Grid Manager.  There is a single global namespace for all DTC Server objects, so try to choose something reasonable that begins with your department or group.

  • Owned By Domain: the name(s) of one or more domain models (from Contacts Database) that should confer “ownership” rights to this object.  Self-service management of DTC objects is not possible at this time, but any user with permissions on any of the named models will be authorized to request changes through hostmgr.

Once configured, you can examine DTC Server objects under Data Management > DNS > Traffic Control.

4. Configure DTC Pools

Please email hostmgr to request creation or modification of DTC Pool objects.  Provide the following details for each:

  • which DTC Servers should be members of this Pool

    Note that it is possible to add the same DTC Server to multiple Pools.

  • which Load Balancing method should be used to select a member (choices are described above under “Load Balancing Methods”)
  • which Health Monitors should be applied to all members (targeting the main IP address configured for each DTC Server object)

  • (optional) Consolidated Health Monitor settings: should off-campus nameservers use health status information obtained by the on-campus nameservers, instead of performing their own independent tests?

    This workaround makes it possible to use DNS Traffic Control for internal use cases where your entire service (and perforce also the health monitor target) is only reachable from on campus.

  • internal display name

    This name has no functional effect on DNS behavior; it is used only for administrative purposes within Grid Manager.  There is a single global namespace for all DTC Server objects, so try to choose something reasonable that begins with your department or group.

  • Owned By Domain: the name(s) of one or more domain models (from Contacts Database) that should confer “ownership” rights to this object.  Self-service management of DTC objects is not possible at this time, but any user with permissions on any of the named models will be authorized to request changes through hostmgr.

Once configured, you can examine DTC Pool objects under Data Management > DNS > Traffic Control.

5. Configure the LBDN

Please email hostmgr to request creation or modification of a DTC LBDN object for your service.  Provide the following details:

  • one or more fully-qualified domain names (FQDNs) for which this LBDN object should synthesize responses, e.g. example.illinois.edu

    Best Practice

    Wherever possible, configure the LBDN to synthesize responses for a single primary FQDN, and implement any additional FQDNs using normal Stand-alone CNAME records:

    www.example.illinois.edu. IN CNAME example.illinois.edu.
    myothersubdomain.illinois.edu. IN CNAME example.illinois.edu.

    Assign multiple FQDNs to the same LBDN only if they cannot be implemented as aliases (e.g. because they reside at the apex of a zone).

  • the TTL value to return for synthesized responses, e.g. 60 seconds

  • one or more DTC Pools to use

    Note that it is possible to add the same DTC Pool to multiple LBDN objects.

  • which Load Balancing method should be used to select a Pool (if using more than one)
  • internal display name

    This name has no functional effect on DNS behavior; it is used only for administrative purposes within Grid Manager.  There is a single global namespace for all DTC Server objects, so try to choose something reasonable that begins with your department or group.

  • Owned By Domain: the name(s) of one or more domain models (from Contacts Database) that should confer “ownership” rights to this object.  Self-service management of DTC objects is not possible at this time, but any user with permissions on any of the named models will be authorized to request changes through hostmgr.

Once configured, you can examine LBDN objects under Data Management > DNS > Traffic Control.

You can also see their DTC LBDN Records which appear within the associated DNS Zones.

screenshot

The screenshot below shows a DTC LBDN Record along with two fallback A records and two fallback AAAA records (displayed in strikethrough font as a visual reminder that they are not used except as a fallback).  Note that this screen does not display the TTL or the possible IP addresses which might be returned by the LBDN itself (since these do not necessarily correspond to the fallback records).  The TXT record for the same domain name is completely unaffected by DNS Traffic Control.

Known Issue: the strikethrough font logic is currently not working, and Grid Manager displays the fallback records in normal font instead.

Monitoring DNS Traffic Control

You can see the health status of your DTC objects in Grid Manager by navigating to Data Management > DNS > Traffic Control.

Notice the “Last Status Update” time!  When your server goes down or comes back up, the underlying health check and actual DNS query responses will update quickly, but it may take a few minutes for the new status to be displayed in Grid Manager.

Select an object and click “Show Visualization” from the Toolbar panel on the right to see at a glance how this object relates to other DTC objects.

Comparisons with other technologies

The following sections discuss similarities, differences, and trade-offs between DNS Traffic Control and several other technologies which can be used for similar purposes.

DTC vs Traditional Round-Robin DNS

Publishing several static (non-DTC) A or AAAA records for the same domain name, a technique known as “round-robin DNS“, is a simple and time-honored way to achieve basic active/active load distribution and (sometimes) fault tolerance among multiple target IPs.

Most recursive DNS resolvers will randomly permute the order of the record set for each response, and most clients will initially try whichever IP appears first in the list they receive.  If that first IP doesn’t work, many clients are smart enough to automatically try another one, although this may involve a noticeable delay (outcomes vary widely depending on client implementation).  Without manual intervention, the unhealthy IP will continue to appear in round-robin DNS responses and thus continue to impact the end-user experience for some fraction of new clients until service is restored.

DNS Traffic Control attempts to improve upon this user experience by no longer returning an address record for the unhealthy IP, thereby (eventually, once the old response has expired from all caches) directing all new connections to a healthy IP. 

It is worth noting that in certain pathological cases (see also “TTL Caveat” above), an improperly-cached stale DTC answer containing only the now-unhealthy IP could actually produce worse outcomes than a cached traditional round-robin answer containing all of the IPs.

Of course DTC also offers some additional options which can’t be implemented using traditional round-robin DNS, including active/passive and topology-sensitive behaviors.  But if simple active/active is what you want, it’s worth considering whether traditional round-robin DNS might be good enough (perhaps with a moderately low TTL to facilitate manual changes in response to a protracted outage) before deciding to implement DTC.

Note that despite having a similar name, the Round Robin DTC load balancing method has nothing to do with traditional round-robin DNS.

DTC vs network-layer Server Load Balancing (SLB)

In typical internet applications (e.g. web browsing), the client first uses DNS to resolve a target domain name into an IP address, then initiates a connection to the server at that IP address.

dtc-vs-slb

DNS Traffic Control alters the first part of this process by returning a different DNS record so that the client will connect to a different IP address.  Network-layer Server Load Balancing (SLB) alters the second part of the process by distributing client requests made to the same virtual server IP address (VIP) among different real servers behind the scenes.

Advantages of SLB:

  • SLB offers better fault tolerance for individual real server failures, because failover is transparent to the client; existing clients need only reconnect to the same VIP, and new clients are not impacted at all.  By contrast, failover in DNS Traffic Control requires client cooperation and is subject to the potential pitfalls described in “TTL Caveat” above.
  • Because SLB handles individual client connections, it can perform true load balancing (i.e. spreading the workload evenly among multiple active real servers).  The best DNS Traffic Control can do is to provide different answers to different recursive DNS resolvers and trust to the law of large numbers to distribute the overall workload, which may not be effective e.g. if many of the clients are using the same recursive DNS resolver, or if a few clients generate disproportionately large numbers of connections.

The primary limitation of SLB is that it is only available in certain locations and places additional constraints on the networks your real servers may belong to; see Server Load Balancing (SLB) for details.

The comparative advantage of DNS Traffic Control is that it can provide failover and load distribution between multiple geographically-distributed sites where sharing the same virtual IP address is not possible, thereby offering some protection (subject to “TTL Caveat” above) against a total failure of one of the sites.

DTC and SLB can also be used in tandem, with DTC directing clients to one of several SLB VIPs (which may in turn be located at different sites).  Note in this case that DTC will need a way to monitor the health of each SLB VIP.

DTC vs Global Server Load Balancing (GSLB)

GSLB is a term commonly used by many other vendors to describe the same functionality which we (and our vendor) call DNS Traffic Control.  We prefer the term DNS Traffic Control because it is more descriptive of the actual behavior and harder to confuse with SLB.

DTC vs Amazon Route 53 DNS Failover

Amazon Route 53 DNS Failover provides functionality very similar to DNS Traffic Control, but for DNS domains or subdomains which have been delegated to Route 53 (see AWS Authoritative DNS Guide for Illinois) instead of domains which are served by IPAM.

The main advantage of Route 53 DNS Failover is that it can take advantage of what AWS already knows about your target resources, e.g. automatically evaluating the health of alias records for Elastic Load Balancers.

DTC vs Virtual Alias records

See Virtual Alias records vs DNS Traffic Control.