The "old" domain survey, as we will now call it, counted hosts by walking the domain name tree, and doing zone transfers of domain data, in order to discover hosts and further subdomains. It is described more completely in RFC1296. The old survey counted the number of domain names that had IP addresses assigned to them.
For each IN-ADDR.ARPA network number delegation, we query for further subdelegations at each network octet boundary below that point. This process takes about two days and when it ends we have a list of all 3-octet network number delegations that exist and the names of the authoritative domain servers that handle those queries. This process reduces the number of queries we need to do from 4.3 billion to the number of possible hosts per delegation (254) times the number of delegations found. In the January 1998 survey, there were 879,212 delegations, or just 223,319,848 possible hosts.
With the list of 3-octet delegations in hand, the next phase of the survey sends out a common UDP-based PTR query for each possible host address between 1 and 254 for each delegation. In order to prevent flooding any particular server, network or router with packets, the query order is pseudo-randomized to spread the queries evenly across the Internet. For example, a domain server that handles a single 3-octet IN-ADDR.ARPA delegation would only see one or two queries per hour. Depending on the time of day, we transmit between 600 and 1200 queries per second. The queries are streamed out asynchronously and we handle replies as they return. This phase takes about 8 days to run.
With the new survey we are now publishing five figures per top-level domain (on our distribution by t-l-d charts). For each t-l-d, we show the total number of hosts found (which equals the number of PTR records found), the number of duplicate host names found (which usually indicate a host with many addresses), and then we subtract the duplicate count to arrive at the final host count.
We also publish two new numbers, a count of names per 2nd and 3rd level domain name under each t-l-d. These counts will have different meanings depending on how the particular t-l-d is organized. For example, for the .COM domain, the number of 2nd level names equals the number of organizations using names registered under .COM, and the number of 3rd level names is, possibly, meaningless. However, some t-l-d's like .UK and .AU, have a few fixed subdomains at the 2nd-level (like .CO.UK) and so the 3rd level count shows the number of organizations.
We decided not to try to verify the PTR entries we collected (by trying to look up the name returned and verify its address matched the PTR record). One reason is that this process would take far longer than the PTR lookup process. However, another reason is that there are a lot of PTR entries that are wrong, even though the host actually does exist. Cases were found where an IP address was pingable and had a PTR entry, but a lookup on the hostname did not return an address.
In our distribution by t-l-d charts, we show an entry called "ARPA" and one called "UNKNOWN". The count for ARPA shows you the number of administrators that tried to setup a PTR entry for a host but left off the trailing dot in their zone files. These are hosts that probably exist, but have an invalid host name. The UNKNOWN count shows you the number of PTR entries that did not have any valid t-l-d name. These are sometimes typos, and other times entries for unused addresses (for example, a domain administrator might put in the hostname "unassigned" for any unused address).
Note that this new survey has the same potential problems as the old survey. Namely, that just because a hostname is assigned an IP address, or an IP address is assigned a hostname, does not mean the host actually exists. To find out how many hosts actually exist at a given time, we ping a 1% sample of all the hosts found and apply the result to the total hostcount to obtain an estimate of the total number of pingable hosts. There are other potential survey problems, many of which are discussed in RFC1296.
While comparing host counts per country code between the new survey and the last old survey, we found that a very small number of countries lost a significant number of hosts. We have not yet analyzed the data to find out exactly why this is occuring, but it may be due to a number of reasons. We may just be having very bad network connectivity or packet loss to certain foreign countries that interferes with the data collection process. Another possibility is that in certain places it is not common for providers to place entries in the IN-ADDR.ARPA tables. These anomolies will be looked at further in future surveys as we fine tune the technique.
Another item some may notice is that our count of hostnames (or firstnames as we call them) has interesting changes. For example, the number of hosts named "www" has dropped between the old survey and the new survey. The reason for this is that in the old firstname count, if a host had two names, for exampe nw.com and www.nw.com that were both assigned the same IP address, the name "nw" and "www" would each be counted as a firstname for the same host. In the new survey, a PTR record can only return a single offical hostname for a particular IP address. In the example above, the new survey would count either "nw" or "www" depending on which name the administrator set up to be the official name. Since the "www" count dropped between surveys, it appears that the "www." prefix is used heavily as an "alias" for official host names.