Websnob > Robot Snob > Robotwatching: NetResearchServer/2.2

Robotwatching: NetResearchServer/2.2

Trying to suss out information on new web robots is always a little painful, since many webmasters can't (or won't) reveal the raw data they've collected on suspected bots. Robotwatching is a experiment in total disclosure: I'm going to share the entire log from an unknown robot's visit, share any useful observations I can make, and see if that helps further the discussions of mystery web spiders.

The access_log (with commentary)

64.133.xxx.xxx - - [05/Apr/2002:05:37:51 -0500] "GET /robots.txt HTTP/1.0" 200 116 "-" "NetResearchServer/2.2(loopimprovements.com/robot.html)"

Observations: It starts with robots.txt, so it's a well-behaved bot so far.

The URL included in the agent string, loopimprovements.com/robot.html, states:

IncyWincy crawls pages all over the world in order to build its Invisible Web search engine index. IncyWincy does not crawl entire websites. Instead it only crawls pages listed in the DMOZ Open Directory Project. IncyWincy is only interested in pages with HTML forms.

This is bad. Why is this bad? Because this used to be the information page for another robot. Apparently, NetResearchServer is just a new name for IncyWincy, a robot last seen at bauser.com on 30 March 2002. Gratuitously changing the names of web robots is not good citizenship.

NetResearchServer (neé IncyWincy) is the robot that spiders sites for the Loop Improvement LLC's "invisible web search engines". That engine (demonstrated at IncyWincy.com) attempts to expand the reach of the Open Directory Project database by spidering sites listed in the ODP, copying the first search box it finds, and incorporating the search box directly into the IncyWincy directory. That way, you can do remote searches of site databases from the IncyWincy directory. This is supposed to make it easier to find "invisible web" data that's often stored in formats that regular web robots won't find. In practice, it creates a lot search forms that don't work, although it gets all the forms at bauser.com correct.

By the way, the IP address I obscured maps back to sprinthome.com, a domain used by Sprint Communications. NetResearchServer is available for licensing, so this might be a user of the software, rather than Loop Improvements itself.

64.133.xxx.xxx - - [05/Apr/2002:05:37:53 -0500] "GET /news.groups.reviews/ HTTP/1.0" 200 12188 "-" "NetResearchServer/2.2(loopimprovements.com/robot.html)"

Observations: So far, so good. NetResearchServer grabbed the first bauser.com page listed in the ODP. That page has a search box on it, so NetResearchServer didn't spider anything else. The IncyWincy category listing that URL includes an "[I-Web]" link to IncyWincy's version of the search box.

64.133.xxx.xxx - - [05/Apr/2002:05:57:20 -0500] "GET /roleplaying/ADnD/cantrips.html HTTP/1.0" 200 18446 "-" "NetResearchServer/2.2(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [05/Apr/2002:05:57:21 -0500] "GET /roleplaying/search.pl HTTP/1.0" 200 2421 "-" "NetResearchServer/2.2(loopimprovements.com/robot.html)"

Observations: Here's where we discover that Loop Improvements isn't telling the entire truth when it says NetResearchServer "only crawls pages listed in the DMOZ Open Directory Project". Here, the URL listed in the ODP doesn't have a search box, but it does have a link to "/roleplaying/search.pl". NetResearchServer must be programmed to check nearby URLs with certain keywords in their names. In this case, the search.pl page does have a search box, and that search box gets incorporated into the appropriate category at IncyWincy.

64.133.xxx.xxx - - [05/Apr/2002:05:57:23 -0500] "GET /roleplaying/Spelljammer/index.html HTTP/1.0" 200 7812 "-" "NetResearchServer/2.2(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [05/Apr/2002:05:57:24 -0500] "GET /roleplaying/search.pl HTTP/1.0" 200 2421 "-" "NetResearchServer/2.2(loopimprovements.com/robot.html)"

Observations: Here, NetResearchServer does the exact same thing. In fact, it grabs the exact same search page. The bot will go on to grab "/roleplaying/search.pl" two more times. Apparently NetResearchServer doesn't cache data in a way that can prevent it from making redundant requests.

64.133.xxx.xxx - - [05/Apr/2002:05:57:29 -0500] "GET /roleplaying/StarFrontiers/index.html HTTP/1.0" 200 9215 "-" "NetResearchServer/2.2(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [05/Apr/2002:05:57:30 -0500] "GET /roleplaying/search.pl HTTP/1.0" 200 2421 "-" "NetResearchServer/2.2(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [05/Apr/2002:05:57:37 -0500] "GET /roleplaying/Ghostbusters/ HTTP/1.0" 200 14961 "-" "NetResearchServer/2.2(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [05/Apr/2002:05:57:37 -0500] "GET /roleplaying/search.pl HTTP/1.0" 200 2421 "-" "NetResearchServer/2.2(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [05/Apr/2002:06:45:02 -0500] "GET /beer/index.html?pref=breweries HTTP/1.0" 200 20056 "-" "NetResearchServer/2.2(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [05/Apr/2002:06:45:03 -0500] "GET /beer/search.pl HTTP/1.0" 200 4904 "-" "NetResearchServer/2.2(loopimprovements.com/robot.html)"

Observations: The last two requests made that day show that IncyWincy's bot isn't afraid of URLs with trailing preferences, and isn't afraid to follow URLs from them either.

NetResearchServer made 12 requests (based on 6 entries in the ODP) in 1 hour, 7 minutes, and 12 seconds. All in all, a very low-volume robot. It missed one bauser.com URL listed in the ODP.

64.133.xxx.xxx - - [07/Apr/2002:08:14:44 -0400] "GET /robots.txt HTTP/1.0" 200 116 "-" "NetResearchServer/2.2(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [07/Apr/2002:08:14:45 -0400] "GET /news.groups.reviews/alt.religion.scientology/1.html HTTP/1.0" 200 2764 "-" "NetResearchServer/2.2(loopimprovements.com/robot.html)"

Observations: Two days later, NetResearchServer returns, requesting robots.txt and the one page it missed. This suggests that is takes two days (or more) for NetResearchServer to spider the entire ODP database (which contained approximately 3.39 million URLs in April 2002).

NetResearchServer crawled bauser.com again 20 days later:

64.133.xxx.xxx - - [27/Apr/2002:05:53:27 -0400] "GET /robots.txt HTTP/1.0" 200 159 "-" "NetResearchServer/2.3(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [27/Apr/2002:05:53:28 -0400] "GET /news.groups.reviews/ HTTP/1.0" 200 12188 "-" "NetResearchServer/2.3(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [27/Apr/2002:06:11:22 -0400] "GET /robots.txt HTTP/1.0" 200 159 "-" "NetResearchServer/2.3(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [27/Apr/2002:06:11:23 -0400] "GET /roleplaying/ADnD/cantrips.html HTTP/1.0" 200 18958 "-" "NetResearchServer/2.3(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [27/Apr/2002:06:11:24 -0400] "GET /roleplaying/search.pl HTTP/1.0" 200 2421 "-" "NetResearchServer/2.3(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [27/Apr/2002:06:11:26 -0400] "GET /roleplaying/Spelljammer/index.html HTTP/1.0" 200 8418 "-" "NetResearchServer/2.3(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [27/Apr/2002:06:11:27 -0400] "GET /roleplaying/search.pl HTTP/1.0" 200 2421 "-" "NetResearchServer/2.3(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [27/Apr/2002:06:11:32 -0400] "GET /roleplaying/StarFrontiers/index.html HTTP/1.0" 200 8662 "-" "NetResearchServer/2.3(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [27/Apr/2002:06:11:33 -0400] "GET /roleplaying/search.pl HTTP/1.0" 200 2421 "-" "NetResearchServer/2.3(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [27/Apr/2002:06:11:42 -0400] "GET /roleplaying/Ghostbusters/ HTTP/1.0" 200 15001 "-" "NetResearchServer/2.3(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [27/Apr/2002:06:11:42 -0400] "GET /roleplaying/search.pl HTTP/1.0" 200 2421 "-" "NetResearchServer/2.3(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [27/Apr/2002:07:00:09 -0400] "GET /beer/index.html?pref=breweries HTTP/1.0" 200 19202 "-" "NetResearchServer/2.3(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [27/Apr/2002:07:00:10 -0400] "GET /beer/search.pl HTTP/1.0" 200 5360 "-" "NetResearchServer/2.3(loopimprovements.com/robot.html)"
64.133.xxx.xxx - - [29/Apr/2002:06:45:35 -0400] "GET /news.groups.reviews/alt.religion.scientology/1.html HTTP/1.0" 200 2764 "-" "NetResearchServer/2.3(loopimprovements.com/robot.html)"

Observations: This crawl was virtually identical to the April 5 crawl, although a bit faster (the initial visit is less than an hour). The straggler URL is again grabbed 2 days later, but without the extra request for robots.txt.

NetResearchServer appears to be requesting pages in the the alphabetical order of their ODP categories. It first grabs the URL from Computers/Usenet, then several URLs in branches of Games/Roleplaying/Genres, then a URL listed in Recreation/Food/Drink/Beer, and finally, a URL listed in Society/Religion_and_Spirituality/Opposing_Views/Scientology/a.r.s._Related.

Backtracking through my server logs, I see that the IncyWincy robot demonstrated virtually the same patterns (including the two-day gap) at NetResearchServer, lending credence to the theory that they're identical robots with different names.

Followup research

A Google search for NetResearchServer revealed a lot of site statistics reports from April 2002, but no discussions of the bot. Either nobody else has noticed that IncyWincy changed its name, or nobody cares.

[an error occurred while processing this directive]