An HTTP robot in Tcl

On Wed, 21 Feb 2001, Eric Gorr wrote in comp.lang.tcl:

 Is it possible to use the HTTP package to (I guess) send a URL to a
 site, which can pass some parameters to a cgi script at the site and
 then capture what the cgi script returns, without going through a web
 browser?
 Where might I find sample code to help me write this?

Michael A. Cleverly responded: If I understand your question correctly, you are interested in a Tcl script which talks to a foreign web server (dynamically passing variables to a cgi script on that server) and capturing and parsing the output and doing something with it without the use of a web browser. (CL notes that we sometimes call this "Web scraping".)

Basically the steps are:

  1. Figure out what the location is (url) and what inputs (form variables) you need to pass
  2. Use the http package to fetch the page
  3. Parse the results and glean whatever data you are after

For the first step, go to the website, find the (first) page and then view the source and look at what form variables are expected. Some sites may also depend on the presence of one or more cookies, so you may need to set your browser to warn you when you accept cookies so you can see what the value are. Lynx comes in handy in this respect.

The second step is straightforward. Use the http package, http::geturl specifically, and save the html you receive back in a variable. Then the third step normally is to craft one or more regular expressions (or other string functions) to get at the data you want.

Here's a simple example I wrote to illustrate these steps, while at the same time doing something at least moderately "useful."

For basically all intents and purposes--with only a handful of exceptions (less than a half dozen nationwidewhere zipcodes cross state lines, IIRC)--the first three digits of a zipcode uniquely identify a state. Supposing you wanted this data, you could purchase it from one of many vendors of zipcode data, or you could roll up your sleeves and write a little web robot in Tcl to repeatedly query the US Postal Service's website.

So, step one. We go to http://www.usps.gov . We click on the link in their navbar for "Find Zip Codes." From there we click on the link for "City/State/ZIP Code Assocations page." We get to a page where we can enter a zipcode. Let's enter a zipcode just to see how it works and what kind of data we'll get. Any zipcode will do. I live in 84041, so I put that in.

Now we're at http://www.usps.gov/cgi-bin/zip4/ctystzip2 . There are no form variables in the URL itself (?zipcode=blah,blah,blah type stuff) so they must be using a post method. The result we get are formatted in a fixed width font and look like:

 For this ZIP Code,        ZIP Code
 City Name                  State    the city name is:         Type
 ----------------------------------------------------------------------
 LAYTON                     UT       ACCEPTABLE (DEFAULT)      STANDARD
 WEST LAYTON                UT       NOT ACCEPTABLE-           STANDARD
                                    USE LAYTON

Looking at the HTML source in our browser we see:

 <FORM METHOD="POST" ACTION="/cgi-bin/zip4/ctystzip2">
 <INPUT SIZE="35" MAXLENGTH="35" NAME="ctystzip" value="84041">
 <INPUT TYPE="submit" VALUE="Process">

So now we know that we need to POST to http://www.usps.gov/cgi-bin/zip4/ctystzip2 and pass in a form variable of ctystzip with a five digit zipcode. We'll get an HTML page back, and for a valid zipcode there will be a city name(s), a bunch of spaces, a two letter state abbreviation, and then the words "ACCEPTABLE (DEFAULT)".

Since we don't care about the handful of exceptions where a zipcode crosses a state border in the middle of nowhere, we just need to query their web server repeatedly and build up our list of 3-digit zipcode/state associations. If we start at XXX00 and increment by one we can stop either when we've found a valid zipcode (and hence the state) or we reach XXX99 (meaning we've found a 3-digit zipcode prefix that hasn't been assigned yet). This way though we'll still have to make a bunch of repeated requests, we won't have to make anywhere near a full 100,000 hits.

Once we parse out the data we could save it to a file, write it to standard out, stuff it in a database. Use the standard tcl library and email someone about it. The possibilities are wide open.

So, now, here's the code the example code:

 #!/usr/local/bin/tclsh

 package require http

 # some websites, not the usps necessarily, care what kind of browser is used.
 ::http::config -useragent "Mozilla/4.75 (X11; U; Linux 2.2.17; i586; Nav)"

 set url http://www.usps.gov/cgi-bin/zip4/ctystzip2

 # We'll work down from 999xx to 000xx since it's more gratifying to
 # get results immediately. 999 is Alaska, while 000 and 001 aren't
 # assigned.  :-)
 for {set i 999} {$i >= 0} {incr i -1} {
    for {set j 0} {$j <= 99} {incr j} {

        # use format to pad our string appropriately with leading zeros
        # to come up with a 5-digit zipcode to test
        set zipcode [format %03d $i][format %02d $j]

        # the http man page is a good place to read up on these commands
        set query [::http::formatQuery ctystzip $zipcode]
        set http  [::http::geturl $url -query $query]
        set html  [::http::data $http]

        # we use a regular expression pattern to extract the text
        # we are looking for
        if {[regexp {  ([A-Z][A-Z]) +ACCEPTABLE} $html => state]} {
            puts "[format %03d $i]xx ==> $state"
            # we found a match, so let's break out of the inner loop
            break
        } elseif {$j == 99} {
            puts "[format %03d $i]xx ==> not found"
        }
    }
 }

Running this produces output like:

 999xx ==> AK
 998xx ==> AK
 997xx ==> AK
 996xx ==> AK
 995xx ==> AK
 994xx ==> WA
 993xx ==> WA
 992xx ==> WA
 991xx ==> WA
 990xx ==> WA
 989xx ==> WA
 988xx ==> WA
 987xx ==> not found
 986xx ==> WA

etc.


See also: Finding distances by querying MapQuest