Version 95 of Web scraping

Updated 2007-05-29 12:47:58 by LV

[clt postings from jooky and David Welton and Larry Virden.]

Web scraping is the practice of getting information from a web page and reformatting it.

Some reasons one may do this are to send updates to a pager/WAP phone, etc., email one's personal account status to an account, or move data from a website into a local database. See projects like http://sitescooper.org/ , http://www.plkr.org/ , or Python-based mechanize [L1 ] (itself a descendant of Andy Lester's WWW::Mechanize for Perl) for non-Tcl tools for scraping the web.

"Web Scraping ..." [L2 ]

"Web scraping is easy" [L3 ]



Apt comments on the technical and business difficulty of Web scraping, along with mentions of WebL and NQL, appear here [L6 ]. Some people go in a record-and-playback direction with such tools as AutoIt. Perl probably has the most current Web-scraping activity (even more than tclwebtest?), especially with WWW::Mechanize [L7 ], although Curl [L8 ] also has its place. In 2004, WWWGrab [L9 ] looks interesting for those in a position to work under Windows.


[It seems many of us do our own "home-grown" solutions for these needs.]


An alternative to web scraping is to work with the web host to work out details of a Web Service that would provide useful information programatically.

Also, some web hosts provide XML versions (RSS), or specially formatted versions for use with Avantgo or the plucker command, with the goal of aiding people who need some sort of specialized format for small devices, etc.

It would be great if someone making legitimate use of some of these sources would share some of their code to do this sort of thing.

RS: A little RSS reaper loads an RSS page, renders it to HTML, plus it compacts the referenced pages into the same document with local links, while trying to avoid ads and noise. Not perfect, but I use it daily to reap news sites onto my iPaq :)


See also screenscrape, download files via http, parallel geturl.


LV This isn't technically web scraping, but I'm uncertain where else to reference it - it is making use of a web site's cgi functionality, from a Tk application, from what I can tell.... Amazon.de PreOrder. Or even googling with SOAP.


[Mention Beautiful Soup here, or perhaps in the vicinity of htmlparse.]


Category Internet