[[clt postings from jooky and [David Welton] and [Larry Virden].]] Web scraping is the practice of getting information from a web page and reformatting it. Some reasons one may do this is to send updates to a pager/WAP phone, etc. , email one's personal account status to an account, or create data files for reading on a PDA. See projects like http://Sitescooper.org/ or Plucker http://www.plkr.org/ for non-[Tcl] tools for scraping the web and http://www.Sitescrape.com provide a reliable scraping service. "Web Scraping ..." [http://cedar.intel.com/cgi-bin/ids.dll/content/content.jsp?cntKey=Generic+Editorial%3a%3aws_scraping&cntType=IDS_EDITORIAL&cat=CJA] "Web scraping is easy" [http://www.unixreview.com/documents/s=7822/ur0302h/] ---- * [An HTTP Robot in Tcl] * [websearch] * [Getting stock quotes over the internet] * Also see '''[tcllib]/examples/oscon''' which uses the [htmlparse] module (among others) to parse the schedule pages for OSCON 2001 and convert them into [CSV] files usable by [Excel] and other applications. * See also tDOM's HTML-parser + XPath expression in the web scraping example for EBAY presented at the [First European Tcl/Tk Users Meeting] ( http://sdf.lonestar.org/~loewerj/tdom2001.ppt and http://www.tu-harburg.de/skf/tcltk/tclum2001.pdf.gz) * [TkChat] is one Tk/Tcl web scraper for this web site's chat room! * [TcLeo] allows querying the English <=> German web dictionary at http://dict.leo.org from the command line. * [Getleft] * [wiki-reaper] (and [wish-reaper]) * [A little rain forecaster] * [pop3line] fetches e-mails from the web-mailer of T-Online. * [tclwebtest] ---- Apt comments on the technical and business difficulty of Web scraping, along with mentions of WebL and NQL, appear here [http://lambda.weblogs.com/discuss/msgReader$3567]. Some people go in a record-and-playback direction with such tools as [AutoIt]. [Perl] probably has the most current Web-scraping activity (even more than tclwebtest?), especially with WWW::Mechanize [http://www.perl.com/pub/a/2003/01/22/mechanize.html], although Curl [http://www.unixreview.com/documents/s=1820/uni1011713175619/0201i.htm] also has its place. iOpus [http://iopus.com/] apparently has a commercial record-and-playback tool, "iOpus Internet Macros", that's scriptable (?). ---- [[It seems many of us do our own "home-grown" solutions for these needs.]] ---- An alternative to web scraping is to work with the web host to work out details of a [Web Service] that would provide useful information programatically. Also, some web hosts provide [XML] versions, or specially formatted versions for use with Avantgo or the plucker command, with the goal of aiding people who need some sort of specialized format for small devices, etc. It would be great if someone making legitamate use of some of these sources would share some of their code to do this sort of thing. ---- [Category Internet]