Version 17 of deadlinks.tcl - A Bookmark File Cleaner

Updated 2003-09-08 19:12:13

05 Sept 2003 Mike Tuxford: I was chatting on irc this morning and a friend said 'I need something to clean out all the dead links in my bookmark file' and so I wrote the following script.

It works well for me as you can see from the command line output example below but I advise that you back-up your real bookmark file first.

 burp:/tcl/s/bookmarks# ./deadlinks.tcl bookmarks.html
 Checking 1150 sites...
 Testing site  [1150]
  1024 Good sites
   126 Dead sites

Caveats: Just because a site doesn't respond within 10 seconds while the script is running it doesn't mean the site is permanently a dead link. You can reprocesses your file.dead file again at later times (deadlinks.tcl file.dead) and it will produce a file.dead.good containing any good links it found. Those sites then need to be re-added manually to your bookmarks.


lv Would bracing the exprs below result in the application running a wee bit faster?

MT: If you meant for the -timeout value in 'proc verifySite' I just changed that now. That makes much more sense. However, shaving milliseconds in a script that has to wait 10 seconds on each failed attempt isn't going to save much overall but is better programming in general and I thank you for that. Or do you mean in 'proc extractSite'? If so, I suppose it would, so I did it, but frankly I don't really like that proc myself. Got a better suggestion?

Frink and procheck pointed to a number of lines (there are still at least 6) where there were expr's without their arguments braced - those are the ones to which I was referring. I was just facing feeding deadlinks a massive bookmarks file and figured that if bracing them would reduce each link processing by 1/4 a second, I'd still be saving nearly an hour of processing time...

MT: OK, I never did understand this expr performance issue before. I did some reading and also a test script that created 10,000 vars using expr and indeed there is a huge performance gain on the expr usage and I think I almost understand why. ie; the braced portion gets byte-compiled at run time rather than substuion occuring during every instance. I think that's right? and I hope my appyling the rule below is correct and complete. Thanks for making me learn.


 #!/usr/bin/tclsh
 #
 #################################
 # Mike Tuxford 2003
 #
 # Script: deadlinks.tcl
 # Syntax: deadlinks.tcl <bookmark filename>
 # Checks a netscape (mozilla) style bookmark file for dead links.
 #
 # Requires the http packages.
 # Does not overwrite the bookmark file but creates 2
 # new files (name.good name.dead)
 #
 # Tests sites by just retrieving the html headers and will
 # also leave local file/dir links untouched
 #
 # adjust the "timeout,seconds" values as desired.
 # The timeout value is how long we should try retreiving a
 # page header from a site before listing it as dead.
 #

 array set opt {
   "timeout,seconds" 10
 }
 set opt(timeout,ms) [expr {1000 * $opt(timeout,seconds)}]

 array set data {
   "c,sites" 0 "c,good" 0 "c,dead" 0
 }

 proc main {} {
   global opt fd data
   set i 0
   puts "Checking $data(c,sites) sites..."
   puts -nonewline stdout "Checking site "; flush stdout
   foreach line $data(bm,in) {
     switch -glob -- $line {
       *HREF=* {
         incr i
         puts -nonewline stdout \
             " \[$i\][string repeat \b [expr {[string length $i] +3}]]"
         flush stdout
         if {![extractSite $line]} {
           # it's a local dir/file bookmark
           puts $fd(good) $line
           incr data(c,good)
           continue
         }
         if {![verifySite]} {
           puts $fd(dead) $line
           incr data(c,dead)
         } else {
           puts $fd(good) $line
           incr data(c,good)
         }
       }
       default {
         puts $fd(good) $line
       }
     }
   }
   puts -nonewline "\n"
   flush stdout
   return
 }

 proc verifySite {} {
   global opt data
   if {[catch {http::geturl $data(site) -validate 1 \
       -timeout $opt(timeout,ms)} tok]} {
     #puts $tok
     return 0
   }
   upvar #0 $tok state
   if {$state(status) == "timeout"} {
     http::cleanup $tok
     return 0
   }
   http::cleanup $tok
   return 1
 }

 proc extractSite {line} {
   global data
   if {[string match -nocase *file:* [string range $line 0 \
       [expr {[string first "HREF" $line]+13}]]]} {
     return 0
   }
   set data(site) [string range $line \
     [expr {[string first "HREF" $line]+13}] \
     [expr {[string first \" $line \
     [expr [string first "HREF" $line]+13]]-1}]]
   return 1
 }

 ###########################
 # get the show started here
 #
 if {[catch {package require http} loadErr]} {
   puts $loadErr
   puts "Need the http package"
   exit
 }

 if {[llength $argv] == 0 || [string match -h* [lindex $argv 0]]} {
   puts "Syntax: deadlinks.tcl <bookmark filename>"
   exit
 } else {
   set opt(file,in) [lindex $argv 0]
 }

 if {[catch {open $opt(file,in) r} fd(in)]} {
   puts $fd(in)
   exit
 } else {
   set data(bm,in) [split [read $fd(in)] \n]
   close $fd(in)
 }

 foreach type {good dead} {
   set opt(file,$type) "$opt(file,in).$type"
   if {[catch {open $opt(file,$type) w+} fd($type)]} {
     puts $fd($type)
     exit
   }
 }

 foreach line $data(bm,in) {
   if {[string match -nocase *href=* $line]} {
     incr data(c,sites)
   }
 }

 main
 foreach type {good dead} {
   close $fd($type)
   puts [format "%5d %s" $data(c,$type) "$type sites"]
 }

TV You may want to consider a -command callback with the http::geturl command, at least you should be able to get a few sockets (probably a couple of dozens without getting to upsetting) to try to set up connections simultaneously. Or consider it a background process, depending on OS, it shouldn't eat up much processor time waiting.

Multiple queries give you n*100% speedup... MT: I am looking at this because it looks correct, but struggling with how to implement it. The problem is that if the script continues on the HREF that is being waited for would end up getting written to a wrong position within the bookmarks.good file if it's just a slow validation or a retry, as I would assume you'd want to do. It may have orignally been in a folder named "Tcl" but end up in a folder named "Perl".


lv Bug report. After running a couple of hours, and getting about 60% done, I see this error message:

 Checking 13020 sites...
 Checking site list element in braces followed by "</A>" instead of space
    while executing
 "lrange $line 0 $sidx"
    (procedure "extractSite" line 4)
    invoked from within
 "extractSite $line"
    (procedure "main" line 12)
    invoked from within
 "main"
    (file "/tmp/dl" line 130)

(where /tmp/dl is the deadlinks program)

I wonder if the cause of this would be an HREF which has {} in it - in my bookmarks file, there is at least one entry where the href is some javascript...

MT: Egads! That lrange was never even supposed to be there. I meant to use string range. However, I don;t know if that's a fix for the problem. I sent you email, can you send me an example of the suspect line that includes a java/js link?

My bookmark file had 9 js links and did not fail, but when I added a set of braces like you showed the lrange line failed the same as your bug report but worked fine with the originally intnded string range. Very sorry about that. 13,000+ bookmarks? Sheesh, and I am taking pokes about have 1150.


Category Application | Category Internet