Version 16 of deadlinks.tcl - A Bookmark File Cleaner

Updated 2003-09-08 18:00:27

05 Sept 2003 Mike Tuxford: I was chatting on irc this morning and a friend said 'I need something to clean out all the dead links in my bookmark file' and so I wrote the following script.

It works well for me as you can see from the command line output example below but I advise that you back-up your real bookmark file first.

 burp:/tcl/s/bookmarks# ./deadlinks.tcl bookmarks.html
 Checking 1150 sites...
 Testing site  [1150]
  1024 Good sites
   126 Dead sites

Caveats: Just because a site doesn't respond within 10 seconds while the script is running it doesn't mean the site is permanently a dead link. You can reprocesses your file.dead file again at later times (deadlinks.tcl file.dead) and it will produce a file.dead.good containing any good links it found. Those sites then need to be re-added manually to your bookmarks.


lv Would bracing the exprs below result in the application running a wee bit faster?

MT: If you meant for the -timeout value in 'proc verifySite' I just changed that now. That makes much more sense. However, shaving milliseconds in a script that has to wait 10 seconds on each failed attempt isn't going to save much overall but is better programming in general and I thank you for that. Or do you mean in 'proc extractSite'? If so, I suppose it would, so I did it, but frankly I don't really like that proc myself. Got a better suggestion?

Frink and procheck pointed to a number of lines (there are still at least 6) where there were expr's without their arguments braced - those are the ones to which I was referring. I was just facing feeding deadlinks a massive bookmarks file and figured that if bracing them would reduce each link processing by 1/4 a second, I'd still be saving nearly an hour of processing time...

MT: OK, I never did understand this expr performance issue before. I did some reading and also a test script that created 10,000 vars using expr and indeed there is a huge performance gain on the expr usage and I think I almost understand why. ie; the braced portion gets byte-compiled at run time rather than substuion occuring during every instance. I think that's right? and I hope my appyling the rule below is correct and complete. Thanks for making me learn.


 #!/usr/bin/tclsh
 #
 #################################
 # Mike Tuxford 2003
 #
 # Script: deadlinks.tcl
 # Syntax: deadlinks.tcl <bookmark filename>
 # Checks a netscape (mozilla) style bookmark file for dead links.
 #
 # Requires the http packages.
 # Does not overwrite the bookmark file but creates 2
 # new files (name.good name.dead)
 #
 # Tests sites by just retrieving the html headers and will
 # also leave local file/dir links untouched
 #
 # adjust the "timeout,seconds" values as desired.
 # The timeout value is how long we should try retreiving a
 # page header from a site before listing it as dead.
 #

 array set opt {
   "timeout,seconds" 10
 }
 set opt(timeout,ms) [expr {1000 * $opt(timeout,seconds)}]

 array set data {
   "c,sites" 0 "c,good" 0 "c,dead" 0
 }

 proc main {} {
   global opt fd data
   set i 0
   puts "Checking $data(c,sites) sites..."
   puts -nonewline stdout "Checking site "; flush stdout
   foreach line $data(bm,in) {
     switch -glob -- $line {
       *HREF=* {
         incr i
         puts -nonewline stdout \
             " \[$i\][string repeat \b [expr {[string length $i] +3}]]"
         flush stdout
         if {![extractSite $line]} {
           # it's a local dir/file bookmark
           puts $fd(good) $line
           incr data(c,good)
           continue
         }
         if {![verifySite]} {
           puts $fd(dead) $line
           incr data(c,dead)
         } else {
           puts $fd(good) $line
           incr data(c,good)
         }
       }
       default {
         puts $fd(good) $line
       }
     }
   }
   puts -nonewline "\n"
   flush stdout
   return
 }

 proc verifySite {} {
   global opt data
   if {[catch {http::geturl $data(site) -validate 1 \
       -timeout $opt(timeout,ms)} tok]} {
     #puts $tok
     return 0
   }
   upvar #0 $tok state
   if {$state(status) == "timeout"} {
     http::cleanup $tok
     return 0
   }
   http::cleanup $tok
   return 1
 }

 proc extractSite {line} {
   global data
   if {[string match -nocase *file:* [string range $line 0 \
       [expr {[string first "HREF" $line]+13}]]]} {
     return 0
   }
   set data(site) [string range $line \
     [expr {[string first "HREF" $line]+13}] \
     [expr {[string first \" $line \
     [expr [string first "HREF" $line]+13]]-1}]]
   return 1
 }

 ###########################
 # get the show started here
 #
 if {[catch {package require http} loadErr]} {
   puts $loadErr
   puts "Need the http package"
   exit
 }

 if {[llength $argv] == 0 || [string match -h* [lindex $argv 0]]} {
   puts "Syntax: deadlinks.tcl <bookmark filename>"
   exit
 } else {
   set opt(file,in) [lindex $argv 0]
 }

 if {[catch {open $opt(file,in) r} fd(in)]} {
   puts $fd(in)
   exit
 } else {
   set data(bm,in) [split [read $fd(in)] \n]
   close $fd(in)
 }

 foreach type {good dead} {
   set opt(file,$type) "$opt(file,in).$type"
   if {[catch {open $opt(file,$type) w+} fd($type)]} {
     puts $fd($type)
     exit
   }
 }

 foreach line $data(bm,in) {
   if {[string match -nocase *href=* $line]} {
     incr data(c,sites)
   }
 }

 main
 foreach type {good dead} {
   close $fd($type)
   puts [format "%5d %s" $data(c,$type) "$type sites"]
 }

TV You may want to consider a -command callback with the http::geturl command, at least you should be able to get a few sockets (probably a couple of dozens without getting to upsetting) to try to set up connections simultaneously. Or consider it a background process, depending on OS, it shouldn't eat up much processor time waiting.

Multiple queries give you n*100% speedup...


lv Bug report. After running a couple of hours, and getting about 60% done, I see this error message:

 Checking 13020 sites...
 Checking site list element in braces followed by "</A>" instead of space
    while executing
 "lrange $line 0 $sidx"
    (procedure "extractSite" line 4)
    invoked from within
 "extractSite $line"
    (procedure "main" line 12)
    invoked from within
 "main"
    (file "/tmp/dl" line 130)

(where /tmp/dl is the deadlinks program)

I wonder if the cause of this would be an HREF which has {} in it - in my bookmarks file, there is at least one entry where the href is some javascript...

MT: Egads! That lrange was never even supposed to be there. I meant to use string range. However, I don;t know if that's a fix for the problem. I sent you email, can you send me an example of the suspect line that includes a java/js link?

My bookmark file had 9 js links and did not fail, but when I added a set of braces like you showed the lrange line failed the same as your bug report but worked fine with the originally intnded string range. Very sorry about that. 13,000+ bookmarks? Sheesh, and I am taking pokes about have 1150.


Category Application | Category Internet