05 Sept 2003 [Mike Tuxford]: I was chatting on irc this morning and a friend said 'I need something to clean out all the dead links in my bookmark file' and so I wrote the following script. It works well for me as you can see from the command line output example below but I advise that you back-up your real bookmark file first. burp:/tcl/s/bookmarks# ./deadlinks.tcl bookmarks.html Checking 1150 sites... Testing site [1150] 1024 Good sites 126 Dead sites Caveats: Just because a site doesn't respond within 10 seconds while the script is running it doesn't mean the site is permanently a dead link. You can reprocesses your file.dead file again at later times (deadlinks.tcl file.dead) and it will produce a file.dead.good containing any good links it found. Those sites then need to be re-added manually to your bookmarks. ---- [lv] Would bracing the exprs below result in the application running a wee bit faster? MT: If you meant for the -timeout value in 'proc verifySite' I just changed that now. That makes much more sense. However, shaving milliseconds in a script that has to wait 10 seconds on each failed attempt isn't going to save much overall but is better programming in general and I thank you for that. Or do you mean in 'proc extractSite'? If so, I suppose it would, so I did it, but frankly I don't really like that proc myself. Got a better suggestion? [Frink] and [procheck] pointed to a number of lines (there are still at least 6) where there were [expr]'s without their arguments braced - those are the ones to which I was referring. I was just facing feeding deadlinks a massive bookmarks file and figured that if bracing them would reduce each link processing by 1/4 a second, I'd still be saving nearly an hour of processing time... MT: OK, I never did understand this expr performance issue before. I did some reading and also a test script that created 10,000 vars using expr and indeed there is a huge performance gain on the expr usage and I think I almost understand why. ie; the braced portion gets byte-compiled at run time rather than substuion occuring during every instance. I think that's right? and I hope my appyling the rule below is correct and complete. Thanks for making me learn. ---- #!/usr/bin/tclsh # ################################# # Mike Tuxford 2003 # # Script: deadlinks.tcl # Syntax: deadlinks.tcl # Checks a netscape (mozilla) style bookmark file for dead links. # # Requires the http packages. # Does not overwrite the bookmark file but creates 2 # new files (name.good name.dead) # # Tests sites by just retrieving the html headers and will # also leave local file/dir links untouched # # adjust the "timeout,seconds" values as desired. # The timeout value is how long we should try retreiving a # page header from a site before listing it as dead. # array set opt { "timeout,seconds" 10 } set opt(timeout,ms) [expr {1000 * $opt(timeout,seconds)}] array set data { "c,sites" 0 "c,good" 0 "c,dead" 0 } proc main {} { global opt fd data set i 0 puts "Checking $data(c,sites) sites..." puts -nonewline stdout "Checking site "; flush stdout foreach line $data(bm,in) { switch -glob -- $line { *HREF=* { incr i puts -nonewline stdout \ " \[$i\][string repeat \b [expr {[string length $i] +3}]]" flush stdout if {![extractSite $line]} { # it's a local dir/file bookmark puts $fd(good) $line incr data(c,good) continue } if {![verifySite]} { puts $fd(dead) $line incr data(c,dead) } else { puts $fd(good) $line incr data(c,good) } } default { puts $fd(good) $line } } } puts -nonewline "\n" flush stdout return } proc verifySite {} { global opt data if {[catch {http::geturl $data(site) -validate 1 \ -timeout $opt(timeout,ms)} tok]} { #puts $tok return 0 } upvar #0 $tok state if {$state(status) == "timeout"} { http::cleanup $tok return 0 } http::cleanup $tok return 1 } proc extractSite {line} { global data if {[string match -nocase *file:* [string range $line 0 \ [expr {[string first "HREF" $line]+13}]]]} { return 0 } set data(site) [string range $line \ [expr {[string first "HREF" $line]+13}] \ [expr {[string first \" $line \ [expr [string first "HREF" $line]+13]]-1}]] return 1 } ########################### # get the show started here # if {[catch {package require http} loadErr]} { puts $loadErr puts "Need the http package" exit } if {[llength $argv] == 0 || [string match -h* [lindex $argv 0]]} { puts "Syntax: deadlinks.tcl " exit } else { set opt(file,in) [lindex $argv 0] } if {[catch {open $opt(file,in) r} fd(in)]} { puts $fd(in) exit } else { set data(bm,in) [split [read $fd(in)] \n] close $fd(in) } foreach type {good dead} { set opt(file,$type) "$opt(file,in).$type" if {[catch {open $opt(file,$type) w+} fd($type)]} { puts $fd($type) exit } } foreach line $data(bm,in) { if {[string match -nocase *href=* $line]} { incr data(c,sites) } } main foreach type {good dead} { close $fd($type) puts [format "%5d %s" $data(c,$type) "$type sites"] } ---- [TV] You may want to consider a -command callback with the http::geturl command, at least you should be able to get a few sockets (probably a couple of dozens without getting to upsetting) to try to set up connections simultaneously. Or consider it a background process, depending on [OS], it shouldn't eat up much processor time waiting. Multiple queries give you n*100% speedup... ---- [lv] Bug report. After running a couple of hours, and getting about 60% done, I see this error message: Checking 13020 sites... Checking site list element in braces followed by "" instead of space while executing "lrange $line 0 $sidx" (procedure "extractSite" line 4) invoked from within "extractSite $line" (procedure "main" line 12) invoked from within "main" (file "/tmp/dl" line 130) (where /tmp/dl is the deadlinks program) I wonder if the cause of this would be an HREF which has {} in it - in my bookmarks file, there is at least one entry where the href is some javascript... MT: Egads! That lrange was never even supposed to be there. I meant to use string range. However, I don;t know if that's a fix for the problem. I sent you email, can you send me an example of the suspect line that includes a java/js link? My bookmark file had 9 js links and did not fail, but when I added a set of braces like you showed the lrange line failed the same as your bug report but worked fine with the originally intnded string range. Very sorry about that. 13,000+ bookmarks? Sheesh, and I am taking pokes about have 1150. ---- [Category Application] | [Category Internet]