05 Sept 2003 Mike Tuxford: I was chatting on irc this morning and a friend said 'I need something to clean out all the dead links in my bookmark file' and so I wrote the following script.
It works well for me as you can see from the command line output example below but I advise that you back-up your real bookmark file first.
burp:/tcl/s/bookmarks# ./deadlinks.tcl bookmarks.html Checking 1150 sites... Testing site [1150] 1024 Good sites 126 Dead sites
Caveats: Just because a site doesn't respond within 10 seconds while the script is running it doesn't mean the site is permanently a dead link. You can reprocesses your file.dead file again at later times (deadlinks.tcl file.dead) and it will produce a file.dead.good containing any good links it found. Those sites then need to be re-added manually to your bookmarks.
lv Would bracing the exprs below result in the application running a wee bit faster?
MT: If you meant for the -timeout value in 'proc verifySite' I just changed that now. That makes much more sense. However, shaving milliseconds in a script that has to wait 10 seconds on each failed attempt isn't going to save much overall but is better programming in general and I thank you for that. Or do you mean in 'proc extractSite'? If so, I suppose it would, so I did it, but frankly I don't really like that proc myself. Got a better suggestion?
Frink and procheck pointed to a number of lines (there are still at least 6) where there were expr's without their arguments braced - those are the ones to which I was referring. I was just facing feeding deadlinks a massive bookmarks file and figured that if bracing them would reduce each link processing by 1/4 a second, I'd still be saving nearly an hour of processing time...
MT: OK, I never did understand this expr performance issue before. I did some reading and also a test script that created 10,000 vars using expr and indeed there is a huge performance gain on the expr usage and I think I almost understand why. ie; the braced portion gets byte-compiled at run time rather than substution occuring during every instance. I think that's right? and I hope my appyling the rule below is correct and complete. Thanks for making me learn.
lv the way i think it is explained is that expr does a level of substitution of its own as it processes its arguments. Thus, if you brace its arguments, then in most cases you save the overhead of trying to do a substitution twice. There are, of course, possible designs which depend on two rounds of substitution - but I don't believe that your code is one of those designs.
#!/usr/bin/tclsh # ################################# # Mike Tuxford 2003 # # Script: deadlinks.tcl # Syntax: deadlinks.tcl <bookmark filename> # Checks a netscape (mozilla) style bookmark file for dead links. # # Requires the http packages. # Does not overwrite the bookmark file but creates 2 # new files (name.good name.dead) # # Tests sites by just retrieving the html headers and will # also leave local file/dir links untouched # # adjust the "timeout,seconds" values as desired. # The timeout value is how long we should try retreiving a # page header from a site before listing it as dead. # array set opt { "timeout,seconds" 10 } set opt(timeout,ms) [expr {1000 * $opt(timeout,seconds)}] array set data { "c,sites" 0 "c,good" 0 "c,dead" 0 } proc main {} { global opt fd data set i 0 puts "Checking $data(c,sites) sites..." puts -nonewline stdout "Checking site "; flush stdout foreach line $data(bm,in) { switch -glob -- $line { *HREF=* { incr i puts -nonewline stdout \ " \[$i\][string repeat \b [expr {[string length $i] +3}]]" flush stdout if {![extractSite $line]} { # it's a local dir/file bookmark puts $fd(good) $line incr data(c,good) continue } if {![verifySite]} { puts $fd(dead) $line incr data(c,dead) } else { puts $fd(good) $line incr data(c,good) } } default { puts $fd(good) $line } } } puts -nonewline "\n" flush stdout return } proc verifySite {} { global opt data if {[catch {http::geturl $data(site) -validate 1 \ -timeout $opt(timeout,ms)} tok]} { #puts $tok return 0 } upvar #0 $tok state if {$state(status) == "timeout"} { http::cleanup $tok return 0 } http::cleanup $tok return 1 } proc extractSite {line} { global data if {[string match -nocase *file:* [string range $line 0 \ [expr {[string first "HREF" $line]+13}]]] || \ [string match -nocase *javascript* [string range $line 0 \ [expr {[string first "HREF" $line]+19}]]] } { return 0 } set data(site) [string range $line \ [expr {[string first "HREF" $line]+13}] \ [expr {[string first \" $line \ [expr [string first "HREF" $line]+13]]-1}]] return 1 } ########################### # get the show started here # if {[catch {package require http} loadErr]} { puts $loadErr puts "Need the http package" exit } if {[llength $argv] == 0 || [string match -h* [lindex $argv 0]]} { puts "Syntax: deadlinks.tcl <bookmark filename>" exit } else { set opt(file,in) [lindex $argv 0] } if {[catch {open $opt(file,in) r} fd(in)]} { puts $fd(in) exit } else { set data(bm,in) [split [read $fd(in)] \n] close $fd(in) } foreach type {good dead} { set opt(file,$type) "$opt(file,in).$type" if {[catch {open $opt(file,$type) w+} fd($type)]} { puts $fd($type) exit } } foreach line $data(bm,in) { if {[string match -nocase *href=* $line]} { incr data(c,sites) } } main foreach type {good dead} { close $fd($type) puts [format "%5d %s" $data(c,$type) "$type sites"] }
TV You may want to consider a -command callback with the http::geturl command, at least you should be able to get a few sockets (probably a couple of dozens without getting to upsetting) to try to set up connections simultaneously. Or consider it a background process, depending on OS, it shouldn't eat up much processor time waiting.
Multiple queries give you n*100% speedup...
MT: I am looking at this because it looks correct, but struggling with how to implement it. The problem is that if the script continues on the HREF that is being waited for would end up getting written to a wrong position within the bookmarks.good file if it's just a slow validation or a retry, as I would assume you'd want to do. It may have orignally been in a folder named "Tcl" but end up in a folder named "Perl".
lv Bug report. After running a couple of hours, and getting about 60% done, I see this error message:
Checking 13020 sites... Checking site list element in braces followed by "</A>" instead of space while executing "lrange $line 0 $sidx" (procedure "extractSite" line 4) invoked from within "extractSite $line" (procedure "main" line 12) invoked from within "main" (file "/tmp/dl" line 130)
(where /tmp/dl is the deadlinks program)
I wonder if the cause of this would be an HREF which has {} in it - in my bookmarks file, there is at least one entry where the href is some javascript...
MT: Egads! That lrange was never even supposed to be there. I meant to use string range. However, I don't know if that's a fix for the problem. I sent you email, can you send me an example of the suspect line that includes a java/js link?
My bookmark file had 9 js links and did not fail, but when I added a set of braces like you showed the lrange line failed the same as your bug report but worked fine with the originally intnded string range. Very sorry about that. 13,000+ bookmarks? Sheesh, and I am taking pokes about have 1150.
HJG 2014-07-30 - Note that the wget program has an option "--spider".
From the documentation :
When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there. For example, you can use Wget to check your bookmarks: wget --spider --force-html -i bookmarks.html
But checking the output from that, and removing the dead links is left as an exercise for the user :)