05 Sept 2003 Mike Tuxford: I was chatting on irc this morning and a friend said 'I need something to clean out all the dead links in my bookmark file' and so I wrote the following script.
It works well for me as you can see from the command line output example below but I advise that you back-up your real bookmark file first.
burp:/tcl/s/bookmarks# ./deadlinks.tcl bookmarks.html Checking 1150 sites... Testing site  1024 Good sites 126 Dead sites
Caveats: Just because a site doesn't respond within 10 seconds while the script is running it doesn't mean the site is permanently a dead link. You can reprocesses your file.dead file again at later times (deadlinks.tcl file.dead) and it will produce a file.dead.good containing any good links it found. Those sites then need to be re-added manually to your bookmarks.
lv Would bracing the exprs below result in the application running a wee bit faster?
MT: If you meant for the -timeout value in 'proc verifySite' I just changed that now. That makes much more sense. However, shaving milliseconds in a script that has to wait 10 seconds on each failed attempt isn't going to save much overall but is better programming in general and I thank you for that. Or do you mean in 'proc extractSite'? If so, I suppose it would, so I did it, but frankly I don't really like that proc myself. Got a better suggestion?
Frink and procheck pointed to a number of lines (there are still at least 6) where there were expr's without their arguments braced - those are the ones to which I was referring. I was just facing feeding deadlinks a massive bookmarks file and figured that if bracing them would reduce each link processing by 1/4 a second, I'd still be saving nearly an hour of processing time...
MT: OK, I never did understand this expr performance issue before. I did some reading and also a test script that created 10,000 vars using expr and indeed there is a huge performance gain on the expr usage and I think I almost understand why. ie; the braced portion gets byte-compiled at run time rather than substution occuring during every instance. I think that's right? and I hope my appyling the rule below is correct and complete. Thanks for making me learn.
lv the way i think it is explained is that expr does a level of substitution of its own as it processes its arguments. Thus, if you brace its arguments, then in most cases you save the overhead of trying to do a substitution twice. There are, of course, possible designs which depend on two rounds of substitution - but I don't believe that your code is one of those designs.
TV You may want to consider a -command callback with the http::geturl command, at least you should be able to get a few sockets (probably a couple of dozens without getting to upsetting) to try to set up connections simultaneously. Or consider it a background process, depending on OS, it shouldn't eat up much processor time waiting.
Multiple queries give you n*100% speedup...
MT: I am looking at this because it looks correct, but struggling with how to implement it. The problem is that if the script continues on the HREF that is being waited for would end up getting written to a wrong position within the bookmarks.good file if it's just a slow validation or a retry, as I would assume you'd want to do. It may have orignally been in a folder named "Tcl" but end up in a folder named "Perl".
lv Bug report. After running a couple of hours, and getting about 60% done, I see this error message:
Checking 13020 sites... Checking site list element in braces followed by "</A>" instead of space while executing "lrange $line 0 $sidx" (procedure "extractSite" line 4) invoked from within "extractSite $line" (procedure "main" line 12) invoked from within "main" (file "/tmp/dl" line 130)
(where /tmp/dl is the deadlinks program)
MT: Egads! That lrange was never even supposed to be there. I meant to use string range. However, I don't know if that's a fix for the problem. I sent you email, can you send me an example of the suspect line that includes a java/js link?
My bookmark file had 9 js links and did not fail, but when I added a set of braces like you showed the lrange line failed the same as your bug report but worked fine with the originally intnded string range. Very sorry about that. 13,000+ bookmarks? Sheesh, and I am taking pokes about have 1150.
From the documentation :
When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there. For example, you can use Wget to check your bookmarks: wget --spider --force-html -i bookmarks.html
But checking the output from that, and removing the dead links is left as an exercise for the user :)