Version 1 of Random Hangs on high cpu load

Updated 2017-06-01 10:34:47 by MHo

From time to time I face problems in several long running tcl programs on one of our Windows production Systems, which I'm unfortunally not able to track down.

The source codes are too complicated to post here...

All programs make heavy use of after Events and file events, e.g. to periodically update a flag file on disk or reading stdout from called programs, and are execing many external excutables, either in an endless Loop or triggerd by twapi filesystem Monitoring. If everything works fine, the programs run many days (or weeks) without Problems, calling ten-thousands of other programs without Problems....

What I see yesterday is this:

  • We started another program on the machine, which puts the cpu under heay load and constantly allocates more and more Memory, so the machine slows down
  • We killed that program
  • Although the machine restores to a normal load then, my 3 tcl programs on that machine did not respond any more - that is, the after events did not fire anymore. There are no errors generated; it simply Looks like the programs are hang.
  • Other programs on that machine continue to run normally, so there was no general Windows error condition etc.

As I'm only experienced at the Tcl/Tk-script Level, I don't know how to track down such errors down any further. They aren't reproducable and happen from time to time. The machines are Windows VMs.

My questions are:

  • Under what circumstances is it theoretically possible that the tcl eventloop stops working?
  • Would it help to save the (few) Infos from Sysinternal's Process Explorer Output about active threads, thread state etc.? I'm not able to Interpret such Infos, I fear...
  • Is it possible that a call to exec ,,,,& /open |proc never Returns, blocking the whole program?