How To Read Large Strings without Newlines

XAN: I've run into a kind of surprising limitation(?) and it has me stumped. Maybe someone else here has done something similar?

The Problem

I am attempting to use TclCurl to have a "conversation" with an ASP driven webpage. A little background... these are corporate in-house webpages over a VPN and there is little to no hope of getting them to change anything. And there is a major design flaw in these webpages - the designer has decided to use the __VIEWSTATE field to save state for everything... including the kitchen sink! The result...

The stumbling block is a __VIEWSTATE field that is approximately 500K large! To communicate with the server, I have to capture this awful chunk of data and send it back in the POST request. I have code that does this, and I know that it works because it is heavily tested on pages with a sane-sized __VIEWSTATE. Trying to use the same code on a page with a large VIEWSTATE, however, just causes TCL to hang.

Possible Cause

I have no problem reading a file megabytes large into a variable, so somothing else is going on here. I looked for "offensive" characters that may be choking the interpreter but none seem to exist. The striking detail seems to to me to be the absence of newlines. The ~500K is all on one line... could it be that TCL cannot handle this scenario, or handle it well?

Things I've Tried and The Outcome

  1. I've taken a chunk of about 10K and tried building a 500K string with no newlines using append. The interpreter does get slower and slower and eventually grinds to a halt. When doing this with newlines, no slowdown occurs.
  2. I've tried every possible method of reading this data into a variable...
  • different encodings and translations on fconfigure -> read into variable
  • set variable directly using copy/paste into console
  • breaking up into chunks and rebuilding these chunks into one variable
  1. I've looked for a way to have TclCurl pull the postfields from a file. No such method seems to exist - it must come from a variable.
  2. I've tried using external options like CMD.exe type filename. Same problem. Interestingly, CMD.exe has no problem typing this file out in a CMD.exe instance.

Questions

  1. Why can't TCL handle this data?
  2. What can I do? Anyone have any suggestions or similar experience?

AMG: I wasn't aware of any problems with long strings, even when they lack newlines. The computer I'm on right now takes about thirty milliseconds to assemble a one-megabyte string consisting only of the letter "x":

time {string repeat x 1048576} 100

I suspect the problem isn't with Tcl or the way it handles long strings or newlines, but rather with the way data is exchanged with the TclCurl library. Profile your program with [time] and/or [clock microseconds] to see which command is taking the longest to complete, then split up that command and see which piece is taking too long, and keep repeating until you've isolated the culprit. Let us know what you find.

There doesn't appear to be any documentation for TclCurl on this Wiki. Maybe write a sample program or two and post it so we can see how it works.


RLE (2011-06-01) Something seems odd with your example of creating a 500K string by appending a 10K string a number of times. Note:

 % set s [ string repeat a 10240 ]
 % string length $s
 10240
 % time { for {set i 0} { $i < 50 } {incr i } {append res $s } }
 292 microseconds per iteration
 % string length $res
 512000

$res is now a 500K string of the letter "a". 292 microseconds hardly seems like "grinds to a halt". Therefore, something else seems amiss somewhere.


XAN: Thanks for the input...

I've run the code on my system and come up with similiar results.

(bin) 2 % string length $s
10240
(bin) 3 % 
(bin) 3 % 
(bin) 3 % time { for {set i 0} { $i < 50 } {incr i } {append res $s } }
1626 microseconds per iteration
(bin) 4 % string length $res
512000
(bin) 5 % 

I think that it takes slightly longer here because I'm working on an older Core2 laptop. So far so good - you've convinced me that the problem isn't likely to be with the newlines.

However, I don't think its possible that TclCurl is involved. When I open up a clean interpreter and try to set a variable with the copy/pasted VIEWSTATE content, I get the same issue.

I may have been incorrect about the special characters in the VIEWSTATE... I'm going to take a closer look at the content and see if the interpreter could possibly be hanging up trying to do substitutions.


XAN:

The problem persists and I don't understand it...

1. I took a several different 10K chunks out of the VIEWSTATE and plugged it into RLE's timing sequence above... the timing is similiar and it completes the operation each time. I am going to try and do this with the entire block - although a similiar previous attempt to do this failed.

2. I tried replacing any characters normally escaped or substituded by the interpreter. Loading the entire string still causes a hang.

3. I tried posting the entire VIEWSTATE to the wiki but that didn't work. A sample of the data looks like this:

...bGVkIFdhbmRhHjQ4MzfCoCAtIEEgR2FuZ2xhbmQgTG92ZSBTdG9yeRs0MjUzwqAgLSBBIEdvbGRlbiBDaHJpc3RtYXMmMjIx...

XAN:

I've gotten the code to work by reading in the VIEWSTATE and then reassembling this way:

    set fp [open c:/viewstate.txt r]
 
    fconfigure $fp -translation binary -buffersize 10240
        
        set CTR 0
        set data($CTR) [read $fp 10240]
          while {![eof $fp]} {
           incr CTR
           set data($CTR) [read $fp 10240]
        }
        
        
        close $fp
        
        set FINAL ""
        incr CTR

        for {set x 0} {$x<$CTR} {incr x} {
          append FINAL $data($x)
        }

When I tried breaking it up into chunks previously I wasn't doing it under the binary encoding or setting an explicit buffer. That is the only difference... but it makes a BIG difference. The whole process is now nearly instantaneous.

THANK YOU RLE for setting me off in the right direction!!! The help is greatly appreciated!

BTW - TCL ROCKS!!!


AMG: That works, but this doesn't?

set fp [open c:/viewstate.txt]
fconfigure $fp -translation binary
set FINAL [read $fp]
close $fp

Go ahead and give it a try. If it's also fast, try it without the [fconfigure] line, and let us know if that makes a difference. Also, note that I omitted the r argument to [open] because it is the default.

You can wrap your code and mine inside [time] if you want an easy way to measure it.


XAN: AMG, I can confirm that the above block does not work with the data in question, with or without the [fconfigure]. That's what had me so stumped, and actually still does. I can't post the data in question on the wiki, but I'd be happy to send it to somone that is interested.

To recap: Trying to read the whole thing in always ends in what appears to be a hang. I have left the interpreter open for 45 minutes and still no finish. Trying to copy/paste directly into the console - same thing. The ONLY method that has worked so far is encoding for binary and reading in the data in regular sized chunks, with an explicit buffer set.

Is this not the expected behavior?

AMG: Not at all, it shouldn't hang like you describe. Can you attach the file to an email to [email protected]? I can look at it late tonight when I get home from work. Also attach an md5 or sha1 checksum if possible, so I can be sure the file didn't mutate in transit. I mean, if Tcl is having trouble with the data, who knows what SMTP will do to it? Anyway, I plan to see if I can reproduce the problem on Windows and Linux. On whichever platform it hangs up, I'll trace it in gdb and see what's really going on.

Also, which versions of Tcl have you tried this with? You can use [info patchlevel] to get the Tcl version.


XAN:

1. I will send the data to the supplied address.

2. I experience the same result with 8.4.14 and 8.5.9.

AMG, replying to your email:

On 6/2/2011 6:21 PM, TGD Compute wrote:
> Thanks for your interest in this problem! The VIEWSTATE data in question 
> is attached in a text file.

Windows XP SP3, 1.86 GHz Pentium M, 1 GB RAM, Tcl 8.6b1.2:

% set chan [open viewstate.txt]
% time {set data [read $chan]}
16840 microseconds per iteration
% close $chan

Slackware Linux 13.1, 350 MHz Pentium II, 96 MB RAM, Tcl 8.6b1.2:

% set chan [open viewstate.txt]
% time {set data [read $chan]}
39857 microseconds per iteration
% close $chan

Slackware Linux 13.1, 350 MHz Pentium II, 96 MB RAM, Tcl 8.5.8:

% set chan [open viewstate.txt]
% time {set data [read $chan]}
17116 microseconds per iteration
% close $chan

I don't see a problem. Have you tried different reading the file on different machines, different operating systems, etc.? Maybe the problem isn't with Tcl itself, or maybe the problem is a bad interaction between Tcl and something particular to your machine.

(By the way, it's interesting that it takes more than twice as long for 8.6 to read the file as 8.5, but it's quite possible that this is due to the fact that 8.5 came with Slackware whereas 8.6 is self-compiled. But consider the machine specs before you make a big deal out of this.)

Something just came to mind. If you're trying to read the file interactively in a tclsh or Tkcon session, you will want to suppress display of the return value of [set]; it can take a long time to display. ([set] returns the value it set the variable to.) The easiest way to do this is to throw a dummy [list] command in there, since it returns empty string when given no arguments.

% set data [read $chan]        ;# will display the full contents of the file
% set data [read $chan]; list  ;# will not display anything

Wrapping the command inside [time] will also suppress normal display, since [time] determines the return value, not [set].