Socket Performance Analysis

by Theo Verelst

When I first was using Tcl/Tk to make a user interface for a suite of radiosity / ray-tracing graphics programs and simulation programs, it didn't have sockets built in yet.

So I made a little stub, on Hewlett Packard Unix, a little program I'd hook into a piped exec, which would transfer the stdin and stdout to a Tcl file descriptor; the little C-based program would connect to either another stub or another program via what I then called the 'connection server' (of which the code might even be lost due to dreadfull university ongoings), which would join sockets (the actual ones, by passing file descriptors) based on name-based rendezvous.

The other program could be a C program or a existing application, such as applications allowing external control over a socket- (pipe-) based connection (like AVS, that was about a decade ago).

There are two main performance measures I normally come up with in this area:

ease of applying, streaming control, and absence of unpleasant boundary conditions
the normal performance association: the raw bytes per second, and how high that can be

The first has to do with setting up the socket, not trivial for many C programmers, dealing with flow control, in tcl the fileevent to deal with incoming and outgoing data streams, and the buffering, new-line and end of file behaviour.

At the operating system level, this is also important because buffering and other things are influenced by the OS.

Normally, the performance of a car would mainly be measured not just by its colour or even steering behaviour, but by its speed, which is of course important for sockets, too.

The reason the other performance measure is brought forward, is that in practice the actual performance in terms of speed is often so low, that the other issues prevail. And at times they are related, which in the time of many sort of web applications and things like COM and various IPC methods is important, because it seems to me like many parties take realy incredibly slow and lame performance for granted without even being prompet to wonder why things work a certain way.

First a short note on the Tcl sockets in terms of their usage properties.

      process A (on some.host)                    process B
      ---------                                   ---------
  set s [socket -server server_accept_proc port]
                                               set s [socket some.host port]
  fileevent $s readable ....
                                               puts $s "data"
  ...                                          ...
                                               close $s
  close $s

In principle, A and B can be the same process, but this rarely makes sense.

The dots normally are filled in with at least more fileevent statements, and the main things that bugs me a bit in some of the usage performace issues: a test for end-of-file.

The fileevent readable command will continue to execute its body when a connection is closed on the other end indefinately without explicit programmer care to test for eof, i.e., it doesn't stop once the (other side) closed file's socket has been read til the end.

An important issue in file/socket access is the new-line behaviour: is buffering/flushing related to newline-ended writes, and is it reasonably possible to retrieve all actual end of lines at the receiving end without errors in lines longer than any of the buffers.

And does process control make the receiving process active after a reasonable number of characters are send, but always after a while, even when no newline is sent or only a single character.

Let's first see what on a modern Windows XP (and I'll try Linux, too) is sufficient to get an operating connection, first on the same machine (i.e., just another process).

I started two wish-es, one I call server, other client, though that terminology traditionally applies to programming methods less symmetrical than what we do here.

 ###### server
 proc serv {ss ip port} {
   global cons ;
   lappend cons $ss;
   fileevent $ss readable " puts \"received data length: \[string length \[read $ss\]\]\" ; if \[eof $ss\] {close $ss}"
 }
 set s [socket -server serv 7777]


 ###### client
 set p "0123456789" ; for {set i 0} {$i < 13} {incr i} {append p $p}
 set s [socket localhost 7777]
 puts [time {for {set i 0} {$i < 1000} {incr i} {puts $s $p}}] ; close $s

This 81921000-byte transfer took about 19 seconds in this case on a 2.6 GHz recent pentium4 machine, with a movieplayer running, though.

That makes about a little over 4 megabytes per second, a seriously low figure.

JMN 2006-09-26 : Using Tcl 8.5 on a 400Mhz windows machine - I got approx 42 secs for the above. By changing to -encoding binary -buffering full -buffersize 819200, this came down to 12.3 seconds.

KBK: Indeed, socket performance can be bad on Windows. We're working on it. Expect that the Core will at some point incorporate ideas from David Gravereaux's fine iocpsock package [L1 ] that will address some of these concerns.

TV The page wasn't finished... It's good to hear work is being done in this area, though.

On HP-UX, at the time a 50 MHz PArisc, this performance, but with the right buffering and done in C would be about 25 MB/s, and I seem to remember I got much better performance before on other machines, and that should be possible on tcl.

Mind that at least this example makes read read about 80 Megabytes al at once, though the core memory is physically 512 MB, that is a lot more than any normal receive buffer is configured. So the first, reasonable, next step would be to reduce the per-read size 'til some point where the process switch overhead time is little compared to the actual transfer and send/receive function time, and buffers are used well and not required to grow beyond their normal size.

I guess it can also make a difference whether Tk is loaded. Finally, I want to see what a Cygwin- (and maybe, though I didn't try yet, mingw-) compiled tcl does. Linux should be faster, but maybe I overestimate the speed lists can be fed into the read command. In C, I know that memcopy can work quite well when implementing sockets without system calls, also video processing at least suggests souring[sic] data transfer rates are reasonably possible, between processes.

NEM Results on SuSE Linux 8.2, running on a 1.8GHz Athlon (2200+), with 256MB DDR mem. Tcl/Tk 8.4.2 (shipped with SuSE):

 % puts [time {for {set i 0} {$i < 1000} {incr i} {puts $s $p}}] ; close $s
 3218768 microseconds per iteration
 % set mb [expr {81921000.0 / (1024 * 1024)}]
 78.1259536743
 % set sec [expr 3218768.0 / 1000000.0]
 3.218768
 % expr {78.1259536743/3.218768}
 24.2720052126

So, it looks like Tcl manages about 24MB/S on Linux (although, with very little else running).

TV I've just been trying to use two 8.4 Cygwin-compiled wish-es, which run on XFree86 xwin simulator on the same machine, this time without another relevant cpu-time-consuming application, and got about 16 MB/s.

This is not a downloadable distribution, because I used the Linux (I think it was) sources and compiled them under cygwin as a Unix-type package, making use of the cygnus unix emulation layer and library, and it uses X windows to display the wish window. It may well be that not running quicktime movie player (I like to watch the americafree.tv channels) accounts for a major performance boost. At least the cygwin layer seems not to suck too much, performance-wise, though I'd say the pentium should be able to transer data twice or three times at higher effective rates, given the possible bus speeds, but maybe the memory isn't effectively up to it.

I'll see if I can try out linux now, and I'll add some C stuff, I promised some time ago I'd also check out if cygwin on windows sockets behave useablel, stream-wise, since in the past I had serious problems connecting a C program with tcl, where the sockets buffers couldn't be all filled, or flow would be disrupted without a way out. I made a lot of code to get useable results, but the buttom line was data got 'stuck' in the buffer at some point, and only got out put streaming more data, not by flushing or waiting. Later on I read (this was years ago) it might be that the combination of OpenGL/Mesa and sockets under cygwin may have been the problem, but I'm not sure.

(Oct 31 2003) I tried RedHat 9 with self-compiled wish8.4.4 to run the above, which ran up to about 39 MB/s without further optimisation, on lightly loaded machine. On itself, not bad, though I'd say the machine should be up to more, but for a scripting language, without further optimisation for streaming, thats quite ok.

From a XP (SP1 but not further updates except the blaster patch) 1.7 GHz machine I happen to have access to, not loaded but running TclHttpd and a monitor window, fitted with a completely cheap but recent 100 Mb/s ethernet, to the above Linux system, with the server running on either side, the result is an effective little over 4.5 MegaBytes per second, which is about half of the top network capacity, which in some cases is nearly reached by other setups (almost 10 Megabytes per second over a long decent utp cable), but the CPU activity on the XP machine was full 100 %.

TV I didn't yet mention the socket buildup performance, being the number of socket connections per second which can be made, and also how many in total can be maintained (could be limited to a few hundred or so, I don't know about modern unixes and windows exactly).

The connect-accept OSI layer 5 or so calls make it possible for sockets to wait for acceptation for some time, when competing parties for a certain IP/port combination (usually tied to a process) try to connect. At least a performance measure there is how many that can be (could be 5), and how long it takes when the connect side decides there is no accept.

rmax - Tcl's socket performance also heavily depends on the encoding used on the channel. I used this modified benchmark code:

 ### server
 proc serv {ss ip port} {
    global cons ;
    lappend cons $ss;
    set encoding [gets $ss]
    fconfigure $ss -encoding $encoding
    puts "length: [string length [read $ss]]; encoding: $encoding"
    close $ss
 }
 set s [socket -server serv 7777]
 vwait forever

 ### client
 proc Bench {encoding} {
    set c [socket localhost 7777]
    fconfigure $c -encoding $encoding -buffering line
    puts $c $encoding
    set t [lindex [time {puts -nonewline $c $::s; flush $c}] 0]
    puts [expr {(double($::l)/1024/1024)/(double($t)/1000/1000)}]
    close $c
    after 1000; # let the server settle down before we proceed
 }
 set s [string repeat "0123456789" 8192000]
 set l [string length $::s]
 Bench utf-8
 Bench binary
 Bench iso8859-1

On a dual AMD Opteron running SuSE Linux 9.0, I got the following numbers: 110 MB/s for UTF-8, 59 MB/s for binary, and 38 MB/s for ISO-8859 encodings. Surprisingly, when running this on two machines, connected to the same 100MBit switch, UTF-8 and ISO-8859-1 were almost equal (~7.5 MB/s), but binary was remarkably slower (~6.5 MB/s).

SRIV My Slackware notebook Core Duo 2ghz pulled 122, 62, 77.

CJL - Don't do as I just did and copy-paste the above into the console of a wish.exe on Windows. The console tries to echo the result of the string repetition (all 80+MB of it!). It's better to:

 proc init {} {
   set ::s [string repeat "0123456789" 8192000]
   set ::l [string length $::s]
 }
 init

(Embarassingly low results (even with dual CPUs maxed out) withheld!)

PT If you are concerned about socket performance on Windows with Tcl then David Graveraux has already done lots of work on this subject. This is distilled in the iocpsock extension which uses the more advanced I/O capabilities of the Windows Sockets API to reap significant performance benefits. I/O completion ports and so on. The BSD compatability layer provided by Winsock is not the best scheme to use although it is the simplest to program.

Related issues:

Application layer interface (e.g. pcom)
event handling (fileevent)
compatibility issues (no sockets for file descriptor passing to exec under certain (or all) windows version)
flow issues important to know how process scheduling will run based on sockets transfering data, that is when will the os'es scheduler decide to take the locally buffered data pushed into a socket, transfer it to the receiving buffer, and give the receiving process the time of cpu.
stdio based program input/output and remote execution over sockets, I'm making a bwise start with this on a new page: process and stream based bwise blocks
Watching a remotely stored video

Category Performance