Cuda , short for Compute Unified Device Architecture, is in essence a small supercomputer on a modern graphics card (of NVidia, but hey, they've pretty well available). In fact there's also Tesla, based on simular computing nodes, which preferably in version 8 (with hardware double precision floating point units) can make supercomputing workstations with over 4TFlops of power for under 10k$.




Cuda already has a perl interface IIRC, handy to play around with.

TV himself currently has not done anything with Tcl/Cuda yet, but wants both scripting with cuda programs and a parallel Cuda implementation of tcl (at least in some academic form to play with).

Lars H: A "parallel Tcl" in the sense of having each processor running a separate Tcl script probably isn't going to happen — since there isn't much memory available per core, and per thread in the Cuda model even less, an EIAS data model simply doesn't fit! What could work is to do some kind of "parallel CriTcl", where a Tcl program is boosted by in-line code in some other language whose data model is closer to that of the hardware (in this case, single-precision floats).

TV Did you check it out ? There´s 8 kilopbytes of registers, and for instance with my humble (cheap) 9500GT there´s like 10 Gigabyte/sec memory access speed even more than all three memory interfaces of the new I7 in normal use. And I´m talking like small formula rendering tests first, I guess. And, for non-hashable associative functions, the parallel approach could be a huge improvement over current Tcl. And maybe Tk is fun when connected without the graphics bandwidth bottleneck.

Lars H: Yes, I did follow the links you provided to find more information — that's why I concluded a "parallel Tcl" wouldn't make sense. You can't fit a sensible part of the Tcl runtime environment in 8kB! (Do the math: how many Tcl_Objs can you fit in 8kB, how many hashtable entries? How many are needed in a fresh interp?) 8kB is small even for a L1 cache today, but these aren't caches — they're registers, so any data from memory would have to be fetched explicitly! Full speed memory access presumes that threads can be put to sleep while waiting for data to arrive, and thus in turn that there are many threads per core. The 8kB of registers per core have to be shared between all threads, so in practice you only have a fraction of that to play around with.

There is also the problem that CUDA (according to Wikipedia) is a kind of SIMD system, so even if branching is allowed, I doubt it could handle the kind of massive branching one sees in TEBC. Wikipedia furthermore states that CUDA doesn't support function pointers, and how far do you get in the Tcl sources without using function pointers? It's simply not possible to compile Tcl for the GPU.

However, one thing I didn't notice when I wrote the above is that the Python example in Wikipedia [L1 ] looks exactly like what I called "parallel CriTcl", so I would presume this is what the Perl interface looks like too. If you primarily want to experiment, I can't see why one would need much more than that.

TV Auw man, I mean there are computations in Tcl like everything from list sorting or label finding, vector computations, variable bookkeeping (did you notice the memory bandwidth and size on average modern cards?) and even object oriented (like the set operations in Objective C) features whihc lend themselves for parallelizing . All the mumbo jumbo about compiler possibilities is all very nice but so immaterial for EEs, sorry to say. I mean then use something else then function pointers, isn't it?

See Also

Running a Cuda program from a Tcl loop
an example cuda program called from a tcl script.
Tcl on Cuda