Judy Arrays

http://judy.sourceforge.net/ is a very fast, very scalable dynamic array facility. I wonder how it compares to Tcl's hashing and whether it is worth looking into as the basis for a more powerful array command.

CMcC they certainly look interesting, but it seems to me that they are trees which store their keys internally, and they distribute it within the tree (they seem to be based on a trie data structure) so I can't see how you could use them for storing objects with an arbitrary ordering unless you used the address of the object as the key (cast to an integer). Hmm, I guess that would work after all, as a kind of hash. It's quite counter-intuitive that something giving a complete (if arbitrary) ordering would be more efficient than something like a hash, but since Judy is designed to play well with CPU caches, I guess such unusual results are to be expected.

CMcC I've run together a q&d (quick and dirty?) test of Judy trees, and benchmarked it against dict. There's only a 10% improvement in performance. I suspect the requirement of converting keys to strings is possibly what's dragging it down, and I think it's unavoidable. Anyway, the code's at http://sharedtech.dyndns.org/~colin/tcljudy.tgz

APN As an aside from the OSS perspective, it appears Judy trees are patented and there is always some hesitation in their use for that reason.

http://www.snookles.com/scott/publications/pli2003-slf.pdf talks about how much faster it is in Erlang as compared with Erlang's builtin hashing.

-- Todd Coram

I happened to mention Judy Arrays to Chuck Blake [L1 ] recently. He sent back an initial but fairly detailed take on them (see below). In short: Valuable, but probably oversold. --atp

 Date: Mon, 10 Jan 2005 11:16:15 -0500

 From: Chuck Blake

 To: Andrew Piskorski

Hey, Judy is pretty neat, Andy. Thanks for the link. Baskins really went to town on tries.

While I do think it is neat, and I like all the work he did, he does overplay its ultimate performance significance.

E.g., in the one timing run I did on my Opteron, Judy was actually about 25% slower and used 25% more memory than my AVL tree implementation tends to. In his own "make check" performance tests, Judy is only slightly better than his Splay, Redblack, and hash-based implementations.

Really, I think, if one plays the right implementation tricks, that the best data structure for large scale queries of large, dynamic ordered sets that pays attention to the memory hierarchy is a B-tree with one node being the size of a VM page (to minimize TLB refills) and each node structured as a 4..8 AVL tree internally. Assuming you can keep the root of the B-Tree in L1 cache, you can get random access to an entry with 3 TLB misses and 3*5=15 AVL misses you can get 64 (expected query time is 5 levels deep == 6 level AVL => 2**6=64 entries) to the fourth power (3+1 B-tree levels) or 2**(6*4) or 2**24 = 16 Mentry. 15 cache fills for random access to a 16 Mentry table is really pretty good. It's typically about a microsecond which is about what Judy gets for a few thousand. Assuming any scaling at all, it's no better, and honestly more complex than a simple two-level data structure.

For unordered collections, which Baskins seems to be unwisely pushing Judy for, an extensible hash table is both orders of magnitude simpler in code and faster. You have the directories of the ehash take you to one VM page. Internal to a VM page, you can structure it as a small hash table with either fixed width keys living in the table (fastest) or one more level of indirection. The main directory can be kept in L2 for all but the most extreme population sizes if the table is being actively used. If not, it's just one more cache refill. If the strings don't fit it's three. I would be shocked if Judy could do less than three cache misses per 32-byte string as the first access to an uncached data structure, but that's what the extensible hashing with indirect variable-sized keys gets you. I implemented this in about 200 lines of C code compared to his 20,000. And I think it's faster. :-)

With a properly organized overall computation usually you just have one cache miss which is optimal and typically about .1 microseconds on modern HW. You might have two when the directory doesn't fit. That could grow to .3 in a more pessimal overall program flow (e.g. directory doesn't stay cached, key length cannot be bounded). Judy was .5 usec which indicates 5..10 main memory accesses. Had it been one, my world would have been a bit more rocked. :-)

True, you still must compute the hash function once per string creation. That's almost always fast since the memory is cached (from, e.g. reading the string from a file or socket) and good hashes needn't be expensive. I have a few hash functions that are only 2..4 cycles per byte, or about 1/30 of a cache-fill per byte. So for < 30 byte strings, hashing the query key only adds about one more access worth of time, if you couldn't arrange to have saved that hash value when you created the query string.

Basically, after some deliberation, I think the conclusion I've reached is that the database guys figured out how to deal with memory hierarchies to even slower external memories 25 years ago and the same tricks work in the main memory VM/cache system. Baskins would do well to acknowledge that.

He would do better to support (and play up) the kind of "pattern matching" queries that often make digital trees compelling. Those are the kind where you want to search a collection for the subset of keys matching just a suffix or other substrings, for example, or when the third byte is "foo". I guess that's a more obscure access pattern, and so less likely to take the world by storm, but the competing data structures can't really support such things directly. They are par for the course with digital trees and also do the argumentative work of rationalizing digital decomposition of the keys which is a different "key abstraction" than hashing and comparing and often requires work to retrofit.

Anyway, I did appreciate reading about all his careful memory access engineering. A fun project, and maybe useful to me someday. I have tons of work to do, so I can't really discuss this much more right now. Thanks again.


Sean Barrett benchmarked Judy against a simple hash table here: [L2 ] and came to much the same conclusion.

From his summary:

Judy [does not have an] enormous speed advantage over a traditional "trade size for speed" data structure. Judy has received countless man-hours developing and debugging 20,000 lines of code; I spent an hour or three writing a fairly standard 200-line hash table. [...] Of course, you can certainly take the attitude that Judy performs well enough, and the fact that it's 20,000 lines of code doesn't really matter, as long as those lines of code work and you never have to look at them--and it appears Judy is mature enough that this is the case. It bugs me, though.