rope

A rope is a data structure used to represent strings. Ropes are typically implemented with B-trees [L1 ] of (possibly immutable) strings. The potential immutability of source strings makes rope a good candidate for a mutable string structure on systems where the native string representation is immutable, such as Java. Also, ropes can be used to mix data from disjoint sources, e.g. stack, heap, memory-mapped files, on-demand content, etc.

See [L2 ] for a real-world implementation of ropes in C ("C Cords").

SGI provides a rope template class in its STL implementation [L3 ]. More information about the implementation here [L4 ].

See also http://en.wikipedia.org/wiki/Rope_%28computer_science%29


(escargo 25 Jan 2008 - There is a limitation in the rendering code that ignores the closing paren, so the URL if presented with literal opening and closing parens is right, but the generated HTML is wrong).
(DKF: You need to percent-encode the parenthesis.)

jima (2010-Nov-05)

I would also look into [L5 ] where an introduction to ropes is to be found with some comments on performance issues and pattern matching on ropes.

And I would also suggest reading Colibri source code.


FB: As part of the Cloverfield project I've created a rope package in C. A preview of the code is available here:

http://sourceforge.net/project/showfiles.php?group_id=216988


FB: Cloverfield suggests that ropes be used in place of flat strings as the primary string representation in Tcl.

Each substring node could have metadata attached. For example, the string representation of a proc body could point to the original sourced string, with metadata giving the file name. This could be used to generate meaningful stack traces with line numbers and file names. See also Tcl9 and annotated strings.

I'm becoming convinced that skip list-based ropes are the most suitable structure for Tcl/Cloverfield ropes, whereas B-trees are more suitable for general purpose string operations. Skip lists [L6 ] are typically implemented using layers of linked lists with the bottom-most containing a chain of all the substrings in order, and successive "fast lane" layers built on top of each other. The bottom layer gives the flat string representation, while the above layers can give successive hierarchical layers. Some sources [L7 ] claim that skip list performances are significantly worse than B-trees mostly due to memory locality problem, however I'm convinced that a fine-tuned skip list-like structure is the way to go. For example, vanilla skip lists use randomization to promote nodes to higher levels and keep the whole tree balanced probabilistically. We could use the language rules instead and thus get a tree whose depth is roughly equal to the hierarchical level of the script (for code this would be the indentation level), and where the length of each layer is roughly equal to the number of "statements" on this level (list elements, commands, etc.). Performances would be more predictable, as the corner cases are easy to figure out (the structure is no longer probabilistic but deterministic and follows the language rules). And since EIAS, string operations tend to follow specific patterns for most data structures. For example, keeping an internal list representation in sync with its string rep should be trivial.

This approach somewhat resembles the "non-extractive" tokenization techniques used in the XML world. See for example VTD-XML [L8 ], which builds a tree of tokens over the existing untouched XML data, using the offset and length of each token. Compare to DOM, which builds a tree of objects with data extracted from the original XML data, and thus consumes at least twice the memory for strings only. A similar parallel can be drawn with Tcl regarding the string and internal representations of data structures. They give detailed technical information about their solution here [L9 ], and describe their Virtual Token Descriptor (VTD) format here [L10 ]. As Tcl and XML both share the EIAS principle to a certain extent, using ropes or annotated strings with similar "non-extractive" techniques should give the same kind of benefits in a Tcl context.


The Java 6 SE library introduced two classes based on a concurrent implementation of skip lists. More info here [L11 ] along with a nice introduction to the concept.


Tcllib does skiplists!


The Scheme people at http://schemers.org had this sort of discussion a while back regarding string representations; this can give some insight on what to do for Cloverfield, given the similarities between Tcl and Scheme on many points. [L12 ] [L13 ]


This nice Java applet demonstrates basic operations on several tree structures with step-by-step animations:

http://people.ksp.sk/~kuko/bak/index.html


The following page makes "a rough comparison between different string libraries (and some APIs, embedded in other libraries/programs)":

http://www.and.org/vstr/comparison


RS 2008-01-25: Could someone give an example of a string rep for a rope - i.e. demonstrate how the internals can be turned outside? I roughly assume it to be a relation like of text widget's get method (text content only) vs. dump method ("including the text and information about marks, tags, and embedded windows")... Even if metadata is kept hidden at runtime, it has to be serialized for channel transport (files, sockets, pipes, etc.)

FB: Let's imagine that our rope object type uses 2-3 B-trees [L14 ].

# Create a rope by concatenating words.
% set s [rope create The Quick Brown Fox]
TheQuickBrownFox

# Display the 2-3 tree structure as Tcl lists.
% rope dump $s
Quick {The {Brown Fox}}

# Append more words. This will rebalance the tree.
% set s [rope append $s Over The Lazy Dog]
TheQuickBrownFoxOverTheLazyDog

% rope dump $s
Fox {Quick {The Brown} The {Over {Lazy Dog}}}

# Insert word in the middle of the rope.
% set s [rope insert $s 16 Jumps]
TheQuickBrownFoxJumpsOverTheLazyDog

% rope dump $s
Fox {Quick {The Brown} The {{Jumps Over} {Lazy Dog}}}

LV So it looks like the rope's display version isn't all that useful. That's fine - neither is the display version of the output from open, etc.

Is there, however, a way to at least simulate serializing, so that one can dump the contents of ropes into a flat file and load them back up again without having to somehow know what commands initially were issued?

FB: I don't understand. Ropes are just an internal representation for strings. Serializing a rope just gives back the whole string (ie the concatenation of its substrings). As for serializing the internal structure, it's no different than serializing a B-tree (assuming the rope uses such a structure internally). Or you can serialize a rope by just dumping all its substrings in sequence, in that case you lose the tree structure but you can recreate it by reinserting all the substrings in an empty rope. In any case the internal structure bears no significant information whatsoever.

LV I guess I don't understand. Are you saying that if I just puts the rope variable to a file, then later do a gets of the data in that file, the resulting variable will be an identical rope?

FB: Yes, the ropes will have identical string values. Their internal structures may differ, since they depend on the history of the rope objects (insertions, deletions...) but this is no more significant than identical strings having different pointer values.

LV: When I think of serializing, what I think of is the ability to write out, frequently to a file, a snapshot of a variable (or group of variables) in such a manner that I can recreate said variable(s) at a later time. Maybe there is a better word for this than serializing?

FB: Ropes, like strings, are naturally serialized and need no special handling. Let's consider the following example:

# Create s1 by appending three strings.
% set s1 [rope create "Line 1" \n "Line 2"]
Line 1
Line 2
#
# s1's internal B-tree is a root node with the middle string and two children with the extreme strings:
#        "\n"
#        /  \
# "Line 1"  "Line 2"
#
# Create s2 by appending two strings.
% set s2 [rope create "Line 1 \n" "Line 2"]
Line 1
Line 2
#
# s2's internal B-tree is a single node with both strings:
# ("Line 1\n" "Line 2")   
#
# Write s1 to file, and read it back into s3
% set f [open file.txt w]; puts -nonewline $f $s1; close $f
% set f [open file.txt]; set s3 [rope create [read $f]]; close $f
% set s3
Line 1
Line 2
#
# s3's internal B-tree is a single node with the whole string:
# "Line 1\nLine 2"
#
# Compare ropes
% expr {$s1 == $s2}
1
% expr {$s1 == $s3}
1

Here s1 is a rope constructed by appending three strings: "Line 1", a newline, and "Line 2". s2 is a rope constructed by appending two strings, but with the same string value as s1. s1 is then written to a file, and read back into s3. All three ropes are equal in value, but their internal structures differ because of their history.


FB: The Tk text widget uses a B-tree to store textual data along with tags, so it's some sort of augmented rope, where each node has associated metadata. TIP #169 [L15 ] (Add Peer Text Widgets), implemented in Tk8.5, already involved uncoupling the B-Tree from the widget. A native rope object type could be used to push this idea even further, by allowing data sharing outside of text widgets. For example, a comment dated 24Jan2004 on the text page expresses the wish to share this structure with tdom. The B-Tree code used in the text widget could serve as a basis for a rope object type, as the code is mature and widely tested.

DKF 2008-01-25: The text widget B-tree is only partially decoupled from the text widget, and it's not as widely tested as it should be (it had major changes in 8.5 to support smooth-scrolling and other related changes [L16 ]). It's not at all clear to me that it's the best place to start from for your own implementation either; it's very focussed on being a model for 2-D strings (it's a collection of paragraphs containing rich text).


DKF: I was just looking at this page again, and I thought I'd add a couple of observations.

  1. It is really important that you keep the tree at least somewhat balanced. I remember using a research language once (which the author thought was ready for production use, but wasn't, as you'll see...) where the string representation was a tree, but where there was no balancing of the tree (except for compile-time constants). This had the effect of making the code I was working on (a compiler for another language) cripplingly slow; not only was the tree unbalanced, but it was unbalanced in the worst possible pathological way. Keeping that tree balanced is vital! If you're using B-trees rather than binary trees, this is easier.
  2. A technique/pattern that is commonly used in Java is to use singleton objects for the leaves of the tree (i.e. characters). That simplifies most of the code without totally slaying memory usage.

FB:

  1. Balancing is fortunately easy with immutable structures. My code above keeps the binary trees balanced by construction. Using mutable structures it becomes a bit more complicated (nothing unbearable though). I guess the language you mention used this model (unless the implementation was very loosy). Unbalanced trees typically degenerate to linked lists where each node has one and only one child: access becomes O(n) instead of O(logn), which explain the performance problems.
  2. Rope leaves are usually flat strings (arrays), not individual chars, so no problem here. And such string "sources" are usually shared. For example, a Tcl script typically builds strings from its source code: a proc body would point to the same data as the sourced data.

EKB If this is obvious to everyone else, then I apologize, but... what happens when you delete? E.g., from your example above, by the time you've reached

TheQuickBrownFoxJumpsOverTheLazyDog

with internal representation

Fox {Quick {The Brown} The {{Jumps Over} {Lazy Dog}}}

what happens internally if you then delete "oxJum"?


See also