Version 4 of Colibri

Updated 2009-04-16 14:21:00 by FB

Announcement

ANNOUNCE: Colibri version 0.2
=============================
As part of my ongoing work on the Cloverfield project, I (FB) am pleased to announce
the second version of colibri. 


WHAT IS CLOVERFIELD?
====================

Wiki page: http://wiki.tcl.tk/Cloverfield 


WHAT DOES COLIBRI STAND FOR?
============================

Colibris, known in English as hummingbirds, are a family of birds known for 
their small size and high wing speed. The bee hummingbird (Mellisuga helenae), 
found in Cuba, is the smallest of all birds, with a mass of 1.8 g and a total 
length of about 5cm. They are renown for their stationary and backward flight 
abilities on par with flying insects, which allow them to feed on the nectar of
plants and flowers in-flight.

I've chosen this name for this project because its goal is to be fast and
lightweight, and to follow the feather and bird theme shared with Tcl and many
related projects.


WHAT IS COLIBRI?
================

Colibri is a string and data type infrastructure. It features:

 - Rope-based strings (see http://wiki.tcl.tk/20690 ) :
 
     * ropes are immutable
       ... but ...
     * extraction, insertion, deletion, concatenation... are fast
     * limited compatibility with native C strings allocated in static space
       (or guaranteed to survive the whole application lifetime)
     * support for the full Unicode range (up to 32-bit codepoints)
     * 1/2/4-byte fixed-width encodings as well as variable-width UTF-8 with
       transparent conversions and access
     * custom rope types allows for lazy or procedural representations
     * iterator interface for easy access
     
 - Extensible data type sytem dubbed "words"
 
     * similar to Tcl_Obj
     * minimum payload is identical to that of Tcl_Obj, i.e. 8 bytes, but can
       take more space if needed
     * words can have synonyms, i.e. words of any type
     * synonyms can form chains of arbitrary length, so a word can have several 
       synonyms with different representations -- no more shimmering!
     * predefined types such as ints, single chars and small strings (up to 3
       chars on 32-bit systems) are represented as immediate values: the value 
       is stored directly in the pointer rather than in an allocated structure
       
 - Fast and efficient cell-based allocator
 
     * page-based allocation for optimal locality of reference and cache use
     * 16-byte cells on 32-bit systems fit most needs, but elements can allocate
       more cells if needed (up to 63 cells on 32-bit systems)
     * overhead is small: 2 bits per 16-byte cell.
     * raw alloc performances are competitive with the stock system malloc,
       and in many cases much faster (up to 5 times as fast on small strings and
       on words vs. Tcl_Obj-like structs)
     * single cell allocation (the most frequent case) is very fast
       
 - Automatic memory management thanks to an exact (AKA accurate or precise),
   generational, copying, mark-and-sweep, garbage collector
   
     * exact GC implies that roots (externally referenced elements) and 
       parent-child relations must be declared by the application, using a 
       simple API
     * custom types can define a finalizer; one of the applications is to plug
       in externally managed structures (e.g. malloc/free)
     * the GC process is fully controllable (pause/resume) so that the 
       application don't get interrupted unexpectedly
     * generational GC limits the overhead by restricting the collection to
       newer elements, which are typically short-lived
     * longer-living elements are collected less often as they get promoted to
       older generations
     * promotion is done by moving whole pages between memory pools, optionally
       performing compaction when fragmentation exceeds a certain threshold, 
       thus limiting the overhead and improving the cache-friendliness over time
     * contrary to reference-counting schemes (RC), GCs allow circular 
       references without memory leaks; word synonym chains take advantage of
       this, being implemented as circular lists
     * studies show that GCs consistently outperform RC in raw performances and 
       real-world cases, because the cost (both space and time) of maintaining 
       reference counters outweights the amortized GC overhead, especially with 
       generational GCs
       
            
Colibri is written in plain C and is free of any dependency besides system 
libraries. The compiled binary DLL on Windows is about 32kB. The source code is
heavily commented and follows the Tcl quality standards.


HOW DOES COLIBRI RELATE TO TCL?
===============================

From the Cloverfield announcement:

"   The last major point of the project is related to implementation and low
    level issues. The existing Tcl implementation is notoriously solid,
    however some changes are too radical to be performed on the current code
    base, so a reimplementation is certainly needed on vast portions. For
    example, the string representations, object structures, and virtual
    machines will certainly require complete rewrite to accommodate with the
    needed changes, which means that a lot of code won't be reused. However
    vast portions of library code and algorithms certainly will (clock scan,
    bigint, regexp...), as well as the high coding standards and QA that are
    the trademarks of our beloved language.  "

So Colibri is a candidate infrastructure for Tcl9 as an alternative to the
current Tcl_Obj-based core implementation. I believe that the features provided
by Colibri shall yield higher levels of performances than the current 
architecture, at the price of an ABI incompatibility (for which major versions
are made anyway), but with a similar philosophy that should ease conversion of
existing code.


CHANGES SINCE VERSION 0.1
=========================

1. Major rewrite of GC code: 

  - Roots are now sorted by generation for better performances with large number
    of instances.
  - Reworked parent-child relation handling. Now uses one cell per parent
    instead of one per parent-child pair, in combination with a parent flag
    (replaces the now suppressed generation flag).
  - Roots and parent-child relations now use cells allocated in a dedicated
    memory pool.
  - No more per-cell generation flag; all cells in a page belong to the same
    generation. Promotion is now immediate upon GC.
  - Promotion is done per page rather than per cell: whole pages are moved up 
    one generation at each GC. CPU overhead is much smaller.
  - As a consequence, the first generation is always emptied at each GC, which 
    speeds up further allocations.
  - The last collected generation is by default merged with the first 
    uncollected one, but may be relocated if fragmentation reaches a certain 
    threshold; this is done by copying individual cells instead of moving pages.
  - Relocation and redirection are now performed inline during the mark phase, 
    versus during later steps in previous versions. The resulting code is much 
    simpler, and avoids cache-intensive traversal of whole memory pools.
    
2. Improved performances in every domain thanks to point 1.

3. Added two high-level data types: vectors and lists.

  - Vectors are flat collections of words. Their size is limited, as they must 
    fit within one single page. They are directly addressable.
  - Lists are linear collections of words, made by assembling vectors into
    balanced binary trees. They are to vectors what ropes are to strings. Like
    ropes, lists are immutable. Iterator and traversal APIs are provided.


PLANNED FEATURES FOR NEXT VERSION
=================================

1. Mutable data types with circular references and graceful degradation to
immutable types: 

 - Mutable lists. Candidate models include unrolled linked lists and skip lists.
 - Maps (AKA dicts) with string keys. Candidate models include Patricia trees 
   and Ternary Search Trees, but not hash tables.

2. Proper internal and user documentation


WHAT NEEDS TO BE DONE?
======================

My main development platform is Windows, so for now the source archive only 
includes Microsoft Visual Studio project files. Microsoft provides a free 
edition of their development environment known as Visual Studio Express for 
those willing to compile and test the library without having to buy an expensive
license. Other systems need a makefile and/or autoconf scripts.

The code is fairly portable on 32-bit systems. 64-bit support will need more
work because all the internals are fine-tuned and optimized at the bit level;
however porting to 64-bit should be rather straightforward: the algorithms will 
remain unchanged, structure access is abstracted behing macros, and cell size 
is proportional to the word size (a cell should be able to store 4 pointers, 
which add up to 16 bytes on 32-bit systems).

The only part that needs platform-specific code is the low-level page allocator.
Colibri needs system calls that allocate boundary-aligned pages. At present only
a Windows version is provided, but porting to other systems should require 
minimal effort, as the platform-specific code is limited to a handful of
functions. Platform-specific peculiarities should not impact the overall 
architecture.

Exception handling is poor, especially because Colibri is meant to be part of a
larger system that provides the appropriate infrastructure.

The current implementation is not thread-safe as it uses unprotected global
structures. However I don't think that adding synchronization is necessary given
that it should eventually be integrated into a larger application with the same
appartment threading model as Tcl. The main advantage of having a global 
allocator is inter-thread data sharing, and Tcl typically shares no data between
threads. As Colibri is not designed to be a general-purpose solution, and unless 
there is a real benefit in using a global thread-safe allocator, converting 
global structures to thread-local should fit our needs. 

Tests have been run extensively during development, both on stability and 
performance. However Colibri lacks a real test suite. The source includes a 
test application that has to be hand-edited to run or customize specific tests. 
Note that the current values defined by this application require a system with 
a large memory size (2GB).

Last, it lacks user and design documentation, although the source code is 
extensively commented.


GRAND UNIFICATION SCHEME
========================

A great side project would be to reimplement Tcl over Colibri as a replacement
for the current Tcl_Obj based code. Of course the resulting library would be
incompatible on an ABI level, but this would provide a great testbed for
Colibri in real-world cases (using pure script applications) as well as a
comparison point with the official Tcl core, and will possibly open the path to 
Tcl9. This most likely involves a lot of work.


WHERE CAN IT BE FOUND?
======================

Wiki page: http://wiki.tcl.tk/Colibri 
Project page @ SourceForge.net: https://sourceforge.net/projects/tcl9/ 
Mailing list: https://sourceforge.net/mailarchive/forum.php?forum_name=tcl9-colibri 
Direct Download:
    - source: http://downloads.sourceforge.net/tcl9/colibri0.2.src.zip?use_mirror= 
    - Windows binary: http://downloads.sourceforge.net/tcl9/colibri0.2.win32.zip?use_mirror= 


LICENSE
=======

The license is the same as for Tcl.