biotext: A new Tk widget dedicated to bioinformatics

lm 2010-06-30: This page is dedicated to the multiple sequences alignment visualisation problem. It contains explanations of the problems, and, maybe, discussions to solve it !

The context

Since the beginning of DNA and protein sequencing, biologists were interested in comparing sequences in order to understand their function, specificity, or regulation mechanism. Indeed, upon selection pressure the sequence of a given protein will change, but the residues (amino acids) responsible for the funcion/specificity/regulation will remain, or change with a slower speed compared to other residues of the protein. By aligning sequences from different organisms it is then possible to identify such functional residues.

The universal 20 amino acids are usually represented by a letter (A for alanine, R for arginine, W for tryptophane for example), and so a protein sequence can be seen as a string. Aligning sequences is not a letter-to-letter comparison. Indeed, inside the set of the 20 aminoacids some are physico-chemically closer to others; for example, Aspartic acid (D) and glutamic acid (E) are the only two acidic residues, Valine (V), Leucine (L), Isoleucine (I) and Methionine (M) are hydrophobic and large. It has been shown that amino acids are not randomly changed during evolution, but rather they tend to stay in their physico-chemical subgroup. Some people build substitution matrices, scoring the substitution of an aminoacid by another.

Numerous algorithms have been developed to align sequences. They almost all try to maximize a score linked to the use of a substitution matrix. All programs are able to give a good multiple sequence alignment (MSA) when sequences are very similar, but they fail when sequences are distant, or when the set of sequences to be aligned is too heterogenous. Indeed, because of the new sequencing technologies, there are now more than 1,000 completely sequenced genomes, ranging from bacteria up to homo sapiens. This leads to a large amount of sequences to align for a given protein with a wide range of diversity. Nowadays, a standard MSA is made of 200-500 sequences, each being 300-20,000 residues long. After alignemnt, a MSA is then a matrix of characters of 200-500 x 3,000 - 20,000 characters.

The image below shows sequences before and after being aligned.

http://lbgi.igbmc.fr/~moumou/Images/Stacked_vs_Aligned_Seq.jpg

The Problem

Several softwares have been developed to visualize MSA. Most of them are dealing fine with representation of protein features, like showing the secondary structure of a 3D model in the MSA, or the domains that can be identified inside proteins. The problem arises when the biologist should edit (adjust) the alignment. As said previously, alignment programs do not give optimal alignment when dealing with distant sequences, or multi-domain sequences in general. This implies the biologist should manually adjust the alignment. This is usually hard work, taking hours in front of a terminal. In the 80s, a package was developed by a lab of the Wisconsin University that includes, amongst other programs, a MSA editor, SeqLab. Some years ago the package was sold to the Accelrys company which decided last summer to stop development and distribution of this software. During this time, other packages appeared (EMBOss, JalView, Camera for example), but none of them provided a working MSA editor. Nowadays, a lot of laboratories in genetics, evolution, bioinformatics, still maintain the old GCG package only to use the SeqLab editor.

When editing an alignment, all 20 aminoacids are displayed with a given background and foreground, generally linked to their physico-chemical properties: R and K have white and blue foreground and background respectively, G is black/orange, P is white/black, H and N are black/cyan, etc ... (see Appendix). A typical editor window wiill display between 50 to 70 lines (sequences) and 150-250 characters for each. All the attempts to create a new editor, whatever the langage used to code it, lead to the same results: the scrolling is not soft and fluid as in SeqLab, or flashes, disqualifying the software for production work. Indeed a slow or flashing scrolling became quickly painful when working hours in front of a screen. Some editors (JalView) allow one to edit parts of the alignment, but they can not be used for serious work either.

Solutions

As a reminder, the purpose is to display 50-70 lines of 150-250 characters, all tagged, taken from 200-500 lines of 2,000-20,000 characters long. With the help of some Tk gurus, I tried several things in order to create a working editor. Some trials :

  • the less worst solution is to use a table (tktable package) with one residue per cell and setting all paddings and borderwidth to 0. This gives quite fast results, although it is still flashing and fluid-ish upon scrolling large alignments.
  • a widget text whose scrollbars are disconnected. For each scrolling event, the widget is cleaned and the zone to be displayed is inserted along with the tags in one go (thanks to Jos !). This is slow for 50 sequences by 150 characters.
  • a canvas widget with the same mechanism as above is also slow.
  • a pure text widget containing the whole tagged alignment is useless upon y-scrolling.

A New Widget

After trying so many things, it appears it's worth the investment of writing a new Tk widget. It should allow 1) fast and fluid scrolling even with large alignment, 2) mapping of residues according to a map given by the user, 3) grouping sequences, 4) insertion/deletion of rows/columns.

Characteristics of the widget

The edition/visualisation of MSA has some peculiar characteristics:

  • the document to be visualized is a matrix of single characters
  • there is only one font, which MUST BE of fixed width (Courier, Courier New)
  • a given character will always have the same tag associated to it: for example, a 'F' will always be displayed with a red background and white foreground.
  • you can only insert/delete rows or columns.

A first attempt gives me perfect results, using the xlib XDrawImageString function. The use of Tk_Fill3DRectangle/Tk_DrawChars to emulate the XDrawImageString function doesn't give a fluid widget upon scrolling. IMPORTANT NOTE : I'm new to the Tk C API, and I'm not used to coding in C. So my code may be rubbish !!
You can find a tar file containing the code, Linux compiled version, and example at [L1 ].
(You can try the Tk_Fill3DRectangle/Tk_DrawChars couple by editing src/tkBiotext.c , search for tkslow, change if (0) to if (1) , and recompile with the PourCompiler file ... You'll see, it's slow ...)

Test with disconnected scrollbars

jdc At the 9th European Tcl/Tk Users Meeting Luc and I made this example which is scrolling too slow:

package require Tk

grid [text .t -wrap none] [scrollbar .sy -orient vertical -command yset] -sticky ewns
grid [scrollbar .sx -orient horizontal -command xset] -sticky ewns
grid [entry .x -textvariable x]
grid [entry .y -textvariable y]
grid [button .a -text Adjust -command adjust]

grid rowconfigure . 0 -weight 1
grid columnconfigure . 0 -weight 1

set wx 150
set wy 50

set dta {}
set nx 10000
set ny 5000

# Execute code within this if statement to generate the LM.DAT input file
if {0} { 
    for {set i 0} {$i < $ny} {incr i} {
        set l {}
        for {set j 0} {$j < $nx} {incr j} {
            set c [lindex {a b c d e f g h i j k l} [expr {int(rand()*10)}]]
            lappend l $c $c
        }
        lappend dta $l
    }

    set f [open LM.DAT w]
    puts $f [join $dta \n]
    close $f
}

set f [open LM.DAT r]
set dta [split [read $f] \n]
close $f

.t tag configure a -background red
.t tag configure b -background green
.t tag configure c -background blue
.t tag configure d -background yellow

proc xset {cmd n {unit ""}} {
    global x nx wx
    switch -exact -- $cmd {
        scroll {
            switch -exact -- $unit {
                units {
                    incr x $n
                }
                pages {
                    incr x [expr {$n*$wx}]
                }
            }
        }
        moveto {
            set x [expr {int($n*$nx)}]
        }
    }
    if {$x < 0} {
        set x 0
    }
    if {$x+$wx > $nx} {
        set x [expr {$nx-$wx}]
    }
    adjust
}

proc yset {cmd n {unit ""}} {
    global y ny wy
    switch -exact -- $cmd {
        scroll {
            switch -exact -- $unit {
                units {
                    incr y $n
                }
                pages {
                    incr y [expr {$n*$wy}]
                }
            }
        }
        moveto {
            set y [expr {int($n*$ny)}]
        }
    }
    if {$y < 0} {
        set y 0
    }
    if {$y+$wy > $ny} {
        set y [expr {$ny-$wy}]
    }
    adjust
}

proc adjust {} {
    global x y dta nx ny wx wy
    .t delete 1.0 end
    for {set i $y} {$i < [expr {$y+$wy}]} {incr i} {
        set l [lindex $dta $i]
        set l [lrange $l [expr {$x*2}] [expr {($x+$wx)*2-1}]]
        .t insert end {*}$l \n ""
    }
    .sx set [expr {double($x)/$nx}] [expr {double($x+$wx-1)/$nx}]
    .sy set [expr {double($y)/$ny}] [expr {double($y+$wy-1)/$ny}]
}

set x 0
set y 0

adjust

Using a canvas iso a text widget, performance improves. The time to do one scroll adjust goes down from about 400ms in the text version to 100ms in the canvas version.

package require Tk

grid [canvas .t] [scrollbar .sy -orient vertical -command yset] -sticky ewns
grid [scrollbar .sx -orient horizontal -command xset] -sticky ewns
grid [entry .x -textvariable x]
grid [entry .y -textvariable y]
grid [button .a -text Adjust -command adjust]

grid rowconfigure . 0 -weight 1
grid columnconfigure . 0 -weight 1

set wx 150
set wy 30

set dta {}
set nx 10000
set ny 5000

# Execute code within this if statement to generate the LM.DAT input file
if {0} { 
    for {set i 0} {$i < $ny} {incr i} {
        set l {}
        for {set j 0} {$j < $nx} {incr j} {
            set c [lindex {a b c d e f g h i j k l} [expr {int(rand()*10)}]]
            lappend l $c $c
        }
        lappend dta $l
    }

    set f [open LM.DAT w]
    puts $f [join $dta \n]
    close $f
}

set f [open LM.DAT r]
set dta [split [read $f] \n]
close $f

set fill(a) red
set fill(b) green
set fill(c) blue
set fill(d) yellow
set fill(e) white
set fill(f) white
set fill(g) white
set fill(h) white
set fill(i) white
set fill(j) white
set fill(k) white
set fill(l) white

proc xset {cmd n {unit ""}} {
    global x nx wx
    switch -exact -- $cmd {
        scroll {
            switch -exact -- $unit {
                units {
                    incr x $n
                }
                pages {
                    incr x [expr {$n*$wx}]
                }
            }
        }
        moveto {
            set x [expr {int($n*$nx)}]
        }
    }
    if {$x < 0} {
        set x 0
    }
    if {$x+$wx > $nx} {
        set x [expr {$nx-$wx}]
    }
    adjust
}

proc yset {cmd n {unit ""}} {
    global y ny wy
    switch -exact -- $cmd {
        scroll {
            switch -exact -- $unit {
                units {
                    incr y $n
                }
                pages {
                    incr y [expr {$n*$wy}]
                }
            }
        }
        moveto {
            set y [expr {int($n*$ny)}]
        }
    }
    if {$y < 0} {
        set y 0
    }
    if {$y+$wy > $ny} {
        set y [expr {$ny-$wy}]
    }
    adjust
}

proc adjust {} {
    global x y dta nx ny wx wy fill
    .t delete b
    set by 12
    for {set i $y} {$i < [expr {$y+$wy}]} {incr i} {
        set l [lindex $dta $i]
        set l [lrange $l [expr {$x*2}] [expr {($x+$wx)*2-1}]]
        set bx 12
        foreach {b t} $l {
            .t create rectangle $bx $by [expr {$bx+12}] [expr {$by+12}] -fill $fill($t) -tag b
            .t create text [expr {$bx+2}] [expr {$by}] -anchor nw -text $b -tag b
            incr bx 12
        }
        incr by 12
    }
    .sx set [expr {double($x)/$nx}] [expr {double($x+$wx-1)/$nx}]
    .sy set [expr {double($y)/$ny}] [expr {double($y+$wy-1)/$ny}]
}

set x 0
set y 0
adjust

Simply changing the code to update the items on the canvas rather than recreating them is an obvious improvement.

proc adjust {} {
    global x y dta nx ny wx wy fill id

        set s [clock clicks -milli]
    #.t delete b
    set by 12
        set idx  0
    for {set i $y} {$i < [expr {$y+$wy}]} {incr i} {
        set l [lindex $dta $i]
        set l [lrange $l [expr {$x*2}] [expr {($x+$wx)*2-1}]]
        set bx 12
        foreach {b t} $l {
                        .t itemconfigure $id($idx,r) -fill $fill($t)
                        .t itemconfigure $id($idx,t) -text $b
            #incr bx 12
                        incr idx
        }
        incr by 12
    }
    .sx set [expr {double($x)/$nx}] [expr {double($x+$wx-1)/$nx}]
    .sy set [expr {double($y)/$ny}] [expr {double($y+$wy-1)/$ny}]
        set e [clock clicks -milli]
        puts "Ticks [expr {$e - $s}]"
}

proc make {} {
    global x y dta nx ny wx wy fill id
        set s [clock clicks -milli]
    .t delete b
    set by 12
        set idx 0
    for {set i $y} {$i < [expr {$y+$wy}]} {incr i} {
        set l [lindex $dta $i]
        set l [lrange $l [expr {$x*2}] [expr {($x+$wx)*2-1}]]
        set bx 12
        foreach {b t} $l {
            set id($idx,r) [.t create rectangle $bx $by [expr {$bx+12}] [expr {$by+12}] -fill $fill($t)]
            set id($idx,t) [.t create text [expr {$bx+2}] [expr {$by}] -anchor nw -text $b]
            incr bx 12
                        incr idx
        }
        incr by 12
    }
    .sx set [expr {double($x)/$nx}] [expr {double($x+$wx-1)/$nx}]
    .sy set [expr {double($y)/$ny}] [expr {double($y+$wy-1)/$ny}]
        set e [clock clicks -milli]
        puts "Ticks make [expr {$e - $s}]"
}

set x 0
set y 0

make

lm2010-07-15: Below is the same script as above, with improvements, but in a almost realistic configuration, i.e. number of sequences 50 of 150 characters long, and 20 tags. Unfortunatly, the performances are not so good ...

package require Tk

set LignesCoul "I white magenta
L white magenta
M white magenta
V white magenta
R white blue
K white blue
F white red
Y white red
W white red
D white forestgreen
E white forestgreen
Q white green
P white black
G black orange
H black cyan
N black cyan
S white darkviolet
T white darkviolet
A white darkviolet
C white darkviolet
. black darkslategrey"

proc JunkData {nseq len} {
    global dta
    
    set Lc [list A C D E F G H I K L M N P Q R S T V W Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .]
    set nv [llength $Lc]
    for {set i 0} {$i < $nseq} {incr i} {
        set l [list]
        for {set j 0} {$j < $len} {incr j} {
            set c [lindex $Lc [expr {int(rand()*$nv)}]]
            lappend l $c $c
        }
        lappend dta $l
    }

    return $dta
}

font create MyFont -family Courier -size 10
set Ch [lindex [font metrics MyFont] 5]
set Cw [font measure MyFont "Z"]

grid [canvas .t] [scrollbar .sy -orient vertical -command yset] -sticky ewns
grid [scrollbar .sx -orient horizontal -command xset] -sticky ewns
grid [entry .x -textvariable x]
grid [entry .y -textvariable y]
grid [button .a -text Adjust -command adjust]

grid rowconfigure . 0 -weight 1
grid columnconfigure . 0 -weight 1

set wx 150
set wy 30

set dta {}
set nx 10000
set ny 500
set dta [JunkData $ny $nx]

set CanW [expr {$wx*$Cw}]
set CanH [expr {$wy*$Ch}]
.t configure -width $CanW -height $CanH

foreach l [split $LignesCoul \n] {
    lassign [split $l " "] t f b
    set fill(Bg,$t) $b
    set fill(Fg,$t) $f 
}

proc xset {cmd n {unit ""}} {
    global x nx wx
    switch -exact -- $cmd {
        scroll {
            switch -exact -- $unit {
                units {
                    incr x $n
                }
                pages {
                    incr x [expr {$n*$wx}]
                }
            }
        }
        moveto {
            set x [expr {int($n*$nx)}]
        }
    }
    if {$x < 0} {
        set x 0
    }
    if {$x+$wx > $nx} {
        set x [expr {$nx-$wx}]
    }
    adjust
}

proc yset {cmd n {unit ""}} {
    global y ny wy
    switch -exact -- $cmd {
        scroll {
            switch -exact -- $unit {
                units {
                    incr y $n
                }
                pages {
                    incr y [expr {$n*$wy}]
                }
            }
        }
        moveto {
            set y [expr {int($n*$ny)}]
        }
    }
    if {$y < 0} {
        set y 0
    }
    if {$y+$wy > $ny} {
        set y [expr {$ny-$wy}]
    }
    adjust
}


proc adjust {} {
    global x y dta nx ny wx wy fill id Cw Ch
    
    set s [clock clicks -milli]
    #.t delete b
    set by $Ch
    set idx  0
    for {set i $y} {$i < $y+$wy} {incr i} {
        set l [lindex $dta $i]
        set l [lrange $l [expr {$x*2}] [expr {($x+$wx)*2-1}]]
        set bx $Cw
        foreach {b t} $l {
            .t itemconfigure $id($idx,r) \
                    -fill $fill(Bg,$b) \
                    -outline $fill(Bg,$b)
            .t itemconfigure $id($idx,t) \
                    -text $b \
                    -fill $fill(Fg,$b)
            #incr bx 12
            incr idx
        }
        incr by $Ch
    }
    .sx set [expr {double($x)/$nx}] [expr {double($x+$wx-1)/$nx}]
    .sy set [expr {double($y)/$ny}] [expr {double($y+$wy-1)/$ny}]
    set e [clock clicks -milli]
    puts "Ticks [expr {$e - $s}]"
}

proc make {} {
    global x y dta nx ny wx wy fill id Cw Ch

    set s [clock clicks -milli]
    .t delete b
    set by $Ch
    set idx 0
    for {set i $y} {$i < $y+$wy} {incr i} {
        set l [lindex $dta $i]
        set l [lrange $l [expr {$x*2}] [expr {($x+$wx)*2-1}]]
        set bx $Cw
        foreach {b t} $l {
            set id($idx,r) [.t create rectangle \
                    $bx $by [expr {$bx+$Cw}] [expr {$by+$Ch}] \
                    -outline $fill(Bg,$t) \
                    -fill $fill(Bg,$t)]
            set id($idx,t) [.t create text $bx $by \
                    -font MyFont \
                    -anchor nw \
                    -text $b \
                    -fill $fill(Fg,$t)]
            incr bx $Cw
            incr idx
        }
        incr by $Ch
    }
    .sx set [expr {double($x)/$nx}] [expr {double($x+$wx-1)/$nx}]
    .sy set [expr {double($y)/$ny}] [expr {double($y+$wy-1)/$ny}]
    set e [clock clicks -milli]
    puts "Ticks make [expr {$e - $s}]"
}

set x 0
set y 0

make

Solution

After some tests, it appears that the xlib function XDrawImageString is perfectly adapted to this case. The only problem is that XDrawImageString does not handle Xft fonts, but this does not matter really much as we should use Courier font to have aligned characters.

Problem

The WIN32 counterpart of XDrawImageString is :
DrawText(HDC, LPTSTR, length, RECT, format).
This function will draw a string in a given font with a color, and the background (defined by RECT, here set to the string bbox) with another color. The code to make DrawText works is as follow :

void
squareDrawImageString(Square *squarePtr,
                       Display *display,
                       Drawable d,
                       GC gc,
                       int x, int y, 
                       char *string,
                       int length)
{
  if (d == None) {
    return;
  }

  HDC hdc;
  LPTSTR lpchText;
  RECT rc;
  WinFont *fontPtr;
  HFONT oldFont;
  SubFont *lastSubFontPtr;
  int width, height;
  TkWinDCState state;
  Tk_Font tkfont = squarePtr->tkfont;
  
  //fontPtr = (WinFont *) gc->font;
  hdc = TkWinGetDrawableDC(display,d,&state);
  fontPtr = (WinFont *) tkfont; /* BEWARE ! tkfont is Tk_Font type ...*/
  //hdc = GetDC(fontPtr->hwnd);
  //hdc = GetDC(TkWinGetHWND(Tk_WindowId(squarePtr->tkwin)));

  lastSubFontPtr = &(fontPtr->subFontArray[0]);
  //oldFont = SelectObject(hdc, lastSubFontPtr->hFont);
    
  width     = squarePtr->charWidth;
  height    = squarePtr->charHeight;
  rc.left   = x;
  rc.top    = y;
  rc.right  = x + width;
  rc.bottom = y + height;
  
  SetBkMode(hdc, OPAQUE);
  SetBkColor(hdc, gc->background);
  SetTextColor(hdc, gc->foreground);
  
  lpchText = string;
  
  DrawText(hdc, lpchText, length, &rc, DT_CENTER);

  //SelectObject(hdc, oldFont);//ReleaseDC(fontPtr->hwnd, hdc);
  TkWinReleaseDrawableDC(d, hdc, &state);

  return;
}

The code above works perfectly although the font is not the right one, the one used here is the hdc one. If I uncomment the SelectObject statement to set the font, then wish crashes. The lastSubFont... and fontPtr ... statements have been taken from tk/win/tkWinFont.c.
I've checked that tkfont is OK by issuing a Tk_NameOfFont and Tk_GetFontMetrics.

Can someone help ?

Appendix :

Legend:

Color mapping:
- V,L,I,M white magenta
- R,K white blue
- F,Y,W white red
- D,E white forestgreen
- Q white green
- P white black
- G black orange
- H,N black cyan
- S,T,A,C white darkviolet
- .,- black darkslat