Version 2 of biotext: A new Tk widget dedicated to bioinformatics

Updated 2010-06-30 12:09:37 by jdc

This page is dedicated to the multiple sequences alignment visualisation problem. It contains explanations of the problems, and, maybe, discussions to solve it !

The context

Since the beginning of DNA and protein sequencing, biologists were interested in comparing sequences in order to understand their function, specificity, or regulation mechanism. Indeed, upon selection pressure the sequence of a given protein will change, but the residues (amino acids) responsible for the funcion/specificity/regulation will remain, or change with a slower speed compared to other residues of the protein. By aligning sequences from different organisms it is then possible to identify such functional residues.

The universal 20 amino acids are usually represented by a letter (A for alanine, R for arginine, W for tryptophane for example), and so a protein sequence can be seen as a string. Aligning sequences is not a letter-to-letter comparison. Indeed, inside the set of the 20 aminoacids some are physico-chemically closer to others; for example, Aspartic acid (D) and glutamic acid (E) are the only two acidic residues, Valine (V), Leucine (L), Isoleucine (I) and Methionine (M) are hydrophobic and large. It has been shown that amino acids are not randomly changed during evolution, but rather they tend to stay in their physico-chemical subgroup. Some people build substitution matrices, scoring the substitution of an aminoacid by another.

Numerous algorithms have been developed to align sequences. They almost all try to maximize a score linked to the use of a substitution matrix. All programs are able to give a good multiple sequence alignment (MSA) when sequences are very similar, but they fail when sequences are distant, or when the set of sequences to be aligned is too heterogenous. Indeed, because of the new sequencing technologies, there are now more than 1,000 completely sequenced genomes, ranging from bacteria up to homo sapiens. This leads to a large amount of sequences to align for a given protein with a wide range of diversity. Nowadays, a standard MSA is made of 200-500 sequences, each being 300-20,000 residues long. After alignemnt, a MSA is then a matrix of characters of 200-500 x 3,000 - 20,000 characters.

The image below shows sequences before and after being aligned.

http://lbgi.igbmc.fr/~moumou/Images/Stacked_vs_Aligned_Seq.jpg

The Problem

Several softwares have been developed to visualize MSA. Most of them are dealing fine with representation of protein features, like showing the secondary structure of a 3D model in the MSA, or the domains that can be identified inside proteins. The problem arises when the biologist should edit (adjust) the alignment. As said previously, alignment programs do not give optimal alignment when dealing with distant sequences, or multi-domain sequences in general. This implies the biologist should manually adjust the alignment. This is usually hard work, taking hours in front of a terminal. In the 80s, a package was developed by a lab of the Wisconsin University that includes, amongst other programs, a MSA editor, SeqLab. Some years ago the package was sold to the Accelrys company which decided last summer to stop development and distribution of this software. During this time, other packages appeared (EMBOss, JalView, Camera for example), but none of them provided a working MSA editor. Nowadays, a lot of laboratories in genetics, evolution, bioinformatics, still maintain the old GCG package only to use the SeqLab editor.

When editing an alignment, all 20 aminoacids are displayed with a given background and foreground, generally linked to their physico-chemical properties: R and K have white and blue foreground and background respectively, G is black/orange, P is white/black, H and N are black/cyan, etc ... (see Appendix). A typical editor window wiill display between 50 to 70 lines (sequences) and 150-250 characters for each. All the attempts to create a new editor, whatever the langage used to code it, lead to the same results: the scrolling is not soft and fluid as in SeqLab, or flashes, disqualifying the software for production work. Indeed a slow or flashing scrolling became quickly painful when working hours in front of a screen. Some editors (JalView) allow one to edit parts of the alignment, but they can not be used for serious work either.

Solutions

As a reminder, the purpose is to display 50-70 lines of 150-250 characters, all tagged, taken from 200-500 lines of 2,000-20,000 characters long. With the help of some Tk gurus, I tried several things in order to create a working editor. Some trials :

  • the less worst solution is to use a table (tktable package) with one residue per cell and setting all paddings and borderwidth to 0. This gives quite fast results, although it is still flashing and fluid-ish upon scrolling large alignments.
  • a widget text whose scrollbars are disconnected. For each scrolling event, the widget is cleaned and the zone to be displayed is inserted along with the tags in one go (thanks to Jos !). This is slow for 50 sequences by 150 characters.
  • a canvas widget with the same mechanism as above is also slow.
  • a pure text widget containing the whole tagged alignment is useless upon y-scrolling.

A New Widget

After trying so many things, it appears it's worth the investment of writing a new Tk widget. It should allow 1) fast and fluid scrolling even with large alignment, 2) mapping of residues according to a map given by the user, 3) grouping sequences, 4) insertion/deletion of rows/columns.

Characteristics of the widget

The edition/visualisation of MSA has some peculiar characteristics:

  • the document to be visualized is a matrix of single characters
  • there is only one font, which MUST BE of fixed width (Courier, Courier New)
  • a given character will always have the same tag associated to it: for example, a 'F' will always be displayed with a red background and white foreground.
  • you can only insert/delete rows or columns.

A first attempt gives me perfect results, using the xlib XDrawImageString function. The use of Tk_Fill3DRectangle/Tk_DrawChars to emulate the XDrawImageString function doesn't give a fluid widget upon scrolling. IMPORTANT NOTE : I'm new to the Tk C API, and I'm not used to coding in C. So my code may be rubbish !!
You can find a tar file containing the code, Linux compiled version, and example at ftp:://lbgi.igbmc.fr/~moumou/Docs/biotext.tar.

Test with disconnected scrollbars

At the 9th European Tcl/Tk Users Meeting Luc and I made this example which is scrolling too slow:

package require Tk

grid [text .t -wrap none] [scrollbar .sy -orient vertical -command yset] -sticky ewns
grid [scrollbar .sx -orient horizontal -command xset] -sticky ewns
grid [entry .x -textvariable x]
grid [entry .y -textvariable y]
grid [button .a -text Adjust -command adjust]

grid rowconfigure . 0 -weight 1
grid columnconfigure . 0 -weight 1

set wx 150
set wy 30

set dta {}
set nx 10000
set ny 5000

# Execute code within this if statement to generate the LM.DAT input file
if {0} { 
    for {set i 0} {$i < $ny} {incr i} {
        set l {}
        for {set j 0} {$j < $nx} {incr j} {
            set c [lindex {a b c d e f g h i j k l} [expr {int(rand()*10)}]]
            lappend l $c $c
        }
        lappend dta $l
    }

    set f [open LM.DAT w]
    puts $f [join $dta \n]
    close $f
}

set f [open LM.DAT r]
set dta [split [read $f] \n]
close $f

.t tag configure a -background red
.t tag configure b -background green
.t tag configure c -background blue
.t tag configure d -background yellow

proc xset {cmd n {unit ""}} {
    global x nx wx
    switch -exact -- $cmd {
        scroll {
            switch -exact -- $unit {
                units {
                    incr x $n
                }
                pages {
                    incr x [expr {$n*$wx}]
                }
            }
        }
        moveto {
            set x [expr {int($n*$nx)}]
        }
    }
    if {$x < 0} {
        set x 0
    }
    if {$x+$wx > $nx} {
        set x [expr {$nx-$wx}]
    }
    adjust
}

proc yset {cmd n {unit ""}} {
    global y ny wy
    switch -exact -- $cmd {
        scroll {
            switch -exact -- $unit {
                units {
                    incr y $n
                }
                pages {
                    incr y [expr {$n*$wy}]
                }
            }
        }
        moveto {
            set y [expr {int($n*$ny)}]
        }
    }
    if {$y < 0} {
        set y 0
    }
    if {$y+$wy > $ny} {
        set y [expr {$ny-$wy}]
    }
    adjust
}

proc adjust {} {
    global x y dta nx ny wx wy
    .t delete 1.0 end
    for {set i $y} {$i < [expr {$y+$wy}]} {incr i} {
        set l [lindex $dta $i]
        set l [lrange $l [expr {$x*2}] [expr {($x+$wx)*2-1}]]
        .t insert end {*}$l \n ""
    }
    .sx set [expr {double($x)/$nx}] [expr {double($x+$wx-1)/$nx}]
    .sy set [expr {double($y)/$ny}] [expr {double($y+$wy-1)/$ny}]
}

set x 0
set y 0

adjust

Appendix :

Color mapping:
- V,L,I,M white magenta
- R,K white blue
- F,Y,W white red
- D,E white forestgreen
- Q white green
- P white black
- G black orange
- H,N black cyan
- S,T,A,C white darkviolet
- .,- black darkslat