lm 2010-06-30: This page is dedicated to the multiple sequences alignment visualisation problem. It contains explanations of the problems, and, maybe, discussions to solve it !
Since the beginning of DNA and protein sequencing, biologists were interested in comparing sequences in order to understand their function, specificity, or regulation mechanism. Indeed, upon selection pressure the sequence of a given protein will change, but the residues (amino acids) responsible for the funcion/specificity/regulation will remain, or change with a slower speed compared to other residues of the protein. By aligning sequences from different organisms it is then possible to identify such functional residues.
The universal 20 amino acids are usually represented by a letter (A for alanine, R for arginine, W for tryptophane for example), and so a protein sequence can be seen as a string. Aligning sequences is not a letter-to-letter comparison. Indeed, inside the set of the 20 aminoacids some are physico-chemically closer to others; for example, Aspartic acid (D) and glutamic acid (E) are the only two acidic residues, Valine (V), Leucine (L), Isoleucine (I) and Methionine (M) are hydrophobic and large. It has been shown that amino acids are not randomly changed during evolution, but rather they tend to stay in their physico-chemical subgroup. Some people build substitution matrices, scoring the substitution of an aminoacid by another.
Numerous algorithms have been developed to align sequences. They almost all try to maximize a score linked to the use of a substitution matrix. All programs are able to give a good multiple sequence alignment (MSA) when sequences are very similar, but they fail when sequences are distant, or when the set of sequences to be aligned is too heterogenous. Indeed, because of the new sequencing technologies, there are now more than 1,000 completely sequenced genomes, ranging from bacteria up to homo sapiens. This leads to a large amount of sequences to align for a given protein with a wide range of diversity. Nowadays, a standard MSA is made of 200-500 sequences, each being 300-20,000 residues long. After alignemnt, a MSA is then a matrix of characters of 200-500 x 3,000 - 20,000 characters.
The image below shows sequences before and after being aligned.
Several softwares have been developed to visualize MSA. Most of them are dealing fine with representation of protein features, like showing the secondary structure of a 3D model in the MSA, or the domains that can be identified inside proteins. The problem arises when the biologist should edit (adjust) the alignment. As said previously, alignment programs do not give optimal alignment when dealing with distant sequences, or multi-domain sequences in general. This implies the biologist should manually adjust the alignment. This is usually hard work, taking hours in front of a terminal. In the 80s, a package was developed by a lab of the Wisconsin University that includes, amongst other programs, a MSA editor, SeqLab. Some years ago the package was sold to the Accelrys company which decided last summer to stop development and distribution of this software. During this time, other packages appeared (EMBOss, JalView, Camera for example), but none of them provided a working MSA editor. Nowadays, a lot of laboratories in genetics, evolution, bioinformatics, still maintain the old GCG package only to use the SeqLab editor.
When editing an alignment, all 20 aminoacids are displayed with a given background and foreground, generally linked to their physico-chemical properties: R and K have white and blue foreground and background respectively, G is black/orange, P is white/black, H and N are black/cyan, etc ... (see Appendix). A typical editor window wiill display between 50 to 70 lines (sequences) and 150-250 characters for each. All the attempts to create a new editor, whatever the langage used to code it, lead to the same results: the scrolling is not soft and fluid as in SeqLab, or flashes, disqualifying the software for production work. Indeed a slow or flashing scrolling became quickly painful when working hours in front of a screen. Some editors (JalView) allow one to edit parts of the alignment, but they can not be used for serious work either.
As a reminder, the purpose is to display 50-70 lines of 150-250 characters, all tagged, taken from 200-500 lines of 2,000-20,000 characters long. With the help of some Tk gurus, I tried several things in order to create a working editor. Some trials :
After trying so many things, it appears it's worth the investment of writing a new Tk widget. It should allow 1) fast and fluid scrolling even with large alignment, 2) mapping of residues according to a map given by the user, 3) grouping sequences, 4) insertion/deletion of rows/columns.
The edition/visualisation of MSA has some peculiar characteristics:
A first attempt gives me perfect results, using the xlib XDrawImageString function. The use of Tk_Fill3DRectangle/Tk_DrawChars to emulate the XDrawImageString function doesn't give a fluid widget upon scrolling. IMPORTANT NOTE : I'm new to the Tk C API, and I'm not used to coding in C. So my code may be rubbish !!
You can find a tar file containing the code, Linux compiled version, and example at [L1 ].
(You can try the Tk_Fill3DRectangle/Tk_DrawChars couple by editing src/tkBiotext.c , search for tkslow, change if (0) to if (1) , and recompile with the PourCompiler file ... You'll see, it's slow ...)
jdc At the 9th European Tcl/Tk Users Meeting Luc and I made this example which is scrolling too slow:
package require Tk grid [text .t -wrap none] [scrollbar .sy -orient vertical -command yset] -sticky ewns grid [scrollbar .sx -orient horizontal -command xset] -sticky ewns grid [entry .x -textvariable x] grid [entry .y -textvariable y] grid [button .a -text Adjust -command adjust] grid rowconfigure . 0 -weight 1 grid columnconfigure . 0 -weight 1 set wx 150 set wy 50 set dta {} set nx 10000 set ny 5000 # Execute code within this if statement to generate the LM.DAT input file if {0} { for {set i 0} {$i < $ny} {incr i} { set l {} for {set j 0} {$j < $nx} {incr j} { set c [lindex {a b c d e f g h i j k l} [expr {int(rand()*10)}]] lappend l $c $c } lappend dta $l } set f [open LM.DAT w] puts $f [join $dta \n] close $f } set f [open LM.DAT r] set dta [split [read $f] \n] close $f .t tag configure a -background red .t tag configure b -background green .t tag configure c -background blue .t tag configure d -background yellow proc xset {cmd n {unit ""}} { global x nx wx switch -exact -- $cmd { scroll { switch -exact -- $unit { units { incr x $n } pages { incr x [expr {$n*$wx}] } } } moveto { set x [expr {int($n*$nx)}] } } if {$x < 0} { set x 0 } if {$x+$wx > $nx} { set x [expr {$nx-$wx}] } adjust } proc yset {cmd n {unit ""}} { global y ny wy switch -exact -- $cmd { scroll { switch -exact -- $unit { units { incr y $n } pages { incr y [expr {$n*$wy}] } } } moveto { set y [expr {int($n*$ny)}] } } if {$y < 0} { set y 0 } if {$y+$wy > $ny} { set y [expr {$ny-$wy}] } adjust } proc adjust {} { global x y dta nx ny wx wy .t delete 1.0 end for {set i $y} {$i < [expr {$y+$wy}]} {incr i} { set l [lindex $dta $i] set l [lrange $l [expr {$x*2}] [expr {($x+$wx)*2-1}]] .t insert end {*}$l \n "" } .sx set [expr {double($x)/$nx}] [expr {double($x+$wx-1)/$nx}] .sy set [expr {double($y)/$ny}] [expr {double($y+$wy-1)/$ny}] } set x 0 set y 0 adjust
Using a canvas iso a text widget, performance improves. The time to do one scroll adjust goes down from about 400ms in the text version to 100ms in the canvas version.
package require Tk grid [canvas .t] [scrollbar .sy -orient vertical -command yset] -sticky ewns grid [scrollbar .sx -orient horizontal -command xset] -sticky ewns grid [entry .x -textvariable x] grid [entry .y -textvariable y] grid [button .a -text Adjust -command adjust] grid rowconfigure . 0 -weight 1 grid columnconfigure . 0 -weight 1 set wx 150 set wy 30 set dta {} set nx 10000 set ny 5000 # Execute code within this if statement to generate the LM.DAT input file if {0} { for {set i 0} {$i < $ny} {incr i} { set l {} for {set j 0} {$j < $nx} {incr j} { set c [lindex {a b c d e f g h i j k l} [expr {int(rand()*10)}]] lappend l $c $c } lappend dta $l } set f [open LM.DAT w] puts $f [join $dta \n] close $f } set f [open LM.DAT r] set dta [split [read $f] \n] close $f set fill(a) red set fill(b) green set fill(c) blue set fill(d) yellow set fill(e) white set fill(f) white set fill(g) white set fill(h) white set fill(i) white set fill(j) white set fill(k) white set fill(l) white proc xset {cmd n {unit ""}} { global x nx wx switch -exact -- $cmd { scroll { switch -exact -- $unit { units { incr x $n } pages { incr x [expr {$n*$wx}] } } } moveto { set x [expr {int($n*$nx)}] } } if {$x < 0} { set x 0 } if {$x+$wx > $nx} { set x [expr {$nx-$wx}] } adjust } proc yset {cmd n {unit ""}} { global y ny wy switch -exact -- $cmd { scroll { switch -exact -- $unit { units { incr y $n } pages { incr y [expr {$n*$wy}] } } } moveto { set y [expr {int($n*$ny)}] } } if {$y < 0} { set y 0 } if {$y+$wy > $ny} { set y [expr {$ny-$wy}] } adjust } proc adjust {} { global x y dta nx ny wx wy fill .t delete b set by 12 for {set i $y} {$i < [expr {$y+$wy}]} {incr i} { set l [lindex $dta $i] set l [lrange $l [expr {$x*2}] [expr {($x+$wx)*2-1}]] set bx 12 foreach {b t} $l { .t create rectangle $bx $by [expr {$bx+12}] [expr {$by+12}] -fill $fill($t) -tag b .t create text [expr {$bx+2}] [expr {$by}] -anchor nw -text $b -tag b incr bx 12 } incr by 12 } .sx set [expr {double($x)/$nx}] [expr {double($x+$wx-1)/$nx}] .sy set [expr {double($y)/$ny}] [expr {double($y+$wy-1)/$ny}] } set x 0 set y 0 adjust
Simply changing the code to update the items on the canvas rather than recreating them is an obvious improvement.
proc adjust {} { global x y dta nx ny wx wy fill id set s [clock clicks -milli] #.t delete b set by 12 set idx 0 for {set i $y} {$i < [expr {$y+$wy}]} {incr i} { set l [lindex $dta $i] set l [lrange $l [expr {$x*2}] [expr {($x+$wx)*2-1}]] set bx 12 foreach {b t} $l { .t itemconfigure $id($idx,r) -fill $fill($t) .t itemconfigure $id($idx,t) -text $b #incr bx 12 incr idx } incr by 12 } .sx set [expr {double($x)/$nx}] [expr {double($x+$wx-1)/$nx}] .sy set [expr {double($y)/$ny}] [expr {double($y+$wy-1)/$ny}] set e [clock clicks -milli] puts "Ticks [expr {$e - $s}]" } proc make {} { global x y dta nx ny wx wy fill id set s [clock clicks -milli] .t delete b set by 12 set idx 0 for {set i $y} {$i < [expr {$y+$wy}]} {incr i} { set l [lindex $dta $i] set l [lrange $l [expr {$x*2}] [expr {($x+$wx)*2-1}]] set bx 12 foreach {b t} $l { set id($idx,r) [.t create rectangle $bx $by [expr {$bx+12}] [expr {$by+12}] -fill $fill($t)] set id($idx,t) [.t create text [expr {$bx+2}] [expr {$by}] -anchor nw -text $b] incr bx 12 incr idx } incr by 12 } .sx set [expr {double($x)/$nx}] [expr {double($x+$wx-1)/$nx}] .sy set [expr {double($y)/$ny}] [expr {double($y+$wy-1)/$ny}] set e [clock clicks -milli] puts "Ticks make [expr {$e - $s}]" } set x 0 set y 0 make
lm2010-07-15: Below is the same script as above, with improvements, but in a almost realistic configuration, i.e. number of sequences 50 of 150 characters long, and 20 tags. Unfortunatly, the performances are not so good ...
package require Tk set LignesCoul "I white magenta L white magenta M white magenta V white magenta R white blue K white blue F white red Y white red W white red D white forestgreen E white forestgreen Q white green P white black G black orange H black cyan N black cyan S white darkviolet T white darkviolet A white darkviolet C white darkviolet . black darkslategrey" proc JunkData {nseq len} { global dta set Lc [list A C D E F G H I K L M N P Q R S T V W Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .] set nv [llength $Lc] for {set i 0} {$i < $nseq} {incr i} { set l [list] for {set j 0} {$j < $len} {incr j} { set c [lindex $Lc [expr {int(rand()*$nv)}]] lappend l $c $c } lappend dta $l } return $dta } font create MyFont -family Courier -size 10 set Ch [lindex [font metrics MyFont] 5] set Cw [font measure MyFont "Z"] grid [canvas .t] [scrollbar .sy -orient vertical -command yset] -sticky ewns grid [scrollbar .sx -orient horizontal -command xset] -sticky ewns grid [entry .x -textvariable x] grid [entry .y -textvariable y] grid [button .a -text Adjust -command adjust] grid rowconfigure . 0 -weight 1 grid columnconfigure . 0 -weight 1 set wx 150 set wy 30 set dta {} set nx 10000 set ny 500 set dta [JunkData $ny $nx] set CanW [expr {$wx*$Cw}] set CanH [expr {$wy*$Ch}] .t configure -width $CanW -height $CanH foreach l [split $LignesCoul \n] { lassign [split $l " "] t f b set fill(Bg,$t) $b set fill(Fg,$t) $f } proc xset {cmd n {unit ""}} { global x nx wx switch -exact -- $cmd { scroll { switch -exact -- $unit { units { incr x $n } pages { incr x [expr {$n*$wx}] } } } moveto { set x [expr {int($n*$nx)}] } } if {$x < 0} { set x 0 } if {$x+$wx > $nx} { set x [expr {$nx-$wx}] } adjust } proc yset {cmd n {unit ""}} { global y ny wy switch -exact -- $cmd { scroll { switch -exact -- $unit { units { incr y $n } pages { incr y [expr {$n*$wy}] } } } moveto { set y [expr {int($n*$ny)}] } } if {$y < 0} { set y 0 } if {$y+$wy > $ny} { set y [expr {$ny-$wy}] } adjust } proc adjust {} { global x y dta nx ny wx wy fill id Cw Ch set s [clock clicks -milli] #.t delete b set by $Ch set idx 0 for {set i $y} {$i < $y+$wy} {incr i} { set l [lindex $dta $i] set l [lrange $l [expr {$x*2}] [expr {($x+$wx)*2-1}]] set bx $Cw foreach {b t} $l { .t itemconfigure $id($idx,r) \ -fill $fill(Bg,$b) \ -outline $fill(Bg,$b) .t itemconfigure $id($idx,t) \ -text $b \ -fill $fill(Fg,$b) #incr bx 12 incr idx } incr by $Ch } .sx set [expr {double($x)/$nx}] [expr {double($x+$wx-1)/$nx}] .sy set [expr {double($y)/$ny}] [expr {double($y+$wy-1)/$ny}] set e [clock clicks -milli] puts "Ticks [expr {$e - $s}]" } proc make {} { global x y dta nx ny wx wy fill id Cw Ch set s [clock clicks -milli] .t delete b set by $Ch set idx 0 for {set i $y} {$i < $y+$wy} {incr i} { set l [lindex $dta $i] set l [lrange $l [expr {$x*2}] [expr {($x+$wx)*2-1}]] set bx $Cw foreach {b t} $l { set id($idx,r) [.t create rectangle \ $bx $by [expr {$bx+$Cw}] [expr {$by+$Ch}] \ -outline $fill(Bg,$t) \ -fill $fill(Bg,$t)] set id($idx,t) [.t create text $bx $by \ -font MyFont \ -anchor nw \ -text $b \ -fill $fill(Fg,$t)] incr bx $Cw incr idx } incr by $Ch } .sx set [expr {double($x)/$nx}] [expr {double($x+$wx-1)/$nx}] .sy set [expr {double($y)/$ny}] [expr {double($y+$wy-1)/$ny}] set e [clock clicks -milli] puts "Ticks make [expr {$e - $s}]" } set x 0 set y 0 make
After some tests, it appears that the xlib function XDrawImageString is perfectly adapted to this case. The only problem is that XDrawImageString does not handle Xft fonts, but this does not matter really much as we should use Courier font to have aligned characters.
The WIN32 counterpart of XDrawImageString is :
DrawText(HDC, LPTSTR, length, RECT, format).
This function will draw a string in a given font with a color, and the background (defined by RECT, here set to the string bbox) with another color. The code to make DrawText works is as follow :
void squareDrawImageString(Square *squarePtr, Display *display, Drawable d, GC gc, int x, int y, char *string, int length) { if (d == None) { return; } HDC hdc; LPTSTR lpchText; RECT rc; WinFont *fontPtr; HFONT oldFont; SubFont *lastSubFontPtr; int width, height; TkWinDCState state; Tk_Font tkfont = squarePtr->tkfont; //fontPtr = (WinFont *) gc->font; hdc = TkWinGetDrawableDC(display,d,&state); fontPtr = (WinFont *) tkfont; /* BEWARE ! tkfont is Tk_Font type ...*/ //hdc = GetDC(fontPtr->hwnd); //hdc = GetDC(TkWinGetHWND(Tk_WindowId(squarePtr->tkwin))); lastSubFontPtr = &(fontPtr->subFontArray[0]); //oldFont = SelectObject(hdc, lastSubFontPtr->hFont); width = squarePtr->charWidth; height = squarePtr->charHeight; rc.left = x; rc.top = y; rc.right = x + width; rc.bottom = y + height; SetBkMode(hdc, OPAQUE); SetBkColor(hdc, gc->background); SetTextColor(hdc, gc->foreground); lpchText = string; DrawText(hdc, lpchText, length, &rc, DT_CENTER); //SelectObject(hdc, oldFont);//ReleaseDC(fontPtr->hwnd, hdc); TkWinReleaseDrawableDC(d, hdc, &state); return; }
The code above works perfectly although the font is not the right one, the one used here is the hdc one. If I uncomment the SelectObject statement to set the font, then wish crashes. The lastSubFont... and fontPtr ... statements have been taken from tk/win/tkWinFont.c.
I've checked that tkfont is OK by issuing a Tk_NameOfFont and Tk_GetFontMetrics.
Can someone help ?
Color mapping:
- V,L,I,M white magenta
- R,K white blue
- F,Y,W white red
- D,E white forestgreen
- Q white green
- P white black
- G black orange
- H,N black cyan
- S,T,A,C white darkviolet
- .,- black darkslat