Currently just a starter done by [DDG]. Please add comments and suggestions. [jbr] 2012-04-07 [http://en.wikipedia.org/wiki/FASTA_format] [DDG] 2004-02-21 small bugfix with returning empty lines via getSeq. makeing db a namespace variable, therefore avoiding multiple opening and closing of the indexfile ---- package require snit 0.97 package require oomk package require md5pure snit::type FastaFile { option -filename "" option -indexfile "" variable db variable pvPositions constructor {args} { $self configurelist $args if {$options(-indexfile) eq ""} { set options(-indexfile) $options(-filename).mk } set db [mkstorage %AUTO% $options(-indexfile)] if {[file size $options(-indexfile)] < 2} { $self CreateIndex } else { [$db view positions] as pvPositions } }; destructor { $db close } method CreateIndex {} { if [catch {open $options(-filename) r} infh] { puts stderr "Cannot open $options(-filename): $infh" return 0 } else { # set db [mkstorage %AUTO% $options(-indexfile)] $db layout positions {id md5 pos} [$db view positions] as pvPositions while {[gets $infh line] >= 0} { if {[regexp {^>([^\s]+)} $line -> id]} { set pos [tell $infh] incr pos [expr [string length $line] * -1 - 2] set md5 [md5pure::md5 $id] $pvPositions append id $id md5 $md5 pos $pos } } close $infh $db commit #$db close } }; method getSeq {id args} { # get the sequence either via id or via the md5 hash of the id set arg(-md5) false array set arg $args #set db [mkstorage %AUTO% $options(-indexfile)] if {$arg(-md5)} { [$pvPositions select -exact md5 $id] as pSel } else { [$pvPositions select -exact id $id] as pSel } if {[$pSel size] > 0} { set pos [$pSel get 0 pos] } if {[info exists pos]} { incr pos -1 if [catch {open $options(-filename) r} infh] { puts stderr "Cannot open $options(-filename): $infh" return "" } else { seek $infh $pos set flag false while {[gets $infh line] >= 0} { if {[regexp {^>} $line] && $flag} { break } elseif {[regexp {^>} $line]} { set flag true append seq "$line\n" } elseif {$flag} { append seq "$line\n" } } close $infh return $seq } } else { return "" } } } proc test {} { set file 20050210-134509--ciona2.seq set sf [FastaFile %AUTO% -filename e:/links//project/goblet/data/goblet-results/$file] puts -nonewline [$sf getSeq ci0100130002] puts -nonewline [$sf getSeq cc9e61010e530d9eae11d52093cf2535 -md5 true] } ---- [schlenk] Nice. I have some FASTA utils in Tcl too, (reading, splitting heading/body, joining head/body, validation, an iterator over a large multi sequence file, splitting and joining multi sequence files) all pure Tcl. Not sure if i can open source them though, have to ask. One comment on your code: why don't you use the tcllib md5 package, its more or less identical to md5pure and you have a good chance its already installed by a lot of people. ---- [DDG] Agreed about possible use of md5. However I saw some bugs, problems in the tcllib mailing lists. And we need only to digest small strings. So may be the smaller md5pure is enough. My idea is also to use starkits as the main delivery platform. Then md5pure will be included as well. No tclkit user should have the need to install anything. Just start a tclkit-shell and begin working by fetching the latest biotcl.kit via http::geturl. I will release a first kit this week. Regarding your methods for FastaFile: It should be not difficult to implement them from scratch. [HJG] 2012-04-04 What is the purpose of this program ? It looks like it is creating a file or db of some sort... ---- [Category Science] - [Category File]