FastaFile

Currently just a starter done by DDG. Please add comments and suggestions.

jbr 2012-04-07 [1 ]

DDG 2004-02-21 small bugfix with returning empty lines via getSeq. makeing db a namespace variable, therefore avoiding multiple opening and closing of the indexfile


 package require snit 0.97
 package require oomk 
 package require md5pure

 snit::type FastaFile {
    option -filename ""
    option -indexfile ""
    variable db
    variable pvPositions
    constructor {args} {
        $self configurelist $args
        if {$options(-indexfile) eq ""} {
            set options(-indexfile) $options(-filename).mk
        }
        set db [mkstorage %AUTO% $options(-indexfile)]
        if {[file size $options(-indexfile)] < 2} {
            $self CreateIndex
        } else {
            [$db view positions] as pvPositions
        }
    };
    destructor {
        $db close
    }
    method CreateIndex {} {
        if [catch {open $options(-filename) r} infh] {
            puts stderr "Cannot open $options(-filename): $infh"
            return 0
        } else {
            #            set db [mkstorage %AUTO% $options(-indexfile)]
            $db layout positions {id md5 pos}
            [$db view positions] as pvPositions
            while {[gets $infh line] >= 0} {
                if {[regexp {^>([^\s]+)} $line -> id]} {
                    set pos [tell $infh]
                    incr pos [expr [string length $line] * -1 - 2]
                    set md5 [md5pure::md5 $id]
                    $pvPositions append id $id md5 $md5 pos $pos
                } 
            }
            close $infh
            $db commit 
            #$db close
            
        }
    };
    method getSeq {id args} {
        # get the sequence either via id or via the md5 hash of the id
        set arg(-md5) false
        array set arg $args
        #set db [mkstorage %AUTO% $options(-indexfile)]
        if {$arg(-md5)} {
            [$pvPositions select -exact md5 $id] as pSel
          
        } else {
            [$pvPositions select -exact id $id] as pSel
        }
        if {[$pSel size] > 0} { 
            set pos [$pSel get 0 pos]
        } 
        if {[info exists pos]} {
            incr pos -1
            if [catch {open $options(-filename) r} infh] {
                puts stderr "Cannot open $options(-filename): $infh"
                return ""
            } else { 
                seek $infh $pos
                set flag false
                while {[gets $infh line] >= 0} {
                    if {[regexp {^>} $line] && $flag} {
                        break
                    } elseif {[regexp {^>} $line]} {
                        set flag true
                        append seq "$line\n"
                    }  elseif {$flag} {
                        append seq "$line\n"
                    }
                }
                close $infh
                return $seq
            } 
        } else {
            return ""
        }
    }
 }
 proc test {} {
    set file 20050210-134509--ciona2.seq
    set sf [FastaFile %AUTO% -filename e:/links//project/goblet/data/goblet-results/$file]
    puts -nonewline [$sf getSeq ci0100130002]
    puts -nonewline [$sf getSeq cc9e61010e530d9eae11d52093cf2535 -md5 true]
            
 }

schlenk Nice. I have some FASTA utils in Tcl too, (reading, splitting heading/body, joining head/body, validation, an iterator over a large multi sequence file, splitting and joining multi sequence files) all pure Tcl. Not sure if i can open source them though, have to ask. One comment on your code: why don't you use the tcllib md5 package, its more or less identical to md5pure and you have a good chance its already installed by a lot of people.


DDG Agreed about possible use of md5. However I saw some bugs, problems in the tcllib mailing lists. And we need only to digest small strings. So may be the smaller md5pure is enough. My idea is also to use starkits as the main delivery platform. Then md5pure will be included as well. No tclkit user should have the need to install anything. Just start a tclkit-shell and begin working by fetching the latest biotcl.kit via http::geturl. I will release a first kit this week. Regarding your methods for FastaFile: It should be not difficult to implement them from scratch.

HJG 2012-04-04 What is the purpose of this program ? It looks like it is creating a file or db of some sort...