[Bioperl-l] Help Parsing FASTA Sequence File

Chris Fields cjfields at illinois.edu
Wed Dec 22 10:38:32 EST 2010


You might want to look at Bio::DB::Fasta or Bio::Index::Fasta, or Bio::DB::Flat (all of which index FASTA), and use SQLite or similar to create a database for the score lookups.

chris

On Dec 9, 2010, at 6:50 AM, Fahmida wrote:

> 
> Hi,
> 
> I've several input 'score' files and their corresponding 'data' files like:
> score1.txt data1.txt
> score2.txt data2.txt
> ....
> ....
> 
> score1.txt
> 
> contig00002 length=671 numreads=17 1207 0.0
> contig00003 length=637 numreads=26 1205 0.0
> contig00052 length=535 numreads=10 607 e-176
> contig00072 length=472 numreads=46 571 e-165
> contig00019 length=667 numreads=5 474 e-136
> 
> This file has several rows and five columns.column 1-3 are
> names/descriptions and column 4 (1207, 1205, etc) and column 5 (0.0,0.0,
> e-176, etc). contain the scores. I want to make a list of TOP 2 names based
> on column 4 score and whose column 5 score is not '0.0'. For example. for
> the above data the output list would be:
> 
> contig00052 length=535 numreads=10
> contig00072 length=472 numreads=46
> 
> Use the above list to extract data from the 'data1.txt':
> 
> data1.txt
> 
>> contig00001 length=567 numreads=35
> GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGAAAaCCAAGGGAGAAaGAAa
> CTACACTACTAATGGAAAaGATCTACATGCTAGAAAAa
>> contig00002 length=671 numreads=17
> GGGgCTGACGTGgCcGCTAATACGACTCACTATAGGgAGAGTTACTGTGGAGGGAGAGGC
> TTGCTCAAaTCCGCGTTCAAGGATTTCCAGATTGGTAAGAACTTCAGATT
>> contig00052 length=535 numreads=10
> GGGCTGACGTGgCCGCTAATACGACTCACTATAGGGAGAGATCGTGGCGATCGCCAATCA
> CCCAGGTGCCGTTAGCCA
>> contig00003 length=637 numreads=26
> GGGCTGACGTGgCCGCTAATACGACTCACTATAGGGAGAGATCGTGGCGATCGCCAATCA
> CCCAGGTGCCGTTAGCCAGAGCTG
>> contig00072 length=472 numreads=46
> GGGCTGACGTGgCCGCTAATACGACTCACTATAGGGAGAGTTTtCCCCAGGACCCTGGGA
> GGACCATGCCGTATGGGTGTCTAGTAAGTACAAaGCCATAATTCACATAAGTGAAATATT
> CTCAAGcACTAGGATC
>> contig00019 length=504 numreads=5
> GGGCTGACGTGGCCGCTAATACGACTCACTATAGGgAGAGATCTCACTAAAAAACTGGGG
> ATAACGCCT
> 
> 
> Example Output file:
> 
>> contig00052 length=535 numreads=10
> GGGCTGACGTGgCCGCTAATACGACTCACTATAGGGAGAGATCGTGGCGATCGCCAATCA
> CCCAGGTGCCGTTAGCCA
>> contig00072 length=472 numreads=46
> GGGCTGACGTGgCCGCTAATACGACTCACTATAGGGAGAGTTTtCCCCAGGACCCTGGGA
> GGACCATGCCGTATGGGTGTCTAGTAAGTACAAaGCCATAATTCACATAAGTGAAATATT
> CTCAAGcACTAGGATC
> 
> Any reply would be greatly appreciated.
> 
> -- 
> View this message in context: http://old.nabble.com/Help-Parsing-FASTA-Sequence-File-tp30416193p30416193.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list