[Bioperl-l] genbank

Dimitar Kenanov dimitark at bii.a-star.edu.sg
Tue Nov 30 04:04:30 EST 2010


On 11/30/2010 11:42 AM, Chris Fields wrote:
> On Nov 29, 2010, at 8:54 PM, Dimitar Kenanov wrote:
>
>    
>> On 11/30/2010 10:19 AM, Jason Stajich wrote:
>>      
>>> Dimitar -
>>>
>>> In terms of your question - a GenBank db query previously (ie 4-5 years ago when this was written) WOULD NOT return a sequence if a GenPept ID was specified so we had to have a separate module for GenBank and GenPept db querying since there was a different set of parameters - I think that changed so that most of the queries can run through GenBank
>>>
>>> I see that must have been improved at NCBI.  For the record if you want the full GenPept record with features and annotations you just request a different db, in this case 'gb' for genbank instead of the fasta source
>>> As in: http://gist.github.com/721012
>>>
>>> But maybe you already figured it out?
>>>
>>> -jason
>>> Chris Fields wrote, On 11/29/10 6:39 AM:
>>>        
>>>> On Nov 29, 2010, at 3:35 AM, Dimitar Kenanov wrote:
>>>>
>>>>          
>>>>> Hi again,
>>>>> it seems that when i download (with 'download_query_genbank.pl') the whole proteome from NCBI in fasta format it is first being downloaded and from it is being created some kind of SeqFastaSpeedFactory and after that from it is being copied to the output file. But i want to download and write to output file one by one so i can see the download progress(which is working for genbank data).
>>>>>
>>>>> Its frustrating :)
>>>>>
>>>>> Any ideas where to look for solution
>>>>> Cheers
>>>>> Dimitar
>>>>>            
>>>> You can't do this with the default script, but you can use a modified version and, where you are retrieving a sequence stream, in the last four lines:
>>>>
>>>> my $stream = $dbh->get_Stream_by_query($query);
>>>> while( my $seq = $stream->next_seq ) {
>>>>     $out->write_seq($seq);
>>>> }
>>>>
>>>> insert an iterator in the loop that indicates progress.  Realize the sequence data is processed through Bio::SeqIO, so it won't be exactly the same as what is retrieved from GenBank, but it should be very close.
>>>>
>>>> If you want raw sequence, you can use Bio::DB::EUtilities, but it's a bit more complicated.
>>>>
>>>> chris
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>          
>>>        
>> Thank you Jason,
>> i figured that out yes :)
>> I know am wandering how to get the GI list so i can download by ID or ACC and not by query when i download fasta. I have the simple code:
>> -------------
>> my $query_str='Caenorhabditis elegans[organism] AND refseq[filter]';
>> my $query=Bio::DB::Query::GenBank->new(-db=>'protein',
>>                                        -query=>$query_str);
>>
>> my $count=$query->count;
>> my @ids=$query->ids;<------- error msg here
>> ------------
>> but i get an error msg:
>>
>> MSG: Id list has been truncated even after maxids requested.
>>
>> How can i get the ID/ACC? Any idea?
>>
>> Thank you
>> Dimitar
>>      
> Dimitar,
>
> You are retrieving a huge number of IDs; if you print out $count above, the total is 23906.  Set -maxids to raise the returned default maximum number of IDs higher:
>
> ----------------------------------------------
> use Bio::DB::Query::GenBank;
>
> my $query_str='Caenorhabditis elegans[organism] AND refseq[filter]';
> my $query=Bio::DB::Query::GenBank->new(-maxids =>  40000,
>                                         -db=>'protein',
>                                        -query=>$query_str);
>
> my $count=$query->count;
> say "Count: $count";
>
> my @ids=$query->ids;
> say scalar(@ids); # equal to $count
> ----------------------------------------------
>
> Realize, though, you must submit these in batches of ~300 if retrieving sequences (IIRC, Bio::DB::GenBank only uses GET instead of POST, so there is a URL length limit).  Bio::DB::EUtilities can retrieve more, about ~3000 or so in a batch, when using POST.
>
> chris
>
>
>
>    
Hi again,
i managed to solve my problem. It may be dirty but it works the way i 
want :)
I reworked the 'download_query_genbank.pl' (attached). Now i can get the 
seqs in full fasta for proteomes and genomes and the genpept report 
files for the proteomes.
For DB handle i only use GenPept now cos it gives me stream which i can 
track with term::progressbar.

For the output i use 2 cases:
-----------------
while( my $seq = $stream->next_seq ) {
     #DIMITAR
     my($gi,$locus,$refnum,$desc,$seqstr);
     if($retformat eq 'fasta'){ <-------------------------| for the 
fasta as i want it
         check_progress($prgs,$seqnum,$count);
         $locus=$seq->display_id;
         $refnum=$seq->accession_number;
         $gi=$seq->primary_id;
         $desc=$seq->desc;
         $desc=~s/\.$//;
         $seqstr=$seq->seq;
         print $fhout ">gi\|$gi\|ref\|$refnum\|$locus $desc\n$seqstr\n";
     }else{
         check_progress($prgs,$seqnum,$count);
         $out->write_seq($seq); <--------------------------| for the 
genbank reports
     }
     $seqnum++;
     #DIMITAR

#    $out->write_seq($seq);#original

}
------------------

Thank you for your help and time.

Cheers
Dimitar

-- 
Dimitar Kenanov
Post doctoral fellow
Bioinformatics Institute
A*STAR Singapore
tel: +65 6478 8514
email: dimitark at bii.a-star.edu.sg

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: download_query_genbank.pl
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20101130/d41a206d/attachment.pl>


More information about the Bioperl-l mailing list