[Bioperl-l] extracting GI number from BLAST hit

Joshua Orvis jorvis at gmail.com
Thu Sep 16 11:55:01 EDT 2004

How can one extract the GI number from hits when doing BLAST against
an NCBI-formatted BLAST database?

Each entry in the original multi-FASTA file was like this:

>gi|30260195|ref|NC_003997.3| Bacillus anthracis str. Ames, complete genome
[sequence .....]

and formatting was done like:

# formatdb -i filename.fna -p F -o T

When I BLAST and parse the hit section I cannot see how to get the GI
number out of each hit.  This code:

        ## returns a Bio::SearchIO::blast object
        $report = $fact->blastall($seq);
        ## returns a Bio::Search::Result::BlastResult object
        while( my $result = $report->next_result ) {

            ## returns a Bio::Search::Hit::BlastHit object
            while( my $hit = $result->next_hit ) {
                my $acc  = $hit->accession || 'NOACC';
                my $desc = $hit->description || 'NODESC';
                my $name = $hit->name || 'NONAME';
                my $locus = $hit->locus || 'NOLOC';
                print "$acc - $desc - $name - $locus\n";

                ## returns a Bio::Search::HSP::GenericHSP object
                while( my $hsp = $hit->next_hsp ) {
                    ## TODO, grab the alignments in a bit

generates output like this:

NC_002940 - Haemophilus ducreyi 35000HP, complete genome -
ref|NC_002940.2| - NOLOC
NC_004088 - Yersinia pestis KIM, complete genome - ref|NC_004088.1| - NOLOC
NC_003143 - Yersinia pestis strain CO92, complete genome -
ref|NC_003143.1| - NOLOC
NC_002516 - Pseudomonas aeruginosa PA01, complete genome -
ref|NC_002516.1| - NOLOC
NC_002677 - Mycobacterium leprae strain TN complete genome -
ref|NC_002677.1| - NOLOC

I expected that I could parse it out of the description line, but that
is being done at some stage before.  I'm probably just missing a
method somewhere in the docs.  Any suggestions?


More information about the Bioperl-l mailing list