HOWTO:Getting Genomic Sequences
From BioPerl
Contents |
Authors
Copyright
This document is copyright Brian Osborne. It can be copied and distributed under the terms of the Perl Artistic License.
Abstract
This is a HOWTO that talks about using Bioperl and tools related to Bioperl to get genomic sequence. There are a few different approaches, one uses files that you'll download to your own computer to query locally, others use remote, programmable interfaces or APIs. You should also see the EUtils Cookbook.
Using local Genbank and Entrez Gene files
You can download chromosomal, nucleotide files in FASTA format from NCBI (ftp://ftp.ncbi.nih.gov/genomes/) and get gene position data from Entrez Gene (see Using Bio::DB::EntrezGene to get genomic coordinates), then create indices of the fasta files using Bio::DB::Fasta. There is an example script, extract_genes.pl, that shows how this could be done. The query terms are limited to Gene id's in this example since the positional data is taken from Entrez Gene's gene2accession file.
Requirement: BioPerl.
Using the Perl API at ENSEMBL
You can connect remotely to the ENSEMBL database and query it using any name or identifier that's understood by ENSEMBL.
Requirements: BioPerl, the Ensembl Perl API, DBI, and DBD::mysql. See Ensembl API installation , and the Ensembl Perl API tutorials for information on installation and use.
Example script:
use strict; use DBI; use Getopt::Long; use Bio::EnsEMBL::DBSQL::DBAdaptor; my $identifier; GetOptions( "n=s" => \$identifier ); my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new( -user => "anonymous", -dbname => "homo_sapiens_core_49_36k", -host => "ensembldb.ensembl.org", -pass => "", -driver => 'mysql', -port => 5306, # use mysql 5 (mysql 4: port 3306) ); my $gene_adaptor = $db->get_GeneAdaptor; my $slice_adaptor = $db->get_SliceAdaptor; for my $gene (@{$gene_adaptor->fetch_all_by_external_name($identifier)}) { for my $trans (@{$gene->get_all_Transcripts}) { my $seq = $slice_adaptor->fetch_by_region("chromosome", $trans->seq_region_name, $trans->start, $trans->end); print "\n",$seq->seq,"\n"; } }
You also have the option of using raw SQL when using the ENSEMBL API, the result is that this is a very powerful API for analyzing genomic data.
Notes on this example
- This bit of code has not been extensively tested.
- The
fetch_all_by_external_namemethod does not accept a namespace or database name as an argument, so it lacks some precision. Be careful that your query returns just one sequence. Alternatively use a more precise SQL statement rather thanfetch_all_by_external_name.
- Make sure to use the latest version, not necessarily the
-dbnameargumenthomo_sapiens_core_49_36kused above. To get a listing of available databases using mysql:
$ mysql -u anonymous -h ensembldb.ensembl.org Welcome to the MySQL monitor. ... mysql> show databases like "homo_sapiens_core%"; mysql> show databases;
Using Bio::DB::GenBank when you have genomic coordinates
This is an example of how you can pull sequence chunks from GenBank, complete with GenBank's annotation, using Bio::SeqFeature::Generic objects generated by a Bio::Tools::RNAMotif factory.
Note that you could easily replace Bio::Tools::RNAMotif with anything that gives start, end, and strand information. While the last foreach loop just dumps sequence annotation information, it could be modified to add the sequence feature, determine the seqfeatures's genomic context to surrounding features, etc. For more information, see Bio::DB::GenBank.
Requirement: BioPerl.
#!perl use strict; use Bio::Tools::RNAMotif; # or anything that gives start, end, strand info use Bio::DB::GenBank; use Bio::SeqIO; my $factory = Bio::Tools::RNAMotif->new(-file=>'clean_RNAMotif.txt', -motiftag => 'protein_binding', -desctag => 'pyrR_BL' ); # array of Bio::SeqFeature::Generic objects generated by Bio::Tools::RNAMotif factory my @motifs = (); while(my $motif = $factory->next_prediction) { push @motifs, $motif; } # Start Bio::SeqIO factory my $outfile = Bio::SeqIO->new(-file => '> temp.txt', -format => 'genbank'); my @seqs = (); foreach my $motif (@motifs) { my $strand = ($sf->strand == 1) ? 1 : 2; my $seqstart = $sf->start - 500; my $seqend = $sf->end + 500; # Below is from Bio::DB::GenBank POD, with some modifications my $factory = Bio::DB::GenBank->new(-format => 'genbank', -seq_start => $seqstart, # 500 bp upstream -seq_stop => $$seqend, # 500 bp downstream -strand => $strand, # 1=plus, 2=minus ); my $seqin = $factory->get_Seq_by_acc($motif->seq_id); # store away files $outfile->write_seq($seqin); # may take lots of memory if you have many seqfeatures push @seqs, $seqin; sleep 3; # don't irritate NCBI } # from HOWTO:Feature-Annotation; gives seq annotation foreach my $seq (@seqs) { print $seq->accession_number,"\t", $seq->length,"\n"; for my $feat_object ($seq->get_SeqFeatures) { next unless $feat_object->primary_tag eq "CDS"; print "primary tag: ", $feat_object->primary_tag, "\n"; for my $tag ($feat_object->get_all_tags) { print " tag: ", $tag, "\n"; for my $value ($feat_object->get_tag_values($tag)) { print " value: ", $value, "\n"; } } } }
Using Bio::DB::EntrezGene to get genomic coordinates
You can get the coordinates of a given gene from Entrez Gene using the Bio::DB::EntrezGene module. This involves examining the Annotations associated with the gene (see the Feature-Annotation HOWTO for more information on Annotations) and finding the one labelled "Evidence Viewer", the data is found in a URL. The only identifier that the NCBI Entrez Gene API can use is a Gene id, formerly known as a LocusLink id.
Requirement: BioPerl.
Example code:
use strict; use Bio::DB::EntrezGene; my $id = shift or die "Id?\n"; # use a Gene id my $db = new Bio::DB::EntrezGene; my $seq = $db->get_Seq_by_id($id); my $ac = $seq->annotation; for my $ann ($ac->get_Annotations('dblink')) { if ($ann->database eq "Evidence Viewer") { # get the sequence identifier, the start, and the stop my ($contig,$from,$to) = $ann->url =~ /contig=([^&]+).+from=(\d+)&to=(\d+)/; print "$contig\t$from\t$to\n"; } }
This data is found in a Bio::Annotation::DBLink Annotation.
Once you have the coordinates you can use them to retrieve a sub-sequence either by using a local indexed file (e.g. Bio::DB::Fasta) or by retrieving the sequence from
a remote database (e.g. Bio::DB::GenBank, Bio::DB::RefSeq and using subseq or trunc from Bio::PrimarySeq or Bio::PrimarySeqI (the first approach will give you the best performance).

