[Bioperl-l] Re: Entrez gene parser code

Barry Moore barry.moore at genetics.utah.edu
Thu Apr 14 16:32:48 EDT 2005


Did you want to install a new ppm repository on your Windows box so you 
can install bioperl with ppm?  If so, this is not via CVS.  You want to 
run the following commands from you ppm prompt.

rep add Bioperl http://bioperl.org/DIST/
rep add Kobes http://theoryx5.uwinnipeg.ca/ppms/
rep add Bribes http://www.bribes.org/perl/ppm

You can then search for bioperl and install the version you want.  
Nathan Haigh (usually on this list) was preparing a bioperl 1.5 ppm.  
Not sure if it made it onto the website yet, but 1.4 is there and that 
install works as expected.  BTW, this will only install bioperl core.


Installing Bioperl on Windows



1) Quick Instructions for the Impatient

2) Bioperl on Windows

3) Perl on Windows

4) BioPerl on Windows

5) Beyond the Core

6) BioPerl and Cygwin

7) Cygwin Tips

8) Example Script


This installation guide was written by Barry Moore, Nathan Haigh and 
other Bioperl authors based on

the original work of Paul Boutros. Please report problems and/or fixes 
to the bioperl mailing list,

bioperl-l at bioperl.org


1) Quick instructions for the impatient, lucky, or experienced user.



Download the ActivePerl MSI from 

Run the ActivePerl Installer (accepting all defaults is fine).

Open a command prompt (Menus Start->Run and type cmd) and run the PPM 
shell (C:\>ppm).

Add two new PPM repositories with the following commands:


      ppm> rep add Bioperl http://bioperl.org/DIST

      ppm> rep add Kobes http://theoryx5.uwinnipeg.ca/ppms

      ppm> rep add Bribes http://www.Bribes.org/perl/ppm


Install Bioperl with the following commands:


      ppm> search Bioperl


This returns a numbered list of packages with corresponding version 
numbers etc. with "Bioperl" in

their name.


      ppm> install <number>


Where <number> corresponds to the relevant package and version from the 
numbered list obtained



Go to http://www.bioperl.org and start reading documentation or try the 
example script at the end of

this file.



2) Bioperl



Bioperl is a large collection of Perl modules (extensions to the Perl 
language) that aid in the task

of writing Perl code to deal with sequence data in a myriad of ways.  
Bioperl provides objects for

various types of sequence data and their associated features and 
annotations.  It provides

interfaces for analysis of these sequences with a wide variety of 
external programs (BLAST, fasta,

clustalw and EMBOSS to name just a few).  It provides interfaces to 
various types of databases both

remote (GenBank, EMBL etc) and local (MySQL, flat files, GFF etc.) for 
storage and retrieval of

sequences.  And finally with its associated documentation and mailing 
list Bioperl represents a

community of bioinformatics professionals working in Perl who are 
committed to supporting both

development of Bioperl and the new users who are drawn to the project.


While most bioinformatics and computational biology applications are 
developed in Unix/Linux

environments, more and more programs are being ported to other operating 
systems like Windows, and

many users (often biologists with little background in programming) are 
looking for ways to

automate bioinformatics analyses in the Windows environment.  Perl and 
Bioperl can be installed

natively on Windows NT/2000/XP.  Most of the functionality of Bioperl is 
available with this type

of install.  Much of the heavy lifting in bioinformatics is done by 
programs originally developed in

lower level languages like C and Pascal (e.g. BLAST, clustalw, Staden 
etc).  Bioperl simply acts

as a wrapper for running and parsing output from these external 
programs.  Some of those programs

(BLAST for example) are ported to Windows.  These can be installed and 
work quite happily with BioPerl

in the native Windows environment.  Some external programs such as 
Staden and the EMBOSS

suite of programs can not be installed on Windows at all, and therefore 
any part of Bioperl that

interacts with these packages either won't work or can't be installed at 


If you have a fairly simple project in mind, want to start using Bioperl 
quickly, only have access

to a computer running Windows, and/or don't mind bumping up against some 
limitations then Bioperl on

Windows may be a good place for you to start.  For example, downloading 
a bunch of sequences from

GenBank and sorting out the ones that have a particular annotation or 
feature works great.  Running

a bunch of your sequences against remote or local BLAST, parsing the 
output and storing it in a

MySQL database would be fine also.  Be aware that most if not all of the 
Bioperl developers are

working in some type of a UNIX environment (Linux, OSX, Cygwin).  If you 
have problems with Bioperl

that are specific to the Windows environment, you may be blazing new 
ground and your pleas for help

on the Bioperl mailing list may get few responses - simply because no 
one knows the answer to your

Windows specific problem.  If this is or becomes a problem for you then 
you are better off working

in some type of UNIX like environment.  One solution to this problem 
that will keep you working on a

Windows machine it to install Cygwin, a UNIX emulation environment for 
Windows.  A number of Bioperl

users are using this approach successfully and it is discussed in more 
detail below.


3) Perl on Windows



There are a couple of ways of installing Perl on a Windows machine.  The 
most common and easiest is

to get the most recent build from ActiveState.  ActiveState is a 
software company

(http://www.activestate.com) that provides free builds of Perl for 
Windows users.  The current

(December 2004) build is ActivePerl (ActivePerl   is 
also available and should

work just fine).  To install ActivePerl on Windows:


      Download the ActivePerl MSI from 

      Run the ActivePerl Installer (accepting all defaults is fine).


You can also build Perl yourself (which requires a C compiler) or 
download one of the other binary

distributions.  The Perl source for building it yourself is available from

CPAN (http://www.cpan.org), as are a few other binary distributions that 
are alternatives to

ActiveState.  This approach is not recommended unless you have specific 
reasons for doing so and

know what you're doing.  If that's the case you probably don't need to 
be reading this guide.


Cygwin is a UNIX emulation environment for Windows and comes with its 
own copy of Perl.

Information on Cygwin and Bioperl is found below.


4) BioPerl on Windows



Perl is a programming language that has been extended a lot by the 
addition of external modules.

These modules work with the core language to extend the functionality of 

Bioperl is one such extension to Perl.  These modular extensions to Perl 
sometimes depend on the

functionality of other Perl modules and this creates a dependency.  You 
can't install module X

unless you have already installed module Y.  Some Perl modules are so 
fundamentally useful that the

Perl developers have included them in the core distribution of Perl - if 
you've installed Perl then

these modules are already installed.  Other modules are freely available 
from CPAN, but you'll have

to install them yourself if you want to use them.  BioPerl has such 


Bioperl is actually a large collection of Perl modules (over 1000 
currently) and these modules are

split into six groups.  These six groups are:


      Bioperl Group                         Functions


      bioperl (the core)        Most of the main functionality of Bioperl.

      bioperl-run               Wrappers to a lot of external programs.

      bioperl-ext               Interaction with some alignment functions

                                and the Staden package.

      bioperl-db                Using bioperl with BioSQL and local

                                relational databases.

      bioperl-microarray        Microarray specific functions.

      biperl-gui                Some preliminary work on a graphical user

                                interface to some Bioperl functions.


The Bioperl core is what most new users will want to start with.  
Bioperl (the core) and the Perl

modules that it depends on can be easily installed with PPM.  PPM

(Programmer's Package Manager formally known as the Perl Package 
Manager) is an ActivePerl utility

for installing Perl modules on systems using ActivePerl.  The PPM 
commands shown in this document

are for PPM version 3, if you use PPM version 2 the commands you require 
will be different.  PPM

will look online (you have to be connected to the internet of course) 
for files (these files end

with .ppd) that tell it how to install the modules you want and what 
other modules your new modules

depends on.  It will then download and install your modules and all 
dependent modules for you.

These .ppd files are stored online in PPM repositories.  ActiveState 
maintains the largest PPM

repository and when you installed ActivePerl PPM was installed with 
directions for using the

ActiveState repositories.  Unfortunately the ActiveState repositories 
are far from complete and

other ActivePerl users maintain their own PPM repositories to fill in 
the gaps.  Installing will

require you to direct PPM to look in three new repositories.

You do this by opening a Windows command prompt, typing ppm to start the 
PPM shell and then typing

the following three commands:


      ppm> rep add Bioperl http://bioperl.org/DIST

      ppm> rep add Kobes http://theoryx5.uwinnipeg.ca/ppms       ppm> 
rep add Bribes



Once PPM knows where to look for Bioperl and it's dependencies you 
simply tell PPM to search for

packages with Bioperl in their name, and then which of these to 
install.  This is done with the

following commands:


      ppm> search Bioperl


This returns a numbered list of packages with corresponding version 
numbers etc. with "Bioperl" in

their name.


      ppm> install <number>


Where <number> corresponds to the relevant package and version from the 
numbered list obtained



5) Beyond the Core



You may find that you want some of the features of other Bioperl groups 
like bioperl-run or

bioperl-db.  There are currently no PPM packages for installing these 
parts of

Bioperl (but check this by doing a Bioperl search at the PPM shell):


      ppm> search bioperl


If they are not present, you will have to install these manually from 
source.  For this you will

need a Windows version of the program make called nmake

You will also want

to have a willingness to experiment.  You'll have to read the 
installation documents for each

component that you want to install, and use nmake where the instructions 
call for make.  You will

have to determine from the installation documents what dependencies are 
required and you will have

to get them, read there documentation and install them first.  The 
details of this are beyond the

scope of this guide.  Read the documentation.  Search Google.  Try your 
best, and if you get stuck

consult with others on the bioperl mailing list.


6) BioPerl and Cygwin



Cygwin is a UNIX emulator and shell environment available free at 
www.cygwin.com. BioPerl runs well

within Cygwin. Some users claim that installation of Bioperl is easier 

Cygwin than within Windows, but these may be users with UNIX backgrounds.


One advantage of using Bioperl in Cygwin is that all the external 
modules are available through

CPAN, most if not all external programs can be installed and run so many 
of the limitation of

Bioperl on Windows are circumvented.


To get Bioperl running first install the basic Cygwin package as well as 
the Cygwin Perl, make, and

gcc packages. Clicking the "View" button in the upper right of the 
installer enables you to see

details on the various packages. Then follow the BioPerl installation 
instructions for UNIX in

BioPerl's INSTALL file.


Note that expat comes with Cygwin (it's used by the module XML::Parser).


One known issue is that DBD::mysql can be tricky to install in

Cygwin and this module is required for the bioperl-db, Biosql, and 
bioperl-pipeline external

packages. Fortunately there's some good instructions online:



Also, set the environmental variable TMPDIR, programs like BLAST and 
clustalw need a place to create

temporary files. e.g.:


setenv TMPDIR e:/cygwin/tmp     # csh, tcsh

export TMPDIR=e:/cygwin/tmp     # sh, bash


Note that this is not a syntax that Cygwin understands, which would be 
something like

"/cygdrive/e/cygwin/tmp". This is the syntax that a Perl module expects 
on Windows.


If this variable is not set correctly you'll see errors like this when 
you run



------------- EXCEPTION: Bio::Root::Exception -------------

MSG: Could not open /tmp/gXkwEbrL0a: No such file or directory

STACK: Error::throw



7) Cygwin Tips



The easiest way to install MySQL is to use the Windows binaries 
available at www.mysql.com. Note

that Windows does not have sockets, so you need to force the MySQL 
connections to use TCP/IP

instead. Do this by using the "-h" option from the command-line:


 >mysql -h -u blip -pblop biosql


Or, alias the mysql command in your .tcshrc, .cshrc, or .bashrc so it 
uses a host. For example, if

your databases are installed locally:


alias mysql 'mysql -h'


If you're trying to use some application or resource "outside" of Cygwin 
and you're having a problem

remember that Cygwin's path syntax may not be the correct one. Cygwin 
understands '/home/jacky' or

'/cygdrive/e/cygwin/home/jacky' (when referring to the E: drive) but the 
external resource may want

'E:/cygwin/home/jacky'. So your *rc files may end up with paths written 
in these different syntaxes,



If you can, install Cygwin on a drive or partition that's 
NTFS-formatted, not FAT32-formatted. When

you install Cygwin on a FAT32 partition you will not be able to set 
permissions and ownership

correctly. In most situations this probably won't make any difference 
but there may be occasions

where this is a problem.


If you want to use BLAST we recommend that the Windows binary be 
obtained from NCBI

(ftp://ftp.ncbi.nih.gov/blast/executables/LATEST-BLAST - the file will 
be named something like

blast-2.2.6-ia32-win32.exe). Then follow the Windows instructions in 


Although we've recommended using the BLAST and MySQL binaries you should 
be able to compile just

about everything else from source code using Cygwin's gcc. You'll notice 
when you're installing

Cygwin that many different libraries are also available (gd, jpeg, etc.).


8) Example Script





#A short script to demonstrate how to download sequences from GenBank 
and access

#the sequence and some associated annotations using Bioperl.


use strict;

use warnings;

use Bio::SeqIO;

use Bio::DB::GenBank; #use Bio::DB::GenPept or Bio::DB::RefSeq if needed


#Get some sequence IDs either like below, or read in from a file.  Note that

#this sample script works with the accession numbers below (at least at 
the time

#it was written).  If you add different accession numbers, and you get 

#you may be calling for something that the sequence doesn't have.  
You'll have

#to add your own error trapping code to handle that.

my @ids = ('K03160', 'AB039327', 'BC035972');


#Create the GenBank database object to read from the database.

my $gb = new Bio::DB::GenBank();


#Create a sequence stream to pass the sequences from the database to the 

my $seqio = $gb->get_Stream_by_id(\@ids);


#Loop over all of the sequences that you requested.

while (my $seq = $seqio->next_seq) {


  #Here is how you get methods directly from the RichSeq object.  Replace

  #'display_name' with any other method in Table 2. that can be called on

  #either the RichSeq object directly, or the PrimarySeq object which it has


  print "Display Name:  ", $seq->display_name,"\n";

  print "Sequence Date:  ",$seq->get_dates,"\n";


  #Here is how to access the classification data from the species object.

  my $species = $seq->species;

  print "Species  :", $species->common_name,"\n";

  my @class = $species->classification;

  print "Classification:  @class\n";


  #Here is a general way to call things that are stored as a 

  #Generic object.  Replace 'source' with any other of the "major" 
headings in

  #the feature table (e.g gene, CDS, etc.) and replace 'organism' with 
any of

  #the tag values found under that heading (mol_type, locus_tag, gene, etc.)

  my @source_feats = grep { $_->primary_tag eq 'source' } 

  my $source_feat = shift @source_feats;

  my @mol_type = $source_feat->get_tag_values('mol_type');

  print "Molecule Type:  @mol_type\n";


  #Here is a general way to call things that are stored as some type of a

  #Bio::Annotation oject.  This includes reference information, and 

  #Replace reference with 'comment' to get the comment, and replace

  #$ref->authors with $ref->title (or location, medline, etc.) to get other

  #reference categories

  my $ann = $seq->annotation();

  my @references = ($ann->get_Annotations('reference'));

  my $ref = shift @references;

  my ($title, $authors, $location, $pubmed, $reference);

  if (defined $ref) {

    $authors = $ref->authors;

    print "Authors:  $authors\n";


  print "Sequence:  \n", $seq->seq, "\n\n";


Brian Osborne wrote:

>If you'd like a command-line environment like some sort of Unix install
>Cygwin (www.cygwin.com). No need to install everything, just click the
>"View" button in the main installation window and select and install the
>minimum, something like gcc, binutils, cvs, openssh, make, Perl.
>Brian O.
>-----Original Message-----
>From: bioperl-l-bounces at portal.open-bio.org
>[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Colin Erdman
>Sent: Tuesday, April 12, 2005 2:23 PM
>To: 'Mingyi Liu'; 'Stefan Kirov'
>Cc: 'Bioperl list'
>Subject: RE: [Bioperl-l] Re: Entrez gene parser code
>I am between Linux installs right now and actually running win32 with the
>ActiveState Perl install... How does one add the cvs.open-bio.org repository
>to the PPM console list to search through it and install the bioperl-live
>packages etc? I don't see a comparable cvs command within it.
>This is all new to me and I appreciate the help!
>-----Original Message-----
>From: Mingyi Liu [mailto:mingyi.liu at gpc-biotech.com]
>Sent: Tuesday, April 12, 2005 10:56 AM
>To: Stefan Kirov
>Cc: Colin Erdman; Bioperl list
>Subject: Re: [Bioperl-l] Re: Entrez gene parser code
>Stefan Kirov wrote:
>>In order for this parser to work you need to get
>>GI::Parser::Entrezgene from sourceforge. You can get the address for
>>this module from the perl doc of entrezgene: perldoc
>I just want to add that I will be adding GI::Parser::EntrezGene to cpan
>in a few days, and most likely the name space will switch to Bio::ASN1
>(therefore it'd be Bio::ASN1::EntrezGene) based on PAUSE admin suggestion.
>Bioperl-l mailing list
>Bioperl-l at portal.open-bio.org
>Bioperl-l mailing list
>Bioperl-l at portal.open-bio.org

Barry Moore
Dept. of Human Genetics
University of Utah
Salt Lake City, UT

More information about the Bioperl-l mailing list