[Bioperl-l] Bio::SeqIO can't guess the format of data from a pipe

J.J. Emerson jj.emerson at gmail.com
Thu Aug 25 14:52:48 EDT 2011


Hi Chris,

You asked:

My question (not a criticism, just trying to understand the problem): why
> are you going through all the trouble of using GuessSeqFormat as a permanent
> solution anyway?  If you have a stream returning a possibly unknown data
> type, I would argue that the fundamental bug is not GuessSeqFormat but
> something else, more specifically not knowing the behavior of the data
> source and the returned format to begin with.  Is something preventing that?
>

In my particular case, I'm trying not to impose a particular usage scenario
onto the script I'm writing in the hopes it will be useful (and general) to
others in my lab in the future*. In my proximate case, I will certainly be
able to provide SeqIO with a format argument. But insofar as GuessSeqFormat
is considered desirable (and reasonable people could indeed disagree whether
it is desirable) I think its applicability shouldn't hinge on whether it is
guessing on a pipe or a file.

My point is, GuessSeqFormat is fine as a temporary stop-gap, but it is not a
> permanent solution to your problems (it is guessing, after all).  Note the
> code has had very little development over the years, and the related SeqIO
> code hasn't aged particularly well.
>

I see. I wasn't aware that GuessSeqFormat was so relatively neglected. Given
the rather challenging nature of the more elegant fix you suggested (using
the buffering of Root:IO), perhaps I should consider dropping my issue or
filing it as a feature request rather than a bug?

Cheers,

J.J.

PS

* The way I plan on using my script is roughly as follows:

prog1 [some arguments] \
| myscript.pl --informat fasta \
| prog2 \
| prog3 > pipeline.output

However, I'd like for the "--informat" switch to be optional, mainly to
increase usability for other users. For any well considered format, the
information is right there in the data to know what the format is, and as
such, providing the format a second time is somewhat redundant. In
principle, being able to do the following would be very useful:

prog1 [some arguments] \
| myscript.pl \
| prog2 > pipeline.output

The modularity of pipelining is very valuable and this is what caused me to
anticipate a usage scenario that involved both GuessSeqFormat and reading
from a pipe.

On Thu, Aug 25, 2011 at 9:58 AM, Chris Fields <cjfields at illinois.edu> wrote:

> On Aug 24, 2011, at 8:53 PM, J.J. Emerson wrote:
>
> > Hello All,
> >
> > I have experienced some behavior in SeqIO that doesn't seem to be what I
> > would expect. Basically, for a certain script, if I try to pass something
> > like "-fh => \*STDIN" to Bio::SeqIO->new(), it will fail if both of the
> > following two conditions are met simultaneously:
> >
> >   1. STDIN is coming from a pipe;
> >   2. SeqIO is trying to guess the format.
> >
> > If STDIO is coming from redirection instead of a pipe or if the format is
> > specified manually (i.e. BioPERL doesn't have to guess), the error
> doesn't
> > seem to occur.
> >
> > This issue has been reported previously:
> >
> > http://lists.open-bio.org/pipermail/bioperl-l/2010-July/033681.html
> > https://redmine.open-bio.org/issues/3122
>
> Yes, this was addressed according to that case.
>
> > This issue is ultimately one of using seek() on a pipe, which is
> forbidden
> > (see below). To be clear, there are kludgy ways around this that allow
> > BioPERL to take input from a pipe AND guess the format. My naive and
> > inefficient kludge was to test for reading from STDIN and for the absence
> of
> > a format. If both of these conditions are met, then I slurp STDIN into a
> > variable and then open a filehandle on that variable, and pass it to
> SeqIO,
> > which can guess the format if the fh isn't opened on a pipe. SeqIO then
> > successfully guesses the format and does the SeqIO thing, at the expense
> of
> > having the program pass over the data at least twice. And if the input
> file
> > is huge, it could potentially consume all the memory. A better way to
> > address the problem would be to process the input one line at a time, but
> > this seems to require more extensive changes.
>
> Have you tried tempfiles?  Not that this is a great solution, but it's very
> commonly used for large sequence data, and it is seekable.  This behavior
> could also be wrapped in GuessSeqFormat i suppose (but see below)
>
> > The reason I'm reposting this is because I think that the inability to
> guess
> > the sequence format from data originating from a pipe is an important
> > limitation for a fundamental part of BioPERL. When designing scripts to
> be
> > used in pipelines, the inability to guess formats for piped data limits
> > BioPERL's pipelineability substantially. Even though previous reports of
> > this have been made and a bug opened and closed, I was wondering if
> anyone
> > thought this was worthwhile fixing so as to make SeqIO (and probably
> AlignIO
> > as well?) more flexible?
> >
> > Does anyone think this should be refiled as a bug?
> >
> > Cheers,
> >
> > J.J.
>
> The fundamental problem with pipes (as you indicated) is that the data
> stream is not seekable.  We do have a built-in buffer in Bio::Root::IO that
> somewhat handles this, but Bio::Tools::GuessSeqFormat is (IIRC) designed to
> use the filehandle directly, bypassing the BioPerl IO layer completely.
>
> One solution is to redesign GuessSeqFormat to use Bio::Root::IO, have
> GuessSeqFormat push all data back to the buffer, then let SeqIO parse.  That
> will require some fundamental changes for both Bio::Root::IO and Bio::SeqIO
> (note that one cannot pass a Bio::Root::IO instance to another
> Bio::Root::IO-based class for parsing at this time).
>
> The other option is (as hinted above) having GuessSeqFormat dump the data
> to a tempfile, seek back after guessing, and retain the filehandle for
> Bio::SeqIO.  Not the best solutions, but either should work.
>
> My question (not a criticism, just trying to understand the problem): why
> are you going through all the trouble of using GuessSeqFormat as a permanent
> solution anyway?  If you have a stream returning a possibly unknown data
> type, I would argue that the fundamental bug is not GuessSeqFormat but
> something else, more specifically not knowing the behavior of the data
> source and the returned format to begin with.  Is something preventing that?
>
> My point is, GuessSeqFormat is fine as a temporary stop-gap, but it is not
> a permanent solution to your problems (it is guessing, after all).  Note the
> code has had very little development over the years, and the related SeqIO
> code hasn't aged particularly well.
>
> > PS
> >
> > Below are snippets of code and/or errors related to reproducing the
> failure
> > to guess unspecified formats. I'll see how Mailman treats my attachments
> and
> > post the code as a reply if they don't work.
> >
> > The bioperl_fhtest.pl attachment is the script that reproduces the
> error.
> > The w.fa is a fasta file containing some sequence.
> >
> > Here are the command lines to generate the behavior I observe (w.fa is a
> > file containing some fasta sequences, in my case it was the w gene from
> > different *Drosophila* species):
> >
> > ./bioperl_fhtest.pl fasta < w.fa # Works (redirection, no guessing)
> >> ./bioperl_fhtest.pl < w.fa # Works (redirection, guessing)
> >>
> >> cat w.fa | ./bioperl_fhtest.pl fasta # Works (pipe, no guessing)
> >> cat w.fa | ./bioperl_fhtest.pl # DOESN'T work (pipe, guessing)
> >>
> >
> >
> > Here's the error I get in the last case:
> >
> > ------------- EXCEPTION: Bio::Root::Exception -------------
> >> MSG: Failed resetting the filehandle; IO error occurred
> >> STACK: Error::throw
> >> STACK: Bio::Root::Root::throw
> >> /usr/local/share/perl/5.10.1/Bio/Root/Root.pm:472
> >> STACK: Bio::Tools::GuessSeqFormat::guess
> >> /usr/local/share/perl/5.10.1/Bio/Tools/GuessSeqFormat.pm:512
> >> STACK: Bio::SeqIO::new /usr/local/share/perl/5.10.1/Bio/SeqIO.pm:381
> >> STACK: ./bioperl_fhtest.pl:8
> >> -----------------------------------------------------------
> >>
> >
> >> From what I gather, the error is triggered by a failure of seek() on a
> STDIO
> > fh on lines 517-518 (text from the version GuessSeqFormat.pm installed on
> my
> > server):
> >
> >    512     if (defined $self->{-file}) {
> >>    513         # Close the file we opened.
> >>    514         close($fh);
> >>    515     } elsif (ref $fh eq 'GLOB') {
> >>    516         # Try seeking to the start position.
> >>    517         seek($fh, $start_pos, 0) || $self->throw("Failed
> resetting
> >> the ".
> >>    518                                         "filehandle; IO error
> >> occurred");;
> >>    519     } elsif (defined $fh && $fh->can('setpos')) {
> >>    520         # Seek to the start position.
> >>    521         $fh->setpos($start_pos);
> >>    522     }
> >>
> > <bioperl_fhtest.pl><w.fa>_______________________________________________
>
> You are always welcome to reopen and update the bug, or file a new one.
>
> chris
>
>


More information about the Bioperl-l mailing list