[Bioperl-l] Check sequence format, question

Brian Osborne bosborne11 at verizon.net
Thu Nov 2 09:38:36 EST 2006


Yes, and validation should always be optional. SeqIO is not always as fast
as some want it to be and sometimes you're certain that your files don't
need to be validated.

Brian O.

On 11/2/06 11:11 AM, "Chris Fields" <cjfields at uiuc.edu> wrote:

> Brian,
> I think the validation issue is worthwhile but I can see logistical
> nightmares having every SeqIO parser validate sequence while parsing;
> GenBank and EMBL do this to some extent already but it isn't
> foolproof.  Much of SeqIO (e.g. GenBank/EMBL/Swiss parsing) is
> already in dire need of an overhaul as is w/o adding validation.
> I wonder if it would be better if SeqIO has-a validator object
> instead of acting as a validator itself, i.e. SeqIO would focus on
> parsing and writing, the validator would focus on validation.  It
> might be easier from the maintenance aspect.  It's probably
> worthwhile exploring using Bio::Tools::GuessSeqFormat within SeqIO,
> or setting up a new system altogether.  Validation using the sequence
> validator could then be enabled by having a validation option when
> instantiating SeqIO.  We could even enable XML format validation
> using the DTD/Schema, which should be fairly straightforward.
> Of course, this all depends on someone writing it up...
> Chris
> On Nov 2, 2006, at 6:49 AM, Brian Osborne wrote:
>> Chris et al.,
>> As you know the question of whether SeqIO should or should not
>> validate or
>> check the given format is still an open one. In fact, some SeqIO
>> modules do
>> validate to some extent. See:
>> http://bugzilla.open-bio.org/show_bug.cgi?id=1508
>> I can see that you've commented on this enhancement, I'm replying
>> just to
>> bring this to the attention of others.
>> Brian O.
>> On 11/2/06 12:28 AM, "Chris Fields" <cjfields at uiuc.edu> wrote:
>>> On Nov 1, 2006, at 6:15 PM, Eugene Bolotin wrote:
>>>> Dear bioperl mailing list,
>>>> I trying to get sequence from a file using Bio::SeqIO, before I do
>>>> anything,
>>>> I want to make sure that the file is in a correct Fasta sequence
>>>> format. I
>>>> want it to spit out an error message if it is in any other format.
>>>> What is the easiest way to do it?
>>>> Thanks,
>>>> Eugene Bolotin
>>>> Sladek Lab.
>>> There is no formal FASTA definition that is universally accepted
>>> beyond having the first line start with '>' and an optional
>>> description, with the sequence in subsequent lines.
>>> http://www.bioperl.org/wiki/FASTA_sequence_format
>>> Bio::SeqIO isn't currently set up to validate sequence formats
>>> directly, but you could try preparsing the data using
>>> Bio::Tools::GuessSeqFormat.
>>> Chris
>>> Christopher Fields
>>> Postdoctoral Researcher
>>> Lab of Dr. Robert Switzer
>>> Dept of Biochemistry
>>> University of Illinois Urbana-Champaign
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign

More information about the Bioperl-l mailing list