[Bioperl-l] Check sequence format, question
cjfields at uiuc.edu
Thu Nov 2 10:11:01 EST 2006
I think the validation issue is worthwhile but I can see logistical
nightmares having every SeqIO parser validate sequence while parsing;
GenBank and EMBL do this to some extent already but it isn't
foolproof. Much of SeqIO (e.g. GenBank/EMBL/Swiss parsing) is
already in dire need of an overhaul as is w/o adding validation.
I wonder if it would be better if SeqIO has-a validator object
instead of acting as a validator itself, i.e. SeqIO would focus on
parsing and writing, the validator would focus on validation. It
might be easier from the maintenance aspect. It's probably
worthwhile exploring using Bio::Tools::GuessSeqFormat within SeqIO,
or setting up a new system altogether. Validation using the sequence
validator could then be enabled by having a validation option when
instantiating SeqIO. We could even enable XML format validation
using the DTD/Schema, which should be fairly straightforward.
Of course, this all depends on someone writing it up...
On Nov 2, 2006, at 6:49 AM, Brian Osborne wrote:
> Chris et al.,
> As you know the question of whether SeqIO should or should not
> validate or
> check the given format is still an open one. In fact, some SeqIO
> modules do
> validate to some extent. See:
> I can see that you've commented on this enhancement, I'm replying
> just to
> bring this to the attention of others.
> Brian O.
> On 11/2/06 12:28 AM, "Chris Fields" <cjfields at uiuc.edu> wrote:
>> On Nov 1, 2006, at 6:15 PM, Eugene Bolotin wrote:
>>> Dear bioperl mailing list,
>>> I trying to get sequence from a file using Bio::SeqIO, before I do
>>> I want to make sure that the file is in a correct Fasta sequence
>>> format. I
>>> want it to spit out an error message if it is in any other format.
>>> What is the easiest way to do it?
>>> Eugene Bolotin
>>> Sladek Lab.
>> There is no formal FASTA definition that is universally accepted
>> beyond having the first line start with '>' and an optional
>> description, with the sequence in subsequent lines.
>> Bio::SeqIO isn't currently set up to validate sequence formats
>> directly, but you could try preparsing the data using
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l