[Bioperl-l] Output a subset of FASTA data from a single large file

Chris Fields cjfields at uiuc.edu
Fri Jun 9 15:49:51 EDT 2006


> On 6/9/06 1:59 PM, "Chris Fields" <cjfields at uiuc.edu> wrote:
> 
> > No; I saw the same thing here.  It's not FASTA in the traditional sense:
> >
> > http://www.bioperl.org/wiki/FASTA_sequence_format
> >
> > though he did get it to build a database successfully.  Well, 'success'
> in
> > the sense that no errors were thrown.  I've learned the absence of error
> > messages does not necessarily mean that everything went as planned; it
> > depends on how much error handling has been added to the module by the
> > submitting author.
> >
> > It's possible that the second annotation line was ignored completely.  I
> > suppose it's also possible that two sequences are entered into the
> database,
> > an empty sequence for the first '>' line and the full sequence for the
> > second.  It's all dependent on how the parser handles this.
> 
> I think that Senthil was pointing out that even though >Antisense looks to
> be on its own line, it isn't, but is simply a continutation of the FASTA
> header.  Judging from the context, that is the only interpretation that
> makes sense.
> 
> Sean

Sorry.  Just checked through another mail client and you're right.  That's
what I get for trusting Mr. Gates (stupid Outlook).  I have seen a few funky
FASTA derivations, so I thought that's what was going on here.  My bad!

My point, though erroneous, was that the fasta format parser may not parse
this data correctly if he did have two description lines, but may not
indicate there are problems by throwing an exception.  I demonstrated that
using Bio::SeqIO as an example (you get empty sequences).  Bio::Index::Fasta
parses the file itself using this loop to index:

	# Main indexing loop
	while (<FASTA>) {
		if (/^>/) {
			# $begin is the position of the first character
after the '>'
			my $begin = tell(FASTA) - length( $_ ) + 1;

			foreach my $id (&$id_parser($_)) {
				$self->add_record($id, $i, $begin);
			}
		}
	}

Which simply looks for '>'.  That's fine for a vast majority of sequences.
I thought it would be nice to have something that's a little more strenuous
in verifying the format rather than trusting it implicitly, maybe by using
an eval{} block to make sure the format is FASTA-like and looks like
DNA/RNA/protein.  

Chris


> >> |> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
> >> |> >Antisense;
> >> |> TGGCTCCTGCTGAGGTCCCCTTTCC
> >> |
> >> |Unfortunately that's not Fasta format (which only has a single header
> >> |line starting with a '>'.  I'd imagine that most programs which deal
> >> |with fasta which read that entry would see it as two sequences, the
> >> |first of which is empty.
> >> |
> >>
> >> [snipped]
> >>
> >> hi,
> >>
> >> I think the file is in fasta format and probably you might have seen it
> >> differently because of your mail transport agent.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l



More information about the Bioperl-l mailing list