[Bioperl-l] automation of translation based on alignment

Chris Fields cjfields at illinois.edu
Tue Mar 23 00:53:48 EDT 2010

On Mar 22, 2010, at 3:51 PM, Chris Larsen wrote:

> ...
> 3.
> Chris F said:
>> To preface this, any reason you're not translating the alignment sequences using the above sequence's features as a reference?
> A logical place to start. But-they are usually not given. In addition to the above reason, the amount of data for viral sequences is rarer since fewer grad students want to sequence things that mame you or make you hurl, if you screw up on the nucleic acid extraction. Also, the locations for protein processing sites can be variable, like > or < instead of a real location in the string. So, the GenBank file isnt really very good as a reference, 5% of the time. Last, if there are three child proteins from a CDS, and one is made by a host protease, one by a viral protease, and one by a start codon, what do you say is 'mature'? What should be in the 'feature' field? Its not standardized right now. Nobody has this nailed at NCBI or UniProt.
> Still, like Chris says, a script that asks first for the coordinates, and takes that as the first go round, is best. The GenBank coords when provided, are accurate most of the time. AFter that, you end up comparing everything and making your choice.

Yes, in this case nothing will be a immediate, perfect solution.  It will take some additional work.

> 4.
> Last thoughts:
> * We tried BL2Seq to align query to target one at a time, with good reference sequences. It works, for exactly what you ask for. But! Only in a few virus families. And, its 1200 lines long, doing error checking; as you say its just not easy. Pulling an HSP from a blast report leaves one with with a lot of end trimming and comparing to do, since the HSP ends in an identity, and well, sometimes viruses vary at the point of cleavage of proteins. Good luck with that task, it gave us fits. Its not really appropriate to look at the ends of the hsp and say they are right. It requires that extra code. Still, we may open that code to the public after April database release. It only works for well conserved viruses. (I know... Jumbo Shrimp).

Might be nice to see what you've done, whenever that is ready.

> * I know of no BioPerl module that can parse an MSA and take out the relevant alignments, so you dont have to assign a reference sequence from scratch, every time you do this. Is there one?

If you mean pulling out sets of sequences from a larger alignment or slices of alignments, there should be methods within Bio::SimpleAlign to do this, yes.

> *Sometimes the features on viruses are named differently: /mat_peptide, /sig_peptide; sometimes they are named different in /note or /product. There is no standard for much of this. It needs to be proposed. Maybe we can do that together.
> * If you want to use a synoptic MSA for all Hepatitis B viruses, and then pull the alignments out of that, I'd love to talk to you. The VBRC used precomputed MSAs for all their virus families and got forward a little bit. We are looking into that code.
> All ideas. Nothing set in stone. Dialog welcome.
> Good luck all.
> Chris
> -- 
> Christopher Larsen, Ph.D.
> Sr. Scientist / Grants Manager
> Vecna Technologies
> 6404 Ivy Lane #500
> Greenbelt, MD 20770
> Phone: (240) 965-4525
> Fax: (240) 547-6133
> clarsen at vecna.com

Very nice summary of the problems in the field.  thanks!


More information about the Bioperl-l mailing list