[Bioperl-l] arabidopsis + load_seqdatabase.pl

Hilmar Lapp hlapp at gmx.net
Tue Dec 20 13:28:27 EST 2005

100% agreed.

Angshu, sometimes it goes a long way if you are precise in the way you 
state things or either people have great difficulty understanding what 
exactly your problem is, or you come across as clueless, or both.

So, do keep in mind that it is the Bioperl SeqIO parser that does the 
parsing, not load_seqdatabase.pl. Also, load_seqdatabase.pl doesn't 
manipulate the sequence object returned by the parser, nor does 
bioperl-db. If accession, identifier, and name have the same values, 
then that is what the SeqIO parser does - and probably it does so for a 
reason. If you don't like the way SeqIO builds the object then write 
your own SeqProcessor as mentioned before and you are free to entirely 
rearrange the object. If you feel the SeqIO parser is in error then 
file or post a bug report, in which you will need to state what exactly 
is it that you find to be in error.

So, please stop asking for files to be parsed correctly - they *are* 
parsed correctly. Instead, take a moment to step back and read Sean's 
email again and then stick to the advice given there:

	1) What exactly is it out of those or any other files that you want 
represented in biosql? Not a single one of your answers indicates that 
you know the answer to this question. In other words, not a single one 
of your answers indicates that you know precisely what you want. How do 
you expect others to help you achieve what you want if you don't even 
know what you want, let alone be able to explain it to others.

	2) What have you tried to get what you need? Why did the outcome of 
those attempts fall short of what you want? When doing so, do not label 
software you used in the process as yielding 'incorrect' results unless 
you can back that up with a solid bug report, because almost always the 
one who is 'incorrect' is you, expecting things that you shouldn't have 
expected, or executing things wrongly.

For example, you could state that 1) what you want is all A.thaliana 
transcripts in a biosql database, with each bioentry ideally being a 
CDS, or at least a transcript, with as much annotation as contained in 
the input file, and that 2) you used the NC_* records in GenBank format 
with load_seqdatabase.pl but found yourself with only the contigs as 
bioentries, not the transcripts or CDS records.

At this point, most people will have understood what your goal is, and 
people more experienced in bioperl will also have understood that you 
fell for a common misconception many people new to bioperl have, namely 
to confuse features with the main entry (i.e., sequence).

It would then have been straightforward to point out that your desired 
CDS annotation is surely present in your Biosql instance (if annotated 
in the NC* record), but as rows in seqfeature because they were 
features on the sequences as they came out of the parser.

It would also be straightforward to suggest a solution to this problem, 
namely by either writing a SeqProcessor that converts the CDS features 
of the contig sequences to first-class sequence objects (that's for 
instance what I do for the EMBL formatted Ensembl dumps), or by using 
an input file that has transcripts as primary records instead of as 
feature annotation on a contig.

Given that you could then set out to locate GenBank or EMBL formatted 
files containing A.thaliana transcripts as their records, instead of 
asking others to search the internet for you.

I'm afraid that as far as I'm concerned I won't be able to lend you any 
more of my time unless you are specific and precise in stating what 
your goal is and what you got.


On Dec 20, 2005, at 9:35 AM, Brian Osborne wrote:

> Angshu,
>> I want them to be correctly parsed.
> They have been correctly parsed but you're looking in the wrong place. 
> The
> names and identifiers associated with things like "CDS" or "gene" will 
> not
> be found in the Bioentry table. The Bioentry is the entire NC_* 
> record, the
> genes, mRNAs, and proteins are called features. Read the 
> Feature-Annotation
> HOWTO and doc/schema-overview.txt in the biosql package.
> Brian O.
> On 12/19/05 5:39 PM, "Angshu Kar" <angshu96 at gmail.com> wrote:
>> Sean,
>> I've tried .faa, .fna and .gbk files in the link mentioned below. 
>> After
>> running the script when I saw the loaded database, I saw that in the
>> bioentry table the 3 fields accession, identifier and name containing 
>> the
>> same data.Also, the version column was not populated. I want them to 
>> be
>> correctly parsed. So I want an arabidopsis data file that "goes well" 
>> with
>> the load_seqdatabase.pl script.
>> Thanks,
>> Angshu
>> On 12/19/05, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>>> On 12/19/05 3:20 PM, "Angshu Kar" <angshu96 at gmail.com> wrote:
>>>> Sean,
>>>> I've used files from
>>>> ftp://ftp.ncbi.nih.gov/genomes/Arabidopsis_thaliana/CHR_V  . But the
>>> script
>>>> cannot parse them according to biosql-schema.
>>>> So, I want some files that the script can parse correctly.
>>>> Else, I've to load each and every file onto the biodb and then check
>>> whether
>>>> it has been parsed correctly!
>>> Which file are you trying to load?  What format is it in?  What 
>>> values are
>>> you expecting to be loaded that aren't?  For the answer to the last
>>> question, it will likely help folks to see exactly what line of the 
>>> input
>>> file isn't being loaded as you think it should be.  For example, if 
>>> there
>>> is
>>> a line in a file that contains
>>>     foo         /note="bar"
>>> Then you can point out that you would like to know where, if at all, 
>>> the
>>> annotation associated with the foo tag is stored.
>>> Sean
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757

More information about the Bioperl-l mailing list