Wed, 28 Feb 2001 11:16:39 -0600
I apologize for my newbie questions up front (and cross posting, is that
against etiquette? I am doing both Perl and Java though...). You seem like
a nice crowd though based on my brief surf of the archives. I will continue
looking in the archive for something relating to my question, but thought I
would ask as well...
I have questions about data mining, Blast with GCG, or NCBI. I am fairly
green in biology and bioinformatics, so that doesn't help either. I am
supporting some researchers, and they want me to do multiple blasts, then
parse the output for them. SO far there are two purposes, data mining,
looking for new submissions in Genbank that might be similar to genes of
interest. The other is snp discovery. Both parse the blast outputs pulling
out just a little information. So, I thought of a few ways to do this,
there are more for sure, and maybe some of you have already solved them?
I currently have so .csh scripts that fork off as many blasts as there are
sequences or sequence files in a list. Once all the blasts complete, I
parse the output. I am using the view=1 option on blast for snp discovery,
doing frequency counts, and tracking reference numbers at positions where
there are differences.
For data mining the researchers have to get a baseline, creating a file of
ID's they already know about. Then I can blast on new entries (gbnew), and
filter out anything that they may have seen, and anything with too low of an
e value. I can then give them a short(er) list of new things that may be of
interest to them.
Anybody doing something like this? Any better solutions, anyone change the
blast itself? Might be better to change the source of the data rather than
parse the output, but not sure how feasible that is either.