ESTScan is a program that can detect coding regions in DNA sequences, even if they are of low quality. ESTScan will also detect and correct sequencing errors that lead to frameshifts.
ESTScan is not a gene prediction program , nor is it an open reading frame detector. In fact, its strength lies in the fact that it does not require an open reading frame to detect a coding region. As a result, the program may miss a few translated amino acids at either the N or the C terminus, but will detect coding regions with high selectivity and sensitivity.
ESTScan takes advantages of the bias in hexanucleotide usage found in coding regions relative to non-coding regions. This bias is formalized as an inhomogeneous 3-periodic fifth-order Hidden Markov Model (HMM). Additionally, the HMM of ESTScan has been extended to allows insertions and deletions when these improve the coding region statistics.
The two first versions are Perl modules, and require the BTLib
Perl module.
All versions of ESTScan can be downloaded on sourceforge
page.
The use of ESTScan implies the creation of scores matrices which
reflect the codons preferences in the studied organisms.
These matrices can be obtained by using some scripts, that can be
found in the new estscan-3.0 tar ball, or in the estscan-devel RPM
packages. Those tools do require the BTLib module too.
As some people have problems to create the matrices, a user guide has
been written and is available in the documentation part on sourceforge page.
This guide assist people during the first steps that lead to the
generation of the matrices.
The creation of matrices requires writing a configuration file,
that looks as follows (more details in the user guide):
################################################################################
#
# Parameters for the mouse
# (use PERL syntax!)
#
$organism = "Mus musculus";
$dbfiles = "/db/refseq/release/mus*.gbff /db/refseq/new/mus*.gbff /db/embl/86/mus*.dat /db/embl/new/mus*.dat";
$ugdata = "/db/unigene/Mm.data";
$estdata = "/db/dbest/est_mus-??.seq";
$datadir = "/export/scratch/ludi/ESTScan/Mm";
$nb_isochores = 2;
$tuplesize = 6;
$minmask = 30;
#
# End of File
#
################################################################################
Matrices for some organisms have been created and are available in the
download section of the sourceforge page.
Some experimenting has been done in order to facilitate the choice of
parameters. The details can be found in the Master's report of L. Rielle,
available in the same download section as the matrices.
Information on the sourceforge page.
An online version is available on the Swiss EMBNet node.
There also is a mailing list dedicated to ESTScan. Info available here.
Lottaz C, Iseli C, Jongeneel CV, Bucher P. (2003)
Modeling sequencing errors by combining Hidden Markov models
Bioinformatics 19, 103-112.
Iseli C, Jongeneel CV, Bucher P. (1999)
ESTScan: a program for detecting, evaluating, and reconstructing
potential coding regions in EST sequences.
Proc Int Conf Intell Syst Mol Biol. 138-48.
C. Lottaz, Master's report
Wasmuth JD, Blaxter ML. (2004)
prot4EST: Translating Expressed Sequence Tags from neglected genomes
BMC Bioinformatics5:187