ESTScan project

Description

ESTScan is a program that can detect coding regions in DNA sequences, even if they are of low quality. ESTScan will also detect and correct sequencing errors that lead to frameshifts.

ESTScan is not a gene prediction program , nor is it an open reading frame detector. In fact, its strength lies in the fact that it does not require an open reading frame to detect a coding region. As a result, the program may miss a few translated amino acids at either the N or the C terminus, but will detect coding regions with high selectivity and sensitivity.

Method

ESTScan takes advantages of the bias in hexanucleotide usage found in coding regions relative to non-coding regions. This bias is formalized as an inhomogeneous 3-periodic fifth-order Hidden Markov Model (HMM). Additionally, the HMM of ESTScan has been extended to allows insertions and deletions when these improve the coding region statistics.

Versions of ESTScan

ESTScan 1.3: the old version
It works using false positive rate matrices. There is no particular reason to use this version unless you need to reproduce old results, or to compare it with newer versions.
ESTScan 2.1: the current version
This version allows printing out text annotation to link exon splice junctions and protein sequences, through the -ft option.
ESTScan 3.0: the rewrite version of ESTScan2 in pure C
This version has the advantage that it does not require the Perl stuff, but lacks some feature. This version should be used in case of troubles with Perl modules.

The two first versions are Perl modules, and require the BTLib Perl module.
All versions of ESTScan can be downloaded on sourceforge page.

Matrices for ESTScan

The use of ESTScan implies the creation of scores matrices which reflect the codons preferences in the studied organisms.
These matrices can be obtained by using some scripts, that can be found in the new estscan-3.0 tar ball, or in the estscan-devel RPM packages. Those tools do require the BTLib module too.
As some people have problems to create the matrices, a user guide has been written and is available in the documentation part on sourceforge page. This guide assist people during the first steps that lead to the generation of the matrices.
The creation of matrices requires writing a configuration file, that looks as follows (more details in the user guide):

################################################################################
#
# Parameters for the mouse
# (use PERL syntax!)
#

$organism = "Mus musculus";
$dbfiles = "/db/refseq/release/mus*.gbff /db/refseq/new/mus*.gbff /db/embl/86/mus*.dat /db/embl/new/mus*.dat";
$ugdata = "/db/unigene/Mm.data";
$estdata = "/db/dbest/est_mus-??.seq";

$datadir = "/export/scratch/ludi/ESTScan/Mm";
$nb_isochores = 2;
$tuplesize = 6;
$minmask = 30;

#
# End of File
#
################################################################################

Matrices for some organisms have been created and are available in the download section of the sourceforge page.
Some experimenting has been done in order to facilitate the choice of parameters. The details can be found in the Master's report of L. Rielle, available in the same download section as the matrices.

Links

Information on the sourceforge page.

An online version is available on the Swiss EMBNet node.

There also is a mailing list dedicated to ESTScan. Info available here.

References

Lottaz C, Iseli C, Jongeneel CV, Bucher P. (2003)
Modeling sequencing errors by combining Hidden Markov models
Bioinformatics 19, 103-112.

Iseli C, Jongeneel CV, Bucher P. (1999)
ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences.
Proc Int Conf Intell Syst Mol Biol. 138-48.

C. Lottaz, Master's report

Wasmuth JD, Blaxter ML. (2004)
prot4EST: Translating Expressed Sequence Tags from neglected genomes
BMC Bioinformatics5:187