ProSplicer
A Putative Alternative Splicing Database Based on Proteins, mRNAs and EST
Clusters
Jorng-Tzong Horng, Hsien-Da Huang, Chau-Chin Lee, and Baw-Jhiune Liu
Department of Computer Science and Information Engineering, Department of Life
Science,
National Central University, Taiwan,
Department of Computer Science and Engineering, Yuan-Ze University, Taiwan
horng@db.csie.ncu.edu.tw*
Abstract
In eukaryotes, alternative splicing can produce
variant proteins and expression patterns. ProSplicer is an automatically
generated putative alternative splicing database by using the alignment of
proteins, mRNA sequences and ESTs against the genomic sequences. Protein
sequences, messager RNA sequences and expressed sequence tags (ESTs) provide
valuable information about splice variants of genes. Each of the ESTs and mRNA
sequences is separately aligned to the corresponding genomic sequences by SIM4
[10] and each of the proteins is aligned by BLAST. Alternative splicing forms in
different tissues and splice sites are also provided in the database, e.g., exon
skipping in particular tissues. Using the alternative splicing information in
the database can facilitate users to investigate the alternative splicing forms
and tissue-specific expression of the considered genes. Some other alternative
splicing databases are constructed based on gene annotations containing
alternative splicing and predict alternative splice sites by aligning ESTs to
genomic sequences. PALS db [17] takes the longest mRNA sequence in each UniGene
cluster [7] as the referenced sequence which is aligned with ESTs and mRNAs in
the same cluster to predict alternative splicing sites. ProSplicer is available
at
http://bioinfo.csie.ncu.edu.tw/ProSplicer/.
Contact: horng@db.csie.ncu.edu.tw
1. INTRODUCTION
Alternative splicing is a major mechanism for controlling the expression of
cellular and viral genes and is a widely occurring phenomenon. It changes how a
gene acts in different tissue and developmental states by generating distinct
mRNA isoforms composed of different selections of exons and producing variant
proteins. This phenomenon is widespread in the human genome and was commonly
believed that alternative splicing existed in only about 30% ~ 40% of all genes
[1, 5].
Since the number of sequences, i.e., proteins, mRNAs, and
ESTs, increases in an exponential manner and it is possible to decipher the
alternative splicing forms by computational methods like BLAST. Sequenced mRNA
and EST sequences provide gene expression evidences to reveal the gene
alternative splicing, as well as the protein sequences are translated to
nucleotides and aligned against the genomic sequences.
Several alternative splicing databases, such as AsMamDB[8], ASDB[12], and
SpliceDB[11], are constructed based on gene annotated containing keywords
"Alternative Splicing". AsMamDB[8] contains information about alternative
splicing of several mammals. SpliceNest [6], SpliceDB[11], AsMmDB[8], and
HASDB[4] map clustered ESTs onto human genomic DNA to compute gene structures
and splice variants. PALS db [17] takes the longest mRNA sequence in each
UniGene [7] cluster as the referenced sequence which is aligned with ESTs and
mRNA sequences in the same cluster to predict alternative splicing sites.
Moreover, the comparisons between several alternative splicing database in
public domain are listed in Table 1. The column 'Referenced Sequence' indicates
genomic sequences or the longest mRNA sequence in UniGene clusters are used. The
"Supported Materials" shows the materials including take proteins, mRNA, or EST
sequences are used to analyze and then to investigate alternative splicing forms
of genes. The "Alignment Tools" used in each approach are also shown. The data
whether through literature search is also given in Table 1.
Table 1. Comparison between different alternative splicing databases.
|
|
Referenced Sequence |
Supported Materials |
Statisitcs |
Alignment Tool |
Literature Search |
Organisms |
||
|
Proteins |
mRNAs |
ESTs |
||||||
|
SpliceNest [6]
|
Genomic Sequence |
|
Yes |
Yes |
90,000 EST clusters |
Reputer [13] |
|
Human, and Arabidopsis |
|
PALS db [17] (Release 2) |
Message RNA |
|
Yes |
Yes |
19,936 (human ) 16,615 (mouse) UniGene clusters |
BLAST |
|
Human, and mouse |
|
SpliceDB [11] |
Genomic Sequence |
|
Yes |
Yes |
43,337 splice site pairs |
BLAST |
Yes |
Mammalian |
|
AsMmDB [8] (version 1.0) |
Genomic Sequence |
|
Yes |
Yes |
1,563 alternative splicing genes |
FASTA |
Yes |
Human, mouse, and rat |
|
ASDB [12] (version 2.1) |
|
Yes |
|
|
1,922 proteins |
|
Yes |
|
|
HASDB [4] |
Genomic Sequence |
|
Yes |
Yes |
6,201 UniGene clusters |
BLAST |
|
Human |
|
Genomic Sequence |
Yes |
Yes |
Yes |
21786 genes |
BLAST (protein) SIM4 (mRNA, and ESTs) |
Human |
||
In ProSplicer, whole human genomic sequences and
genes including known or novel ones are taken into account to investigate the
gene alternative splicing. ProSplicer provides alternative splicing information
of the known and novel genes of human by aligning three major gene expression
data, i.e., proteins, mRNA, and EST sequences. The alternative splicing forms
predicted by combining protein sequences, mRNAs, and ESTs are more completely
than the exons predicted by only EST sequences. A user interface is also
provided to reveal the alternative splicing of the considered genes. Alternative
splicing sites predicted by protein sequence, mRNA or EST sequences are also
provided. Tissue information provided in mRNA, and EST sequences also guide to
reveal the tissue favorable alternative splicing forms, e.g., exon skipping. The
database provides keyword search for retrieving and searching the contents in
the database and graphical interface is also provided to show the alternative
splicing information.
The genomic sequences and genes annotation information are obtained from ENSEMBL
[9] (Release 2002,05,28) and there are 21786 genes including known and novel
genes. The mRNA sequences and EST sequences of genes are retrieved from UniGene
[15] (Release 147), containing 96,105 gene clusters of human. The EST sequences
are from dbEST (Release 022202), and there are about 3,991,208 EST sequences of
homo sapiens. The protein sequences are from the SWISS-PROT [2] and TrEMBL
(Release 20, March 2002) [2]. SWISS-PROT (Release 40) contains 101,602 entries
in total and 29,751 human entries. 25,972 entries of rodents in TrEMBL are taken
into account.
The approach predicting alternative splicing consists of three main phases.
Figure 1 shows the flow of our approach. The three phases are the
preprocessing phase, the alignment phase, and the filtering phase. In the
preprocessing phase, the gene DNA sequences, ESTs, mRNA sequences and protein
sequences are collected, converted, and integrated into a single database,
namely GeneInfo, to prepare the sequences for analysis. Next, in the alignment
phase, protein sequences are aligned to gene DNA sequence by TBLASTN, as well as
the mRNA and ESTs sequences are aligned by SIM4. The candidate exons are
generated and stored. In the filtering phase, we filter the noise of the
candidate exons, and connect the candidate exons as transcript form of each EST,
mRNA and protein sequences. Finally, the exons of the transcript forms are
provided in the ProSplicer.

Figure 1. Predicting approach of alternative splicing.
As shown in Table 2, ProSplicer takes
21,786 genes form ENSEMBL [9] and totally 2,311,460 related sequences including
protein, mRNA, and EST sequences to compute the putative alternative splicing.
The amount of exons predicted by aligning approach is shown in Table 2. 442 077,
395 619, and 12 361 685 exon transcripts are predicted by aligning protein
sequences, mRNA sequences, and EST sequences against the genomic sequences,
respectively.
Table 2. ProSplicer
statistics.
|
Sequence Type |
Amount of Sequences |
Amount of Predicted Exons |
||
|
Protein |
44,184 |
26,115 (human) |
442,077 |
279,656 (human) |
|
18,069 (mouse) |
162,421 (mouse) |
|||
|
mRNA |
20,577 |
395,619 |
||
|
EST |
2,246,699 |
12,361,685 |
||
|
Total |
2,311,460 |
13,199,381 |
||
By considering the alternative splicing forms of
a gene provided in ProSplicer, an exon can be left out or included by comparing
to other protein, mRNA, or EST sequences. Three types of the alternative
splicing forms including exon skipping, alternative 5' donor sites, and
alternative 3' acceptor sites forms [17] are included in the database. Figure 2
shows the three types of alternative splicing. The three types of alternative
splicing forms can be shown directly in the graphical user interface providing
in ProSplicer as shown in Figure 3.

Figure 2. Three types alternative splicing form
of genome view.

Figure 3. Three type alternative splicing of an example in ProSplicer.
Keyword Search
ProSplicer provides several keyword search methods, such as Ensembl genes identification numbers, gene symbols or names, and so on. The default search field is gene symbol or gene name. The users can submit the keyword of gene symbol and the database returns the entries, the gene symbols or names containing the keyword.

Figure 4. The gene information in ProSplicer.
Gene Information
ProSplicer provides related reference links to other biological databases and
related sequences about the select genes. The related annotations and reference
database links of a gene include Ensembl identification numbers, gene symbols,
genomic locations and gene descriptions. As shown in Figure 4, the
available reference links include GO (Gene Ontology data) [16], HUGO (providing
access to the list of currently approved human gene symbols) [14], GeneCard
(integrating human genes, their products and their involvement in diseases),
LocusLink (organizing information around genes to generate a central hub for
accessing gene-specific information), RefSeq (providing reference sequence
standards for genomes, transcripts and proteins) and OMIM [3].

Figure 5. "Detailed View" graphical interface in ProSplicer.
The splicing view consists of two parts, i.e., "Overview" and "Detailed View".
The "Overview" interface provides the graphical information about the gene
location related to chromosome. There are two graphic blocks in the "Detailed
View". Figure 5 shows an example of the "Detailed View" in ProSplicer. The
graphical interface provides the following functions.
A. Jump to random sub region inside the gene region.
Filling the sub region in
the text box then click "Redraw" button. Then the graphic will focus on the
assigned region.
B. Zoom in & Zoom out. Zoom adjust bar provides 1/8 x, 1/4 x, 1/2 x, 2x, 4x and 8x times for current size of view region.
C. Shift some windows. Click on one triangle shape shift one window and two triangle shape shift tow windows.
The below block of Figure 5 is the main graphical view of alternative splicing.
In contains the basic gene information such as D. Ensembl gene identification
number, E. gene symbol, and F. gene description. Other functions provided in the
splicing view are described as follows:
G. The quality of alignment. The color of matching exon blocks represents one range of the quality of alignment.
H. The length of appearing gene region.
I. Sequence Identification. Each "Sequence ID" is hyper-linked to SWISS-PROT, GenBank or dbEST.
J. Annotation tip of position. When the mouse moves over on the empty region, a top is show to display the current chromosome position.
K. Annotation tip of exon. When the mouse moves over the exon block for few seconds, the information of the exon including "Seqeunce ID", start and end position on sequence, start and end position on genome sequence are shown.
L. Intron. Click on the intron block, a new browsing window showing alignment flat file is create.
M. Annotation tip of tissue. When the mouse moves over on tissue coloring block for few seconds, it shows the tissue information, e.g., brain, muscle, or lung.
N. Exon. When the mouse clicks on exon, a new browsing window showing alignment flat file is created. The filled color of exon means the alignment quality by referencing to the G.
O. Tissue Information. Different tissues are represented in different colors.
P. Zooming Popup menu. The graphic view can be adjusted into different scales by
selecting the item in the popup menu.
REFERENCES
1. A. A. Mironov, J. W. Fickett, and M. S. Gelfand. Frequent alternative splicing of human genes. Genome Res, 1999, 9, 1288-1293.
2. A. Bairoch and R. Apweiler.. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucl. Acids. Res. 2000, 28, 45-48.
3. A. Hamosh, A. F. Scott, J. Amberger, C. Bocchini, D. Valle, and V. A. McKusick. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucl. Acids. Res. 2002, 30, 52-55.
4. B. Modrek, A. Resch, C. Grasso, and C. Lee. Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucl. Acids. Res. 2001, 29, 2850-2859.
5. D. Brett, J. Hanke, G. Lehmann, S. Haase, S. Delbruck, S. Krueger, J. Reich, and P. Bork, EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett. 2000, 474, 83-86.
6. E. Coward, S.A. Haas, and M. Vingron. SpliceNest: visualization of gene structure and alternative splicing based on EST clusters. Trends Genet, 2002, 18 (1), 53-55.
7. G.D. Schuler, et al. A gene map of the human genome. Science, 1996, 274 (5287), 540-6.
8. H. Ji, Q. Zhou, F. Wen, H. Xia, X. Lu, and Y. Li. AsMamDB: an alternative splice database of mammals. Nucl. Acids. Res. 2001, 29, 260-263.
9. K.D. Pruitt and D. R. Maglott. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res, 2001, 29 (1), 137-140
10. L. Florea, G. Hartzell, Z. Zhang, G.M.Rubin, W. Miller. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998, 8, 967-74.
11. M. Burset, I. A. Seledtsov, and V. V. Solovyev. SpliceDB: database of canonical and non-canonical mammalian splice sites. Nucl. Acids. Res. 2001 29, 255-259.
12. M.S. Gelfand, I. Dubchak, I. Dralyuk, and M. Zorn. ASDB: database of alternatively spliced genes. Nucl. Acids. Res. 2000, 28, 296-297.
13. S. Kurtz, et al. REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res., 29. 4644-4642.
14. S. Povey, R. Lovering, E. Bruford, M. Wright, M. Lush, H.M.Wain. The HUGO Gene Nomenclature Committee (HGNC). Nomenclature Recommendations. Human Genetics, 2001, 109, 678-680.
15. T. Hubbard, D. Barker, E. Birney, G. Cameron, et al. The Ensembl genome database project. Nucl. Acids. Res. 2002, 30, 38-41
16. The Gene Ontology Consortium. Creating the Gene Ontology Resource: Design and Implementation. Genome Res. 2001, 11, 1425-1433.
17. Y.H. Huang, Y.T. Chen, J.J. Lai, S.T. Yang, and U.C. Yang. PALS db: Putative Alternative Splicing database. Nucl. Acids. Res. 2002, 30, 186-190.
18. Z. Kan, E. C. Rouchka, W. R. Gish, and D. J. States. Gene Structure
Prediction and Alternative Splicing Analysis Using Genomically Aligned ESTs.
Genome Res. 2001, 11, 889-900.