ProSplicer


A Putative Alternative Splicing Database Based on Proteins, mRNAs and EST Clusters

Jorng-Tzong Horng, Hsien-Da Huang, Chau-Chin Lee, and Baw-Jhiune Liu


Department of Computer Science and Information Engineering, Department of Life Science,
National Central University, Taiwan,
Department of Computer Science and Engineering, Yuan-Ze University, Taiwan
horng@db.csie.ncu.edu.tw*
 

 

  1. Introduction
  2. Methods
  3. Comparison
  4. Statistics
  5. Query interface
  6. Splicing View
  7. References

 

Abstract

In eukaryotes, alternative splicing can produce variant proteins and expression patterns. ProSplicer is an automatically generated putative alternative splicing database by using the alignment of proteins, mRNA sequences and ESTs against the genomic sequences. Protein sequences, messager RNA sequences and expressed sequence tags (ESTs) provide valuable information about splice variants of genes. Each of the ESTs and mRNA sequences is separately aligned to the corresponding genomic sequences by SIM4 [10] and each of the proteins is aligned by BLAST. Alternative splicing forms in different tissues and splice sites are also provided in the database, e.g., exon skipping in particular tissues. Using the alternative splicing information in the database can facilitate users to investigate the alternative splicing forms and tissue-specific expression of the considered genes. Some other alternative splicing databases are constructed based on gene annotations containing alternative splicing and predict alternative splice sites by aligning ESTs to genomic sequences. PALS db [17] takes the longest mRNA sequence in each UniGene cluster [7] as the referenced sequence which is aligned with ESTs and mRNAs in the same cluster to predict alternative splicing sites. ProSplicer is available at http://bioinfo.csie.ncu.edu.tw/ProSplicer/.
Contact: horng@db.csie.ncu.edu.tw
 

1. INTRODUCTION

Alternative splicing is a major mechanism for controlling the expression of cellular and viral genes and is a widely occurring phenomenon. It changes how a gene acts in different tissue and developmental states by generating distinct mRNA isoforms composed of different selections of exons and producing variant proteins. This phenomenon is widespread in the human genome and was commonly believed that alternative splicing existed in only about 30% ~ 40% of all genes [1, 5].
    Since the number of sequences, i.e., proteins, mRNAs, and ESTs, increases in an exponential manner and it is possible to decipher the alternative splicing forms by computational methods like BLAST. Sequenced mRNA and EST sequences provide gene expression evidences to reveal the gene alternative splicing, as well as the protein sequences are translated to nucleotides and aligned against the genomic sequences.
Several alternative splicing databases, such as AsMamDB[8], ASDB[12], and SpliceDB[11], are constructed based on gene annotated containing keywords "Alternative Splicing". AsMamDB[8] contains information about alternative splicing of several mammals. SpliceNest [6], SpliceDB[11], AsMmDB[8], and HASDB[4] map clustered ESTs onto human genomic DNA to compute gene structures and splice variants. PALS db [17] takes the longest mRNA sequence in each UniGene [7] cluster as the referenced sequence which is aligned with ESTs and mRNA sequences in the same cluster to predict alternative splicing sites. Moreover, the comparisons between several alternative splicing database in public domain are listed in Table 1. The column 'Referenced Sequence' indicates genomic sequences or the longest mRNA sequence in UniGene clusters are used. The "Supported Materials" shows the materials including take proteins, mRNA, or EST sequences are used to analyze and then to investigate alternative splicing forms of genes. The "Alignment Tools" used in each approach are also shown. The data whether through literature search is also given in Table 1.
 

Table 1. Comparison between different alternative splicing databases.

 

Referenced Sequence

Supported Materials

Statisitcs

Alignment Tool

Literature Search

Organisms

Proteins

mRNAs

ESTs

SpliceNest [6]

 

Genomic Sequence

 

Yes

Yes

90,000 EST clusters

Reputer [13]

 

Human, and Arabidopsis

PALS db [17]

(Release 2)

Message RNA

 

Yes

Yes

19,936 (human )

16,615 (mouse)

UniGene clusters

BLAST

 

Human, and mouse

SpliceDB [11]

Genomic Sequence

 

Yes

Yes

43,337 splice site pairs

BLAST

Yes

Mammalian

AsMmDB [8]

(version 1.0)

Genomic Sequence

 

Yes

Yes

1,563 alternative splicing genes

FASTA

Yes

Human, mouse, and rat

ASDB [12]

(version 2.1)

 

Yes

 

 

1,922 proteins

 

Yes

 

HASDB [4]

Genomic Sequence

 

Yes

Yes

6,201 UniGene clusters

BLAST

 

Human

ProSplicer

Genomic Sequence

Yes

Yes

Yes

21786 genes

BLAST (protein)

SIM4 (mRNA, and ESTs)

Human

In ProSplicer, whole human genomic sequences and genes including known or novel ones are taken into account to investigate the gene alternative splicing. ProSplicer provides alternative splicing information of the known and novel genes of human by aligning three major gene expression data, i.e., proteins, mRNA, and EST sequences. The alternative splicing forms predicted by combining protein sequences, mRNAs, and ESTs are more completely than the exons predicted by only EST sequences. A user interface is also provided to reveal the alternative splicing of the considered genes. Alternative splicing sites predicted by protein sequence, mRNA or EST sequences are also provided. Tissue information provided in mRNA, and EST sequences also guide to reveal the tissue favorable alternative splicing forms, e.g., exon skipping. The database provides keyword search for retrieving and searching the contents in the database and graphical interface is also provided to show the alternative splicing information.

 

2.METHODS


The genomic sequences and genes annotation information are obtained from ENSEMBL [9] (Release 2002,05,28) and there are 21786 genes including known and novel genes. The mRNA sequences and EST sequences of genes are retrieved from UniGene [15] (Release 147), containing 96,105 gene clusters of human. The EST sequences are from dbEST (Release 022202), and there are about 3,991,208 EST sequences of homo sapiens. The protein sequences are from the SWISS-PROT [2] and TrEMBL (Release 20, March 2002) [2]. SWISS-PROT (Release 40) contains 101,602 entries in total and 29,751 human entries. 25,972 entries of rodents in TrEMBL are taken into account.
The approach predicting alternative splicing consists of three main phases. Figure 1 shows the flow of our approach. The three phases are the preprocessing phase, the alignment phase, and the filtering phase. In the preprocessing phase, the gene DNA sequences, ESTs, mRNA sequences and protein sequences are collected, converted, and integrated into a single database, namely GeneInfo, to prepare the sequences for analysis. Next, in the alignment phase, protein sequences are aligned to gene DNA sequence by TBLASTN, as well as the mRNA and ESTs sequences are aligned by SIM4. The candidate exons are generated and stored. In the filtering phase, we filter the noise of the candidate exons, and connect the candidate exons as transcript form of each EST, mRNA and protein sequences. Finally, the exons of the transcript forms are provided in the ProSplicer.
 

Figure 1. Predicting approach of alternative splicing.

As shown in Table 2, ProSplicer takes 21,786 genes form ENSEMBL [9] and totally 2,311,460 related sequences including protein, mRNA, and EST sequences to compute the putative alternative splicing. The amount of exons predicted by aligning approach is shown in Table 2. 442 077, 395 619, and 12 361 685 exon transcripts are predicted by aligning protein sequences, mRNA sequences, and EST sequences against the genomic sequences, respectively.

Table 2. ProSplicer statistics.
 

Sequence Type

Amount of Sequences

Amount of Predicted Exons

Protein

44,184

26,115  (human)

442,077

279,656  (human)

18,069  (mouse)

162,421  (mouse)

mRNA

20,577

395,619

EST

2,246,699

12,361,685

Total

2,311,460

13,199,381

3.QUERY INTERFACE
 

By considering the alternative splicing forms of a gene provided in ProSplicer, an exon can be left out or included by comparing to other protein, mRNA, or EST sequences. Three types of the alternative splicing forms including exon skipping, alternative 5' donor sites, and alternative 3' acceptor sites forms [17] are included in the database. Figure 2 shows the three types of alternative splicing. The three types of alternative splicing forms can be shown directly in the graphical user interface providing in ProSplicer as shown in Figure 3.
 

Figure 2. Three types alternative splicing form of genome view.
 


Figure 3. Three type alternative splicing of an example in ProSplicer.

 

Keyword Search
 

ProSplicer provides several keyword search methods, such as Ensembl genes identification numbers, gene symbols or names, and so on. The default search field is gene symbol or gene name. The users can submit the keyword of gene symbol and the database returns the entries, the gene symbols or names containing the keyword.



Figure 4. The gene information in ProSplicer.

 

Gene Information

ProSplicer provides related reference links to other biological databases and related sequences about the select genes. The related annotations and reference database links of a gene include Ensembl identification numbers, gene symbols, genomic locations and gene descriptions. As shown in Figure 4, the available reference links include GO (Gene Ontology data) [16], HUGO (providing access to the list of currently approved human gene symbols) [14], GeneCard (integrating human genes, their products and their involvement in diseases), LocusLink (organizing information around genes to generate a central hub for accessing gene-specific information), RefSeq (providing reference sequence standards for genomes, transcripts and proteins) and OMIM [3].
 


Splicing graphical view

 

Figure 5. "Detailed View" graphical interface in ProSplicer.


The splicing view consists of two parts, i.e., "Overview" and "Detailed View". The "Overview" interface provides the graphical information about the gene location related to chromosome. There are two graphic blocks in the "Detailed View". Figure 5 shows an example of the "Detailed View" in ProSplicer. The graphical interface provides the following functions.


A. Jump to random sub region inside the gene region. Filling the sub region in the text box then click "Redraw" button. Then the graphic will focus on the assigned region.

B. Zoom in & Zoom out. Zoom adjust bar provides 1/8 x, 1/4 x, 1/2 x, 2x, 4x and 8x times for current size of view region.

C. Shift some windows. Click on one triangle shape shift one window and two triangle shape shift tow windows.

The below block of Figure 5 is the main graphical view of alternative splicing. In contains the basic gene information such as D. Ensembl gene identification number, E. gene symbol, and F. gene description. Other functions provided in the splicing view are described as follows:
 

G. The quality of alignment. The color of matching exon blocks represents one range of the quality of alignment.

H. The length of appearing gene region.

I. Sequence Identification. Each "Sequence ID" is hyper-linked to SWISS-PROT, GenBank or dbEST.

J. Annotation tip of position. When the mouse moves over on the empty region, a top is show to display the current chromosome position.

K. Annotation tip of exon. When the mouse moves over the exon block for few seconds, the information of the exon including "Seqeunce ID", start and end position on sequence, start and end position on genome sequence are shown.

L. Intron. Click on the intron block, a new browsing window showing alignment flat file is create.

M. Annotation tip of tissue. When the mouse moves over on tissue coloring block for few seconds, it shows the tissue information, e.g., brain, muscle, or lung.

N. Exon. When the mouse clicks on exon, a new browsing window showing alignment flat file is created. The filled color of exon means the alignment quality by referencing to the G.

O. Tissue Information. Different tissues are represented in different colors.

P. Zooming Popup menu. The graphic view can be adjusted into different scales by selecting the item in the popup menu.


REFERENCES

1. A. A. Mironov, J. W. Fickett, and M. S. Gelfand. Frequent alternative splicing of human genes. Genome Res, 1999, 9, 1288-1293.

2. A. Bairoch and R. Apweiler.. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucl. Acids. Res. 2000, 28, 45-48.

3. A. Hamosh, A. F. Scott, J. Amberger, C. Bocchini, D. Valle, and V. A. McKusick. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucl. Acids. Res. 2002, 30, 52-55.

4. B. Modrek, A. Resch, C. Grasso, and C. Lee. Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucl. Acids. Res. 2001, 29, 2850-2859.

5. D. Brett, J. Hanke, G. Lehmann, S. Haase, S. Delbruck, S. Krueger, J. Reich, and P. Bork, EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett. 2000, 474, 83-86.

6. E. Coward, S.A. Haas, and M. Vingron. SpliceNest: visualization of gene structure and alternative splicing based on EST clusters. Trends Genet, 2002, 18 (1), 53-55.

7. G.D. Schuler, et al. A gene map of the human genome. Science, 1996, 274 (5287), 540-6.

8. H. Ji, Q. Zhou, F. Wen, H. Xia, X. Lu, and Y. Li. AsMamDB: an alternative splice database of mammals. Nucl. Acids. Res. 2001, 29, 260-263.

9. K.D. Pruitt and D. R. Maglott. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res, 2001, 29 (1), 137-140

10. L. Florea, G. Hartzell, Z. Zhang, G.M.Rubin, W. Miller. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998, 8, 967-74.

11. M. Burset, I. A. Seledtsov, and V. V. Solovyev. SpliceDB: database of canonical and non-canonical mammalian splice sites. Nucl. Acids. Res. 2001 29, 255-259.

12. M.S. Gelfand, I. Dubchak, I. Dralyuk, and M. Zorn. ASDB: database of alternatively spliced genes. Nucl. Acids. Res. 2000, 28, 296-297.

13. S. Kurtz, et al. REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res., 29. 4644-4642.

14. S. Povey, R. Lovering, E. Bruford, M. Wright, M. Lush, H.M.Wain. The HUGO Gene Nomenclature Committee (HGNC). Nomenclature Recommendations. Human Genetics, 2001, 109, 678-680.

15. T. Hubbard, D. Barker, E. Birney, G. Cameron, et al. The Ensembl genome database project. Nucl. Acids. Res. 2002, 30, 38-41

16. The Gene Ontology Consortium. Creating the Gene Ontology Resource: Design and Implementation. Genome Res. 2001, 11, 1425-1433.

17. Y.H. Huang, Y.T. Chen, J.J. Lai, S.T. Yang, and U.C. Yang. PALS db: Putative Alternative Splicing database. Nucl. Acids. Res. 2002, 30, 186-190.

18. Z. Kan, E. C. Rouchka, W. R. Gish, and D. J. States. Gene Structure Prediction and Alternative Splicing Analysis Using Genomically Aligned ESTs. Genome Res. 2001, 11, 889-900.