Microsoft Word - cet-01.docx CHEMICAL ENGINEERING TRANSACTIONS VOL. 46, 2015 A publication of The Italian Association of Chemical Engineering Online at www.aidic.it/cet Guest Editors: Peiyu Ren, Yancang Li, Huiping Song Copyright © 2015, AIDIC Servizi S.r.l., ISBN 978-88-95608-37-2; ISSN 2283-9216 An Improved Method for Exon Array Data Analysis Shenghua Jin School of Computer engineering, Huaiyin Institute of Technology, Huaian, Jiangsu, China ; Huaian key laboratory of the study and application of Internet of Things, Huaian, Jiangsu, China; Jiangsu “Internet of Things” mobile Internet technology engineering laboratory, Huaian, Jiangsu, 223003; zybjsh819@163.com Alternative splicing of genes is usually induced by some severe diseases. Exon array manufactured by Affymetrix Company has been widely used to detect alternative splicing. Utilizing the mapping between the gray values of splice isoforms and the array probes, we proposed the calculation of gene expression level based on Kseq model. Aiming at the massive amount of array data, an algorithm for parallel computation using multi-core processor was presented. The algorithm was verified through experiments using real datasets and compared with the commonly used algorithm. It was found that parallel computation improved the computation efficiency of the model and the new model enhanced the computation accuracy in subsequent bioanalysis. Keywords: Exon array, gene expression, alternative splicing Parallel Computing 1. Introductions Human genome project (HGP) was completed in 2003. Since then, increasing concern has been given to gene functions and the mining of the correlations between genes and the diseases, which was confirmed in (Caceres and Kornblihtt, 2002). In the field of bioinformatics, exploring the mechanism of gene expression and transcriptional regulation represents new direction of research. DNA microarray technology makes monitoring simultaneously the gene expression level of thousands of genes under different samples possible, in gene expression data the type of alternative splicing plays an important role in gene expression and transcriptional regulation. Alternative splicing produces transcripts through various combinations of exons. Here we employed the exon array manufactured by Affymetrix Company. PM probes are designed on the Affymetrix exon array, which was confirmed (Karin, 2010). The probes total about 6.5 million and constitute about 1.4 million probe sets. Santa Clara (2005) reported that four probes can cover about 90% of the exons and the array covers nearly 1 million exons. This high-integration exon array is able to detect the expressions of transcripts on the level of isomers, exons and genes, which was confirmed (KapurK, 2007). There are several methods for the calculation of isomer expression, including multi-mapping Bayesian gene expression (MMBGX), which was confirmed (Turro, 2010), and multiple exon array preprocessing (MEAP), which was confirmed (Chen, 2011). Based on the mapping between the gray values of splice isomers and probes, a data model conforming to chi-squared distribution was proposed to calculate the gene expression level (Yin and Liu.2015). Comparison between the proposed algorithm and other popular algorithms on real datasets reveals that Kseq model can effectively reduce the noises in the original experimental data and improved the accuracy and efficiency of subsequent bioanalysis. 2. Methods 2.1 Mapping between genes, isoforms and probes Table 1 provides the relationships of 5 splice isomers to gene ENSG00000000457, which respectively are ENST00000367770, ENST00000367771, ENST00000367772, ENST00000423670, ENST00000470669 and ENST00000470238. In table 1 the second and third columns of pos-x and pos-y are the coordinates of the corresponding probes on the array, by which the probes corresponding to the isomers can be found ,and the fourth and fifth column of Probe-set is the name of probe set. The relationships among the gene, its splices and probes are presented as Fig 1. DOI: 10.3303/CET1546064 Please cite this article as: Jin S.H., 2015, An improved method for exon array data analysis, Chemical Engineering Transactions, 46, 379-384 DOI:10.3303/CET1546064 379 Table 1: Mapping between genes and isoforms Probe-set pos-x pos-y Gene name Isoform name 2443560 110 2055 ENSG00000000457 ENST00000367770 2443540 112 2516 ENSG00000000457 ENST00000367770 2443560 295 2543 ENSG00000000457 ENST00000367771 2443558 339 2192 ENSG00000000457 ENST00000367771 2443548 349 1782 ENSG00000000457 ENST00000367771 2443541 486 2531 ENSG00000000457 ENST00000367772 2443559 495 2125 ENSG00000000457 ENST00000423670 2443559 468 2324 ENSG00000000457 ENST00000423670 2443551 496 943 ENSG00000000457 ENST00000423669 2443551 523 1042 ENSG00000000457 ENST00000423669 2443552 652 1112 ENSG00000000457 ENST00000470238 2443552 661 956 ENSG00000000457 ENST00000470238 Fig 1: Illustration of the relationships among genes, splice isoforms and probes Suppose that gjc gjkc k y s=  , where gjcy is the gray value of the j -th PM probe of gene g on array c . This gray value corresponds to several isomers of the gene. gjkcs is the expression of the k -th isomer corresponding to this probe, which satisfies the chi-squared distribution of gkcα and gjβ . gjβ is the random variable shared by several isomers corresponding to the probe. Given the properties of random variable conforming to chi-squared distribution, there is ,gjc gkc gj k y Ga α β      (1) Suppose gjβ conforms to the chi-squared distribution of gc and gd , then ( ),gj g gGa c dβ  , and each isomer satisfies the following: ( ) ( ) ( ),| , |gjkc gjkc gkc gj gj g g gjp s p s p c d dα β β β=  (2) Using (2), the joint distribution of each isomer is calculated, and the logarithm is taken for its likelihood function. The deduction is shown below: Gene Isoform1……. Isoform n Probe 1 Probe 2 Probe 8 Probe 9 Probe m 380 ( ) ( ) 1 , , l o g ( ) l o g | , | , ( ) l o g ( ) ( ) g k c g k g g c g g g g j g j g g g j c g k c g j j kc c g g j c q j cg g k c k L c d P Y d P c d P y d q y c α α β β α β ω α − =   =       Γ =  Γ Γ     ∏  ∏  (3) where gc gkcα α =   , gkc g c k q cα= + , gjc g c y dω = + . Estimates ˆˆ ˆ, ,gkc g gc dα of the parameters , ,gkc g gc dα are obtained by maximum likelihood method. Then the probability density function of specific signal gjkcs is solved using the estimated parameter values: ˆ ˆˆˆ ˆ ˆ( | , , ) ( | , ) ( | , )g j k c g k c g g g j g j k c g k c g j g j g gP s a c d d P s P c dβ α β β=  ( ) ( ) ( ) ˆ ˆ 1 ˆˆ ˆˆ( ) ˆˆ ˆ g g k c g g k c c a g g k c g g j k c c g k c g g g j k c c d s c d s α α α − + Γ + = Γ Γ + (4) The mean and variance of ( )log gjkcs are calculated: ( ) ( )ˆ ˆ ˆlo g lo g ( ) ( ) ˆ ˆlo g ( ) '( ) '( ) g jk c g g k c g g jk c g k c g s d c V a r s c α α   = + Ψ − Ψ   = Ψ + Ψ  (5) where (.)Ψ denotes the derivative of log (.)Γ ; ' (.)Ψ denotes the first-order derivative of (.)Γ . Thus the mean and variance of the corresponding gene are ( )ˆ ˆ ˆl o g l o g ( ) ( ) ˆ ˆl o g ( ) '( ) '( ) g j k c g g k c g k k g j k c g k c g k k s d c V a r s c α α     = + Ψ − Ψ      = Ψ + Ψ        (6) Since the distribution of gkcα is unimodal and gkcα is larger than zero, the distribution of gkcα is fitted by the Gaussian distribution truncated at point zero. The mean and variance of the isomer are ( ) ( ) ( ) ( ) ( ) 1ˆ ˆ ˆ ˆe x p ' ( ) 2 ; , TT g c g c g c g c g c g c g c g c g c g c g c g c g c P L a a H a N α α α α α α μ  ∝ − + − −    =  (7) where Hessian matrix gcH is written as the following: ( ) ( ) ( ) 2 1 ˆ ˆ ˆ| , ' ,gc gc gc gcgc gc ij gc gc gc gc gc gc gic gjc L H L Hα α α μ α α α α − = ∂ = = + = − ∂ ∂   (8) 2 ˆ exp 2 2 22 2 C erf σ μ μ μ μ α σπ σ    = − + − −         (9) ( )( ) ( ) 2 22 2 2 1 ˆ ˆˆ 1 2 exp 2 22 2 C erf μ σ μ σ σ μ α μ α σσ σ      = + − − − + − −           (10) The mean and variance of the isomer are calculated by (9) and (10), respectively. 381 3. Experimental results and discussion The real dataset GSE13072 was used to verify whether the proposed algorithm could reduce the noise in the original data. The dataset was derived from Gene Expression Omnibus Database, which is a source for alternative splice events and isomer expressions. Human Exon 1.0ST Array was used. Each array was a 2 5 6 0 2 5 6 0× matrix consisting of 6,553,600 probes. The dataset included two samples, which were brain and reference, respectively. VTbrain and VTreference came from Virginia Tech, and each sample had 5 replicates. 3.1 Comparison of computation accuracy Comparison was made with RMA and iterPLER through the calculation of gene expressions on 5 replicates for each sample. The correlation between the replicates under each algorithm was calculated. The higher the correlation, the lower the noise was. It can be seen from the table that the correlation between the replicates for each sample using Kseq model was much higher than that using the other two methods. Therefore, Kseq model can more significantly reduce the noises in original data than the conventional RMA and iterPLER on a real dataset. Table 2: Correlation of gene expression Sample 5 replicates RMA Iterplier Kseq (replicate4, replicate 5) 0.9911 0.9935 0.9980 VTbrain (replicate1, replicate 2) 0.9915 0.9927 0.9973 (replicate1, replicate 3) 0.9906 0.9925 0.9978 (replicate1, replicate 4) 0.9883 0.9909 0.9966 (replicate1, replicate 5) 0.9887 0.9912 0.9968 (replicate2, replicate 3) 0.9919 0.9935 0.9979 (replicate2, replicate 4) 0.9899 0.9922 0.9975 (replicate2, replicate 5) 0.9893 0.9914 0.9972 (replicate3, replicate 4) 0.9879 0.9900 0.9970 (replicate3, replicate 5) 0.9882 0.9902 0.9970 (replicate4, replicate 5) 0.9870 0.9912 0.9969 VTreference (replicate1, replicate 2) 0.9903 0.9919 0.9973 (replicate1, replicate 3) 0.9901 0.9908 0.9979 (replicate1, replicate 4) 0.9887 0.9905 0.9968 (replicate1, replicate 5) 0.9897 0.9913 0.9976 (replicate2, replicate 3) 0.9866 0.9862 0.9968 (replicate2, replicate 4) 0.9873 0.9894 0.9957 (replicate2, replicate 5) 0.9866 0.9876 0.9963 (replicate3, replicate 4) 0.9857 0.9850 0.9969 (replicate3, replicate 5) 0.9895 0.9916 0.9976 (replicate4, replicate 5) 0.9865 0.9877 0.9969 Mean±standard deviation 0.9881±0.0028 0.9896±0.0034 0.9968±0.0011 382 To further test the computation accuracy, MAQC QRT-PCR results were incorporated and the differential genes were identified through t-test, which was confirmed (Cui and Churchill, 2003). The larger the AUC, the better the performance of the algorithm is, of which was confirmed (Sing et al, 2005). Table 2 shows the test results in which the column of sample represents the sample type, the column of 5 replicates is given the number of the found replicate, the correlation coefficients of RMA, iterPLER and Kseq respectively. From the result of Table 2, the performance of Kseq is higher than of RMA and iterPLER, and the last line in Table 2 respectively presents the mean values and standard variations of RMA, iterPLER and Kseq which also show that the performance of Kseq is the best. The AUC values of RMA, iterPLER and Kseq are given in fig2, it can be seen that Kseq model was superior to PLIER based on AUC, while RMA had the highest AUC. Since RMA cannot calculate the expression of isomers, Kseq model was chosen for the analysis of alternative splicing for improving the accuracy of bioanalysis. Fig 2: AUC values for the three way rule conductions 3.2 Comparison of computation speed To compare the efficiency of different algorithms, Table 4 shows the computation time of the three methods on dataset VTbrain and VTreference with parallel computation. The experimental platform was the 64-bit Linux system with Intel Pentium 4 processor (3.4GHZ, 8G RAM). Because the computer had four cores, parallel computation with simultaneous use of 4 computers could be simulated with MapReduce on a single computer. Table 3: Computation time of three different methods on different datasets (h) RMA Iterrpler Kseq VTreference 5.4 5.4 3.5 VTbrain 7.9 8.1 6.2 It can be seen from Table 3 that Kseq had higher computation efficiency, because the mapping between genes, isomers and probes was fully utilized in the algorithm. The efficiency can be further improved by parallel computation using several computers simultaneously 3. Conclusions This paper aims to solve the multi-source mapping of the gene isoform by using the relationship of the gene, the gene isoform and the gray value of the gene probe in Affymetrix extron chip. Firstly, the expression level of the splice isoform of the gene is calculated through using chi-squared distribution algorithm and the percentage of the gene which corresponds to the splice isoform is also computed. Secondly, the proposed 383 model in this paper is verified by the real data of GSE13072, and the experiment results show that Kseq model can improve the computing accuracy and efficiency which supports the mechanism of alternative splicing. Finally, this model uses parallel computing to improve the computing methods and fully use multi- core processor, which makes the subsequent processing of large-scale data have better development prospects and provide the good reference such as differential analysis and cluster analysis. Conflict of interest The authors confirm that this article content has no conflicts of interest Acknowledgment This work is supported by the National natural fund project, China (No. 61402192), Jiangsu natural science fund project, China (No. 14KJB520006) Reference Caceres J. F., Kornblihtt A R. 2002. Alternative splicing: multiple control mechanisms and involvement in human disease [J], Trends in Genetics, 18; 186-193 Chen P, 2011. Comprehensive exon array data processing method for quantitative analysis of alternative spliced variants [J].Nucleic Acids Research, 39:e123. Cui X., Churchill G., 2003, Statistical tests for differential expression in cDNA microarray experiments. Genome Biology, 4:210-235. Kapur K. 2007, Exon arrays provide accurate assessments of gene expression [J], Genome Biology, 8(5); R82 Santa Clara. 2005. Affymetrix GeneChip exon array design [R]. www.affymetrix.com/support/technical/ sample_data/exon_array_data.affx. Karin Z. U. L., 2010, Analysis of Affymetrix Exon Arrays [R]. Sing T., 2005, ROCR: visualizing Classifier Performance in R [J]. Bioinformatics, 21(20):3940:3941. Turro E., 2010. MMBGX: A method for estimating expression at the isoform level and detecting differential splicing using while-transcript Affymetrix arrays [J].Nucleic Acids Research. 38:e4. Yin L., Liu Y.G. 2015, Biclustering of the Gene Expression Data by Coevolution Cuckoo Search. International Journal of Bioautomation, 19(2), 161-176. Zhao Z., Liu X., Zhang L., 2011, A probabilistic model for the analysis of RNA-Seq data, CBME'2011, Wuhan, China. 384