63   |  Vol. 3   No. 2 ,  July   202 2   

  
P-ISSN : 2715-2448 | E-ISSSN : 2715-7199   

Vol.3 No.2 July 2022   

Buana Information Tchnology and Computer Sciences (BIT and CS)  

  
A Time Series Based Gene Expression Profiling Algorithm for Stomach   
 Cancer Diagnosis  

  
Teresa Kwamboka Abuya1   
Study Program  

Computer Science   
Kisii University, Kenya tkwambokaa@gmail.com  

 
‹β›  

Bayu Priyatna 2  

Study Program  
Information System  

Universitas Buana Perjuangan Karawang 

bayu.priyatna@ubpkarawang.ac.id  

  
Abstrak— Eksperimen biologis telah menghasilkan sejumlah besar 
data ekspresi gen yang memiliki nilai sangat besar untuk diagnosis, 

pengobatan, dan pencegahan penyakit. Namun, kelemahan yang 

cukup besar memang ada dalam pemanfaatan yang tepat dari data 

ini karena skala yang besar dan kerumitannya. Sejumlah algoritma 

telah dikembangkan untuk menginterpretasikan data ini dalam 

bentuk profil gen untuk tujuan diagnosis. Diantaranya K-means, 

pengelompokan hierarkis, pengelompokan berbasis kepadatan, 

pengelompokan subruang, dan peta yang mengatur sendiri. 

Sayangnya, algoritme ini mengabaikan ketergantungan berurutan 

di antara titik waktu yang berurutan, tidak memadai dalam 

penemuan pola untuk mengubah aktivitas selama interval terbatas 

dari kerangka waktu eksperimen, dan tidak mampu membedakan 

antara pola faktual dan acak. Dengan demikian, ada kebutuhan 

untuk algoritme pembuatan profil gen yang mengatasi kekurangan 

yang dibatasi waktu dalam algoritme saat ini dan karenanya 

memfasilitasi pembuatan profil gen yang efisien untuk 

mendiagnosis kanker perut secara dini. Selama bertahun-tahun, 

eksperimen ekspresi gen deret waktu telah banyak digunakan untuk 

mempelajari berbagai proses biologis seperti siklus sel, 

perkembangan, dan respons imun. Dalam makalah ini 

dikembangkan algoritma profil gen berdasarkan deret waktu untuk 

diagnosis awal kanker lambung. Dengan menetapkan gen ke satu 

set profil model yang telah ditentukan sebelumnya yang menangkap 

pola potensial yang berbeda, signifikansi masing-masing profil ini 

dapat ditetapkan. Profil signifikan ini kemudian dapat dianalisis 

lebih lanjut dan digabungkan untuk membentuk cluster yang 

kemudian dapat dimanipulasi oleh algoritma clustering. Idenya 

adalah untuk mengukur aktivitas gen selama rentang waktu yang 

singkat sehingga dapat menghasilkan gambaran universal tentang 

fungsi seluler. Singkatnya, ini termasuk mendeteksi pola berulang 

dalam data biologis. Pola-pola ini kemudian digunakan untuk 

mengungkapkan informasi diagnostik yang mungkin penting bagi 

praktisi medis. Desain penelitian eksperimental digunakan untuk 

mencapai tujuan penelitian. Data yang berkaitan dengan genom 

biologis digunakan untuk pekerjaan penelitian ini. Karena 

perkembangan penyakit kanker saat ini, hasil dari penelitian ini 

diharapkan dapat menjadi signifikan dalam diagnosis dini kanker 

lambung sehingga pengobatan yang tepat dapat diberikan..  

Kata kunci: Data microarray, respon imun, clustering, profil 

signifikan, diagnosis kanker.  

Abstract— Biological experiments have produced enormous 

amount of gene expression data that possess enormous value for 

the diagnosis, treatment, and prevention of diseases. However, 

considerable drawbacks do exist in the appropriate utilization of 

this data due to its massive scale and intricacy. A number of 

algorithms have been developed to interpret this data in form of 

gene profiling for diagnosis purposes. They include K-means, 

hierarchical clustering, density-based clustering, subspace 

clustering, and self-organizing maps. Unfortunately, these 

algorithms ignore the sequential dependency among successive 

time points, are inadequate in the discovery of patterns for 

changing activity over a restricted interval of an experiment’s time 

frame, and are incapable of discriminating between factual and 

random patterns. As such, there is a need for a gene profiling 

algorithm that addresses the time-constrained shortcomings in the 

current algorithms and hence facilitating efficient profiling of 

genes for early stomach cancer diagnosis. Over the years, time 

series gene expression experiments have been widely used to study 

a range of biological processes such as the cell cycle, development, 

and immune response. In this paper a gene profiling algorithm 

based on time series for early stomach cancer diagnosis is 

developed. By assigning genes to a predefined set of model profiles 

that capture the potential distinct patterns, the significance of each 

of these profiles can be established. These significant profiles can 

then be analyzed further and combined to form clusters that can 

then be manipulated by clustering algorithms. The idea is to 

measure the genes’ activities over a short period span so as to come 

up with a universal depiction of the cellular functionality. In a 

nutshell, this includes detecting recurring patterns in biological 

data. These patterns are then employed to reveal diagnostic 

information that may be important for the medical practitioners. 

An experimental research design was utilized to achieve the study 

objectives. Data pertaining to biological genomes was employed 

for this research work. Due to the upsurge of cancer in the current 

times, the outcomes of this research work is anticipated to be 

significant in the early diagnosis of stomach cancer so that 

appropriate medication can be administered.  

  
Keywords: Microarray data, immune response, clustering, 

significant profiles, cancer diagnosis.  

I. INTRODUCTION   

  
64   |  Vol. 3   No. 2 ,  July   202 2   

  
Functional genomics is the discipline in which genes are 

utilized in the determination of their function whereas gene 

expression is an approach employed to examine the 

functional changes in these genes. According to [1], the 

expression level for a given gene across different 

experimental conditions are collectively referred to as the 

gene expression profile and the expression levels for all the 

genes under an experimental condition are jointly referred to 

as the sample expression profile.  

One of the goals in microarray data analysis is the identification 

of genes for which the expression level is significantly changed 

under different experimental conditions. Another objective is 

to cluster the expressed genes or samples having similar 

expression profiles to make a meaningful biological inference 

from the set of genes or samples (Martin et.al., 2016).  

The field of bioinformatics essentially deals with biological 

information processing. One of the requirements for effective 

bioinformatics is an extensive range of computational models 

that helps in representation and computation of massive 

biological data. As [2] point out, biological experiments and 

processes analysis require too much effort. Additionally, this 

process can prove to be very slow. This can be attributed to the 

ever-growing intricacy of the processes and fiery growth of 

biological data emerging from laboratories universally. The 

recent drawback, as [3] noted, is on how to convert this 

enormous data repository into knowledge that can facilitate 

understanding of biological processes and experiments 

pertaining to both health and diseases. According to [4] 

timeseries gene expression analysis allows for principled 

estimation of unobserved time-points, clustering, and dataset 

alignment. In this technique, every expression profile is 

modeled as a piecewise polynomial which is estimated from the 

observed data and every time point sways the overall smooth 

expression curve. Gene expression experiments carried out 

using time series show that unobserved timepoints can be 

reconstructed with 10-15% less error when compared to other 

profiling methods. The time series-based clustering algorithm 

operates directly on the continuous representations of gene 

expression profiles. This is particularly effective when applied 

to non-uniformly sampled data.  

Stomach Cancer (SC) is the fourth most frequently diagnosed 

malignancy and the second leading cause of cancer death 

worldwide (Yang et.al.,2018). Although the incidence of SC 

has declined for decades, the prognosis of SC remains very 

poor, especially in China. At present, the pathogenesis of SC is 

unclear, thereby necessitating effective biomarkers and 

targeted therapeutics. Traditionally, clinic pathological 

parameters were used in risk stratification of SC outcomes. 

However, a number of advanced SC patients remained stable 

for a couple of years, whereas some early-stage patients 

progressed rapidly [5]. Therefore, reliable biomarkers or 

stratification systems that can be used for more accurate 

prediction are highly essential [6].  

The greatest challenge in cancer diagnosis is the identification 

of a subset of genes with crucial roles in diverse stages of these 

diseases’ progression from early stages of carcinogenesis to its 

final stage of metastasis. As [7] explains, reliable identification 

of molecular determinants of clinical outcomes can facilitate 

the discovery of functional biomarkers predictive of therapy 

response or disease progression. In addition, this can provide 

insights into new therapeutic targets in this aggressive disease. 

[8] further point out that the complexity of genomic networks 

and the vast volume of genes present increase the challenges of 

understanding and interpreting the resulting mass of data. The 

problem is compounded by the vagueness, imprecision, and 

noise present in this data. According to [9], the current 

algorithms such as Hierarchical gene profiling algorithm, Self-

Organizing Maps (SoM), Support Vector Machines (SVM) and 

K-means algorithm, can only detect relationships where there 

is sufficient variability in gene expressions and as such, 

functional interactions are only detectable if they induce 

changes in transcriptional state that persist over a reasonable 

timescale. To address this problem, algorithms for visualizing 

high-throughput single-cell datasets and identifying putative 

functional relationships between genes are required [10].  

Due to the potential of time series to unravel biological 

processes that take place over short time duration, this 

research work employed this nonconventional data type to 

come up with a gene profiling algorithm that is instrumental 

in disease diagnosis in human beings. In this paper, a time 

series - based gene profiling algorithm for early stomach 

cancer diagnosis was developed. Early and accurate 

diagnosis of stomach cancer can significantly improve the 

design of personalized therapy and enhance the success of 

therapeutic interventions. Since time series has the potential 

of identifying significant chronological expression profiles 

and the genes associated with this profile, it can enable the 

comparison of cancer infected genes behavior across 

multiple conditions over short time duration. Specifically, 

the response of gastric epithelial cells infected with the 

vacAmutant strain of the pathogen Helicobacter pylori was 

investigated [11].  

The contributions of this paper include the derivation of 

mathematical parameters that were shown to help in the 

generations of gene profiles over a limited duration of time.  

The rest of this paper is organized as follows. Section 2 

presents the related work while section 3discusses gene 

profiles derivation. Section 4 gives a presentation of results 

and discussion while part 5 concludes the paper.  

II. METHOD  

This paper adopted an experimental research design to 

develop an algorithm that aided in the derivation of gene 

profiles. The approach involved the derivation of gene 

profiling parameters which were then employed to develop 

a time-series based algorithm. This algorithm was then 

experimented on sample genomic data described in section 

A below, to provide the required gene profiles visualization 

in the form of graphs. This visualization provided a straight 

forward means of establishing the sequential dependency 

among successive time point. In addition, the visualization 

facilitated the discovery of patterns for changing activity 

over a restricted interval of an experiment’s time frame.  

  
65   |  Vol. 3   No. 2 ,  July   202 2   

  
3.1 Data Set  

The genomics data employed in this paper were from two 

experiments measuring the response of gastric epithelial 

cells infected with the vacA-mutant strain of the pathogen 

Helicobacter pylori. The data is sampled at five time points 

0 h, .5 h, 3 h, 6 h, and 12 h. A sample of these data is shown 

in Figure 1 for G27 TC1 trial 4.  

  
Figure 1. Sample G27 TC1 Trial 4 Data  

This Figure 1 shows TC1 gastric epithelial (AGS) cells infected 

with wild type H. pylori (G27) and isogenic mutants in cagA 

and vacA for 0, 0.5, 3, 6, and 12 hours. Figure 2.0 shows the 

G27 TC1 trial 5 data.  

  
Figure 2. Sample G27 TC1 Trial 5 Data  

  
In these data samples, hybridizations of G27 (trial 4) and cag 

A- (trial 3) time-courses are accomplished in parallel. A 

technical replicate of the G27 time course (trial 5) and 

hybridization of vacA- (trial 3) time course is also 

accomplished in parallel. The cag A 6- and 12-hour time points 

technically replicated (trial 4) (the cag A 6-hour sample of trial 

3 are lost).  

  
3.2 Gene Profiling Modeling Process  

This research dealt with the profiling, comparing and 

visualizing gene expression data from short time series of two 

experiments measuring the response of gastric epithelial cells 

infected with the vacA-mutant strain of the pathogen 

Helicobacter pylori. The gene expression profiling comprised 

of four major steps as shown in Figure 3. As show in this figure, 

the steps included the generation and normalization of 

expression signals, testing each probe for its differential or 

association with the phenotype, the application of proper 

statistical significance criteria to identify the gene expression 

profile, and the investigation of the functions and pathways of 

the genes in the expression profile.  

  
Figure 3. Gene Profiling Steps  

  
Thereafter, a number of statistical significance criteria such 

as Pearson correlation, P-value, Euclidean distance, 

Logistic regression, Bonferroni correction, False discovery 

rate and Time Points Permutation were applied to help 

identify specific list of genes differentially expressed or 

associated with the phenotype.  

Although mutual information (MI) measure is superior over 

simpler measures such as Pearson correlation as it is 

capable of capturing complex non-linear and non-

monotonic dependencies. In addition, it can reflect the 

dynamics between pairs or groups of genes, computing MI 

involves estimating pair-wise joint probability distributions 

which requires density estimation or data discretization, 

with the accuracy of these estimates depending on sample 

sizes. As this measure was not deployed in this research 

study. Table 2 gives a summary for the deployment of the 

various performance metrics.  

  
Table 2 Performance Metrics Deployment  

  
SNO  Statistical  

Measure  

Deployment  

1.  Pearson 

correlation  

Weighted Relation 

between all genes  

2.  P-value  Significance of gene 

coexpressions  

3.  Euclidean 

distance  

Correlation distance 

between gene profiles  

4.  Logistic 

regression  

Estimate of cancerous 

probability  

5.  Bonferroni 

correction  

Adjustment to the 

confidence levels  


66   |  Vol. 3   No. 2 ,  July   202 2   

  
6.  False discovery 

rate  

Adjustment to the 

confidence levels  

7.  Time Points 

Permutation  

Optimize the number of 

required profiles   

  
3.3 The Algorithm of modeling gene profiles The first 

step was the commencement of the algorithm while the 

second step in the processing activities was the input of the 

genomics data as shown in Figure 4.0. in the next page. In 

step three, validation is done against empty genomic file 

upload such that if this field is empty, then an error message 

is generated for this effect in the fourth step. During the fifth 

step, the validation against spot IDs not included is done 

such that if these IDs are not included, then they are 

computed in the sixth step. The value of the spot ID is 

initialized to 1 which are thereafter incremented by one 

until the value of 24192 is reached, which is equivalent to 

the number of genes in the file that were investigated. 

Whereas spot IDs were unique for each gene entry, the same 

gene symbol may appear multiple times in the data file 

corresponding to the same gene appearing on multiple 

spots.  

  
The seventh step was the computation of the average value for 

the expression values for the same gene. This was 

accomplished using the median before further analysis on the 

data was carried out. The eighth step was that option of filtering 

some specific genes using P-value metric. In situations where 

a gene was filtered, then it was excluded from further analysis. 

Gene filtering was accomplished for those genes that did not 

show a sufficient response to experimental conditions, those 

genes that had too many missing values, or the gene expression 

pattern over repeats was too inconsistent as dictated by the 

minimum correlation between repeats. The ninth step was the 

usage of additional parameters namely the maximum Pearson 

correlation and maximum number of candidate model profiles 

to dictate the selection of model profiles along with the 

maximum number of model profiles and maximum unit change 

in model profiles between time points as shown in Figure 5.0. 

In this algorithm, the candidate model profiles were designed 

to be nonconstant profiles which started at zero and increased 

or decreased an integral number of units that was less than or 

equal to the value of the maximum unit change in model 

profiles between time points.  

  
Figure.4. Gene Profiling Algorithm Pseudo-Code  

  
When this parameter was set to zero, all permutations were 

used. In the eleventh step, the P-value based significance 

level was utilized to set the connotation level at which the 

number of genes assigned to a model profile as compared 

to the expected number of genes assigned was regarded as 

significant. During the twelfth step, the permutation test 

was set to permute all time points including time point zero 

when computing the expected number of genes assigned to 

a profile.    

  
In this case, the developed algorithm located profiles with 

significantly more genes assigned than expected on 

condition that all the input columns had been randomly 

reordered. On the other hand, during the thirteenth step, the 

permutation test was configured not to permute at time 

point zero and as such, the algorithm found profiles with 

more genes assigned than expected on condition that all the 

columns except for the first column had been randomly 

reordered. In the developed algorithm, permuting time point 

zero was preferred since it was the only test that took into 

account the significant changes that took place between 

time point zero and the immediate next time point (0.5 h).   

  
67   |  Vol. 3   No. 2 ,  July   202 2   

  
In the fourteenth step, the correction method was utilized to 

adjust the significance level since this algorithm was meant 

to test multiple profiles for significance. Two types of 

corrections were utilized in this algorithm. The first one was 

the Bonferroni correction while the second one was the 

conservative false discovery rate (FDR) control. In the third 

scenario, no correction was made for the multiple 

significance tests.    

  
Figure 5. Modeling Gene Profiling Process  

  
In the fifteenth step, two parameters namely the minimum 

correlation and the minimum correlation percentile were 

utilized to control the grouping of significant model profiles 

into clusters. In so doing, these parameters served to control 

how similar two model profiles had to be if they were grouped 

together. For the case of the minimum correlation, any two 

model profiles assigned to the same cluster of profiles had to 

have a correlation above this parameter's value.  On its part, the 

minimum correlation percentile was employed in cases there 

were repeat data from different time periods. It was used to 

specify that any two model profiles assigned to the same cluster 

of profiles had to have a correlation in their expression greater 

than the correlation of this percentile in the distribution of gene 

expression correlations between the repeats. The last step was 

the display of the gene profiles based on the Euclidean distance 

after which the algorithm halted in the seventeenth step.  Figure 

6 gives a diagrammatic representation of the gene profile 

derivation process. As this figure shows, the process gene 

profile derivation process involves the input of the genomic 

data containing the gene expressions to be profiled.   

  
These data items are analyzed using parameters such as Pvalue, 

Pearson correlations, permutations, logistic regression and 

median to yield probable profiles as already discussed above. 

Correction methods are then employed to adjust the 

significance level to permit the testing of multiple profiles for 

significance.  

  
Figure 6. Schematic Gene Derivation Process  

  
The output gene groupings are then clustered using minimum 

correlation and minimum correlation percentile before 

Euclidean distance is applied to them to distinguish the various 

gene profiles. The final outputs are the gene profiles in form of 

graphs.  

  
The logic here was that when the number of candidate  

model profiles exceeded the p - value of seeing t more genes  

in the intersection, then instead of explicitly generating al l  

candidate model profiles, a subset of candidate model profiles  

of this size was randomly selected. In the tenth step, the  

number of permutations per gene parameter was employed to  

specify the number of permutations of time points that were  

randomly selec ted for each gene when computing the  

expected number of genes assigned to each of the model  

profiles.   

  
68   |  Vol. 3   No. 2 ,  July   202 2   

  
III.RESULTS AND DISCUSSION  

In this section a time series-based gene profiling algorithm is 

developed. To test the derived parameters and their gene 

profiling abilities, the algorithms and statistical computations 

were put into use to achieve some functionality as shown in 

Table 3. The genomics data from two experiments measuring 

the response of gastric epithelial cells infected with the vac A- 

mutant strain of the pathogen Helicobacter pylori were then fed 

as input to this algorithm.  

  
Table 3. Gene Derivation Process  
Step  Parameter  Activity  
1  n/a  -Commence gene derivation process  
2  n/a  -Input genomic data  
3  n/a  -Validation is done against empty genomic file upload  
4  n/a  -Prompt genomic data input error  
5  n/a  -Validation against spot IDs  
6  n/a  -If not included in file compute spot IDs  
7  Median  -Computation of the average value for the expression values for the same 

gene  
8  P-value  -Filtering specific genes  
9  Pearson correlation, p-value  -Model profiles selections.  
10  Permutation  -Computation of  the expected number of genes assigned to each of the 

model profiles  
11  P-value, Logistic regression  

  
-Setting the connotation level at which the number of genes are assigned 

to a model profile  

12  Permutation  -Compute the expected number of genes assigned to a specific profile  
13  Permutation  -Configure  permutation test not to permute at time point zero  
14  Bonferroni, FDR  -Adjust the significance level to test multiple profiles for significance  
15  Minimum correlation, minimum correlation  -Control grouping of significant model profiles into clusters 

percentile  
16  Euclidean distance  -Display generated gene profiles  
17  n/a  -Halt gene derivation process  

    
The minimum absolute expression change was any value 

more than -0.05. As an illustration, using the maximum 

number of missing values to be 2, the minimum correlation 

between repeats to be 0, and the minimum absolute 

expression change to be 0.05 yielded the information in 

Table 4.0 for the sample filtered genes. Table 4. Sample 

Filtered Genes  

  
The genes that were devoid of these three characteristics 

were regarded as standard genes and were the ones that took 

part in further analysis. Table 5 gives information on the 

sample genes that passed the classification criteria.  

Table 5. Sample Genes Passing Classification Criteria  

  
Afterwards, eight parameters were utilized for the 

computational derivation of gene profiles from this set of data: 

maximum correlation, maximum number of candidate model 

profiles, maximum number of model profiles and maximum 

unit change in model profiles between time points, number of 

permutations per gene, significance level, and correction 

method as shown in Table 6 below.  

Table 6. Gene Profiles Evaluation Metrics  
Gene Profiling Option  Value  

Maximum correlation  1  
Maximum number of candidate model profiles  1,000,000  

Number of permutations per gene(0 for all permutations)  0  
P-value significance level  0.05  

Maximum Number of model profiles  50  
Maximum unit change in model profiles between time points  2  

Correction method  None  
Minimum Correlation  0.7  

    
Based on the evaluation metrics of Table 6.0, the algorithm was 

run to yield the proposed gene profiles.  

3.1 Time Series-Based Gene Profiling.  
In this profiling, the maximum correlation specified the value 

that the maximum correlation between any two model profiles 

had to be below, and was therefore employed to guarantee that 

two very similar profiles were not selected. The maximum 

value for this parameter was set to 1 in order to prevent two 

perfectly correlated model profiles from being selected. It was 

observed that lowering this parameter led to the number of 

model profiles selected being less than the maximum number 

of model profiles even in situations where more candidate 

model profiles were available.  

  
On the other hand, the maximum number of candidate model 

profiles represented  non-constant profiles which commenced  

at 0 and increased or decreased an integral number of units that 

was less than or equal to the value of the maximum unit change 

in model profiles between time points. The number of 

permutations per gene parameter specified the number of 

permutations of time points that were randomly selected for 

each gene when computing the expected number of genes 

assigned to each of the model profiles.   

  
69   |  Vol. 3   No. 2 ,  July   202 2   

  
When this parameter was set to 0, all permutations were used. 

It was also important to set permutation test to permute time 

point 0 or not. When computing the expected number of genes 

assigned to a profile, if the permutation test for time point 0 was 

set, the permutation test permuted all time points including time 

point 0. It was observed that doing this led to profiles with 

significantly more genes being assigned than expected if all the 

input columns had been randomly reordered. On the contrary, 

if the permutation test was not set for 0, the permutation test 

permuted all time points except for time point 0. In this 

scenario, profiles with more genes were assigned than expected 

if all the columns except for the first column had been randomly 

reordered.   

  
Permuting time point 0 was preferred since only this test took 

into account significant changes that occurred between time 

point 0 and the immediate next time point. However in some 

cases based on experimental design a gene's expression value 

before transformation at time point 0 was expected to be known 

more accurately than the other time points, and because of this 

asymmetry, not permuting time point 0 was also be useful.  

  
It was observed that increasing the maximum number of 

model profiles increased the number of candidate models as 

shown in Table 7.  

  
Table 7. Maximum Number of Gene Model Profiles Viz. 

Significant Gene Models  
Maximum Number of Model Profiles  Resulting Sig. Number of Gene Models  
50  14  
60  16  
70  18  
80  20  
100  21  
120  23  
140  24  

    
Based on the values in Table 7, a graph was plotted for 

maximum number of model profiles against the resulting 

significant number of gene models as shown in Figure 7.  

  
Figure 7. Maximum No. of model profiles Viz. Resulting 

Significant. No. of Gene Models  

  
The graph of Figure 7.0 shows that the resulting significant 

number of gene models increase nearly exponentially as the 

maximum number of model profiles was increased. 

Consequently, to get fine grained gene model profiles, the 

maximum number of model profiles had to be increased and 

vice versa. Table 8.0 gives the shift in the resulting 

significant number of gene profiles as the maximum unit 

change in model profiles between time points was adjusted.  

  
Table 8. Maximum Unit Change in Model Profiles Viz. 

Significant Gene Models  
Maximum Unit Change in  Model Profiles  Resulting Sig. Number of Gene Models  
1  13  
2  14  
3  15  
4  13  
5  14  
6  15  
7  15  
8  16  
9  16  
10  15  

    
As shown in this table, generally as the maximum unit change 

in model profiles is increased, the resulting significant number 

of gene models is increased. This is due to the pronounced 

Euclidean distances between the gene models. Regarding 

maximum correlation, the value of minimum correlation was 

set to zero (0) and the value of maximum correlation was 

slowly reduced from 1 to zero. The results obtained are shown 

in Table 9 are observed.  

  
Table 9. Maximum Correlations Viz. Significant Gene  Models  

 
 Maximum Correlation  Resulting Sig. Number of Gene Models Number of Genes Assigned  

1  12  1005  
0.9  12  1005  
0.8  9  993  
0.7  6  1185  
0.6  4  1216  
0.5  3  1243  
0.4  3  1348  
0.3  3  1348  
0.2  3  1470  
0.1  3  1470  
0  1  1275  

    
Generally, as the value of maximum correlation is reduced from 

one to zero, the number of resulting significant number of gene 

models reduced to unity (1) while the number of genes assigned 

to these gene models increased from 1005 to a maximum value 

of 1275. This implies that when the correlation value is small, 

gene models are basically indistinguishable hence at 

correlation zero, there is only one resulting significant model. 

On the other hand, at maximum correlation, the genes can be 

clearly distinguished and hence the resulting significant gene 

models are many.   

  
Concerning the number of genes assigned to models, at low 

correlation coefficients, genes profiles are indistinguishable 

and hence a large number of genes are assigned to the few 

available models. However, as the correlation coefficients are 

increased, the gene profiles become increasing disparate and 


70   |  Vol. 3   No. 2 ,  July   202 2   

  
few genes are assigned to each of the many models now 

available as the rest are discriminated due to their large pvalues.  

  
3.2 Prediction Power of the Developed Algorithm In the 
developed algorithm, sequences of gene expressions were 

listed in order of occurrence, starting at time point 0h to 12h. 

The aim was to collect and investigate precedent observations 

of gene expressions at various time points in order to come up 

with ideal models to express the intrinsic structure of the 

underlying genomic data. Based on these models, it was 

possible to predict future gene expressions. To put this into 

perspective, profile ID 17 was considered whose gene 

expressions are shown in Figure 8.  

  
Figure 8. Gene Expressions for Profile ID 17  

  
A total of 54 genes were assigned to this model profile 

whose individual expressions are shown in Figure 8. By 

sketching a line of best fit through these gene expressions 

and performing some extrapolations, the future expressions 

beyond the 12h time point can be obtained as shown in 

Figure 9 below.  

  
Figure 9. Gene Profile Prediction  

  
The white thick line through the gene expressions is the line 

of best fit while the thick red line represents the 

extrapolated gene expressions for the 54 genes assigned to 

profile ID 17 for the future 18h and 30h time points. 

Suppose that the stomach cancer patient gene expressions 

are as shown in Figure 10 below.  

  
Figure 10. Stomach Cancer Patient Gene Expressions over  

30h Duration  

  
Comparing the hypothesized gene expressions over the 30h 

duration and the gene models in Figure 11.0 below, then 

considering the first few gene expressions, model profile 

IDs 13, 14,15,16,17 and 18 are candidates’ models that the 

stomach cancer patient gene expressions can fit in. 

However, taking into account the preceding time points 

eliminates model profiles 14(experiences near exponential 

growth followed by plateau), 15(experiences linear growth 

followed by plateau), 16 (portrays linear growth, linear 

decay and plateau), and 18 (presents linear growth followed 

by plateau). This leaves profile ID 13 and 17 as the most 

probable model profiles. By drawing a horizontal line 

through these two profiles as shown in Figure 11, it is 

possible to discern which of them perfectly fits the patient 

gene expressions.  

  
Figure 11. Gene Model Fitting  

  
Based on this line and considering time-points at which troughs 

and crests appear, it is clear that model profile ID 17 perfectly 

fits the patient gene expressions for a duration of 30h time 

points. As such, it can be implied that the developed algorithm 

led to accurate diagnosis of stomach cancer patients within 12h 

time points since the commencement of the cancerous gene 

expressions. In the next section, this algorithm is validated 

against some well-known gene profiling algorithms.  

  
3.3 Validation of the Developed Algorithm In this section, 
the time series-based algorithm that was developed is validated 

  
71   |  Vol. 3   No. 2 ,  July   202 2   

  
against other gene profiling algorithms such as Hierarchical 

gene profiling algorithm, Support Vector Machine, Self-

organizing maps, and K-means algorithm. In Hierarchical gene 

profiling algorithm, genes with related expression patterns are 

grouped together and connected by a series of branches to form 

a dendrogram. Unfortunately, this algorithm considers each 

gene as an individual cluster and genes that are similar to each 

other form nested clusters based on the pair-wise distances.   

  
On the other hand, the time series-based algorithm developed 

in this research study considered a group of genes with similar 

expressions as profile clusters. For instance, in Figure 6.6, a 

total of 155 genes were represented by a single model profile 

with ID 40 and 90 genes were represented by model profile ID 

37. These two model profiles formed a cluster with a total of 

245 genes. As such, the developed algorithm is operationally 

faster during gene profiling compared to Hierarchical gene 

profiling algorithm, rendering it ideal for large genomic data 

set. The genomic data that was utilized in this research 

consisted of 24192 gene symbols observed under 5 time points, 

making the total gene expressions 120960, a very big data set 

for the rather slow Hierarchical gene profiling algorithm.  

  
To effectively apply the Support Vector Machine gene profiling 

algorithm, it requires training using the same members of each 

model profile that have to be identified. This training takes time 

and hence compared to the developed algorithm, it is slow and 

hence inefficient for large data sets such as the 120960 gene 

expressions that were under investigation in this research.   

  
Although Self-organizing maps algorithm has been employed 

to group 1,036 genes into 24 categories, this algorithm is slow 

in training, hard to train against slowly evolving data and are 

not so intuitive since neurons close on the map (topological 

proximity) may be far away in feature space. Additionally, 

these maps do not behave so gently when using categorical 

data, or mixed data. Comparing the 1036 genes that 

Selforganizing maps algorithm profiled into 24 categories with 

the 24192 genes that were profiled using the developed time 

series-based algorithm, it is clear that the proposed algorithm is 

efficient.  

  
Regarding SVM algorithm, this algorithm has been used for 

cancer classification with microarray data where it served as a 

powerful classifier together with four effective feature 

reduction methods namely principal components analysis 

(PCA), class-separability measure, Fisher ratio and t-test to the 

problem of cancer classification based on gene expression data. 

Although it very high classification accuracies, it requires 

feature reduction methods which renders it structurally 

complex compared to the time series-based algorithm 

implemented in this research.  

  
On its part, the K-means algorithm operates on a series of 

microarray experiments measuring the expression of a set 

of genes at regular time intervals in a common cell line. It 

requires that data be normalized to permit for comparisons 

across these microarrays. The output produced is in form of 

clusters of genes which vary in similar ways over time and 

hence it is possible to infer that genes which vary in the 

same way may be co-regulated and or participate in the 

same pathway. Unfortunately, the numbers of clusters need 

to be specified which may be unknown in some instances, 

and figuring out the right number of clusters that represent 

the true number of clusters in the population is quite 

subjective. As such, the profiles obtained using K-means 

can vary greatly depending on the location of the 

observations that are randomly chosen as initial centroids. 

However, the developed time series-based algorithm 

employs statistical metrics such as p-value, Pearson 

correlation, logistics regression and Euclidean distance 

whose significance levels are well known.    

  
The K-means clustering algorithm assumes that the 

underlying clusters in the population are spherical, distinct, 

and are of approximately equal size and hence tends to 

identify clusters with these characteristics. Therefore, this 

algorithm is incapable of yielding good results when 

clusters are elongated or not equal in size like the genomic 

data used in this research where some gene expressions 

were negative, zero and others positive. The K- algorithm 

is also sensitive to initial conditions, implying that different 

initial conditions produce varying result of gene profiles. It 

is also possible for a very far data from the centroid to pull 

the centroid away from the real one as shown in Figure 12. 

below. Here, 5 genes are assigned to cluster ID 55 and it is 

clear from the gene.  

  
Figure 12. K-Means Based Profiling  

  
Expressions that the profiling is not such accurate especially 

after the 0.5h time point. Whereas 4 gene expressions have 

negative gradients, one of them has a positive gradient. 

During the 3h time-point, some gene profiles are at the 

rough, others are at the crest, plateau while others are still 

on their descent. The same is observed during the 6h time 

point. These varying result of gene profiles give 

contradicting depiction of gene activities and hence may 

lead to inaccurate stomach cancer diagnosis.  

  
IV.CONCLUSION  

The aim of this paper was to develop a gene profiling 

algorithm based on time series to help in early stomach cancer 

diagnosis. Based on a number of derived gene profiling 


72   |  Vol. 3   No. 2 ,  July   202 2   

  
parameters, an algorithm was developed that was then 

experimented on sample genomic data. The results of this paper 

included a number of gene profiles that obtained from the 

underlying pathogen Helicobacter pylori data. The significance 

of this research lies on the fact that it helped generate gene 

profiles using very short time points. This feature is very 

critical in early stomach cancer diagnosis as it facilitates 

necessary preventive measures that curtail the cancerous cells 

advancement to other fatal phases. Since this research was 

purely based on stomach cancer, future work in this area lies on 

the implementation of this algorithm for other types of cancer 

or diseases.  
  

REFERENCES  

[1] Sanchita & Ashok S. (2015). Future Challenges in 

Application of Algorithms and Tools for Clustering of  

Gene Expression Data. Biotechnology Division, 

CSIRCentral Institute of Medicinal and Aromatic Plants. 

Lucknow 22601. (pp. 515-531)5 India.  

[2] Brohée S., Barriot R., &  Moreau Y. (2015).Biological 

knowledge bases using Wikis: combining the flexibility of 

Wikis with the structure of databases. Bioinformatics, 

Oxford Journals.   

[3] Wong  K.    (2016).Computational 

 Biology  and  

Bioinformatics: Gene Regulation. CRC Press.  

[4] Ziv B., Georg G., David K., & Tommi S. (2015). A New 

Approach to Analyzing Gene Expression Time Series 

Data. Whitehead Institute for Biomedical Research.  

[5] Wang, H., Wang, X., Xu, L. et al. (2020).High expression 

levels of pyrimidine metabolic rate–limiting enzymes   are 

adverse prognostic factors in lung adenocarcinoma: a 

study based on The Cancer Genome Atlas and Gene 

Expression Omnibus datasets. Purinergic Signalling 16, 

347–366 (2020). https://doi.org/10.1007/s11302-020-

09711-4.  

[6] Wenhui, Y.,  Zhiyong L.., Yuan, Li.,  Jianbing, M.,  

Mudan, Yang., Jun, X.(2019).Immune    signature 

profiling  identified prognostic factors for gastric cancer. 

Chinese journal of cancer research. Https// doi: 

10.21147/j.issn.1000-9604.2019.03.08.[7] Abolfazl R.,  

Fatemeh A., Salendra S., and Vinay V. 

(2016).NetworkBased Enriched Gene Subnetwork 

Identification: A Game-Theoretic Approach. Biomed Eng 

Comput Biol. Vol. 7, Issue 2, pp. 1–14.  

[8] Oyelade J., Itunuoluwa I., Funke O., Olufemi A., Efosa 

U., Faridah A., Moses A.,  and Ezekiel A. (2016). 

Clustering Algorithms: Their Application to Gene 

Expression Data. Bioinformatics and Biology  

Insights,10, 237–253.  

[9] Thalia E.C.,  Michael P.H., and Ann C. (2017).Gene 

Regulatory Network Inference from Single-Cell Data 

Using Multivariate Information Measures. Cell 

Systems,  5, 251–267.   

[10] Gwang H., Peter S., Sung J., & Joo H. (2016). Screening 

and surveillance for gastric cancer in the United States: 

Is it needed? American Society for Gastrointestinal 

Endoscopy. Volume 84, No. 1, pp. 18-28.  

[11] Siregar, Amril Mutoi, et al. "Perbandingan Algoritme 
Klasifikasi Untuk Prediksi Cuaca." Jurnal Accounting 

Information System (AIMS) 3.1 (2020): 15-24.