Sample Paper - Manuscript Preparation


J. mt. area res., Vol. 8, 2023 

1 

   
        Journal of Mountain Area Research 
  

IDENTIFICATION MOLECULAR FUNCTIONS OF DYNEIN MOTOR PROTEINS USING 

EXTREME GRADIENT BOOSTING ALGORITHM WITH MACHINE LEARNING 
Ali Ghulam1*, Rahu Sikander2, Dhani Bux Talpur3, Erum Saba1, Mir Sajjad Hussain Talpur1, Zulfikar 

Ahmed Maher1, Saima Tunio1 

 
1. Information Technology Centre, Sindh Agriculture University, Sindh, Pakistan  

2. School of Computer Science and Technology, Xidian University, Xi’an 710071, China 

3. Department of Computer Science, University of Gwaddar, Gwaddar, Balochistan 

 
ABSTRACT 

The majority of cytoplasmic proteins and vesicles move actively primarily to dynein motor proteins, 

which are the cause of muscle contraction. Moreover, identifying how dynein are used in cells will 

rely on structural knowledge. Cytoskeletal motor proteins have different molecular roles and 

structures, and they belong to three superfamilies of dynamin, actin and myosin. Loss of function of 

specific molecular motor proteins can be attributed to a number of human diseases, such as 

Charcot-Charcot-Dystrophy and kidney disease.  It is crucial to create a precise model to identify 

dynein motor proteins in order to aid scientists in understanding their molecular role and designing 

therapeutic targets based on their influence on human disease. Therefore, we develop an accurate 

and efficient computational methodology is highly desired, especially when using cutting-edge 

machine learning methods. In this article, we proposed a machine learning-based superfamily of 

cytoskeletal motor protein locations prediction method called extreme gradient boosting (XGBoost). 

We get the initial feature set All by extraction the protein features from the sequence and 

evolutionary data of the amino acid residues named BLOUSM62. Through our successful eXtreme 

gradient boosting (XGBoost), accuracy score 0.8676%, Precision score 0.8768%, Sensitivity score 

0.760%, Specificity score 0.9752% and MCC score 0.7536%.  Our method has demonstrated 

substantial improvements in the performance of many of the evaluation parameters compared to 

other state-of-the-art methods. This study offers an effective model for the classification of dynein 

proteins and lays a foundation for further research to improve the efficiency of protein functional 

classification. 

KEYWORDS: Dynein motor proteins, Machine learning, BLOUSM62, Computational Methods  

 
*Corresponding author: (Email: garahu@sau.edu.pk) 

 
1. INTRODUCTION 

Muscle contraction is driven by motor 

proteins, and proteins and vesicles play an 

important role in cytoplasmic transport. 

Chemical energy from the hydrolysis of 

adenosine triphosphate (ATP) converts these 

proteins into mechanical tasks that move 

along actin filaments or microtubules. 

Adenosine triphosphate (ATP) hydrolysis' 

chemical energy can be converted by these 

proteins into mechanical work that flows 

along actin filaments or microtubules. To 

provide the mechanical forces needed to 

power biological movement, various methods 

have developed. Mechanochemical 

enzymes, or "motor proteins," are a highly 

effective and common method of producing 

biological force [1, 2].  Myosin motors 

operating on actin filaments cause muscle 

Vol. 8, 2023 

https://doi.org/10.53874/jmar.v8i0.166 

Full length article 

mailto:garahu@sau.edu.pk


Ali Ghulam et al., J. mt. area res. 08 (2023) 1-13 

J. mt. area res., Vol. 8, 2023 
2 

cell contraction, vesicle movement, 

cytoplasmic streaming, and other 

morphological changes. Members of the 

dynein and kinesin microtubule-based motor 

superfamilies move vesicles and organelles 

inside of cells, cause the beating of flagella 

and cilia, and segregate replicated 

chromosomes to offspring cells in the mitotic 

and meiotic spindles. [3].  Transport of 

different freight products, including 

membrane organelles, protein complexes, 

and mRNAs directly involved in the dynein 

superfamily of proteins [4].  The dynein and 

kinesin superfamilies of microtubule motors are 

responsible for vesicle and organelle 

movement within cells, which causes flagella 

and cilia to beat. They also work within mitotic 

and meiotic spindles to identify duplicated 

chromosomes.  [4]. For example, dynamin 

and cytoplasmic dynamin can be detected in 

spinal cord spheres associated with motor 

neuron disease [5]. Neurodegenerative 

diseases are also associated with mutations in 

dynein proteins and targeted disruption of 

function [6, 7]. T cell-mediated disease [8] is 

myosin-induced acute myocarditis. Myosin 

variant Ixb increases the risk of celiac disease, 

leading to primary intestinal barrier defects 

[9]. 

Many bioinformatics researchers are 

interested in the important role that motor 

proteins play in human function. For example, 

Zhu, C et al. [10] have established an overview 

of the factor superfamily of dynein motor 

proteins, in which they focus on the structure 

and function of motor dynein proteins. The 

bioinformatics system for classifying heavy 

dynamin chains was subsequently introduced 

by Yagi [11]. Khataee and Liew [12] projected 

a computational model for studying forward 

and reverse dynamics on the basis of the four-

state discrete random motion model. The 

molecular function of motor dynein protein 

consists of a double-stranded dimer.  The 

polypeptides associated with each feature 

(shown in pink) are different and a complex 

set of intermediary chains, light intermediate 

chains, and light chains in dynamin). (Readers 

are referred to the online version of this article 

to explain the color references in this legend.) 

A process simulation model for dynein [13]. 

Stedman et al. [14] proved the association 

between dynein gene mutation and human 

anatomical lineage changes by using 

bioinformatics method. Some previous studies 

in bioinformatics have also been used to 

detect dynein activity and dynein 

phosphatase diversity [15, 16].  

We proposed computational protein 

sequence analysis, we create molecular level 

biological problems at the level of single 

molecules, cells, and tissues in this work. We 

used machine learning algorithm for the 

prediction of accuracy, sensitivity, specity 

and then MCC, live-cell imaging in neurons, 

and computational modelling to analyses the 

regulation of the beginning of dynein-

dynactin transport. In some bioinformatics 

researchers, WEKA is an automatic learning 

method for data mining technology [17]. 

Secondly, various problems related to protein 

functional classification were obtained by 

using RBF network [18, 19], good results have 

been obtained. In addition, LibSVM [20] the 

emergence of deep learning means that the 

field of bioinformatics must become more 

efficient. Machine learning is an advanced 

computer education method, which uses 

artificial intelligence to learn representative 

data with multiple neural network layers [21]. 

There are many benefits to using machine 

learning, such as getting the latest results, 


Ali Ghulam et al., J. mt. area res. 08 (2023) 1-13 

J. mt. area res., Vol. 8, 2023 
3 

reducing the need for extraction functions, 

and so on (GPU). In recent years, 

bioinformatics has shifted from traditional 

machine learning and protein function 

learning to in-depth learning strategies. For 

example, as tried by Alipanahi et al., DNA and 

RNA proteins learn the sequence properties of 

the binding in depth. Spencer et al. [22] 

proposed an ab initio deep learning network 

for protein secondary structure prediction. To 

solve this problem, [23] create Stack-ACPred, 

a brand-new predictor for accurately 

identifying ACPs. To construct the stacking-

base ensemble model for targeting efficient 

ACPs, the best qualities should be chosen. [24] 

suggested bioinformatics technique thus 

outperformed all current state-of-the-art 

sequence based CPP methods. In the current 

study, [25] developed a novel feature 

extraction technique that takes into account 

the influence of residues in the vicinity of the 

mutation site. To assess the effectiveness of 

the suggested feature extracting method, 

rigorous cross-validation and independent 

tests were run on benchmark datasets. The 

molecular processes that enable these 

numerous and varied tasks are the subject of 

this research. Therefore, it is crucial to increase 

the immunoglobulin classification's accuracy 

by using efficient illness research techniques. 

Based on the BLOSUM vector score matrix, we 

extract IgG characteristics using the reduced 

feature dimension features that were chosen. 

Extreme Gradient Boosting is an ensemble 

learning technique we've created (XGBoost) 

[26]. Cancer peptide therapy is interesting 

since it offers so many alluring advantages. 

Anticancer Peptides (ACPs), which are 

essential for the development of innovative 

cancer therapeutics, have attracted a lot of 

attention from researchers in recent years. 

Experimental methods are expensive, time-

consuming, and frequently produce 

unreliable predictions when used to forecast 

ACPs [27]. In order to increase the linkages 

between disease variation and new 

molecular correlations between genetic 

mutations, we carried out a pathway-based 

investigation [28]. On the basis of shared gene 

interactions among illness-pathways, we 

created a biological network, and then we 

used network analysis to try and understand 

how a disease develops. 

In this article, we suggested dynein motor 

proteins with XGBoost algorithm to predict 

effectors using attributes deduced from 

sequencing. As a result, this study suggests 

using an XGBoost algorithm built from a 

position-specific scoring matrix. We judiciously 

summarized the elements addressed in other 

research and added a few new features to 

get around the shortcomings of the current 

methodologies. As shown in a series of recent 

publications [29], the guidelines under the 

five-step rule should be followed to create a 

truly useful sequence-based statistical 

prediction for a biological or biomedical 

approach. We would like to explain the five 

steps here: how I should construct the 

biological sequence of samples to generate, 

select, or predict a valid data set that can 

realistically represent their interactions with 

the predicted target. How the biological 

sequence of samples should be formed or 

built to run a powerful algorithm (or engine), 

we will explain how these measures are 

accomplished one by one below. 

 
Ali Ghulam et al., J. mt. area res. 08 (2023) 1-13 

J. mt. area res., Vol. 8, 2023 
4 

Table 1. Statistics of all retrieved dynein proteins. 

 
2. MATERIALS AND METHODS 

We proposed a novel computational method 

to analyze the molecular activity of high-

performance dynein protein using eXtreme 

gradient boosting (XGBoost) firstly Dataset 

Collection, Feature Extraction, and Model 

Interpretation make up the two stages of the 

dynein protein-XGB procedure, which is 

depicted in Figure 1. The following sections 

provide a full description of each stage. 

2.1 Data collection 

In our study, we chose secreted effectors and 

non-effectors to create the benchmark 

dataset and build the machine learning-

based model for dynein protein prediction. 

Our dataset, which included 250 non-dynein 

proteins and 520 dynein proteins, was directly 

retrieved from the recently published work 

(Wang, et al., 2019). The CD-HIT programmed 

ran the protein dataset through a filter of >30% 

sequence identity to cut down on sequence 

redundancy [30]. A total of 400 dynein 

proteins and 200 non-dynein proteins made 

up the final training dataset, and 120 dynein 

proteins and 50 non-dynein proteins made up 

the independent test dataset. 

I I I
 

         (1) 

The subjects of this study were positive 

dynein protein 400 examples of good 

behavior made up the participants of this 

study's I+ dataset, while 230 examples of 

negative behavior made up the target 

population of the I dataset. In conclusion, 

Table 1 shows that the baseline dataset 

consists of 228 protein sequences. We 

employed two different data sets for 

independent validation in order to further 

highlight the accuracy of the technique 

discussed in this study. 

2.2 Feature extraction technique 

In this work, we identified the coevolution 

relationship using 3080 dynein motor protein 

characteristics MSAs. At least one family will 

have a structure that has been determined 

empirically, and each family will include 

sequences of 100, 400, and 700 lengths. The 

"no gap" location is the only one the MSA will 

calculate. A "no-gap" position, then, is one 

that occupies less than 10% of the clearance. 

We employed mutual information to extract 

the association between evolutionary 

coevolution. We extracted features sequence 

data of the protein is represented by the 

BLOSUM62 matrix profiles. The matrix with m' L 

elements, where L is the length and m = 20 is 

the number of amino acids, serves as a 

representation for each residue in the training 

dataset. The 20 frequent amino acids are 

represented by one row in the normalized 

BLOSUM62 matrix. Equal-length peptides can 

be encoded with the BLOSUM62 descriptor. 

The BLOSUM62 matrix profiles, a regularly used 

tool for determining the alignment of two 

different protein sequences, was used in this 

work. By analyzing observed polypeptide 

alignments on a vast scale, the value of the 

BLOSUM62 Matrix is calculated. In MSA, mutual 

information between two locations is 

 Original 

data 

Similarity 

<30% 

Cross-

validation  

Dynein 

protein 

4153 721 306 

non-Dynein 

protein 

4153 721 252 

Total Protein  8306 1442 558 

    
Ali Ghulam et al., J. mt. area res. 08 (2023) 1-13 

J. mt. area res., Vol. 8, 2023 
5 

described as information like [31] between 

two sites. 

       
f A ,Bi ja
L = f A ,B logi jij A Bi , j f A f Bi j

 
 
    

 
     

       

  (2) 

    
where, more particularly, we used 21 

different types of amino acids, calculating the 

gap as the 21st amino acid, if q is equal to 21. 

Although there are enough variables in the 

MSA sequence data set, we use frequency to 

roughly approximation the likelihood. The 

frequency of a single type A amino acid seen 

at position I is indicated by f (Ai). The frequency 

f of the two types of amino acids occurring at 

locations I and J combined f (Ai, Bj). Sequence 

redundancy must be taken into account 

when calculating single- and dual-frequency 

weights. We compute all the mutual 

information in reference [32]. 

When assessing coevolution correlations, 

a false positive can also happen, as was the 

case in the current study. We used the 

structural data that was present in the training 

data set. Remainders are only paired together 

if they are architecturally close to one 

another. The "adjacent residues" are defined 

as those residues with tip atoms that are 

separated by 4.5 angstroms or less. A group of 

residues are more likely to interact with one 

another the more dissimilar their sequences 

are and the closer they are to one another. 

 
2.3 Derive a new matrix from contextual 

substitutions 

An element in the amino acid substitution 

matrix that corresponds to the substitution 

percent was used to categories the target 

population. How frequently one amino acid 

gets swapped out for another is indicated by 

the substitution fractions. Coevolution 

calculations show that there are universal 

relationships between protein families. Instead 

of happening in isolation, most amino acid 

changes take place in concert with their 

structural context. In addition, alternate 

fractions that are identical to the original 

BLOSUM vector matrix are produced using the 

logarithmic odds ratio [33]. 

 
 

   

f A, B
s A, B = λlog

f A f B
         (3)                                 

In this section A and B must be substituted for 

each other in order to determine the joint 

frequency F (A, B) using the expected 

frequency for a single amino acid type and 

the actual joint frequency. Is a scalar factor 

whose objective is to succeed. 

2.4 Model Proposed 

The goal of the current study was to 

accurately predict if a protein sequence is 

that of a dynein protein; this categorization 

issue is known as a dichotomy problem [34, 

35]. Three classifiers were used in this paper's 

analyses in order to choose those that could 

predict dynein proteins more precisely than 

the others. SVM, AdaBoost, KNN, and Random 

Forest (RF) were the three classifiers 

employed. This observation might be  

 
J. mt. area res., Vol. 8, 2023 

6 

 
Figure 1. Proposed framework model 

 
interpreted as showing that a nuclear's solvent 

accessibility is determined by the amount of 

atomic surface that is exposed to solvents. The 

degree of atomic accessibility in a residue 

determines its residue availability [36]. 

Similarly, residues are regarded as exposed 

residues if their solvent accessibility is higher 

than the threshold. The following diagram 

illustrates several stages of the dataset 

processing, function formulation, RF model 

generation, and novel protein sequence 

prediction processes.  

 
2.5 Model Evaluation performance 

In this work, to evaluate how well the classifiers 

employed in this work performed, the 

following metrics were used. Accuracy, 

sensitivity, F-measure, Matthew's correlation 

coefficient (MCC), and area under receiver 

operating characteristic curve (AUC) [37, 38] 

were used to evaluate the performance of the 

classifiers utilized in this investigation. As the  

 
prediction threshold was changed, a trade-off 

between sensitivity and specificity was seen. 

AUC offers a single measure of total threshold-

independent accuracy, making it a useful tool 

for comparing the overall prediction 

performance of various approaches. Perfect 

prediction accuracy is shown by an AUC and 

MCC of 1. As follows is a definition of these 

measures: 

 
3. RESULTS AND DISCUSSION 

In this part, we used the 5-fold cross validation 

method on the training data set to assess the 

prediction strength of the various feature 

categories individually and in combination. 

The training data set was partitioned into five 

subsets at random for 5-fold cross validation. 

Four subsets were utilized to train XGBoost, 

and the final one was used to assess the 

model's effectiveness. Five times each of the 

processes were repeated. ACC and dynein 

proteins, two performance indicators from the 

training set, were averaged, and the findings 

are displayed in Table 1. On the training data 


Ali Ghulam et al., J. mt. area res. 08 (2023) 1-13 

J. mt. area res., Vol. 8, 2023 
7 

set, it can be shown that some specific Fasta 

feature categories have a greater overall 

prediction power. This finding suggests that, 

when compared to other types of 

characteristics, Fasta-based features perform 

better in the prediction of the dynein protein. 

We perfumed. The combining various features 

could provide a more thorough 

representation of protein sequences [39, 40]. 

The combined characteristics, as shown in 

Table 2, produce the ACC of 93.95 percent 

and the MCC of 0.8346, both of which are 

greater than other Fasta-based features. In 

conclusion, the combination of all 

characteristics regularly outperformed single 

feature-based models in terms of 

performance. 

 
TP
Sensitivity = 

TP FN                   (4) 

                                    
TN
Specificity = 

TN FP                 (5) 

TP TN
Accuracy = 

TP FP TN FN



                 (6) 

 
TP* TN FP* FN
MCC = 

( TP FP )( TP FN )( TN FP )( TN FN )



   
 

         (7) 

 
Information about the amounts of h dynein 

protein and non-dynein protein in the 

aforementioned formulations is provided (true 

negative). We calculated false negative and 

false positive results are also regarded as non-

dynein proteins. In addition, utilizing X-axis 

sensitivity and Y-axis accuracy, we discovered 

an operator receptor curve (ROC). When 

assessing the model effectiveness of different 

choice values, the significance of the AUC is 

important. Horses and metaphors are both 

equally useful, accurate, and adaptable. The 

connections between accuracy, precision, 

sensitivity, specificity and MCC score have 

been identified in order to find the correlation 

of Matthew similarities. 

 
3.1 Impact of BLOSUM62 Extraction 

Algorithm performance 

Building novel protein sequences should take 

into account the difficulty of obtaining 

random protein sequences using eXtreme 

gradient boosting (XGBoost), Blossom62, due 

to the high synthesis of dynein protein . For 

several categories, the same data set is 

utilized. This is the only effective method for 

reducing the function's mixed space 

(Blossom62). The value of the ensuing Spaces 

is shown by the existence of these varied 

individual and mixed places. Analysis using 

random forests was done. The ideal model 

growth parameter is discovered via the 

eXtreme gradient boosting (XGBoost), 

Blossom62 strategy. Content analysis is utilized 

as an indicator for the analysis of parameters 

impacting the dynein protein and to gauge 

how well the dynein protein is functioning. The 

accordance with maintaining consistent 

precision values between dynein protein 

structures, and their corresponding dynein 

protein precision values are displayed in Table 

2. 

Table 2. eXtreme gradient boosting (XGBoost), 

identifying optimum parameter for various models 

 
The analysis was based on the results of a 5-fold 

cross-validation of our suggested models using 

the properties of the BLOSUM vector matrix. 

Fivefold cross-validation was used to evaluate 

the performance of our proposed model on 

 ACC Precision Sensitivity Specificity 

XGBoost 0.8676% 0.8768% 0.760% 0.9752% 


Ali Ghulam et al., J. mt. area res. 08 (2023) 1-13 

J. mt. area res., Vol. 8, 2023 
8 

both positive and negative data sets. We 

trained the model based on training sets we 

employed in our experiment, we analyses the 

outcomes. We achieved prediction 

performance was relatively strong, with an 

overall accuracy of 0.8676% percent, 

according to the computational study. Figure 2 

displays the ROC (AUC) curve score that we 

were given. As seen by the predicted score and 

the ROC-AUC of 0.91% percent, our model's 

estimation is quite accurate. The dynein proteins 

exhibit strong ROC curve performance 

(auc=0.91% percent), and the ROC curve also 

works well, using a BLOSUM -extracting feature 

model. 

 
Figure. 2. Proposed ROC-AUC score predictive 

models 

3.2 Proposed methods comparison 

performance with other ML classifiers  

The objective was to compare the 

performance of the algorithms Adaboost, RF, 

SVM, and K-NN, four widely used techniques in 

the field, in order to demonstrate eXtreme 

gradient boosting (XGBoost) superiority in 

dynein protein sequence prediction. The 

training technique was created with Python 

3.8, finished with 5-fold cross-validation, and 

included relatively extensive parameter 

tuning. We investigation support as shown 

Figure 3 validity intuitively. Processed by 

dynein proteins the precision of AUC values is 

0.92 percent when using the ROC-AUC curve, 

which employs ROC AUC curves comparable 

to those used in other methodologies. Three 

classification systems, including the KNN 

classifier (KNN), the SVC classifier, and the 

Random Forest Classifier (RFC), are used to 

supplement the results for our proposed 

eXtreme gradient boosting (XGBoost) model 

[41]. XGBoost, a traditional ensemble learning 

technique, was used to evaluate KNN and 

SVM 

 
3.3 Comparison performance with other 4 

ML Classifiers hybrid features 

Our findings can be compared to results of 

earlier studies that, we compared performance 

with the reference models (AdaBoost, RF, SVM, 

and KNN) based on accuracy, precision, 

sensitivity, specificity and MCC. We obtained 

the score based on parameters were properly 

measured, as shown in Table 3.  Excellent results 

were obtained in this combined comparison 

with 3080 features length using eXtreme 

gradient boosting (XGBoost) classification. 

Following, they employed a data set with 

features from the BLOSUM-62 vector score in an 

.


J. mt. area res., Vol. 8, 2023 

9 

 
Figure 3.  Proposed ROC-AUC score predictive models 

 
eXtreme gradient boosting (XGBoost) model. 

With BLOSUM retrieved features, we achieved 

an accuracy score of 0.8676% percent. It is 

evident that standard machine learning 

algorithms are outperformed by eXtreme 

gradient boosting (XGBoost) techniques in terms 

of prediction accuracy and AUC values. 

Table 3. Performance with other 4 ML Classifiers 

hybrid features 

 
3.4 ROC (AUC) comparison of combined 4 

ML Classifiers 

The data provide convincing evidence the 

previous article described how this experiment 

compared the efficacy of four  

 
classifiers: AdaBoost, RF, SVM, and KNN. The 

parameters for the four methods were taken 

from the classifiers' default settings. The linear 

SVM kernel by default had consequence 

coefficient values of C = 1.0. In comparison to 

the findings of earlier studies, we evaluated 

the accuracy, precision, sensitivity, specificity 

and MCC of the classification of the dynein 

protein dataset utilization. The results and the 

3080 best feature subsets from the BLOSUM62 

technique are displayed in Table 3. In specific, 

based on BLOSUM62 profiles score with 

(XGBoost 0.91% ROC (AUC) curves score 

obtained.  Dynein motor proteins that 

contain ROCs matching to one feature 

extraction approaches. Figure 4 illustrates how 

eXtreme gradient boosting (XGBoost) ROC-

auc projected outputs performed better than 

those of other machine learning models. 

 
3.5 Comparison metric performance of 

various machine learning classifiers 

The BLOSUM62 matrix profiles score is a 

component of the results analysis. The 

BLOSUM62 matrix profiles with extreme 

ML-

Classifiers 

ACC Precision Sensitivity  Specificity  

AdaBoost 0.8030% 0.7810% 0.775% 0.8311% 

KNN 0.5666% 0.5601% 0.6066% 0.5266% 

RF 0.8004% 0.7855% 0.775% 0.8259% 

SVM 0.6079% 0.6460% 0.4976% 0.7183% 

XGBoost 0.8676% 0.8768% 0.76% 0.9752% 


Ali Ghulam et al., J. mt. area res. 08 (2023) 1-13 

J. mt. area res., Vol. 8, 2023 
10 

gradient boosting (XGBoost) obtained 

accuracy score better than other four 

classification models—AdaBoost and RF, KNN, 

 
Figure. 4. Proposed model ROC-AUC score result 

 
and SVM—were each regarded to have 

significant results. The BLOSUM62 model then 

recorded a precision score, indicating that 

the overall performance of our test's eXtreme 

gradient boosting (XGBoost) classification is 

good. Second, based on our experimental 

testing, the KNN classification was obtained, 

along with the highest analyses, as shown in 

Figure 5, and the third-best classifier with 

precision on the SVM model. 

 
Figure. 5. Parameter metric performance evaluation 

of dynein protein. 

 
CONCLUSION 

An interesting side finding was that machine 

learning integrating with computational 

biology is a major concern for biological 

researchers in light of its outstanding results in 

a variety of fields. In this study, we introduced 

dynein-XGB, a predictor model built on the 

XGBoost algorithm for precise identification of 

dynein proteins. Specifically, when compared 

to earlier predictors on the benchmark 

dataset, we have attained state-of-the-art 

performance. There are three key inferences 

that can be made. First off, as compared to 

other algorithms, the XGBoost method 

performs dynein prediction with a higher level 

of stability and accuracy. Second, feature 

vectors were optimized using the feature 

selection technique known as Relief, which 

helped to extract key features from a wide 

pool of candidate features and enhance the 

model's functionality. Additionally, dynein-

XGB, in contrast to previous sequence-based 

dynein predictors, can offer relevant 

explanation based on samples supplied 

utilizing the feature importance and the SHAP 

technique. When compared to conventional 

machine learning methods, our methodology 

improved on the majority of the analyzed 

metrics. In this paper, we established a reliable 

approach for accurately identifying novel 

proteins that are members of motor 

superfamilies, which can be exploited to 

develop therapeutic targets. The 

contributions of this study may serve as a 

foundation for future research that might 

tackle numerous bioinformatics issues. 

 
DECLARATIONS 

Funding: No funding was received for this study. 

0.00%

0.50%

1.00%

1.50%

Performance comparison with four 

classifiers

Adaboost KNN RF SVM XGBoost


Ali Ghulam et al., J. mt. area res. 08 (2023) 1-13 

J. mt. area res., Vol. 8, 2023 
11 

Conflicts of interest/Competing interests: The 

authors declare no any conflict of 

interest/competing interests.  

Data availability: Not applicable 

Code availability: Not applicable 

Authors’ contributions: Ali Ghulam and Rahu 

Sikander designed the concepts write-up, Dhani 

Bux Talpur, Sajjad Hussain Talpur and Erum 

Saba carried out the experiments and analyses 

and produced the manuscript. The method's 

development and the manuscript's revision were 

helped by Ali Ghulam. Zulfikar Ahmed Maher 

and Saima Tunio oversaw the project, offered 

helpful advice on how to execute it better, and 

revised the document. The article was read and 

approved by all writers. 

REFERENCES 

[1] S.A. Burgess, M.L. Walker, H. Sakakibara, P.J. 

Knight, K. Oiwa, Dynein structure and power 

stroke, Nature 421 (2003) 715–718. 

[2] R.D. Vale, T.S. Reese, M.P. Sheetz, Identification 

of a novel force-generating protein, kinesin, 

involved in microtubule-based motility, Cell 42 

(1985) 39–50. 

[3] A.J. Roberts, T. Kon, P.J. Knight, K. Sutoh, S.A. 

Burgess, Functions and mechanics of dynein 

motor proteins, Nat. Rev. Mol. Cell Biol. 14 

(2013) 713–726 

[4]  Hirokawa, N., Noda, Y., Tanaka, Y., & Niwa, S., 

Kinesin superfamily motor proteins and 

intracellular transport. Nature reviews 

Molecular cell biology, 10(10), (2009) 682-696. 

[5] Banci, L., Bertini, I., Boca, M., Calderone, V., 

Cantini, F., Girotto, S., & Vieru, M., Structural 

and dynamic aspects related to 

oligomerization of apo SOD1 and its mutants. 

Proceedings of the National Academy of 

Sciences, 106(17), (2009) 6980-6985. 

[6] Chen, X. J., Xu, H., Cooper, H. M., & Liu, Y., 

Cytoplasmic dynein: a key player in 

neurodegenerative and neurodevelopmental 

diseases. Science China Life Sciences, 57(4), 

(2014) 372-377. 

[7] Eschbach, J., & Dupuis, L., Cytoplasmic dynein 

in neurodegeneration. Pharmacology & 

therapeutics, 130(3) (2011) 348-363. 

[8] Bar‐Or, A., Fawaz, L., Fan, B., Darlington, P. J., 

Rieger, A., Ghorayeb, C., ... & Smith, C. H., 

Abnormal B‐cell cytokine responses a trigger of 

T‐cell–mediated disease in MS?. Annals of 

neurology, 67(4) (2010) 452-461. 

[9] Le, N. Q. K., Yapp, E. K. Y., Ou, Y. Y., & Yeh, H. 

Y., iMotor-CNN: Identifying molecular functions 

of cytoskeleton motor proteins using 2D 

convolutional neural network via Chou's 5-step 

rule. Analytical biochemistry, 575 (2019) 17-26. 

[10] Zhu, C., Zhao, J., Bibikova, M., Leverson, J. D., 

Bossy-Wetzel, E., Fan, J. B., ... & Jiang, W., 

Functional analysis of human microtubule-

based motor proteins, the kinesins and dyneins, 

in mitosis/cytokinesis using RNA interference. 

Molecular biology of the cell, 16(7) (2005) 3187-

3199. 

[11] Janssens, F., Glänzel, W., & De Moor, B., 

Dynamic hybrid clustering of bioinformatics by 

incorporating text mining and citation analysis. 

In Proceedings of the 13th ACM SIGKDD 

international conference on knowledge 

discovery and data mining (2007) (360-369). 

[12] H. Khataee, A.W.-C. Liew, A mathematical 

model describing the mechanical kinetics of 

kinesin stepping, Bioinformatics 30 (2013) 353–

359 

[13] Dutta, M., & Jana, B., Computational modeling 

of dynein motor proteins at work. Chemical 

Communications, 57(3), (2021) 272-283. 

[14] Li, L., Alper, J., & Alexov, E. (2016). Cytoplasmic 

dynein binding, run length, and velocity are 

guided by long-range electrostatic 

interactions. Scientific reports, 6(1), (2012)1-12. 


Ali Ghulam et al., J. mt. area res. 08 (2023) 1-13 

J. mt. area res., Vol. 8, 2023 
12 

[15] Erdős, G., Szaniszló, T., Pajkos, M., Hajdu-Soltész, 

B., Kiss, B., Pál, G., ... & Dosztányi, Z., Novel linear 

motif filtering protocol reveals the role of the 

LC8 dynein light chain in the Hippo pathway. 

PLoS computational biology, 13(12), (2017) 

e1005885. 

[16] Gao, F. J., Hebbar, S., Gao, X. A., Alexander, 

M., Pandey, J. P., Walla, M. D., ... & Smith, D. S., 

GSK‐3β phosphorylation of cytoplasmic dynein 

reduces Ndel1 binding to intermediate chains 

and alters dynein motility. Traffic, 16(9), (2015) 

941-961. 

[17] Ho, Q. T., & Ou, Y. Y., Classifying the molecular 

functions of Rab GTPases in membrane 

trafficking using deep convolutional neural 

networks. Analytical biochemistry, 555, (2018) 

33-41. 

[18] Zou, C., Gong, J., & Li, H., An improved 

sequence-based prediction protocol for DNA-

binding proteins using SVM and comprehensive 

feature analysis. BMC bioinformatics, 14(1), 

(2013)1-14. 

[19] Zou, Q., Wan, S., Ju, Y., Tang, J., & Zeng, X., 

Pretata: predicting TATA binding proteins with 

novel features and dimensionality reduction 

strategy. BMC systems biology, 10(4), 

(2016)401-412. 

[20] Tao, Z., Li, Y., Teng, Z., & Zhao, Y., A method for 

identifying vesicle transport proteins based on 

LibSVM and MRMD. Computational and 

Mathematical Methods in Medicine, (2020). 

[21] Kumar, K., & Thakur, G. S. M., Advanced 

applications of neural networks and artificial 

intelligence: A review. International journal of 

information technology and computer 

science, 4(6), (2012) 57. 

[22] Zhang, Y., Qiao, S., Ji, S., Han, N., Liu, D., & Zhou, 

J., Identification of DNA–protein binding sites 

by bootstrap multiple convolutional neural 

networks on sequence information. 

Engineering Applications of Artificial 

Intelligence, 79, (2019)58-66. 

[23] Arif, Muhammad, et al. "StackACPred: 

prediction of anticancer peptides by 

integrating optimized multiple feature 

descriptors with stacked ensemble approach." 

Chemometrics and Intelligent Laboratory 

Systems 220 (2022): 104458. 

[24] Arif, Muhammad, et al. "DeepCPPred: a deep 

learning framework for the discrimination of 

cell-penetrating peptides and their uptake 

efficiencies." IEEE/ACM Transactions on 

Computational Biology and Bioinformatics 19.5 

(2021): 2749-2759. 

[25] Ge, Fang, et al. "TargetMM: Accurate Missense 

Mutation Prediction by Utilizing Local and 

Global Sequence Information with Classifier 

Ensemble." Combinatorial Chemistry & High 

Throughput Screening 25.1 (2022): 38-52. 

[26] Ghulam, Ali, et al. "Accurate prediction of 

immunoglobulin proteins using machine 

learning model." Informatics in Medicine 

Unlocked 29 (2022): 100885. 

[27] Ghulam, Ali, et al. "ACP-2DCNN: deep learning-

based model for improving prediction of 

anticancer peptides using two-dimensional 

convolutional neural network." Chemometrics 

and Intelligent Laboratory Systems 226 (2022): 

104589. 

[28] Ghulam, Ali, et al. "Disease-pathway 

association prediction based on random walks 

with restart and PageRank." IEEE Access 8 

(2020): 72021-72038. 

[29] J. Song, F. Li, K. Takemoto, G. Haffari, T. Akutsu, 

K.-C. Chou, G.I. Webb, PREvaIL, an integrative 

approach for inferring catalytic residues using 

sequence, structural, and network features in a 

machine-learning framework, J. Theor. Biol. 443 

(2018) 125–137. 


Ali Ghulam et al., J. mt. area res. 08 (2023) 1-13 

J. mt. area res., Vol. 8, 2023 
13 

[30] G.O. Consortium, Expansion of the gene 

Ontology knowledgebase and resources, 

Nucleic Acids Res. 45 (2016) D331–D338. 

[31] Jia, K., & Jernigan, R. L., New amino acid 

substitution matrix brings sequence alignments 

into agreement with structure matches. 

Proteins: Structure, Function, and 

Bioinformatics, 89(6), (2021)671-682. 

[32] Sakhanenko NA, Galas DJ. Biological data 

analysis as an information theory problem: 

multivariable dependence measures and the 

shadows algorithm. J Comput Biol. 

2015;22:1005-1024. 

[33] Boughorbel, S.; Jarray, F.; El-Anbari, M. Optimal 

classifier for imbalanced data using Matthews 

Correlation Coefficient metric. PLoS ONE 2017, 

12, e0177678.  

[34] Ding, Y.; Tang, J.; Guo, F. Identification of drug–

target interactions via fuzzy bipartite local 

model. Neural Comput. Appl. 2020, 32, 1–17. 

[35] Lee, B.; Richards, F. M. The interpretation of 

protein structures: estimation of static 

accessibility. J. Mol. Biol., 1971, 55, 379-400. 

[36] Statnikov, A.; Wang, L.; Aliferis, C.F. A 

Comprehensive comparison of random forests 

and support vector machines for microarray-

based cancer classification, BMC 

Bioinformatics, 2008, 9, 319. 

[37] Baldi P, Brunak S, Chauvin Y, Andersen CAF, 

Nielsen H. Assessing the accuracy of prediction 

algorithms for classification: an 

overview. Bioinformatics. 2000; 16:412–424. 

[38] Matthews BW. Comparison of the predicted 

and observed secondary structure of T4 phage 

lysozyme. Biochim Biophys Acta. 1975; 405:442–

451.  

[39] Accurate prediction of potential druggable 

proteins based on genetic algorithm and 

Bagging–SVM ensemble classifier. Artif. Intell. 

Med. 2019, 98, 35–47. 

[40] Jiang, Q.; Wang, G.; Jin, S.; Yu, L.; Wang, Y. 

Predicting human microRNA–disease 

associations based on support vector 

machine. Int. J. Data Min. Bioinform. 2013, 8, 

282–293. 

[41] Murugan, A.; Nair, S.A.H.; Kumar, K.P.S. 

Detection of Skin Cancer Using SVM, Random 

Forest and KNN Classifiers. J. Med. Syst. 2019, 

43, 269. 

 
Received: 15 Sep. 2022. Revised/Accepted: 10 Nov. 2022. 

 This work is licensed under a Creative Commons Attribution 4.0 International License. 

http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/