Knowledge Engineering and Data Science (KEDS)  pISSN 2597-4602 

Vol 6, No 2, October 2023, pp. 231–248  eISSN 2597-4637 

 
https://doi.org/10.17977/um018v6i22023p231-248 

©2023 Knowledge Engineering and Data Science | W : http://journal2.um.ac.id/index.php/keds | E : keds.journal@um.ac.id 

This is an open access article under the CC BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/) 

Comparison of Machine Learning Algorithms for Species 
Family Classification using DNA Barcode  

Lala Septem Riza a,1,*, M Ammar Fadhlur Rahman a,2, Yudi Prasetyo a,3, Muhammad Iqbal Zain a,4, 
Herbert Siregar a,5, Topik Hidayat b,6, Khyrina Airin Fariza Abu Samah c,7, Miftahurrahma  

Rosyda d,8 

a Department of Computer Science Education, Universitas Pendidikan Indonesia 

Jl. Dr. Setiabudi No.229, Bandung 40154, Indonesia 
b Department of Biology Education, Universitas Pendidikan Indonesia 

Jl. Dr. Setiabudi No.229, Bandung 40154, Indonesia 
c Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA Cawangan Melaka 

110 off, Jalan Hang Tuah, Malaysia 
d Universitas Ahmad Dahlan 

Jl. Kapas No.9, Yogyakarta 55166, Indonesia 
1 lala.s.riza@upi.edu*; 2 mafr@student.upi.edu; 3 yudiprasetyo@upi.edu; 4 iqbalzain99@upi.edu; 5 herbert@upi.edu; 

 6 topikhidayat@upi.edu; 7 khyrina783@uitm.edu.my; 8 miftahurrahma.rosyda@tif.uad.ac.id 
* corresponding author 

 
I. Introduction  

The development of living specimen processing technology [1] in recent decades has created 

many biological data, including Deoxyribonucleic Acid (DNA) sequence data. The collection of 

DNA sequences starts with taking samples from living organisms. The sample is then processed 

through various stages such as extraction, enumeration, and amplification to obtain pieces of DNA. 

These DNA fragments are then collected and sequenced to obtain the nucleic acid symbols (such as 

adenine (A), guanine (G), cytosine (C), and thymine (T)), which compose the DNA sequence [2]. 

The pieces of DNA sequences are then analyzed to obtain a genome that has been restructured so 

that it becomes a complete genome. That part of the genome is then selected as a barcode representing 

the species [3][4]. All these stages are depicted in Figure 1. 

It has long been known that DNA sequences can be used to identify species, and nowadays, this 

activity is better known as DNA barcoding [5][6]. DNA barcoding is a method for identifying 

unknown specimens. It sequences in certain gene regions/loci that represent species in each kingdom, 

namely: cytochrome C Oxidase subunit I (COI) for animals [7] obtained from mitochondria in cells, 

ARTICLE INFO A B S T R A C T   

Article history: 

Received 25 October 2023 

Revised 27 October 2023 

Accepted 03 November 2023 

Published online 07 November 2023 

 
Classifying plant species within the Liliaceae and Amaryllidaceae families presents 
inherent challenges due to the complex genetic diversity and overlapping 
morphological traits among species. This study explores the difficulties in accurate 
classification by comparing 11 supervised learning algorithms applied to DNA 
barcode data, aiming to enhance the precision of species family classification in these 
taxonomically intricate plant families. The ribulose-1,5-bisphosphate carboxylase-
oxygenase large sub-unit (rbcL) gene, selected as a DNA barcode locus for plants, is 
used to represent species within the Amaryllidaceae and Liliaceae families. The 
experimental results demonstrate that nearly all tested models achieve accurate species 
classification into the appropriate families, with an accuracy rate exceeding 97%, 
except for the Naïve Bayes model. Regarding computational time, the Random Forest 
model requires significantly more time for training than other models. Regarding 
memory usage, the Least Squares Support Vector Machine with a polynomial kernel, 
and Regularized Logistic Regression consume more memory than other models. 
These machine learning models exhibit strong concordance with NCBI's 
classifications when predicting families using the test dataset, effectively categorizing 
species into the Amaryllidaceae and Liliaceae families. 

This is an open access article under the CC BY-SA license 

(https://creativecommons.org/licenses/by-sa/4.0/).  

Keywords: 

Machine Learning 

Supervised Classification 

Species Classification 

DNA Barcode 

rbcL Gene 

Data Analysis 

Bioinformatics          

http://u.lipi.go.id/1502081730
http://u.lipi.go.id/1502081046
http://journal2.um.ac.id/index.php/keds
mailto:keds.journal@um.ac.id
https://creativecommons.org/licenses/by-sa/4.0/
https://creativecommons.org/licenses/by-sa/4.0/


 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 232 

 
ribulose-1,5-bisphosphate carboxylase -oxygenase large sub-unit (rbcL) and megakaryocyte-

associated tyrosine kinase (matK) for plants [8] obtained from chloroplast cells, and internal 

transcribed spacer (ITS) for fungi [9] found in nucleus cells. 

 
Fig. 1. Process of processing living specimens into DNA barcodes 

The process of identifying species in DNA barcoding is done by analyzing the similarity of a 

barcode belonging to a specimen with another barcode belonging to a species already known in the 

database. The specimen can be classified as an existing species if the barcode has a high degree of 

similarity. If no barcode pairs are found with a high degree of similarity, then the specimen may be 

a new species and needs to be verified by a taxonomist. 

Several approaches are commonly used to classify species in DNA barcodes: tree-based, 

similarity-based, and character-based [10][11]. The tree-based method classifies a barcode into 

species based on its membership in the DNA barcode tree. The similarity-based method classifies 

barcodes based on the number of similar characters in the DNA barcode. At the same time, the 

character-based method relies on the presence or absence of specific characters in the DNA barcode. 

In addition to these three approaches, species classification using DNA barcodes can also be treated 

as a case of machine learning problems with supervised learning [12][13][14][15][16]. 

The Liliaceae family, colloquially called the 'Lily Family', predominantly consists of 

monocotyledonous plants characterized by notable morphological diversity. Encompassing 

approximately 16 genera and over 610 species [17], members of this family manifest primarily as 

herbs and shrubs. They are predominantly distributed across temperate and subtropical regions [18]. 

The amphipathic properties inherent to certain compounds within Liliaceae render them effective as 

surfactants. Beyond their ecological significance, these plants exhibit multifaceted utility: they are 

esteemed for ornamental purposes and utilized as vegetables, and certain species are acknowledged 

for their medicinal properties. Given the vast potential inherent to the Liliaceae family, they hold 

promise in cosmetics and pharmaceutical development [19]. 

The Amaryllidaceae family, a prominent member of the order Asparagales, is distinguished by its 

bulbous flowering plants. These plants are celebrated for their visually captivating flowers, making 

them famous for ornamental cultivation [20]. From a taxonomic perspective, the Amaryllidaceae 

family is stratified into three subfamilies: Agapanthoideae, Allioideae, and Amaryllidoideae  [21]. 

Historically, these were regarded as distinct families. The term “Amaryllidaceae” is recurrently cited 

in phytochemical and pharmaceutical literature, particularly in discussions centered on the 

Amaryllidoideae subfamily [20][22]. 

The medicinal potential of the Amaryllidaceae family is both historical and contemporary. 

Tracing back to the Classical period, luminaries like Hippocrates and Dioscorides harnessed the 

therapeutic properties of Narcissus oil, particularly for conditions believed to be associated with 

uterine tumors. In modern traditional medicine, the applications are diverse. For instance, 

Ammocharis is employed for blood purification and wound treatment, Brunsvigia for respiratory and 


233 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 

 
hepatic ailments, Clivia for snakebites and facilitation of childbirth, and Crinum for a spectrum of 

conditions ranging from tumors to rheumatism [23]. 

In previous research, the Amaryllidaceae family was classified under the Liliaceae family. 

However, advancements in phylogenetics have led to a taxonomic reorganization. A team of 

scientists, spearheaded by Rolf Dahlgren [24], extensively examined monocot characteristics, 

including numerous microscopic features, culminating in a revised classification. 

Historically, taxonomic experts such as Bentham and Hooker [25], Engler and Prantl [26], Bessey 

[27], Rendle [28], and Hutchinson [29] categorized Amaryllidaceae with an inferior ovary and 

Liliaceae with a superior ovary into distinct families based on ovary position differences. Despite 

these distinctions, both families exhibited numerous shared characteristics. Consequently, Cronquist 

[30] and Takhtajan [31] integrated the Amaryllidaceae family into Liliaceae. Further research 

regarded 'lilies' as a heterogeneous collection of genera and positioned them in families grouped 

under two orders: Asparagales and Liliales [32]. 

The problem in both families is depicted in the classification of Allium albopilosum. Allium 

albopilosum, indigenous to Turkestan, is cultivated for its notable utility as a cut flower. While 

traditionally, Allium species have been categorized under the Liliaceae family due to the presence of 

superior ovaries in their flowers, there exists a divergence of opinion among botanists. Some propose 

their reclassification to the Amaryllidaceae family, citing the characteristic umbellate inflorescence. 

Conversely, others advocate for a distinct classification, suggesting establishing a unique family, 

Alliaceae, to accommodate them [33]. 

The Consortium Barcode of Life [8] advocated the rbcL gene as a barcode for plant taxonomy 

and phylogenetic analysis. This gene is pivotal in plant species identification, phylogenetics, and 

relationships. The rbcL gene is located in chloroplast DNA [8]. Several studies have employed the 

rbcL gene for plant relationship research. For instance, the rbcL gene elucidates the relationships 

within Selaginellaceae [34]. Similarly, another research combined the rbcL gene with trnL-F for a 

phylogenetic study on Rhamnaceae [35]. 

Machine learning is a study attempting to extract knowledge from available data using computer 

programs that can learn and get smarter automatically based on experience [36][37]. Currently, the 

application of machine learning can be found in various activities in everyday life, such as 

recommendations for goods in Amazon e-commerce services [38], recommendations on the music 

streaming platform Spotify [39], and recommendations in education assessment [40][41][42]In 

bioinformatics, machine learning has been widely used to solve problems in various areas, including 

genomics, proteomics, systems biology, evolution, microarrays, and text mining [43][44] [45]. The 

application of machine learning in each case handles the different characteristics of the input data.  

Based on the type of feedback from the input data, there are three forms of learning: supervised 

learning, unsupervised learning, and reinforcement learning [46]. Of the three forms of machine 

learning, bioinformatics case studies generally use supervised learning and unsupervised learning to 

solve problems. For example, supervised learning is used in genomics for the case of gene finding 

[47]. Another example is the application of Support Vector Machines (SVM) [48] and Random 

Forests (RF) [49] for the prediction of phenotypic effects [50]. An example of the application of 

unsupervised learning in bioinformatics is microarray science for clustering genes into groups with 

specific biological meanings [51]. 

This study attempts to compare supervised machine learning algorithms to predict families of 

species based on DNA barcode sequences in the R programming language. By predicting the family, 

we can more accurately place the species in the correct family in the taxonomy. Machine learning 

algorithms that are used in this research are Random Ferns, SVM Linear, SVM Poly, SVM Radial, 

SVM Radial Weights, LSSVM Poly, Naïve Bayes, Random Forest, C5.0, K-Nearest Neighbours, 

and Regularized Logistic Regression.  

The DNA barcode sequence employed in this study is derived from a segment of the chloroplast 

gene specific to the rbcL gene region of each examined species. This research contributes to resolving 

the existing classification ambiguity between the Liliaceae and Amaryllidaceae families. It 


 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 234 

 
accomplishes this by applying various machine learning methodologies, the results of which are 

juxtaposed with contemporary, state-of-the-art classification systems from NCBI to yield more 

definitive insights into the precise familial categorizations. 

II. Methods 

A. Data Collection 

The data used are DNA barcode sequence data obtained from GenBank [52] (ncbi.nlm.nih.gov, 

accessed August 15, 2023). The dataset contains rbcL enzyme sequences from the chloroplast gene 

of plants in the Amaryllidaceae and Liliaceae families. Information on the number of species, 

sequences, and file size of each dataset is listed in Table 1.  

Table 1. Descriptions of the used datasets 

Dataset Number of Species Number of DNA Sequences File Capacity (kB) 

Training Data 

Amaryllis 308 689 708.4 
Lily 331 713 784.3 

Testing Data 

Amaryllis 23 113 114.7 

Lily 28 140 136.5 
Total 690 1,655 1,743.9 

 
The Amaryllis dataset contains 802 samples from the Amaryllidaceae family, of which 689 were 

used for training and 113 for testing. Meanwhile, the Lily dataset comes from the Liliaceae family 

and contains 853 samples, with details of 713 used for training and 140 for testing. All sequences in 

the dataset have varying sequence lengths (base pair; bp), with the most extended sequence having 

1,458bp and an average sequence length of 903bp. 

The training dataset was obtained by downloading all species sequences in the family and omitting 

several selected species in the Amaryllidaceae and Liliaceae families. The complete list of species 

omitted from the training dataset can be seen in Table 2. The testing dataset is a sequence of species 

omitted from the training dataset. The difference in the number of species in the testing dataset in 

Table 1 with the species in Table 2 is due to (1) not all species have samples of the rbcL gene 

sequence in GenBank at the time of data collection (example: Allium chrysanthum) and (2) GenBank 

distinguishes main species from varieties/sub-species (example: Crinum asiaticum and Crinum 

asiaticum var. Japonicum). All species collected in the testing dataset are listed in Table 3. 

The entire dataset is downloaded and saved in FASTA format. Figure 2 shows an example of 

dataset content containing the GenBank accession number, species name, sequence description, and 

DNA sequence. Each sequence is indicated by a line starting with the greater than symbol (“>”) and 

ending with a blank line. 

 
Fig. 2. RNN, LSTM, and GRU architecture development 


235 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 

 
Table 2. List of species selected for test data 

No. Amaryllis Lily 

1 Agapanthus campanulatus Alstroemeria aurea 
2 Allium altaicum Calochortus apiculatus 

3 Allium cepa Calochortus lyallii 

4 Allium chrysanthum Cardiocrinum cathayanum 

5 Allium chrysocephalum Cardiocrinum cordatum 
6 Allium fistulosum Cardiocrinum giganteum 

7 Allium monanthum Erythronium albidum 

8 Allium obliquum Erythronium americanum 

9 Allium porrum Fritillaria unibracteata 
10 Allium prattii Gagea serotina 

11 Allium pskemense Lilium bulbiferum 

12 Allium sativum Lilium davidii 

13 Allium tuberosum Lilium distichum 
14 Allium xichuanense Lilium fargesii 

15 Amaryllis minuta Lilium lancifolium 

16 Crinum asiaticum Lilium longiflorum 

17 Crinum macowanii Lilium pardalinum 
18 Hymenocallis caribaea Lloydia oxycarpa 

19 Hymenocallis henryae Medeola virginiana 

20 Hymenocallis tubiflora Nomocharis aperta 

21 Lycoris radiata Scoliopus bigelovii 
22 Narcissus poeticus Tricyrtis macropoda 

23 Pancratium arabicum Tulipa gesneriana 

24 Zephyranthes candida Zigadenus glaberrimus 

25 Zephyranthes simpsonii  

 
Table 3. List of species included in test data 

No. Amaryllis Lily 

1 Agapanthus campanulatus Calochortus apiculatus 

2 Allium altaicum Calochortus lyallii 

3 Allium ampeloprasum Cardiocrinum cathayanum 
4 Allium cepa Cardiocrinum cordatum 

5 Allium fistulosum Cardiocrinum giganteum 

6 Allium monanthum Cardiocrinum giganteum var. giganteum 

7 Allium prattii Cardiocrinum giganteum var. yunnanense 
8 Allium pskemense Erythronium albidum 

9 Allium sativum Erythronium americanum 

10 Allium tuberosum Fritillaria unibracteata 

11 Amaryllis minuta Fritillaria unibracteata var. longinectarea 
12 Crinum asiaticum Gagea serotina 

13 Crinum asiaticum var. japonicum Lilium apertum 

14 Crinum macowanii Lilium bulbiferum 

15 Hymenocallis caribaea Lilium bulbiferum subsp. croceum 
16 Hymenocallis henryae Lilium davidii 

17 Hymenocallis tubiflora Lilium davidii var. willmottiae 

18 Lycoris radiata Lilium distichum 

19 Narcissus poeticus Lilium fargesii 
20 Narcissus poeticus var. plenus Lilium lancifolium 

21 Pancratium arabicum Lilium longiflorum 

22 Zephyranthes candida Lilium longiflorum var. scabrum 

23 Zephyranthes simpsonii Lilium pardalinum 
24  Lilium pardalinum subsp. pardalinum 

25  Lloydia oxycarpa 

26  Medeola virginiana 
27  Tricyrtis macropoda 

28  Tulipa gesneriana 

 
B. Computational Model 

The computational model used in this study is depicted in Figure 3. This study uses the R 

programming language R version 4.2.1, which is run on a computer with an eight-core CPU using 

an Intel Core i5-1135G7 processor with a frequency of 2.4 GHz, RAM with a capacity of 16GB and 

512GB Solid-State Disk (SSD). Several stages use the package libraries available in the public 


 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 236 

 
repository CRAN and Bioconductor. However, preparatory steps are still being taken to use the 

package according to research needs. Furthermore, each stage in the computational model of this 

research will be explained as follows. 

 
Fig. 3. Computational model of comparison of machine learning algorithms for species family classification using DNA 

barcode 


237 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 

 
The first is to retrieve the training/testing dataset. All data are downloaded using the program code 

with the help of the rentrez package [53]. First, a filter query was made to search for DNA sequences 

that matched the following criteria: (1) members of the Amaryllidaceae and Liliaceae families, (2) 

more than 450 bp and less than 10,000 bp in length, (3) excluding species excluded from training 

data or only species selected for data testing, and (4) is the rbcL gene. The search results are used to 

download the whole sequence in FASTA format. A series of pre-processed data stages are carried 

out to use DNA sequences in the classification model. The pre-processing stage starts from the DNA 

Sequence Parsing stage to Family Labeling.  

The second is DNA Sequence Parsing. At this stage, sequences in FASTA format are converted 

to the DNAStringSet format with the help of the Biostrings package [54]. The results of the sequence 

conversion in this stage are exemplified in Figure 4. 

 
Fig. 4. Conversion of DNA sequences from FASTA format to the DNAStringSet data type 

The third is sequence alignment. Datasets are combined and processed so that the symbols in the 

sequences are arranged between each sequence to have the same length. Sequence Alignment is run 

using the Multiple Sequence Alignment (MUSCLE) algorithm with the help of the muscle package 

[55].  

Fourth, aligned sequence parsing. The sequence alignment results are then converted to DNAbin 

format with the help of the ape (Analyses of Phylogenetics and Evolution) package [56] so that it 

can be read by the package used in the next stage. 

Fifth is sequence trimming. The next step is to perform Sequence Trimming on the existing 

sequences so that there are no gap symbols in each sequence's upstream (left end) and downstream 

(right end). The sequences were trimmed with the help of the IPS (Interfaces to Phylogenetic 

Software) package [57] until 99% of the sequences had no gaps upstream downstream. Figure 5 

shows an example of DNA sequence data before and after sequence alignment. 

 
Fig. 5. DNA sequences before and after alignment and trimming 


 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 238 

 
Sixth, conversion to the data frame. Furthermore, the sequence conversion from the Sequence 

Trimming results is carried out into the data frame structure. It is the fundamental format commonly 

used in the R programming language. Each symbol in the sequence is converted to a column with 

the character data type (character; chr). The DNA representation in the data frame is shown in Figure 

6. 

 
Fig. 6. DNA sequences in the data frame 

Seventh is a split training and testing set. The data in the data frame are then separated back into 

the data frame for training and testing. All species sequences whose species names are listed in Table 

3 are separated into a new data frame as a testing data frame. 

Eight casts DNA bases into factor. Each column containing the DNA sequence symbol in the data 

frame is then cast to an unordered factor data type with five levels. Each of these levels represents a 

gap symbol and a nucleobase in the DNA sequence, “-”, “a”, “c”, “g”, and “t”. Gaps replace other 

nucleobase symbols that have ambiguous properties. 

Ninth is family labeling. The training data frame is then added to a new column filled with family 

labels according to the data from GenBank, while the data frame testing added a new column for the 

family but with empty data. 

The next is one-hot encoding. The data is transformed into a numeric representation in this stage, 

facilitating its subsequent processing. Precisely, each character that represents nucleobases-namely 

“a”, “c”, “g”, “t”, or “-” derived from the alignment process, is mapped to a five-column matrix. 

Within this matrix, the column corresponding to the specific nucleobase character is assigned a value 

of 1, while the remaining columns are assigned a value of 0, as illustrated in Figure 7 [58]. 

 
Fig. 7. One hot encoding process 

After that, models training. At this stage, a prediction model is made based on the training data 

frame that has been prepared. Packages used to build classification models are C5.0, kknn, LiblineaR, 

naivebayes, rFerns, randomForest, kernlab, and caret. At this stage, experiments were carried out on 

the dataset and the parameters of the Random Ferns algorithm, the number of ferns, and the depth. 

A validation process is also carried out to ensure the model is not overfitting or underfitting through 

a cross-validation process with the help of a caret package [59]. Parallel [60] and doParallel [61] 

packages speed up the cross-validation resampling. The foreach package [62] also turns off parallel 

compute mode. The model used in this experiment including: C5.0, Knn (k-Nearest Neighbors), 

lssvmPoly (Least Squares Support Vector Machine with a polynomial kernel), naive_bayes, 

regLogistic (Regularized Logistic Regression), rf: (Random Forest), rFerns (Random Ferns), 

svmLinear (Support Vector Machines with Linear Kernel), svmPoly (Support Vector Machines with 


239 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 

 
Polynomial Kernel), svmRadial (Support Vector Machines with Radial Basis Function Kernel), and 

svmRadialWeights (Support Vector Machines with Class Weights). 

The next one is prediction. Class prediction is carried out on the data frame testing based on the 

model made in the previous stage. 

The last is evaluation. The prediction results of the classification models are then evaluated based 

on the level of accuracy concerning the family label of each sequence in GenBank and the results of 

the sequence consensus made using the DECIPHER package [63]. Duration and memory used when 

training the model are measured using the profvis package [64]. 

III. Results and Discussions 

This study used rbcL gene sequence data from species in the Amaryllidaceae and Liliaceae 

families obtained from GenBank. Each species in the dataset has more than one sample because each 

sequence comes from sequencing results in different locations. All dataset downloads are performed 

using program code. For example, downloading the Amaryllis training dataset starts by searching 

GenBank using the entrez_search function from the rentrez package in the following program code. 

The argument for the term parameter is a variable that contains the search query. 

search_result <- rentrez::entrez_search( 
  db = "nuccore", 
  term = query, 
  retmax = limit 
  use_history = TRUE 
) 

The search results are then used to download the sequences using the entrez_fetch function from 

the rentrez package. Iterations are carried out with 50 steps until the DNA sequences from the search 

results are entirely downloaded. 

Initialize sequences as an empty string 
Set chunk_size to 50 
 
Calculate num_iterations based on the total number of ids in search_result 
divided by chunk_size, rounded up 
 
For each iteration i from 1 to num_iterations: 
    Calculate start_idx based on the current iteration and chunk_size 
    Calculate end_idx as the smaller value between i times chunk_size and the 
total number of ids in search_result 
     
    Extract a subset of ids from search_result between start_idx and end_idx 
and store in current_ids 
     
    Use current_ids to fetch sequence data from nuccore database and store the 
result in fetchRes 
     
    If fetchRes is not found in sequences: 
        Append fetchRes to sequences 

The downloaded result is then exported into a file using the write function. Detailed information 

about the datasets successfully downloaded and used in this study has been presented in the Datasets 

section. 

write(amaryllis_train, file = "amaryllis_train.fasta") 

After all datasets have been downloaded, the next step is to carry out a series of data and 

experiments in the pre-processing stages. The configuration of this experimental scenario is shown 

in Table 4. All the data used went through the pre-processing stage to the exact sequence alignment. 

After that step, the difference is started by setting the sequence trimming threshold, which will affect 


 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 240 

 
the length of the resulting sequence after the sequence trimming stage. A resampling method was 

used for each configuration combination using ten iterations of 10-fold cross-validation. 

Table 4. Experimental scenario and results 

Scenario Result 

Sequence Trimming Threshold 99% 

Amaryllis training samples 689 
Lily training samples 713 

Total sample training used 1402 

 
One of the steps in the pre-processing stage of the data in this study is to perform sequence 

alignment. The MUSCLE algorithm is used with the help of the muscle library package. The DNA 

sequence data from the training and testing datasets combined and converted to the DNAStringSet 

data type are given as arguments to the muscle function. After sequence alignment, each sequence 

has a length of 3050bp and is stored in DNAMultipleAlignment format. 

aligned_dataset <- muscle::muscle(parsed_dataset) 

The aligned sequences are then trimmed using the trimEnds function from the IPS package. At 

this stage, you can set a minimum threshold for the number of columns that do not have gaps in the 

sequence_trimming_threshold variable. The DNA sequence converted to the DNAbin data type is 

given as the first argument of this function. After going through the sequence trimming process with 

a 99% threshold configuration, the sequence has a length of 1192 bp. 

arsed_aligned_dataset <- ape::as.DNAbin(aligned_dataset) 
trimmed_dataset <- ips::trimEnds(parsed_aligned_dataset, trim_at_least) 

The DNA sequences are then converted to data frame data types and separated again into training 

and testing sets. The separation is done based on each dataset's sequence labels (row names) before 

being combined. After splitting, the test data frame contains 220 sequence lines. The training data 

frame contains 1,402 sequence rows. It can be noted that the threshold set in the sequence trimming 

process affects the resulting sequence data. The larger the set threshold, the shorter the sequence will 

be and cause the sequence data between one specimen and another to have the same content. 

Furthermore, the conversion of base symbols in the data sequences in the data frame is carried 

out into a factor data type. The factor function encodes the vector data type to the factor data type 

and is used as the second argument of the lapply function. The following argument in the lapply 

function is the argument to the function specified in the second parameter. It is specified that five 

levels represent gaps and nucleotide base symbols in anonymous vectors for the levels parameter. 

After the conversion, the nucleotide base symbol other than the four symbols will be changed to NA 

and replaced with a gap symbol. 

For each column in train_df: 
    Convert the column to a factor with levels "-", "a", "c", "g", "t" 
    Replace the original column with the converted column 

In the training data frame, a new column is added for the family label based on the information 

obtained from GenBank. Labeling is done based on data sequence labels (row names) in each training 

dataset (Amaryllis and Lily) before being combined. After this step, the data is transformed into a 

numeric representation using one-hot encoding. 

Function oneHotDNA(df, n = ["A", "C", "G", "T", "-"]): 
    Convert all elements in df to uppercase 
     
    Initialize seq_col as the number of columns in df 
    Initialize seq_row as the number of rows in df 
     
    Create an empty matrix seq_mat with dimensions (seq_row, seq_col * 
length(n)) 
    Initialize an empty list column_names 
     
    For each column i from 1 to seq_col: 


241 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 

 
        For each row j from 1 to seq_row: 
            If the element at (j, i) in df is in n: 
                Find the position of the element in n 
                Set the corresponding position in seq_mat to 1 
                 
        For each element j in n: 
            Append (j + "-" + i) to column_names 
             
    Set the column names of seq_mat to column_names 
     
    Return seq_mat 

Furthermore, the classification models and the parameters obtained from the random search 

process are made. The first argument is the formula for the attribute, and the following is the data 

source used. The third argument is a model that is being used, and this is a model that has been made 

previously with the parameters obtained from the random search process. The last two arguments are 

the configurations for Caret’s train function and the maximum number of tuning parameter 

combinations that will be generated. The resulting model is then stored in the model variable. The 

tilde operator (~) is used to define the model formulae. The left side of the operator is interpreted as 

the result data of the function, and the right side is the function input. In this case, the formulae 

Family ~. means that the Family column is modeled as a function of all data other than the Family 

column. 

model <- train_profile( 
    Family ~ ., 
    data = data, 
    method = custom_model, 
    trControl = train_configuration, 
    tuneLength = tune_length 
) 

Classification models can be created directly using the default parameters. However, it cannot be 

known whether the model is the best as it was created once. A resampling step with the cross-

validation method was carried out using a caret package to verify the performance of the 

classification model. In this study, a 10-fold cross-validation method was used with ten repetitions. 

Caret supports creating classification models and cross-validation from other package containing 

machine learning models such as C5.0, kknn, LiblineaR, naivebayes, rFerns, randomForest, and 

kernlab. Below is an example of creating a random fern model. 

custom_model <- caret::getModelInfo("rFerns")$rFerns 
 
custom_model$parameters <- data.frame( 
  parameter = c("depth", "ferns"), 
  class = rep("numeric", 2), 
  label = c("Fern Depth", "Number of Ferns") 
) 
 
custom_model$grid <- function(x, y, len = NULL, search = "grid") { 
  if (search == "grid") { 
    out <- expand.grid( 
      depth = unique(floor(seq(1, 16, length = len))), 
      ferns = floor((1:len) * 100) 
    ) 
  } else { 
    out <- data.frame( 
      depth = sample(1:16, size = len, replace = TRUE), 
      ferns = sample(1:1000, replace = TRUE, size = len) 
    ) 
  } 
  out 


 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 242 

 
} 
 
custom_model$fit <- function(x, y, wts, param, lev, last, classProbs, ...) { 
  if (!is.data.frame(x) | inherits(x, "tbl_df")) { 
    x <- as.data.frame(x, stringsAsFactors = TRUE) 
  } 
  rFerns::rFerns(x, y, depth = param$depth, ferns = param$ferns, ...) 
} 

After the model definition is completed, a cross-validation configuration is prepared using the 

trainControl function. 

caret::trainControl( 
  method = "cv", 
  number = 10, 
  search = "random", 
  timingSamps = 1, 
  verboseIter = TRUE 
) 

In practice, this resampling process takes a relatively long time. The resampling process is 

configured to be done in parallel by utilizing all CPU cores available to speed it up. By default, caret 

has configured the resampling process to run in parallel. However, configuring and initializing socket 

clusters is still necessary to perform parallel processing. The makePSOCKcluster function from the 

parallel package is used with the parameter number of cores to be used, in this case, using eight cores. 

It is then configured to run operations in parallel using the registerDoParallel function from the 

doParallel package. Validation of socket cluster creation can be done with the showConnections 

function. 

# Enable Parallel Processing 
cl <- parallel::makePSOCKcluster(8) 
doParallel::registerDoParallel(cl) 
# Check 
showConnections() 
# Disable Parallel Processing 
parallel::stopCluster(cl) 
foreach::registerDoSEQ() 
# Check 
showConnections() 

The outcomes of the classification algorithms are delineated in Table 5, where optimal 

hyperparameters were ascertained through a 10-fold cross-validation technique. Notably, the C5.0, 

Regularized Logistic Regression, Random Forest, SVM linear, SVM poly, SVM radial, and SVM 

radial weights yielded exceptional classification accuracy at 99.85%. In contrast, the Naïve Bayes 

algorithm emerged as the least efficacious model, registering a mere 53.2% accuracy rate. A 

graphical representation of the accuracy metrics accrued during the training phase is furnished in 

Figure 8. 

From Figure 8, we can see that the model's accuracy is relatively high, except for Naïve Bayes. 

The rest of the model obtained more than 97% accuracy. Whereas for the computing power, we 

monitor memory usage and the computational cost for each model. The result can be seen in Table 

6. 

In Table 6, we present a comparative analysis of computational efficiency, quantified in terms of 

time complexity and memory utilization across various machine learning models. The Random Ferns 

algorithm exhibits superior computational performance, requiring a mere 24.67 seconds for 

execution. Naïve Bayes, SVM follows this with a linear kernel, and Regularized Logistic Regression, 

which necessitate computational durations of 41.14 seconds, 45 seconds, and 77.76 seconds, 

respectively. Conversely, the Random Forest algorithm demonstrates the least computational 

efficiency, consuming a substantial 1 hour, 12 minutes, and 51.75 seconds for its computational tasks. 


243 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 

 
Table 5. Best parameters obtained from the training process, along with accuracy 

Algorithm Best Parameters Accuracy 

C5.0 − trials:51 

− model:2 

− winnow: FALSE 

0.9985 

knn k:27 0.974320884 
lssvmPoly − degree:3 

− scale:0.0438458726217134 

− tau:27.3810618566797 

0.995704 

naive_bayes − laplace:0 

− use kernel: FALSE 

− adjust:1 

0.532098 

regLogistic − cost:5.98885780307469 

− loss:L2_dual 

− epsilon:1 

0.998571356 

rf mtry:270 0.998571356 

rFerns − depth:14 

− ferns:330 

0.996428499 

svmLinear C:0.0594077128418059 0.998571356 

svmPoly − degree:1 

− scale:0.00434829110517236 

− C:1.73112678364247 

0.998571356 

svmRadial − sigma:0.000353579812294785 

− C:8.56785410178861 

0.998571356 

svmRadialWeights − sigma:0.000353579812294785 

− C:8.56785410178861 

− Weight:12.9745684396476 

0.998571 

 
Fig. 8. Accuracy obtained from training process 

 
In the context of memory utilization, the Naïve Bayes algorithm demonstrates the most efficient 

memory footprint, consuming a mere 863.9 megabytes (Mb), but in return, this model performs 

poorly in the experiment. This result is followed by k-Nearest Neighbors, Random Forest, Random 

Ferns, and Support Vector Machines with a linear kernel, which exhibits memory usage within the 

1450 to 1650 Mb range. Conversely, the Least Squares Support Vector Machine (LSSVM) 

employing a polynomial kernel and Regularized Logistic Regression algorithms manifest 

significantly elevated memory consumption, requiring 6511.4 Mb and 4419.8 Mb, respectively. The 

visual representations of these memory utilization metrics are provided in Figure 9 and Figure 10. 

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%


 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 244 

 
Table 6. Computational time and memory used in the training process 

Algorithm Time (s) Memory (Mb) 

C5.0 326.17 2968.5 
knn 308.69 1481.4 

lssvmPoly 264.91 6511.4 

naive_bayes 41.14 863.9 

regLogistic 77.76 4419.8 
rf 4371.75 1646.8 

rFerns 24.67 1582.2 

svmLinear 45 1599.9 

svmPoly 88.83 2385.5 
svmRadial 122.25 2763.1 

svmRadialWeights 103.13 2959.2 

 
After the classification model, predictions are made on the testing data using the model. The 

predict function is used with the training models as the first argument and the testing data frame as 

the second. The output of the prediction function is saved to the prediction variable. 

prediction$results <- predict(model, PredictData) 

Figure 11 illustrates the outcomes of various algorithms applied in predicting the family of species 

within the test dataset. This evaluation was conducted by executing each Machine Learning model 

to predict classifications of the ambiguous data, followed by a comparison with the labels provided 

by NCBI. A prediction was deemed accurate if it aligned with the NCBI labels. Remarkably, the 

consistency between these results and the accuracy observed during the training phase suggests the 

reliability of NCBI's classification of the contentious data. 

For species in the Amaryllidaceae family (according to NCBI), almost all algorithms consistently 

predict Amaryllidaceae, aligning perfectly with the NCBI consensus. This suggests that these 

algorithms are agreed for classifying species into the Amaryllidaceae family. There are a few 

anomalies, such as Lilium bulbiferum subsp. croceum, where most algorithms predict 

Amaryllidaceae instead of Liliaceae, diverging from the NCBI consensus and most algorithmic 

predictions. 

 
Fig. 9. Computational time for training process (s) 

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000


245 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 

 
Fig. 10. Memory used for training process (Mb) 

 
Fig. 11. How accurate model predicts the disputed data 

However, there is some variability in the predictions. For example, the algorithm LSSVM poly 

and naïve bayes occasionally predict species in the Amaryllidaceae family as belonging to the 

Liliaceae family. This indicates that while these algorithms are generally accurate, there may be 

specific cases where they diverge from the consensus. The algorithms agree with the NCBI consensus 

regarding the Liliaceae family, predicting Liliaceae for all species listed under this family. 

The machine learning algorithms largely agree with the NCBI consensus, demonstrating their 

effectiveness in classifying species into the Amaryllidaceae and Liliaceae families. Almost all 

machine learning algorithms gain an accuracy of 98%, except knn with 96% and naïve bayes with 

65% accuracy. However, the algorithms diverge a few instances, suggesting areas for further research 

and model refinement. Overall, the table is a valuable resource for evaluating the performance and 

reliability of various machine-learning algorithms in plant taxonomy. 

0

1000

2000

3000

4000

5000

6000

7000

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 246 

 
IV. Conclusions 

Almost all of the models compared in this study were able to classify the DNA Barcode data using 

the rbcL gene with reasonable accuracy with more than 97% accuracy, except for the Naïve Bayes 

model with just 53% accuracy. From the results of the resampling process using ten iterations of 10-

fold cross-validation, we get that the most accurate model, namely C5.0, Regularized Logistic 

Regression, Random Forest, SVM linear, SVM poly, SVM radial, and SVM radial weights yielded 

exceptional classification accuracy at 99.85%. Regarding computational time, the most exhaustive 

model is Random Forest and the least exhaustive model is Random Ferns, which only uses 24.67 

seconds of computing time. In terms of memory used by the model, the LSSVM that uses a 

polynomial kernel model and Regularized Logistic Regression gain the highest memory usage at 

6511.4 Mb and 4419.8 Mb, respectively. In contrast, Naïve Bayes gets the least computing power, 

but the model's accuracy is less significant than other models. While predicting the family using a 

test dataset, the machine learning models align highly with the NCBI's classifications, effectively 

categorizing species into the Amaryllidaceae and Liliaceae families. Nonetheless, some 

discrepancies exist, indicating the need for additional research and model improvement. 

 
Acknowledgment  

The authors would like to acknowledge the Ministry of Research and Technology, Research and Community Services 

Institutions of Universitas Pendidikan Indonesia on Indonesia Collaboration Research (RKI) for funding this work through 

the research grant of 1261/UN40.LP/PT.01.03/2022. 

Declarations  

Author contribution  

All authors contributed equally as the main contributor of this paper. All authors read and approved the final paper. 

Funding statement  

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.  

Conflict of interest  

The authors declare no known conflict of financial interest or personal relationships that could have appeared to influence 
the work reported in this paper.  

Additional information  

Reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. 

Publisher’s Note: Department of Electrical Engineering and Informatics - Universitas Negeri Malang remains neutral with 

regard to jurisdictional claims and institutional affiliations. 

 
References 

[1] A. Yang, W. Zhang, J. Wang, K. Yang, Y. Han, and L. Zhang, “Review on the Application of Machine Learning 
Algorithms in the Sequence Data Mining of DNA,” Front. Bioeng. Biotechnol., vol. 8, p. 1032, Sep. 2020. 

[2]  S. Behjati and P. S. Tarpey, “What is next generation sequencing?,” Arch. Dis. Child. - Educ. Pract. Ed., vol. 98, no. 
6, Art. no. 6, Dec. 2013. 

[3]  J. Dabney et al., “Complete mitochondrial genome sequence of a Middle Pleistocene cave bear reconstructed from 
ultrashort DNA fragments,” Proc. Natl. Acad. Sci., vol. 110, no. 39, Art. no. 39, Sep. 2013. 

[4]  L. Riza, M. Nurfathiya, J. Kusnendar, and K. Abu Samah, “DNA barcoding using particle swarm optimization on 
apache spark SQL case study: DNA of covid-19,” Int. J. Nonlinear Anal. Appl., vol. 12, no. Special Issue, Art. no. 

Special Issue, Jan. 2021. 
[5]  P. D. N. Hebert, A. Cywinska, S. L. Ball, and J. R. deWaard, “Biological identifications through DNA barcodes,” 

Proc. R. Soc. Lond. B Biol. Sci., vol. 270, no. 1512, pp. 313–321, Feb. 2003. 

[6]  C. Manwell and C. M. A. Baker, “A sibling species of sea cucumber discovered by starch gel electrophoresis,” Comp. 
Biochem. Physiol., vol. 10, no. 1, Art. no. 1, Sep. 1963. 

[7]  P. D. N. Hebert, S. Ratnasingham, and J. R. De Waard, “Barcoding animal life: cytochrome c oxidase subunit 1 
divergences among closely related species,” Proc. R. Soc. Lond. B Biol. Sci., vol. 270, no. suppl_1, Aug. 2003. 

[8]  CBOL Plant Working Group1 et al., “A DNA barcode for land plants,” Proc. Natl. Acad. Sci., vol. 106, no. 31, Art. 
no. 31, Aug. 2009. 

http://journal2.um.ac.id/index.php/keds
https://doi.org/10.3389/fbioe.2020.01032
https://doi.org/10.3389/fbioe.2020.01032
https://doi.org/10.1136/archdischild-2013-304340
https://doi.org/10.1136/archdischild-2013-304340
https://doi.org/10.1073/pnas.1314445110
https://doi.org/10.1073/pnas.1314445110
https://doi.org/10.22075/ijnaa.2021.5812
https://doi.org/10.22075/ijnaa.2021.5812
https://doi.org/10.22075/ijnaa.2021.5812
https://doi.org/10.1098/rspb.2002.2218
https://doi.org/10.1098/rspb.2002.2218
https://doi.org/10.1016/0010-406X(63)90101-4
https://doi.org/10.1016/0010-406X(63)90101-4
https://doi.org/10.1098/rsbl.2003.0025
https://doi.org/10.1098/rsbl.2003.0025
https://doi.org/10.1073/pnas.0905845106
https://doi.org/10.1073/pnas.0905845106


247 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 

 
[9]  C. L. Schoch et al., “Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker 
for Fungi,” Proc. Natl. Acad. Sci., vol. 109, no. 16, pp. 6241–6246, Apr. 2012. 

[10]  C.-H. Yang, K.-C. Wu, L.-Y. Chuang, and H.-W. Chang, “DeepBarcoding: Deep Learning for Species Classification 
Using DNA Barcoding,” IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 19, no. 4, pp. 2158–2165, Jul. 2022. 

[11]  J. Yang et al., “Development of Chloroplast and Nuclear DNA Markers for Chinese Oaks (Quercus Subgenus 
Quercus) and Assessment of Their Utility as DNA Barcodes,” Front. Plant Sci., vol. 8, p. 816, May 2017 . 

[12]  M. Emu and S. Sakib, “Species Identification using DNA Barcode Sequences through Supervised Learning Methods,” 
in 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE), Cox’sBazar, 

Bangladesh: IEEE, Feb. 2019, pp. 1–6. 

[13]  T. He, L. Jiao, A. C. Wiedenhoeft, and Y. Yin, “Machine learning approaches outperform distance- and tree-based 
methods for DNA barcoding of Pterocarpus wood,” Planta, vol. 249, no. 5, Art. no. 5, May 2019. 

[14]  L. Jin, J. Yu, X. Yuan, and X. Du, “Fish Classification Using DNA Barcode Sequences through Deep Learning 
Method,” Symmetry, vol. 13, no. 9, Art. no. 9, Aug. 2021. 

[15]  P. K. Meher, T. K. Sahu, and A. R. Rao, “Identification of species based on DNA barcode using k-mer feature vector 
and Random forest classifier,” Gene, vol. 592, no. 2, pp. 316–324, Nov. 2016. 

[16]  E. Weitschek, G. Fiscon, and G. Felici, “Supervised DNA Barcodes species classification: analysis, comparisons and 
results,” BioData Min., vol. 7, no. 1, p. 4, Dec. 2014. 

[17]  D. Sobolewska, A. Galanty, K. Grabowska, J. Makowska-Wąs, D. Wróbel-Biedrawa, and I. Podolak, “Saponins as 
cytotoxic agents: an update (2010–2018). Part I—steroidal saponins,” Phytochem. Rev., vol. 19, no. 1, pp. 139–189, 
Feb. 2020. 

[18]  P. Nagare and S. S. Shekokar, “A Literature Review Of Some Important Pharmacological Activities Of Few Plants 
Of Liliaceae Family,” 2022. 

[19]  P. F. Stevens, “Angiosperm Phylogeny Website. Version 13.,” Angiosperm Phylogeny Website Version 13, 2016. 
[20]  A. M. Takos and F. Rook, “Towards a molecular understanding of the biosynthesis of Amaryllidaceae alkaloids in 

support of their expanding medical use,” Int. J. Mol. Sci., vol. 14, no. 6, pp. 11713–11741, 2013. 

[21]  L. Torras Claveria, L. R. Tallini, F. Viladomat Meya, and J. Bastida Armengol, “Research in natural products: 
Amaryllidaceae ornamental plants as sources of bioactive compounds,” Recent Adv. Pharm. Sci. VII 2017 Res. 
Signpost Ed. Diego Muñoz-Torrero Montserrat Riu Carles Feliu ISBN 978-81-308-0573-3 Chapter 5 P 69-82, 2017. 

[22]  M. W. Chase, J. L. Reveal, and M. F. Fay, “A subfamilial classification for the expanded asparagalean families 
Amaryllidaceae, Asparagaceae and Xanthorrhoeaceae,” Bot. J. Linn. Soc., vol. 161, no. 2, pp. 132–136, 2009. 

[23]  A. Kornienko and A. Evidente, “Chemistry, biology, and medicinal potential of narciclasine and its congeners,” Chem. 
Rev., vol. 108, no. 6, pp. 1982–2014, 2008. 

[24]  R. M. Dahlgren and H. T. Clifford, The monocotyledons: a comparative study. Academic Press, 1982. 
[25]  G. Bentham and J. D. Hooker, Genera plantarum :ad exemplaria imprimis in Herberiis Kewensibus servata definita 

/auctoribus G. Bentham et J.D. Hooker. London, England: A. Black, 1862. 
[26]  A. Engler, K. Krause, R. Pilger, and K. Prantl, Die Natürlichen Pflanzenfamilien nebst ihren Gattungen und 

wichtigeren Arten, insbesondere den Nutzpflanzen, unter Mitwirkung zahlreicher hervorragender Fachgelehrten 

begründet. Leipzig: W. Engelmann, 1887. 

[27]  C. E. Bessey, “The Phylogenetic Taxonomy of Flowering Plants,” Ann. Mo. Bot. Gard., vol. 2, no. 1/2, p. 109, Feb. 
1915. 

[28]  A. B. Rendle, The classification of flowering plants, no. Vol. 2. Cambridge: Cambridge Univ. Press, 1925. 
[29]  J. Hutchinson, “Families of Flowering Plants. II. Monocotyledons,” Oxf. Univ. Press, p. 243, 1934. 
[30]  A. Cronquist, An integrated system of classification of flowering plants. New York: Columbia University Press, 1981 . 
[31]  A. L. Takhtajan, “Outline of the classification of flowering plants (magnoliophyta),” Bot. Rev., vol. 46, no. 3, pp. 

225–359, Jul. 1980. 

[32]  H. Clifford, R. Dahlgren, and P. Yeo, The families of the monocotyledons: structure, evolution, and taxonomy. 
Springer, 1985. 

[33]  Y. Mimaki and Y. Sashida, “Steroidal Saponins from the Liliaceae Plants and Their Biological Activities,” in Saponins 
Used in Traditional and Modern Medicine, G. R. Waller and K. Yamasaki, Eds., in Advances in Experimental 

Medicine and Biology, vol. 404. Boston, MA: Springer US, 1996, pp. 101–110. 

[34]  P. Korall and P. Kenrick, “Phylogenetic relationships in Selaginellaceae based on rbcL sequences,” Am. J. Bot., vol. 
89, no. 3, pp. 506–517, 2002. 

[35]  J. E. Richardson, M. F. Fay, Q. C. Cronk, D. Bowman, and M. W. Chase, “A phylogenetic analysis of Rhamnaceae 
using rbcL and trnL‐F plastid DNA sequences,” Am. J. Bot., vol. 87, no. 9, pp. 1309–1324, 2000. 

[36]  T. M. Mitchell, Machine Learning. in McGraw-Hill series in computer science. New York: McGraw-Hill, 1997. 
[37]  A. C. Müller and S. Guido, Introduction to machine learning with Python: a guide for data scientists, First edition. 

Sebastopol, CA: O’Reilly Media, Inc, 2016. 

[38]  G. Linden, B. Smith, and J. York, “Amazon.com recommendations: item-to-item collaborative filtering,” IEEE 
Internet Comput., vol. 7, no. 1, Art. no. 1, Jan. 2003. 

[39]  K. Jacobson, V. Murali, E. Newett, B. Whitman, and R. Yon, “Music Personalization at Spotify,” in Proceedings of 
the 10th ACM Conference on Recommender Systems, Boston Massachusetts USA: ACM, Sep. 2016, pp. 373–373. 

[40]  L. S. Riza, A. D. Pertiwi, E. F. Rahman, M. Munir, and C. U. Abdullah, “Question Generator System of Sentence 
Completion in TOEFL Using NLP and K-Nearest Neighbor,” Indones. J. Sci. Technol., vol. 4, no. 2, Art. no. 2, Sep. 

2019. 

[41]  L. S. Riza, F. S. Anwar, E. F. Rahman, C. U. Abdullah, and S. Nazir, “Natural Language Processing and Levenshtein 
Distance for Generating Error Identification Typed Questions on TOEFL,” J. Comput. Soc., vol. 1, no. 1, Art. no. 1, 
Jun. 2020. 

https://doi.org/10.1073/pnas.1117018109
https://doi.org/10.1073/pnas.1117018109
https://doi.org/10.1109/TCBB.2021.3056570
https://doi.org/10.1109/TCBB.2021.3056570
https://doi.org/10.3389/fpls.2017.00816
https://doi.org/10.3389/fpls.2017.00816
https://doi.org/10.1109/ECACE.2019.8679166
https://doi.org/10.1109/ECACE.2019.8679166
https://doi.org/10.1109/ECACE.2019.8679166
https://doi.org/10.1007/s00425-019-03116-3
https://doi.org/10.1007/s00425-019-03116-3
https://doi.org/10.3390/sym13091599
https://doi.org/10.3390/sym13091599
https://doi.org/10.1016/j.gene.2016.07.010
https://doi.org/10.1016/j.gene.2016.07.010
https://doi.org/10.1186/1756-0381-7-4
https://doi.org/10.1186/1756-0381-7-4
https://doi.org/10.1007/s11101-020-09661-0
https://doi.org/10.1007/s11101-020-09661-0
https://doi.org/10.1007/s11101-020-09661-0
https://wjpr.s3.ap-south-1.amazonaws.com/article_issue/55b476a88fdda7a3103b861912c6e892.pdf
https://wjpr.s3.ap-south-1.amazonaws.com/article_issue/55b476a88fdda7a3103b861912c6e892.pdf
https://www.cabdirect.org/cabdirect/abstract/20177200238
https://doi.org/10.3390/ijms140611713
https://doi.org/10.3390/ijms140611713
https://diposit.ub.edu/dspace/bitstream/2445/120781/1/T_1516359696munoz%205.pdf
https://diposit.ub.edu/dspace/bitstream/2445/120781/1/T_1516359696munoz%205.pdf
https://diposit.ub.edu/dspace/bitstream/2445/120781/1/T_1516359696munoz%205.pdf
https://academic.oup.com/botlinnean/article-abstract/161/2/132/2418404
https://academic.oup.com/botlinnean/article-abstract/161/2/132/2418404
https://pubs.acs.org/doi/full/10.1021/cr078198u
https://pubs.acs.org/doi/full/10.1021/cr078198u
https://www.cabdirect.org/cabdirect/abstract/19820681948
https://doi.org/10.5962/bhl.title.747
https://doi.org/10.5962/bhl.title.747
https://doi.org/10.5962/bhl.title.4635
https://doi.org/10.5962/bhl.title.4635
https://doi.org/10.5962/bhl.title.4635
https://doi.org/10.2307/2990030
https://doi.org/10.2307/2990030
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=A.+B.+Rendle%2C+Dicotyledons%2C+Reprint.+in+The+classification+of+flowering+plants+no.+Vol.+2.+Cambridge%3A+Cambridge+Univ.+Press%2C+1975.&btnG=
https://www.cabdirect.org/cabdirect/abstract/19591604881
https://scholar.google.com/scholar?cluster=8287924650690164991&hl=en&as_sdt=0,5
https://doi.org/10.1007/BF02861558
https://doi.org/10.1007/BF02861558
https://books.google.com/books?hl=en&lr=&id=3iGndTFY0skC&oi=fnd&pg=PA1&dq=The+families+of+the+monocotyledons:+structure,+evolution,+and+taxonomy&ots=92zowgVXdW&sig=z1a3pUg-u438LzuDpQ94G7aP5-s
https://books.google.com/books?hl=en&lr=&id=3iGndTFY0skC&oi=fnd&pg=PA1&dq=The+families+of+the+monocotyledons:+structure,+evolution,+and+taxonomy&ots=92zowgVXdW&sig=z1a3pUg-u438LzuDpQ94G7aP5-s
https://doi.org/10.1007/978-1-4899-1367-8_10
https://doi.org/10.1007/978-1-4899-1367-8_10
https://doi.org/10.1007/978-1-4899-1367-8_10
https://doi.org/10.3732/ajb.89.3.506
https://doi.org/10.3732/ajb.89.3.506
https://doi.org/10.2307/2656724
https://doi.org/10.2307/2656724
https://doi.org/10.1609/aimag.v18i3.1303
https://books.google.com/books?hl=en&lr=&id=1-4lDQAAQBAJ&oi=fnd&pg=PP1&dq=Introduction+to+machine+learning+with+Python:+a+guide+for+data+scientists,+First+edition&ots=28pVGKKHUZ&sig=nexcy_4Pvr9rsRScf5BNwrMfImQ
https://books.google.com/books?hl=en&lr=&id=1-4lDQAAQBAJ&oi=fnd&pg=PP1&dq=Introduction+to+machine+learning+with+Python:+a+guide+for+data+scientists,+First+edition&ots=28pVGKKHUZ&sig=nexcy_4Pvr9rsRScf5BNwrMfImQ
https://doi.org/10.1109/MIC.2003.1167344
https://doi.org/10.1109/MIC.2003.1167344
https://doi.org/10.1145/2959100.2959120
https://doi.org/10.1145/2959100.2959120
https://doi.org/10.17509/ijost.v4i2.18202
https://doi.org/10.17509/ijost.v4i2.18202
https://doi.org/10.17509/ijost.v4i2.18202
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Natural+Language+Processing+and+Levenshtein+Distance+for+Generating+Error+Identification+Typed+Questions+on+TOEFL&btnG=
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Natural+Language+Processing+and+Levenshtein+Distance+for+Generating+Error+Identification+Typed+Questions+on+TOEFL&btnG=
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Natural+Language+Processing+and+Levenshtein+Distance+for+Generating+Error+Identification+Typed+Questions+on+TOEFL&btnG=


 L. S. Riza et al. / Knowledge Engineering and Data Science 2023, 6 (2): 231–248 248 

 
[42]  L. S. Riza, R. A. Rosdiyana, A. R. Pérez, and A. Wahyudin, “The K-Means Algorithm for Generating Sets of Items 
in Educational Assessment,” Indones. J. Sci. Technol., vol. 6, no. 1, Art. no. 1, Jan. 2021. 

[43]  P. Larrañaga et al., “Machine learning in bioinformatics,” Brief. Bioinform., vol. 7, no. 1, Art. no. 1, Mar. 2006. 
[44]  L. S. Riza, F. D. Pratama, E. Piantari, and M. Fahsi, “Genomic repeats detection using Boyer-Moore algorithm on 

Apache Spark Streaming,” TELKOMNIKA Telecommun. Comput. Electron. Control, vol. 18, no. 2, Art. no. 2, Apr. 

2020. 

[45]  L. S. Riza, A. B. Rachmat, Munir, T. Hidayat, and S. Nazir, “Genomic repeat detection using the Knuth-Morris-Pratt 
algorithm on R high-performance-computing package,” Int J Adv. Soft Compu Appl, vol. 11, no. 1, Art. no. 1, Mar. 

2019. 

[46]  S. J. Russell, P. Norvig, and E. Davis, Artificial intelligence: a modern approach, 3rd ed. in Prentice Hall series in 
artificial intelligence. Upper Saddle River: Prentice Hall, 2010. 

[47]  S. Salzberg, “Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm,” J. Comput. Biol., 
vol. 2, no. 3, Art. no. 3, Jan. 1995. 

[48]  C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, Art. no. 3, Sep. 1995. 
[49]  L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, Art. no. 1, 2001. 
[50]  L. Bao and Y. Cui, “Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using 

structural and evolutionary information,” Bioinformatics, vol. 21, no. 10, Art. no. 10, May 2005. 

[51]  A. Gupta, H. Wang, and M. Ganapathiraju, “Learning structure in gene expression data using deep architectures, with 
an application to gene clustering,” in 2015 IEEE International Conference on Bioinformatics and Biomedicine 
(BIBM), Washington, DC, USA: IEEE, Nov. 2015, pp. 1328–1335. 

[52]  D. A. Benson et al., “GenBank,” Nucleic Acids Res., vol. 41, no. D1, pp. D36–D42, Nov. 2012. 
[53]  D. Winter J., “rentrez: An R package for the NCBI eUtils API,” R J., vol. 9, no. 2, p. 520, 2017. 
[54]  H. Pagès et al., “Biostrings: Efficient manipulation of biological strings.” Bioconductor version: Release (3.17), 2023 . 
[55]  R. C. Edgar, “MUSCLE: multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Res., 

vol. 32, no. 5, Art. no. 5, Mar. 2004. 

[56]  E. Paradis and K. Schliep, “ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R,” 
Bioinformatics, vol. 35, no. 3, Art. no. 3, Feb. 2019. 

[57]  C. Heibl, “PHYLOCH: R language tree plotting tools and interfaces to diverse phylogenetic software packages.” Jan. 
2008.  

[58]  L. S. Riza, M. I. Zain, A. Izzuddin, Y. Prasetyo, T. Hidayat, and K. A. F. Abu Samah, “Implementation of Machine 
Learning in DNA Barcoding for Determining the Plant Family Taxonomy,” SSRN Electron. J., 2022. 

[59]  M. Kuhn et al., “caret: Classification and Regression Training.” Mar. 21, 2023. (Accessed: Jul. 10, 2023) 
[60]  R Core Team, “R: The R Project for Statistical Computing,” Jul. 10, 2023. (Accessed: Jul. 10, 2023) 
[61]  F. Daniel, M. Corporation, S. Weston, and D. Tenenbaum, “doParallel: Foreach Parallel Adaptor for the ‘parallel’ 

Package.” Feb. 07, 2022. (Accessed: Jul. 10, 2023) 
[62]  F. Daniel, H. Ooi, R. Calaway, Microsoft, and S. Weston, “foreach: Provides Foreach Looping Construct.” Feb. 02, 

2022. (Accessed: Jul. 10, 2023)  

[63]  E. Wright S., “Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R,” R J., vol. 8, no. 1, Art. no. 1, 
2016. 

[64] W. Chang et al., “profvis: Interactive Visualizations for Profiling R Code.” May 02, 2023. Accessed: Jul. 10, 2023.  
 

https://doi.org/10.17509/ijost.v6i1.31523
https://doi.org/10.17509/ijost.v6i1.31523
https://doi.org/10.1093/bib/bbk007
https://doi.org/10.1089/cmb.1995.2.473
https://doi.org/10.1089/cmb.1995.2.473
https://doi.org/10.1089/cmb.1995.2.473
http://www.i-csrs.org/Volumes/ijasca/7_page-93-110_Genomic-Repeat-Detection-Using-the-Knuth-Morris-Pratt-Algorithm-on-R-High-Performance-Computing-Package_1.pdf
http://www.i-csrs.org/Volumes/ijasca/7_page-93-110_Genomic-Repeat-Detection-Using-the-Knuth-Morris-Pratt-Algorithm-on-R-High-Performance-Computing-Package_1.pdf
http://www.i-csrs.org/Volumes/ijasca/7_page-93-110_Genomic-Repeat-Detection-Using-the-Knuth-Morris-Pratt-Algorithm-on-R-High-Performance-Computing-Package_1.pdf
https://ds.amu.edu.et/xmlui/bitstream/handle/123456789/10406/artificial%20intelligence%20-%20a%20modern%20approach%20%283rd%2C%202009%29.pdf?sequence=1&isAllowed=y
https://ds.amu.edu.et/xmlui/bitstream/handle/123456789/10406/artificial%20intelligence%20-%20a%20modern%20approach%20%283rd%2C%202009%29.pdf?sequence=1&isAllowed=y
https://doi.org/10.1089/cmb.1995.2.473
https://doi.org/10.1089/cmb.1995.2.473
https://doi.org/10.1007/BF00994018
https://doi.org/10.1023/A:1010933404324
https://doi.org/10.1093/bioinformatics/bti365
https://doi.org/10.1093/bioinformatics/bti365
https://doi.org/10.1109/BIBM.2015.7359871
https://doi.org/10.1109/BIBM.2015.7359871
https://doi.org/10.1109/BIBM.2015.7359871
https://doi.org/10.1093/nar/gks1195
https://doi.org/10.32614/RJ-2017-058
https://doi.org/10.18129/B9.bioc.Biostrings
https://doi.org/10.1093/nar/gkh340
https://doi.org/10.1093/nar/gkh340
https://doi.org/10.1093/bioinformatics/bty633
https://doi.org/10.1093/bioinformatics/bty633
http://www.christophheibl.de/Rpackages.html
http://www.christophheibl.de/Rpackages.html
https://doi.org/10.2139/ssrn.4268748
https://doi.org/10.2139/ssrn.4268748
https://cran.r-project.org/web/packages/caret/index.html
https://www.r-project.org/
https://cran.r-project.org/web/packages/doParallel/index.html
https://cran.r-project.org/web/packages/doParallel/index.html
https://cran.r-project.org/web/packages/foreach/index.html
https://cran.r-project.org/web/packages/foreach/index.html
https://doi.org/10.32614/RJ-2016-025
https://doi.org/10.32614/RJ-2016-025
https://cran.r-project.org/web/packages/profvis/index.html