Lontar - Template


LONTAR KOMPUTER VOL. 13, NO. 3 DECEMBER 2022 p-ISSN 2088-1541 
DOI : 10.24843/LKJITI.2022.v13.i03.p03 e-ISSN 2541-5832 
Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 
 

160 
 

Balinese Script Recognition Using Tesseract Mobile 
Framework 

 
Gede Indrawana1, Ahmad Asronia2, Luh Joni Ernawati Dewia3, I Gede Aris Gunadia4, 
I Ketut Paramartab5 

 
aDepartment of Electrical Engineering and Computer Science, Universitas Pendidikan Ganesha 

Jl. Udayana 11, Singaraja, Buleleng, Bali, Indonesia 
1gindrawan@undiksha.ac.id (Corresponding author) 

2ahmad.asroni@undiksha.ac.id 
3joni.ernawati@undiksha.ac.id 

4igedearisgunadi@undiksha.ac.id 

 
bDepartment of Balinese Language Education, Universitas Pendidikan Ganesha 

Jl. Ahmad Yani 67, Singaraja, Buleleng, Bali, Indonesia 
5ketut.paramarta@undiksha.ac.id 

 
Abstract 

 
One of the main factors causing the decline in the use of Balinese Script is that Balinese people 
are less interested in reading Balinese Script because of their reluctance to learn Balinese Script, 
which is relatively complicated in the recognition process. The development of computer 
technology has now been used to help by performing character recognition or known as Optical 
Character Recognition (OCR). Developing the OCR application for Balinese Script is an effort to 
help preserve, from the technology side, as a means of education related to Balinese Script. In 
this study, that development was conducted by using a Tesseract OCR engine that consists of 
several stages, i.e., the first one is to prepare the dataset, the second one is to generate the 
dataset using the Web Scraping method, the third one is to train the OCR engine using the 
generated dataset, and finally, the fourth one is to implement the generated language model into 
a mobile-based application. The study results prove that the dataset generation process using 
the Web Scraping method can be a better choice when faced with a training dataset that requires 
a large dataset compared to several previous studies of non-Latin character recognition. In those 
studies, the jTessBox tools were used, which took time because they had to select per character 
for a dataset. The best result of the language model is a combination of character, word, sentence, 
and paragraph datasets (hierarchical combination of character, word, sentence, and paragraph 
datasets) with a coincidence rate of 66.67%. The more diverse and structured hierarchical 
datasets used, the higher the coincidence rate. 
 
Keywords: Balinese Script, Mobile Framework, Tesseract, Optical Character Recognition, Web 
Scraping 
 
1. Introduction 

Balinese Script, literature, and language are sources of imagination, creativity, and energy in 
Balinese culture. This is starting to decline, especially in terms of the use of Balinese Script, which 
is decreasingly being used in the daily life of Balinese people [1]. One of the main factors causing 
the decline is that Balinese people are less interested in reading Balinese Script because of their 
reluctance to learn Balinese Script, which is relatively complicated in the recognition process. Bali 
Governor Regulation Number 80 of 2018, concerning the protection and use of the Balinese 
Language, Script, and literature, also the Implementation of the Balinese Language Month, 
regulates the use of the Balinese language as a means of communication in Balinese family life, 
communication in all activities of Hindu religious, Balinese customs and culture, and providing 
information on public services both in government institutions and private institutions as a 
companion to Indonesian [2]. 

The development of computer technology has now been widely used to perform character 
recognition, termed Optical Character Recognition (OCR). OCR converts printed text and images 


LONTAR KOMPUTER VOL. 13, NO. 3 DECEMBER 2022 p-ISSN 2088-1541 
DOI : 10.24843/LKJITI.2022.v13.i03.p03 e-ISSN 2541-5832 
Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 

161 
 

into digital character forms, which machines can manipulate. OCR implementation has been used 
in many application sectors, such as education, banking, finance, law, etc. Along with the 
development of OCR technology, many studies have used OCR to perform character recognition 
for non-Latin scripts [3]. Most of the development of OCR is still focused on Latin English script 
because it is supported by the encoding standard of the American Standard Code for Information 
Interchange or ASCII for short. The limited ability of OCR to recognize non-Latin scripts is a 
challenge for researchers to improvise. 

OCR technology is growing rapidly with the creation of several OCR engines that are open source 
and paid. This study tested which OCR engine has the highest performance for Information 
Extraction using Named Entity Recognition by comparing three OCR engines, namely Foxit, 
PDF2GO, and Tesseract [4]. Based on the research conducted by Ramdhani et al., compared 
the performance levels of three OCR engines with high-performance levels. The test was carried 
out with 8,562 government human resource documents in six document categories, two document 
structures, and four measurements. The test results found that Tesseract was the most suitable 
solution and got the highest performance in Information Extraction. The details of the test results, 
on average, PDF2GO gets a performance of 86.27%, Foxit gets a performance value of 84.01%, 
and Tesseract gets a performance value of 92.46%. 

In a study by Abdul Robby et al., they used the Tesseract OCR engine to be implemented as a 
Javanese Script character recognition engine. This study aims to simplify the process of 
automatically recognizing Javanese characters using a mobile application [5]. The dataset used 
as a data source to build the Tesseract OCR engine training data is 5,880 Javanese characters. 
To build the Javanese Script dataset was collected from digital characters with specifications (3 
sets x 120 characters) and handwriting (46 sets x 120 characters). The dataset training tools used 
in this study are the Neural-Network API from the Tesseract OCR engine. Before the training, the 
Javanese Script dataset was selected by segmenting each character and setting variables for the 
cluster of characters using JTessBoxEditor. The highest accuracy achieved by the model 
generated from the trained data is 97.50%. 

The following research similar to the case of non-Latin optical character recognition is the study 
conducted by Mudiarta et al. This research focuses on preserving knowledge of reading Balinese 
Script in pictures by combining information technology with Balinese Script discipline. In this 
study, the OCR application was developed on a mobile-based device with camera facilities. The 
input in this application is in the form of images and is processed with Tesseract OCR engine 
technology. The Balinese Script dataset is based on eighteen basic Balinese Script syllables and 
only numbers to carry out the training process. The tool used to carry out the training process is 
jTessBoxEditor. This tool has fully automated facilities for training datasets. In the test results for 
50 words, 62% recognition was obtained with good quality image-based Bali-Simbar font [1]. 

From the exposure of the two studies above, there are similarities in terms of the Optical Character 
Recognition engine and the data training process carried out. The training data to create the 
trained data model utilizes the jTessBoxEditor tool by segmenting characters from non-Latin 
character images. The segmentation process is carried out alternately for each dataset owned. 
The jTessBoxEditor tools must be done manually by segmenting each dataset, making the 
training process relatively more time-consuming. Several weaknesses occur in the two studies, 
especially in the data training process. In the chapter suggestions of the two studies, the focus is 
on increasing the number of datasets used. 

Based on the weaknesses and suggestions of the two studies, it can be resolved using different 
data training methods. In addition to using the jTessBoxEditor tools, there is the latest training 
method to create trained data, using the latest Tesseract OCR training method. The latest version 
of Tesseract OCR provides training tools without relying on external tools such as jTessBoxEditor. 
The concept of training datasets in the newest version of Tesseract OCR tools supports the 
automatic dataset training process by using the command line for all dataset training execution 
commands. Compared to jTessBoxEditor, almost all steps must be done manually using a GUI, 
such as selecting the character box segmentation, correcting the ground truth character box, and 
merging all the resulting training data files. This latest Tesseract OCR training method can 
perform dataset training simultaneously for all datasets. According to Idrees & Hassani, since 
version 4.0, Tesseract OCR presents a new engine based on Long Short-Term Memory (LSTM) 
[6]. LSTM, as a special form of Artificial Neural Network (RNN), provides much higher accuracy 


LONTAR KOMPUTER VOL. 13, NO. 3 DECEMBER 2022 p-ISSN 2088-1541 
DOI : 10.24843/LKJITI.2022.v13.i03.p03 e-ISSN 2541-5832 
Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 
 

162 
 

in image recognition than the previous version of Tesseract OCR. In the previous version, 
Tesseract OCR processing still used traditional processing step by step, not using artificial neural 
network (RNN). In the first stage of connected component analysis, the outline is collected and 
will be converted into a Blob. 

Furthermore, in the second stage, Blob will be arranged into proportional text lines, broken down 
into words with definite and fuzzy spaces. The third stage is character recognition, namely the 
recognition of each word, and the last is validating alternative hypotheses to find lowercase text 
using fuzzy space [7]. The Tesseract can be trained from scratch or refined based on the language 
that has been trained. 

2. Research Methods 

This research focuses on applying the latest Tesseract OCR training model for non-Latin digital 
characters, especially languages that Tesseract OCR has not supported. No research has been 
found regarding this. This study uses the latest data training method from Tesseract OCR by 
focusing on the dataset format consisting of two types of datasets, namely the image and the 
ground truth image. This training method differs from the two studies that discuss non-Latin digital 
character recognition using the jTessBoxEditor tools to conduct data training [1][5]. The stages 
carried out in this study can be seen in Figure 1. 
 

 Dataset 
Preparation

Generate 
Dataset

Training Dataset

Testing Language 
Model 

Traineddata

Implementation 
Model 

Traineddata Into 
Tesseract Mobile 

Framework

• Translation From Latin 
Into Balinese Script

• Convert Unicode Into 
Website Page HTML

• Image Acquisition Using 
Web Scraping

• Generate Ground Truth 
Image Balinese Script

• Train Tesseract LSTM with 
make from Single Line 
Images and Ground Truth

 
Figure 1. Research Methodology 

2.1. Dataset Preparation 

Dataset preparation was carried out to obtain a data set consisting of character images and 
ground truth. The data used to create the dataset is derived from the research conducted by G. 
Indrawan et al. [8]. That research consolidated a dataset with more than 35,000 words in Balinese 
with its Indonesian and English counterparts. The transliteration method implemented in the study 
was adopted using a different platform, namely using a website-based platform. The preparation 
of this dataset went through several stages for transliteration from Latin to Balinese Script. The 
first stage was converting the Latin-Balinese dataset into the database using Unicode to display 


LONTAR KOMPUTER VOL. 13, NO. 3 DECEMBER 2022 p-ISSN 2088-1541 
DOI : 10.24843/LKJITI.2022.v13.i03.p03 e-ISSN 2541-5832 
Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 

163 
 

on HTML pages. Next, add the family Balinese font Noto Sans Balinese so that the Unicode 
displayed on the HTML page can be converted into digital Balinese characters. The results of the 
dataset preparation can be seen in Figure 2. 

2.2. Dataset Generation  

The dataset generation technique used to extract information from the website platform is the web 
scraping technique [9]. The web scraping technique extracts information from websites 
automatically by parsing hypertext tags and retrieving information in the form of text, images, and 
videos embedded in them from large amounts of data from web pages [10][11]. The web scraping 
technique implemented in this research consists of four main processes. The first process is to 
create a scrapping template in the form of an HTML page that contains information that can be 
extracted into a Balinese Script image dataset, and the Balinese Script ground truth. The second 
process runs the website using the browser in the browser search field. The third process is 
making a web scraping algorithm to acquire Balinese Script images and automatically extracting 
ground truth when the algorithm is run. The last process is to store all the datasets resulting from 
the web scraping technique in the database. The dataset generated from the web scraping 
process consists of two datasets: the Balinese Script image dataset and the Balinese Script 
ground truth in digital character format. 
 

Figure 2. Result of Dataset Preparation 

2.3. Dataset Training 

As a well-known open-source OCR engine, Tesseract [12] is under active development by 
Google. It is currently available with the latest version 5.0, including the newest version of the 
LSTM-based OCR engine. Meanwhile, other Tesseract version below 5.0 is categorized as 
traditional machine [13]. LSTM is a Recurrent Neural Network in Deep Learning developed 


LONTAR KOMPUTER VOL. 13, NO. 3 DECEMBER 2022 p-ISSN 2088-1541 
DOI : 10.24843/LKJITI.2022.v13.i03.p03 e-ISSN 2541-5832 
Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 
 

164 
 

specifically for handling sequential prediction problems [14]. Tesseract can be trained using 
several operating systems, such as Linux, Windows, and macOS, by running a command line set 
and the Tesseract OCR training shell script [15]. Several operating system options can be 
selected according to needs, but Tesseract OCR is recommended to use the Linux operating 
system locally or in the cloud. A virtual server has a relatively good performance in running data 
training. In their use, containers have various benefits or advantages that make them popular 
among data training tools, such as having a simple configuration, good security level, can run on 
several cloud platforms, can perform debugging, and can be used on various operating systems 
[16]. The dataset training consists of two main processes: character form training and language 
dictionary creation. The output of the dataset training is the trained data file that needs to be 
copied to the Tesseract instance data folder and will be used to perform character recognition. 

2.4. Language Model Testing 

Testing the language model is an important stage to test the language model generated from the 
dataset training process. The result of trained data obtained after training the dataset through a 
testing process consisting of two types of testing, namely the unit testing and performance testing 
stages [17][18]. To perform automated unit testing, some additional requirements are required. It 
includes additional dependencies for training tools and downloads all necessary submodules, 
such as git and the model repository. In comparison, performance testing is carried out to obtain 
test results to see the model's level of speed and performance based on the allocation of 
resources used [15]. One of the unit testing methods that can be used to measure the language 
model's accuracy is Coincidence. Coincidence refers to the accuracy level of an optical character 
recognition language model. The way Coincidence work is to do a match based on an identifiable 
character matrix. The matrix form in question is a single-line transliteration to the ground truth of 
the testing character image. The accuracy test result using the Coincidence method is the 
percentage level of accuracy. A higher level of Coincidence means that the accuracy of the 
language model is also higher. Still, if the level of Coincidence is low, it means that the quality of 
the accuracy of the language model is also low [19][20]. A step that can be taken to optimize the 
model's performance is to optimize the code to increase memory capability in processing large 
numbers of characters. Much better performance improvements can be made by making the 
network smaller [21]. 

2.5. Tesseract Mobile Framework 

The mobile framework technology used in this research is the Flutter Mobile Framework. Flutter 
is an open-source UI kit developed by Google that allows the creation of cross-platform 
applications, including Android and IOS platforms. Flutter was first introduced at the 2015 Dart 
Developer Summit. On December 4th, 2018, Google released Flutter 1.0 at the Flutter Live Event. 
This also marks the release of the first stable version of Flutter. Subsequently, Flutter 1.12 was 
released at the Flutter Interact event on December 11, 2019 [22]. Flutter supports cross-platform 
that can be run on several different platforms. By using Flutter, the Android and iOS application 
development process can be done at the same time. 

Other than mobile platforms, Flutter can also run on web and desktop platforms. This will save 
time by not needing to learn the native language used on each platform. As a result, developers 
can produce high-quality applications that run well on multiple platforms using only one codebase 
[23]. Flutter uses Dart programming language, which Google also created in 2011. The Flutter 
engine is mainly written in the C++ programming language and remains at the core of Flutter. The 
engine implements Flutter's core APIs, including accessibility support, Dart runtime, text graphics 
layout, and plugin architecture. Flutter consists of a system layer structure. It works and runs in 
order, with each layer depending on the previous layer [24]. With the advantage offered by Flutter 
in the development process, namely one codebase for multi-platforms, it can provide a level of 
code efficiency that can be increased. In principle, the flutter system development applies the 
concept of reusable widgets, where the basic architecture of Flutter can be seen in Figure 3. 


LONTAR KOMPUTER VOL. 13, NO. 3 DECEMBER 2022 p-ISSN 2088-1541 
DOI : 10.24843/LKJITI.2022.v13.i03.p03 e-ISSN 2541-5832 
Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 

165 
 

Figure 3. Flutter Basic Architecture 

3. Result and Discussion 

Balinese Script Optical Character Recognition uses Tesseract OCR engine version 5 as the 
model and Flutter mobile framework version 2.16 as the mobile application framework. In the 
dataset training stage, the operating system used is a Linux Ubuntu 20.04 virtual server with 
specifications of 1 GB Memory, 25 GB Disk, and SGP1 - Ubuntu 20.04 (LTS) x64. For the dataset 
training process to run in an isolated environment, a service is needed that provides the ability to 
package and run an application in an isolated environment called a container. With adequate 
isolation and security, running multiple containers simultaneously on a particular host is possible. 
In this section, the discussion related to the research results consists of several sections based 
on the two main technologies used: Tesseract OCR and Flutter Mobile Framework. 

3.1. Dataset Generation Result 

Generating datasets using the web scrapping method aims to produce two datasets: the Balinese 
Script image dataset and ground truth transliteration. The process of generating data requires 
Balinese language data, which is converted into Balinese Script using Unicode. The Balinese 
language data used is a Balinese transliteration dataset totaling 35,319 words. The composition 
of the transliterated dataset consists of Balinese, Indonesian and English words. The amount of 
data based on the word index of the dataset can be seen in Table 1. 

Based on the composition of the transliterated data in Table 1, it is then converted into a pair of 
data, namely a single-line text image with a "png" file extension and its single-line transliteration 
text with a "gt.txt" file extension. The form of the resulting dataset can consist of text images of 
the alphabet and text images of words in Balinese. At the dataset generation stage, a website-
based platform uses the Laravel framework as a backend. In addition to using the backend at this 
stage, the other plugin for the image acquisition process that works on the client side was used. 
This plugin aims to ease server performance in generating a large number of datasets. This image 
acquisition process captures selected HTML pages based on the index id of each element 
simultaneously. Using an id on each HTML element aims to provide a unique identity so that when 
the image acquisition plugin performs image capture, it can select the area's boundary. A sample 
of data from the generated dataset can be seen in  

Figure 4. To carry out the training and validation process, the dataset is divided into a composition 
of 90% for training and 10% for conducting the validation process. The dataset used to carry out 


LONTAR KOMPUTER VOL. 13, NO. 3 DECEMBER 2022 p-ISSN 2088-1541 
DOI : 10.24843/LKJITI.2022.v13.i03.p03 e-ISSN 2541-5832 
Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 
 

166 
 

the testing process is built by pairs of data taken each from the word index so that the amount of 
data used to carry out the testing process is 21 pairs of data. 
 

Table 1. Composition of Transliterated Dataset 
 
 
Figure 4. A Sample of Pair of Data from the Generated Dataset  

3.2. Dataset Training Result 

At the dataset training stage, several stages must be done to the generated dataset. The first 
stage groups the dataset into several groups, namely the dataset group per character, the dataset 
group per word, the dataset group per sentence, and the dataset group per paragraph. In the next 
second stage, after grouping the dataset, the datasets are arranged based on the dataset 
hierarchy. The preparation process of a dataset hierarchy is made into several versions and 
tested whether the hierarchical arrangement can increase the quality of the dataset training result. 
The first hierarchical arrangement of dataset training is a hierarchical arrangement by combining 
the dataset randomly (Random Dataset Combination Hierarchy). The percentage rate of 
Coincidence obtained using a random hierarchical arrangement is 25%. Next is the hierarchical 

Word Index Word Count 

A 1423 
B 2090 
C 1171 
D 936 
E 642 
G 1686 
H 28 
I 468 
J 792 
K 3767 
L 1494 
M 4881 
N 4602 
O 274 
P 3508 
R 894 
S 2943 
T 2279 
U 856 
W 576 
Y 9 


LONTAR KOMPUTER VOL. 13, NO. 3 DECEMBER 2022 p-ISSN 2088-1541 
DOI : 10.24843/LKJITI.2022.v13.i03.p03 e-ISSN 2541-5832 
Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 

167 
 

arrangement of the dataset using per character only (Single Character Dataset Combination 
Hierarchy). The hierarchical arrangement of this dataset gets a coincidence percentage rate of 
40%. This result increased from the previous hierarchy, which consisted of a random dataset 
combination. 

The last dataset hierarchy is a hierarchical arrangement consisting of dataset group per character, 
dataset group per word, dataset group per sentence, and finally, dataset group per paragraph 
(Combination Hierarchy of Character, Word, Sentence, and Paragraph Datasets). The 
hierarchical arrangement of this dataset regards the order of levels according to the order 
described previously. The dataset training process using this hierarchical arrangement is carried 
out in several training iterations until all the hierarchical levels are finished. The first level being 
trained is the dataset level per character. After the process is complete, it will proceed to the 
dataset level per word, after that the dataset level per sentence, and the last is the dataset level 
per paragraph. The results from the dataset training using this hierarchy got a coincidence 
percentage rate of 66.67%. The coincidence rate obtained has increased compared to the 
previous two experiments. The generated language model by the dataset training process is in 
the form of a trained data binary file. This language model will be the language library of the 
Tesseract OCR engine. Based on the result of the data training carried out, it can be seen that 
several dataset training scenarios were carried out with different dataset compositions and 
hierarchies. The result of the language model (trained data file) that will be used is the language 
model, which has the highest coincidence rate. The following dataset training results can be seen 
in  
Figure 5. 
 

Figure 5. Dataset Training Results 
 

The combination and the hierarchy of datasets used are the main factors influencing the increase 
in the coincidence performance of the three experiments conducted using different combinations 
of training datasets. The results of the three experiments have a common thread in terms of the 
hierarchical structure of the dataset. The more structured the hierarchy used, the better the 
coincidence rate. This increase is because Tesseract OCR learns and recognizes characters 
starting from the smallest unit, namely per character, then per word, after that per sentence, and 
finally per paragraph. The following graph of the increase in the coincidence rate can be seen in 
Figure 6. 
 

LONTAR KOMPUTER VOL. 13, NO. 3 DECEMBER 2022 p-ISSN 2088-1541 
DOI : 10.24843/LKJITI.2022.v13.i03.p03 e-ISSN 2541-5832 
Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 
 

168 
 

Figure 6. The Coincidence Performance 
 

The preliminary test of the resulting model language includes several test scenarios, namely the 
Basic Syllables test scenario, the Numerals test scenario, and the word test scenario. From the 
model language test process, the maximum coincidence rate was 100%, the minimum 
coincidence rate was 66.67%, and the average coincidence rate was 88.26%. The test results 
can be seen more clearly in Figure 7. 
 

Figure 7. Testing Result 


LONTAR KOMPUTER VOL. 13, NO. 3 DECEMBER 2022 p-ISSN 2088-1541 
DOI : 10.24843/LKJITI.2022.v13.i03.p03 e-ISSN 2541-5832 
Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 

169 
 

3.3. Tesseract Mobile Framework Implementation 

The application is built using the Flutter mobile framework by applying the concept of a clean code 
architecture. The clean code architecture is a blueprint for a modular system, which strictly follows 
a design principle called separation of concerns. More specifically, this architectural style focuses 
on separating the software into multiple layers to simplify the development and maintenance of 
the application itself. When layers are appropriately separated, code snippets can be reused, 
developed, and updated independently. The resulting application is also scalable, readable, 
testable, and can be easily maintained at any time. In addition to using clean code architecture, 
the application uses the Flutter Tesseract OCR dependency version 0.4.20 with a minimum SDK 
version of 2.12. To carry out the process of recognizing Balinese characters, the application can 
receive Balinese Script image input in two ways: using existing images that can be taken from the 
smartphone gallery or images taken from smartphone cameras. The Balinese Script image input 
will be processed to be recognized and converted into text. The results of the implementation of 
the Tesseract Mobile Framework OCR can be seen in  
Figure 8. 
 

Figure 8. Balinese Script OCR Application: (a) Camera Screen; (b) Image Preview Screen; (c) 
Balinese Script Screen; and (d) History Screen 

 
4. Conclusion 

Several initial steps were carried out in the dataset preparation process: preparing Balinese 
transliteration data, converting Latin Balinese to Balinese Script using Unicode, and creating a 
template for the dataset generation process. Dataset generation utilizes web scraping methods 
and a web-based platform for the image acquisition process. The result of generating the dataset 
is in the form of paired files, namely a single-line-text image of Balinese characters (with "png" file 
extension) and its related single-line text transliteration (with "gt.txt" file extension). The dataset 
has been successfully generated with 35,319 image-text file pairs. The optical character 
recognition method and engine used in training the dataset and the Balinese character recognition 
process is Tesseract OCR version 5. The dataset training process consisted of three experiments 
with different dataset hierarchical structures. The first dataset hierarchy is a random dataset 
combination (Random Dataset Combination Hierarchy) which produces a coincidence rate of 
25%. The second dataset hierarchy is the dataset hierarchy per character (Single Character 
Dataset Combination Hierarchy), with a coincidence rate of 40%. Then, the last dataset hierarchy 
is a combination of dataset per character, dataset per word, dataset per sentence, and dataset 
per paragraph (Dataset Combination Hierarchy of Character, Word, Sentence, and Paragraph) 
by producing a coincidence rate of 66.67%. From the three dataset hierarchical structures used 
for the training process, it can be concluded that the more diverse and structured the dataset 


LONTAR KOMPUTER VOL. 13, NO. 3 DECEMBER 2022 p-ISSN 2088-1541 
DOI : 10.24843/LKJITI.2022.v13.i03.p03 e-ISSN 2541-5832 
Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 
 

170 
 

hierarchy used, the higher the coincidence rate. The training process's result from the trained 
data language model is then implemented into a mobile-based application platform. Mobile 
application development uses the Flutter mobile framework by applying a clean code architectural 
concept. That mobile application has several main pages: Camera Screen, Image Preview 
Screen, Balinese Script Screen, and History Screen. It can be concluded that generating a dataset 
can be a better choice when needing a large training dataset compared to some previous studies 
that used jTessBox tools that require relatively more time to select characters for the dataset. 

Based on the results of the research process that has been carried out, it is realized that the 
coincidence level can still be improved. Several things are important to note to improve the 
coincidence rate result. In this study, the dataset used in building the language model is limited 
to only using synthetic data images. The next work to be carried out is to enhance several dataset 
hierarchies by combining several Balinese script characters with different styles, like optical 
characters, original data, and handwritten characters. The hierarchical arrangement of the dataset 
will refer to the more complex Balinese writing rules based on the existing Balinese dictionary. 
Furthermore, the structured hierarchy will be verified by Balinese language and script experts to 
ensure the validity of the dataset to be trained. Related to the image quality of the dataset, there 
will be stages like preprocessing, thresholding, and other image preprocessing methods before 
carrying out the dataset training process. 
 
Acknowledgment 

The authors gratefully acknowledge the support of the Indonesian Ministry of Education, Culture, 
Research, and Technology for research funding in the area of technology for information data on 
various forms of local wisdom. 

 
References 
 
[1] I. M. D. R. Mudiarta et al., "Balinese character recognition on mobile application based on 

tesseract open source OCR engine," Journal of Physics: Conference Series, vol. 1516, no. 
1, 2020, doi: 10.1088/1742-6596/1516/1/012017. 

[2] Bali Governor, Bali Governor Regulation No. 80 on Protection and Usage of Balinese 
Language, Script, and Literature. Indonesia, 2018. 

[3] A. Qaroush, A. Awad, M. Modallal, and M. Ziq, "Segmentation-based, omnifont printed 
Arabic character recognition without font identification," Journal of King Saud University - 
Computer and Information Sciences, Volume 34, Issue 6, Part A, 2020, doi: 
10.1016/j.jksuci.2020.10.001. 

[4] T. W. Ramdhani, I. Budi, and B. Purwandari, "Optical Character Recognition Engines 
Performance Comparison in Information Extraction," International Journal of Advanced 
Computer Science and Applications, vol. 12, no. 8, pp. 120–127, 2021, doi: 
10.14569/IJACSA.2021.0120814. 

[5] G. Abdul Robby, A. Tandra, I. Susanto, J. Harefa, and A. Chowanda, "Implementation of 
Optical Character Recognition Using Tesseract With the Javanese Script Target in Android 
Application," Procedia Computer Science, vol. 157, pp. 499–505, 2019, doi: 
10.1016/j.procs.2019.09.006. 

[6] H. Hassani and S. Idress, "Exploiting Script Similarities to Compensate for the Large Amount 
of Data in Training Tesseract LSTM: Towards Kurdish OCR," Applied Sciences, p. 20, Oct. 
2021, doi: 10.3390/app11209752. 

[7] R. Smith, "An Overview of the Tesseract OCR Engine," in Ninth International Conference on 
Document Analysis and Recognition (ICDAR 2007), 2007, pp. 629–633, doi: 
0.1109/ICDAR.2007.4376991. 

[8] G. Indrawan, I. K. Paramarta, K. Agustini, and Sariyasa, "Latin-to-Balinese Script 
Transliteration Method on Mobile Application: A comparison," The Indonesian Journal of 
Electrical Engineering and Computer Science (IJEECS), vol. 10, no. 3, pp. 1331–1342, 2018. 

[9] S. Chaudhari, R. Aparna, V. G. Tekkur, G. L. Pavan, and S. R. Karki, "Ingredient/Recipe 
Algorithm using Web Mining and Web Scraping for Smart Chef," Proceedings CONECCT 
2020 - 6th IEEE International Conference on Electronics, Computing and Communication 
Technologies, no. 3, pp. 22–25, 2020, doi: 10.1109/CONECCT50063.2020.9198450. 

[10] W. Uriawan, A. Wahana, D. Wulandari, W. Darmalaksana, and R. Anwar, "Pearson 


LONTAR KOMPUTER VOL. 13, NO. 3 DECEMBER 2022 p-ISSN 2088-1541 
DOI : 10.24843/LKJITI.2022.v13.i03.p03 e-ISSN 2541-5832 
Accredited Sinta 2 by RISTEKDIKTI Decree No. 158/E/KPT/2021 

171 
 

Correlation Method and Web Scraping For Analysis of Islamic Content on Instagram Videos," 
Proceedings - 2020 6th International Conference on Wireless and Telematics (ICWT) 2020, 
2020, doi: 10.1109/ICWT50448.2020.9243626. 

[11] G. Adomavicius and A. Tuzhilin, "Web Scraping: State of the art," IEEE Transactions on 
Knowledge and Data Engineering, vol. 17, no. 6, pp. 734–749, 2019. 

[12] Tesseract OCR, "Tesseract User Manual," Github, 2018. https://tesseract-
ocr.github.io/tessdoc/ (accessed Jul. 08, 2022). 

[13] S. Idrees and H. Hassani, "Exploiting Script Similarities to Compensate For The Large 
Amount of Data In Training Tesseract LSTM: Towards Kurdish OCR," Applied Sciences, vol. 
11, no. 20, 2021, doi: 10.3390/app11209752. 

[14] P. Kumar, P. Sihag, P. Chaturvedi, K. V. Uday, and V. Dutt, "BS-LSTM: An Ensemble 
Recurrent Approach to Forecasting Soil Movements in the Real World," Front. Earth Sci., 23 
August 2021 Sec. Environmental Informatics and Remote Sensing, vol. 9, no. August, pp. 
1–23, 2021, doi: 10.3389/feart.2021.696792. 

[15] C. Clausner, A. Antonacopoulos, and S. Pletschacher, "Efficient and effective OCR engine 
training," International Journal on Document Analysis and Recognition (IJDAR), vol. 23, no. 
1, pp. 73–88, 2020, doi: 10.1007/s10032-019-00347-8. 

[16] V. K. Kaliappan, S. Yu, R. Soundararajan, S. Jeon, D. Min, and E. Choi, "High-Secured Data 
Communication for Cloud Enabled Secure Docker Image Sharing Technique Using 
Blockchain-Based Homomorphic Encryption," Energies, vol. 15, no. 15, 2022, doi: 
10.3390/en15155544. 

[17] N. H. Khan and A. Adnan, "Urdu optical character recognition systems: Present contributions 
and future directions," IEEE Access, vol. 6, pp. 46019–46046, 2018, doi: 
10.1109/ACCESS.2018.2865532. 

[18] K. O. Mohammed Aarif and S. Poruran, "OCR-Nets: Variants of Pre-trained CNN for Urdu 
Handwritten Character Recognition via Transfer Learning," Procedia Computer Science, vol. 
171, no. 2019, pp. 2294–2301, 2020, doi: 10.1016/j.procs.2020.04.248. 

[19] B. Wang, Y. W. Ma, and H. T. Hu, "Hybrid model for Chinese character recognition based 
on Tesseract-OCR," International Journal of Internet Protocol Technology, vol. 13, no. 2, pp. 
102–108, 2020, doi: 10.1504/IJIPT.2020.106316. 

[20] R. Bassam et al., "Autonomous Assistance System for Visually Impaired using Tesseract 
OCR & gTTS Autonomous Assistance System for Visually Impaired using Tesseract OCR & 
gTTS," Journal of Physics: Conference Series, Volume 2327, 4th International Conference 
on Intelligent Circuits and Systems, doi: 10.1088/1742-6596/2327/1/012065. 

[21] D. Sporici, E. Cus, and C. Boiangiu, "Using Convolution-Based Preprocessing," SS 
symmetry, 2020. 

[22] Google, "Flutter architectural overview." https://docs.flutter.dev/resources/architectural-
overview (accessed February 06, 2022). 

[23] Google, "Dart overview." https://dart.dev/overview (accessed Feb. 06, 2022). 
[24] N. Chigali, S. R. Bobba, K. Suvarna Vani, and S. Rajeswari, "OCR assisted translator," 7th 

International Conference on Smart Structures and Systems (ICSSS), July 2020, doi: 
10.1109/ICSSS49621.2020.9202034.