International Journal on Advances in ICT for Emerging Regions 2021 14 (3): 

International Journal on Advances in ICT for Emerging Regions                                                                                   July 2021 

Estimating the Effects of Text Genre, Image 

Resolution and Algorithmic Complexity needed 

for Sinhala Optical Character Recognition 
Isuri Anuradha#1, Chamila Liyanage2, Ruvan Weerasinghe3 

 
Abstract— While optical character recognition for Latin based 

scripts have seen near human quality performance, the accuracy 

for the rounded scripts of South Asia still lags behind. Work on 

Sinhala OCR has mainly reported on performance on 

constrained classes of font faces and so been inconclusive. This 

paper provides a comprehensive series of experiments using 

conventional machine learning as well as deep learning on texts 

and font faces of diverse types and in diverse resolutions, in order 

to present a realistic estimation of the complexity of recognizing 

the rounded script of Sinhala. While texts of both old and 

contemporary books can be recognized with over 87% accuracy, 

those in old newspapers are much harder to recognize owing to 

poor print quality and resolution. 

 
Keywords— Sinhala OCR, Optical Character Recognition, 
Tesseract, Deep learning.  

I. INTRODUCTION 

Optical Character Recognition (OCR) technology is 

designed to recognize printed texts into machine operable text. 

OCR is a collection of multiple steps such as scanning, pre-

processing, segmentation, feature extraction, classification, 

recognition and post-processing. In recent literature, many 

OCR systems have been developed for recognizing Latin 

characters [1].  With the advancement of Natural Language 

Processing during the past few years, researchers have 

integrated machine learning/deep learning techniques for 

analysing the textual representations on digital documents. 

Template Matching, Neural Network (NN) and Recurrent 

Neural Network (RNN) are popular and widely used 

algorithms for character recognition. These technologies are 

better when applied for the other character sets, since large 

volumes of data are available in print media for many 

languages. The proposed Sinhala OCR is discussed in this 

paper with special focus on the text genre, image resolution 

and algorithmic complexities needed for training an OCR 

system for the Sinhala character set.  

As the state of the art OCR technology, currently Tesseract 

is used in the training of OCR systems for many character sets. 

Further, Tesseract has moved from machine learning to deep 

learning with LSTM architecture and provides relatively better 

recognition competence [2]. However, algorithmic complexity 

is not enough for training an OCR model, as text genre and 

image quality affect training a more accurate OCR model. 

Since large volumes of available data are in print media and 

they have been printed before the computer era, the documents 

have been printed using different techniques such as offset 

printing and screen printing. Therefore, common type-faces 

used in the history of printing should also be trained to train 

the model to get such text recognizable. Further, types and 

sizes of the fonts and size of the training text is also significant. 

In this paper we discuss the OCR system developed for Sinhala 

by estimating the effects of text genre, image resolution and 

algorithmic complexity.  

The rest of this paper is structured as follows: Section II gives 

a brief overview of the related work in this area. Section III 

discusses some properties and characteristics of the Sinhala 

script as it is significant to review the complexities with regard 

to the particular script. Algorithmic complicacy adopted to 

OCR is discussed in section IV. Further, section V gives the 

motivation and rationale for the experimental set and 

systematic description on training data, word lists, and training 

regime adopted to develop the Sinhala OCR. Section VI 

presents experimental results on the OCR methods, and we 

also give an analysis of their performance comparison. Finally, 

the paper is concluded with a discussion of future works.   

II. RELATED WORKS 

Despite decades of research on the engineering aspects, the 

problem of Sinhala character recognition remains as a 

challenging issue in the OCR field. When the past few years 

are considered, some studies have been conducted to identify 

widely used font types in Sri Lanka [3].  When considering 

OCR for the Sinhala language, initially the K-Nearest 

Neighbour (KNN) algorithm-based Sinhala OCR was 

developed by the Language Technology Research Laboratory, 

University of Colombo School of Computing [3]. For the 

following study, commercially used font types have been 

employed by varying font sizes to obtain 94% of average 

accuracy.  

  Considering literature, Neural Network based Sinhala 

OCR systems have been developed in recent years [4], [5], [6]. 

In 2013, the Sri Lanka Institute of Information Technology 

conducted a research based on applying neural networks for 

Sinhala optical character recognition [4]. In this study they 

have only focused on 36 characters in the alphabet. Another 

Sinhala OCR application integrating neural networks was 

developed by a local research group [5]. These studies mainly 

focused on the character level accuracies and not on word 

accuracies. 

Correspondence: I. Anuradha #1 (E-mail isa@ucsc.cmb.ac.lk) 
Received: 29-12-2020 Revised:31-05-2021  Accepted: 11-06-2021 

I. Anuradha#1 C Liyanage2, R. Weerasinghe3, are form University of 
Colombo School of Computing. (isa@ucsc.cmb.ac.lk)  

This paper is an extended version of the paper “Deep Learning Based 

Sinhala Optical Character Recognition (OCR)” presented at the ICTer 

Conference (2020)  
DOI: 

© 2021 International Journal on Advances in ICT for Emerging Regions 

 
mailto:isa@ucsc.cmb.ac.lk
UCSC
Typewritten Text

UCSC
Typewritten Text
http://doi.org/10.4038/icter.v14i3.7231

UCSC
Typewritten Text

UCSC
Typewritten Text


Estimating the Effects of Text Genre, Image Resolution and Algorithmic Complexity needed for Sinhala Optical Character Recognition    2 

 
International Journal on Advances in ICT for Emerging Regions                                                                                   July 2021 

In addition, the Software Development Unit of University 

of Colombo School of Computing has trained a Sinhala OCR 

model using Tesseract 3 [7]. This system shows relatively 

good results only for the high-resolution images. Also, 

Language Technology and Research laboratory at University 

of Colombo is experimenting on the integration process of 

machine learning concepts to Sinhala OCR applications [8]. 

Further, Manisha et al. [9] has also tried to combine the 

Tesseract OCR engine with the Sinhala characters and 

mentions 97% of accuracy. However, the performance has not 

been well documented. It’s well-known that Indic languages 

have  many complexities and variations of characters which 

makes OCR systems hard to develop. But in the past few years, 

multiple studies have been conducted integrating Tesseract 

OCR engine for character recognition using different low 

resource languages such as Tamil [10], Hindi [11], Bengali 

[12] and Urdu [13].      

III. SINHALA SCRIPT 

The Sinhala script is an abugida or alphasyllabary script 

in which consonant-vowel sequences are written as units and 

thereby it is called a segmental writing system. The script has 

evolved from the Brahmi script. The letters in Sinhala are 

circular-shaped and are written from left to right [14]. The 

Sinhala script is used primarily to write the Sinhala language, 

which is one of the official languages of Sri Lanka spoken by 

about 16 million people in the country. In addition, it is also 

used in Sri Lanka for writing Pali, the canonical language of 

Theravada Buddhism, and sometimes Sanskrit, the Old Indo-

Aryan language [15].  

There are 20 vowels and 41 consonants in the Sinhala 

script. Since Sinhala is a segmental writing system, vowels 

take two representations as independent vowels: occur in the 

initial position of a word (infrequently occur in the middle of 

a word: E.g. නුවරඑළිය, ජාඇල) and dependent vowels also 

known as vowel modifiers: occur after a consonant. Figure 1 
and 2 illustrate the vowels with their modifiers and consonants 

in Sinhala script respectively.  

 
Vowels and modifiers included for the training data 

ආ ා   උ ා   ඒ ේා  
ඇ ා   ඌ ා   ඓ ෛා 
ඈ ා   ඍ ා   ඔ ේා  
ඉ ා   ඎ ා   ඕ ේා  
ඊ ා   එ ේා  ඖ ේා  
    ා    ා  
Vowels and modifiers not included for the training data 

 ඏ  ා  ඐ  ා   
Fig. 1 Vowel characters and modifiers in Sinhala script 

From among the vowel modifiers in figure 1, ං  (anusvara) 

and ං  (visarga) are two specific modifiers. They occur not 

only with consonants but also with vowels. E.g. අ , ඉ , උ , අ , 

ඕ .  

Consonants included for the training data 

ක ඛ ග ඝ  ඟ ළ හ 
ච ඡ ජ ඣ ඤ ඥ ය ශ 
ට ඨ ඩ ඪ ණ ඬ ර ෂ 
ත ථ ද ධ න ඳ ල ස 
ප ඵ බ භ ම ඹ ව ෆ 

Consonants not included for the training data 

  ඞ  ඦ    
Fig. 2 Consonant characters in Sinhala script 

Two vowels: ඏ, ඐ and their corresponding vowel 

modifiers in figure 1 and ඦ in figure 2 were not included for 

the training data as they  do not occur in old or contemporary 

Sinhala books. However, ඞ in figure 2 occurs in a limited 

number of words in old Sinhala books. It was not included 

because the shape of the particular character would  cause 

misrecognition with similar characters in Sinhala script.  

Sinhala consonants imply the inherent vowel /a/ (අ) when 

they are  occur with no modifiers. Absence of the inherent 

vowel is marked by adding a symbol called hal lakuna or 

halkirima to the top of the particular consonant. E.g. ක්, ව්. 

Further, hal lakuna also  occurs with two vowels and their 

modifiers. It has two shapes as illustrated in figure 3.  

 
Fig. 3 Two different shapes of hal lakuna with vowels and consonants 

As a segmental writing system, vowel modifiers  appear 

above, below or to the right or left of the basic consonant. From 

all the consonant-vowel sequences in Sinhala script, ළු is a 

special character as it appears as a separate symbol to represent 

ළ+උ sequence. As an example, following figure 4 illustrates 

all the consonant-vowel sequences for consonant ‘ක’.  

ක  ක  ක  කි කී කු 
කූ ක  ක  ේක ේේ ෛක 

ේක  ේක  ේක  ක  ක   
Fig. 4 Consonant ‘ක’ with all the vowel modifiers 

There are three consonant modifiers which occur  in the 

Sinhala script, known as rakaranshaya (  % ), yanshaya ( H ) and 

rephaya (    _).  Among them rakaranshaya represents ‘ර’ (ra) 
and yanshaya represents ‘ය’ (ya) when they appear after a 

consonant (from which the inherent vowel has been removed). 

However, as symbols, rakaranshaya appears below (e.g. ක්‍රම, 

ආශ්‍රය, වක්‍ර) and yanshaya to the right (වයසන, සත්‍ය, 


3            Isuri Anuradha#1, Chamila Liyanage2, Ruvan Weerasinghe3 

 
International Journal on Advances in ICT for Emerging Regions                                                                                   July 2021 

සංඛ්‍යාව) of the basic consonant. Further, rephaya is also 

used to denote 'ර්' when it occurs before a consonant and the 

symbol appears on top of the basic consonant (e.g. ධර්‍ම, සර්‍ව, 

ත්‍ර්‍ක). Using rephaya is an alternative rule in the Sinhala 

writing system while rakarakshaya and yanshaya are 

essential. All the vowel modifiers surround the consonant-

rakaranshaya (e.g. ක්‍ක්‍රෝ ), consonant-yanshaya (e.g. ක්‍කයෝ ) 

or rephaya-consonant (e.g. ක්‍ර්‍කෝ ) units. Figure 5 illustrates 

how vowel modifiers occur with rakaranshaya.   

 
ක්‍ර ක්‍ර  ක්‍ර  ක්‍ර  ක්‍රි ක්‍රී 
ක්‍ර  ක්‍ර  ේක්‍ර ේේ ෛක්‍ර ේක්‍ර  
ේක්‍ර  ේක්‍ර  ක්‍ර  ක්‍ර    
 

Fig. 5 Consonant ‘ක’ with rakaranshaya and the vowel modifiers 

One other significant characteristic in Sinhala writing 

system is using compound consonants. This frequently 

occurred in old Sinhala books. However, in contemporary 

Sinhala this writing system is infrequent and therefore only the 

first set of compound consonants in figure 6 (which are rarely 

occurred in contemporary Sinhala books) have been concerned 

for the training data in this research.  

Compound consonants rarely occurred in contemporary 

Sinhala books 

ක්‍ව ක + ZWJ + ව 
ක්‍ෂ ක + ZWJ + ෂ 
ග්‍ධ ග + ZWJ + ධ 
ත්‍ථ ත + ZWJ + ථ 
ත්‍ව ත + ZWJ + ව 
න්‍ථ න + ZWJ + ථ 
න්‍ද න + ZWJ + ද 
න්‍ධ න + ZWJ + ධ 
න්‍ව න + ZWJ + ව 
ද්‍ය ද + ZWJ + ය 

Compound consonants occurred in old books 

 ඤ + ZWJ + ච 
ට්‍ඨ ට + ZWJ + ඨ 
ද්‍ධ ද + ZWJ + ධ 
ද්‍ව ද + ZWJ + ව 
Fig. 6 Compound consonants in Sinhala script 

 
IV. SYSTEM OVERVIEW 

In our study, different Sinhala text genres were given 

different accuracy results. From a variety of genres, 

explanation and descriptive writings, narrative writings, and 

news reportage were selected for our purpose. When selecting 

documents, we considered a variety of documents and Unicode 

font types from different printing eras. When image 

resolutions were considered, low image resolutions may affect 

not only quality but also speed degradation of overall OCR 

performance, since uncertainty in character pictures produce 

more recognition variants. In the Tesseract engine also, high 

resolution images were able to give high accuracy by 

identifying all the punctuations, modifiers and complex letters. 

In the Tesseract engine, image processing is a combination of 

several steps such as rescaling, Binarization, Dilation / 

Erosion, and etc. 

 
For the training process, we adapted and experimented on both 

Tesseract 3.0 (Legacy version) and Tesseract 4.0 (Deep 

learning) OCR engines as a tool. Tesseract has a standard level 

of accuracy in its engine. It’s necessary to have a library file in 

the OCR engine called ‘traineddata’ which works on Sinhala 

inputs. This file is a concatenation of multiple files. According 

to the accuracy and richness in the library file, the OCR engine 

can work to its full potential. Sinhala language is complicated 

and has various types of letters including vowels, consonants, 

compound characters and other special types. Therefore, for 

Tesseract 3.0, we developed a large character set for Sinhala. 

It is important to mention that, for Tesseract 3.0 we need to 

uniquely identify each and every character. Sometimes due to 

the complexity of the character set, the OCR may not always 

detect a character correctly even if the character is included in 

the training files.  

 
The preparation of data and the training process adopted 

for developing the Sinhala OCR model for both tesseract 3.0 

and 4.0 versions are described in the following subsections. 

V. TRAINING PROCESS 

The preparation of data and the training process adopted 

for developing the Sinhala OCR model for both Tesseract 3.0 

and 4.0 versions are described in the following subsections. 

● Preparation Tesseract version 3.0  

A. Setting up the OCR Engine  

We installed the Tesseract version 3.0 in the Windows 

Operating System. Since there is no user interface of Tesseract 

3.0, we used several commands in the command line to launch 

the application.  

B. Preparing training data 

  The process followed by preparing training data is described 

below. 

1)  Preparing OCR alphabet for Sinhala: Initially the OCR 

alphabet for Sinhala was defined to collect text data. The 

OCR alphabet is distinct from the character alphabet which 

includes the basic units of training data as follows. 


Estimating the Effects of Text Genre, Image Resolution and Algorithmic Complexity needed for Sinhala Optical Character Recognition    4 

 
International Journal on Advances in ICT for Emerging Regions                                                                                   July 2021 

o All vowels: e.g.  අ ඉ උ එ ඒ ඔ ඕ 

o All consonants e.g. ක ව ඟ ඳ ඣ  

o Consonants with touching modifiers: e.g. කි කු වි  

o Consonants with hal lakuna: e.g. ක් ව්  

o Compound consonants: e.g. ක්‍ර ව්‍ර ක්‍ෂ ත්‍ථ  

o Compound consonants with touching modifiers: e.g. 

ක්‍රි ව්‍රි ක්‍ෂි ක්‍ෂු ත්‍ථි ත්‍ථු  

o Non-touching modifiers: e.g. ෙං ංා ං  ං   

o Punctuation marks: e.g. ! ? ( ) /  

2)  Text data collection: Text data for preparing training 

images were collected from the UCSC 10M Sinhala corpus 

and these were extracted as per the OCR alphabet that we 

defined. We selected 1050 words from extracted data for the 

preparation of training images. figure 7 shows a sample of 

text data.  

 
Fig. 7 Sample of text data 

 
3)  Preparing training Image: Training images were 

prepared for fonts widely used in Sinhala typing. The criteria 

followed to create text images are as follows. 

Colour: grayscale  

Font size of the text: 12, 14, 16 

DPI: 300 

Fonts used: Iskoola Pota, FMAbhaya, Malithi Web, 

BhashithaScreen, DinaminaUniWeb  

Based on the above criteria we prepared two sets of 

training images. The first set consisted of computer-

generated images (screenshots). As an iterative process of 

training, the second set of training images were prepared 

with scanned images for the same text data. Figure 8 shows 

a sample of such training images.  

 
Fig. 8 Sample of training images  

 
4)  Character Segmentation: In Tesseract 3.0, character 

segmentation is performed using the process of creating box 

files. A ‘box file’ is a text file, which contains the necessary 

information of the training images. The coordinate values of 

the characters in the training images along with 

corresponding Unicode characters are stored in these box 

files. The segmentation of the characters in the training 

images was done as per the OCR alphabet we defined. Each 

training image should have a box file in which the number of 

boxes must tally with the number of training character 

segments in the image. Figures 9 and 10 show an image with 

segmented characters and a sample of box information.  

 
Fig. 9 Character segmentation 
 
 
Fig. 10 Sample of box information 

 
5            Isuri Anuradha#1, Chamila Liyanage2, Ruvan Weerasinghe3 

 
International Journal on Advances in ICT for Emerging Regions                                                                                   July 2021 

ෙේ ´කලියුගය´ වූ කලි ´ගේ ෙෙරළිය´ හා ´යුගාන්තය´ එකට 

ඈඳීෙෙන් ඒ  තුන, තුන් ව ද රුේ එකෙ කතාන්තරයක් කරන්නකි. 

බටහිර සභ්‍යත්වය ර ගීෙ නිසා ෙඳින් ෙඳ ෙවනස ්වන්ට වූ දකුණු 

ෙළාෙත් ග මි ෙවුලක ජීවිතයත් ග මි සොජයත් ගේ ෙෙරළිෙයහි 

කථාවට වස්තු වී ය. එහි ෙසෙළාස ්වන ෙරච්ෙේදෙයහි ෙේ කියුෙ 

දක්නා ල ෙේ.   

´´ෙ ගෙලාත්සවෙයන් ෙසු ෙකාළඹ ගිය පියල් ඇත ේ විට ගෙට 

එන්ෙන් සතියකට වරකි;  ඇත ේ විට ෙදසතියකට වරකි. ක්‍රෙෙයන් 

දියුණු වන ෙවළඳාෙ උෙදසා තෙ කාලය වඩාත් ෙයදිය යුතු වූෙයන් 

පියල් ෙකාළඹ ෙදි චියට යාෙට සිතුෙව් ය. කලකට ෙෙර ඔහු ෙෙෝදර 

තනවන්ට ආරේභ්‍ කළ ෙගයි ව ඩ නිෙ ෙකෙළ් විවාහයට ෙදෙසකට 

ෙෙර ය. ඔහු ඒ ෙගය තනවන්ට ෙටන් ගත්ෙත් කුලියට ෙදනු පිණිස 

මිස තොෙේ ෙදි චිය පිණිස ෙනාෙව්. තො ස ද වූ ඒ ෙගයින් ගත යුතු 

ප්‍රෙයෝජනය පියල්ට ද න් ව ටෙහයි.´´ 

පියල් හා නන්දා, අනුලා හා තිස්ස ද සෙඟ  ෙකාළඹට ස ක්‍රෙණය වී 

ෙෙෝදර අර ෙගයි ෙදි චි ෙවති. ෙේ ´කලියුගය´ ඔවුන්ෙේත් ඔවුන්ෙේ 

දරුවන්ෙේත් ජීවිතය වණණනා කරනු පිණිස ෙබ ඳුණකි. 

C. Training the model  

The training was performed as an iterative process until 

better results were obtained. Firstly, the training models were 

done for individual data sets of computer-generated images for 

given font types and sizes. Secondly, we combined the training 

data sets for multiple fonts and multiple sizes and trained the 

models. Thirdly, the training was performed using the scanned 

images and trained multiple models for the given font types 

and sizes. Finally, all the data sets of computer-generated 

images and scanned images were combined in several ways 

and trained multiple models.  

● Preparation Tesseract 4.0 version 

A. Setting up the OCR Engine  

For setting up the Tesseract 4.0 version we selected Ubuntu 

environment. Since Tesseract 4.0 deals with deep learning 

techniques such as Long Short-Term Memory (LSTM), the 

Ubuntu operating system provides full compatibility for OCR 

engines. And all the tasks were carried out in the terminal and 

instructions were given as commands. 

B. Preparation of training data sets 

Training data plays an important role in Tesseract version 

4.0. With the integration of deep learning techniques, more 

training data will result in good outcomes. For our experiment, 

we have employed 3 datasets which are available for the 

Sinhala language. Further details of the 1) UCSC 10 million 

Sinhala dataset, 2) common crawler Sinhala dataset and 3) 

Google dataset will be discussed in the next few lines. 

1)  UCSC 10 Million Word Sinhala Corpus:  UCSC 10M 

Word Sinhala Corpus has been compiled by the Language 

Technology Research Laboratory - University of Colombo 

School of Computing (UCSC) in Sri Lanka. This text corpus 

contains a huge variety of Sinhala books including novels, 

short stories, translations, critiques written by renowned 

Sinhala writers, and Sinhala newspapers: Silumina, 

Dinamina, Lankadeepa and Lakbima. The UCSC 10 million 

dataset includes texts which belong to different eras in Sri 

Lanka. It also contains texts from various sources; the text is 

rich with different writings. Noise data and other textual data 

with different languages have been removed from this 

dataset in order to minimize the errors. 

2)  5million+ sentences in Sinhala common crawler: In 

2019, Guzman [16] presented two monolingual corpora for 

Sinhala. Those were a combination of 155k+ sentences of 

filtered Sinhala Wikipedia and 5178k+ sentences of Sinhala 

common crawl. Since this study considered only textual data 

available online, the diversity of textual representation is 

considerably low. Furthermore, a high noise rate exists in 

this dataset with other common issues like the zero width 

joiner problem and the combination of multiple language 

textual data with Sinhala textual data.  And these affect the 

overall accuracy of the system. 

3)  Google dataset for Sinhala is especially built with the 

Tesseract. This dataset includes variety of textual 

representations gathered in recent years.   

4)  UCSC 400K distinct wordlist: This list of monolingual 

vocabulary was developed from the UCSC 10 million words 

Sinhala corpus by the Language Technology Research Lab 

of UCSC. The list includes 440,021 distinct entries and is 

available on the web. 

 
After comparing these 3 datasets, the UCSC 10 million Sinhala 

dataset [17] was selected by the authors due to the enrichment 

of textual combinations in different eras and less noise data. 

UCSC 400K Distinct Word List [18] was also combined with 

the existing Tesseract word list. 

As a special feature, the Tesseract version 4.0 generates the tiff 

file and box file automatically. Additionally, image and 

corresponding UTF-8 text transcription are generated on lstmf 

file at the process of font training. Also in Tesseract 4.0 the 

clustering steps (mftraining, cntraining, shape clustering) are 

replaced with a single slow lstm training step.  

 
Fig. 11 Sample of training data for Tesseract 4.0 

C. Selection of font types and sizes 

Since typefaces are significant in training an OCR system, 

we investigated the commonly used Sinhala fonts to train the 

OCR model in Tesseract 4.0. Though there are hundreds of 

non-Unicode fonts available for the Sinhala script, they have 

no unique character code point for identification. Owing to its 

16-bit encoding, UNICODE is theoretically able to support 

over 65,000 unique character code points [19] and we selected 

9 Unicode fonts from the limited number of Sinhala Unicode 

fonts available. They include Unicode fonts which are most 

commonly used in printed and digital media [20]. The font 

types involved with the research is given below.  

○ Noto-Sans font 

○ LKLug font 

○ Malithi font 

○ Dinamina font 

○ Iskolapotha font  

○ BhashitaComplex font  


Estimating the Effects of Text Genre, Image Resolution and Algorithmic Complexity needed for Sinhala Optical Character Recognition    6 

 
International Journal on Advances in ICT for Emerging Regions                                                                                   July 2021 

D. Training the model 

As pre-processing steps for noise removal, adaptive 

thresholding, page layout analysis and connected component 

analysis were performed by the Tesseract OCR engine. The 

following steps were followed to train the model. Initially 

generated training data is provided as the input to the engine 

and extract the generated model. Then the model was fine 

tuned to decrease the error rate and finally the fine-tuned 

model was combined with the initial trained model. We 

combine multiple fonts for model creation. Single font models, 

Double font models and Triple font models were used for 

analysis.  

VI. EVALUATION AND RESULTS 

 
For the evaluation process, we considered both Tesseract 

3.0 and Tesseract 4.0. As the first phase, the Tesseract version 

3.0 was evaluated by character level. Meanwhile, the Tesseract 

4.0 was evaluated at both character and word level. The 

developed OCR models have been tested with 30 images 

selected for three different categories (10 for each category). 

When selecting images for testing we chose non identical 

images with different typefaces and different image qualities.  

1)  Old Sinhala newspapers: The testing data for this 
category was selected from archived images of Sinhala 

newspapers published in 1870 – 1890. The newspapers 

include: සෙතයෝදය (sathyodaya), සතයාල කාරය 

(sathyalankaraya) and දිනෙතා ප්‍රවෘත්ති (dinapatha 

prawruththi). All the images are in 200 DPI.  

2)  Old Sinhala books: Testing images for this category were 

selected from old Sinhala books which are printed on 

Letterpress printing. The old books selected include: 

බුත්සරණ (buthsarana), පූජාවලිය (pujawaliya) and 

සද්ධෙණරත්නාවලිය (saddharmarathnawaliya). The images 

in this category are in 72 DPI.  

3)  Contemporary Sinhala books: The books printed with 

computerized fonts were selected for this category. 10 

images of randomly selected pages from 10 books were 

taken and they were scanned for 300 DPI.  

To calculate the accuracy of the systems we compare the 

common and different characters between original and OCR 

document.  

 
A. Evaluation of the models from Tesseract 3.0 

The evaluation of Tesseract version 3.0 was conducted 

only for the third category of testing images for two reasons. 

Firstly, the results for the other two categories were not at 

satisfying level and secondly, we gave our main priority for 

the evaluation of Tesseract version 4.0 Therefore, we selected 

the most accurate model (Scanned-iskolapotha model) out of 

18 multiple models created by varying different font types and 

sizes. Original data of the testing samples consist of 2592 

words and 16380 characters. Testing results are illustrated in 

table i.  

TABLE I 

TESSERACT 3.0 RESULTS OF OCR DOCUMENTS 

Font type 

 
Recognized 

character 

count 

Recognized 

word count 

Accuracy 

of the 

system 

Iskolapotha 16962 2507 36.89% 

 
B. Evaluation of the models from Tesseract 4.0 

The models generated from Tesseract 4.0 OCR engine 

were evaluated for the three categories of testing samples 

explained above. From the generated models, all the models of 

individual fonts and three selected Combined Models (CM) 

were evaluated. The same set of testing images were used in 

the evaluation process.  

For the first category of evaluation, we selected 10 images 

from old newspapers and they consist of 1557 words and 9821 

characters. Some of the texts in these images are even hard to 

read by a human. The results for the first category of images 

are shown in table ii.  

TABLE II 

CATEGORY 01 RESULTS OF OCR DOCUMENTS 

Font type 

 
Recognized 

character 

count 

Recognized 

word count 

Accuracy 

of the 

system 

Noto-Sans 10142 1462 61.43% 

LK-LUG 10031 1441 61.66% 

Malithi 10094 1516 65.51% 

Iskolapotha 10067 1458 67.02% 

Dinamina 9897 1426 59.83% 

Bashitha 10056 1451 61.96% 

Noto-LKLug 

(CM) 

10071 1458 61.51% 

Malithi-Lug 

(CM) 

10003 1449 63.30% 

Noto-Lug-

Malithi 

(CM) 

10035 1445 64.40% 

 There were 3032 words and 18074 of characters in 

the images of category 2, the old books printed in letterpress 

printing era. The results obtained are illustrated in table iii. 

 
Accuracy (%) =   Count of common characters      X  100% 

 Count of (common characters + different characters) 


7            Isuri Anuradha#1, Chamila Liyanage2, Ruvan Weerasinghe3 

 
International Journal on Advances in ICT for Emerging Regions                                                                                   July 2021 

TABLE III 

CATEGORY 02 RESULTS OF OCR DOCUMENTS 

Font type 

 
Recognized 

character 

count 

Recognized 

word count 

Accuracy 

of the 

system 

Noto-Sans 18623 3019 85.13% 

LK-LUG 18584 3012 87.15% 

Malithi 18774 3039 87.07% 

iskolapotha 18688 3022 85.28% 

Dinamina 18428 2807 84.97% 

Bashita 18387 2983 87.53% 

Noto-LKLug 

(CM) 

18627 3023 86.06% 

Malithi-Lug 

(CM) 

18704 3017 85.69% 

Noto-Lug-

Malithi 

(CM) 

18461 3010 87.52% 

Third category of 10 images captured from contemporary 

Sinhala books and they consist 2592 words and 16151 

characters. The results are denoted in table iv.  

TABLE IV 

CATEGORY 03 RESULTS OF OCR DOCUMENTS 

 
Font type 

 
Recognized 

character 

count 

Recognized 

word count 

Accuracy 

of the 

system 

Noto-Sans 16476 2620 85.91% 

LK-LUG 16306 2615 84.91% 

Malithi 16530 2626 86.14% 

iskolapotha 16391 2635 87.63% 

Dinamina 16104 2459 85.49% 

Bashita 16178 2591 87.49% 

Noto-LKLug 

(CM) 

16335 2627 84.83% 

Malithi-Lug 

(CM)  

16338 2796 85.49% 

Noto-Lug-

Malithi(CM) 

16259 2613 86.59% 

 
When three categories of input image sets were considered, 

old newspapers have a low accuracy rate in character 

recognition due to the high existence of noises and low quality 

of images. In the old Sinhala book category Malithi, LKLug 

and the combined model of Noto Sans, LKLug and Malithi 

were given best accuracy when recognizing characters. The 

model trained with the font Iskolapotha obtained the highest 

accuracy rate in 3rd category of contemporary Sinhala books. 

When analysing OCR outputs, some identifiable improvement 

can be done in the recognition process of the system.  

Moreover, as another part of this research we randomly 

selected 5 images from the contemporary Sinhala books and 

converted for low DPI count (from 300px to 96px). Thereafter, 

we chose our best 3 models in the contemporary Sinhala books 

category and evaluated the performance. A total count of 1482 

words and 8735 characters were there in the selected 5 image 

data. 

TABLE V 

COMPARISON WITH THE LOW DPI LEVEL 

Font type 

 
Recognized 

character 

count 

Recognized 

word count 

Accuracy 

of the 

system 

iskolapotha 8983 1481 87.88% 

Malithi 9067 1479 86.09% 

Noto-Lug-

Malithi 

(CM) 

8895 1473 87.40% 

 
As a special note, after reducing DPI count, some character 

combinations which were not recognized correctly but 
samples with 300 DPI were able to recognized. For instance 
yanshaya was not recognized well in previous efforts but with 
this modification yanshaya was recognized well. Some clearly 
identifiable errors in the recognition process of our OCR 
system has been briefly noted below.  

 
o Confusing with similar shaped individual characters. (e.g. 
හ - භ්‍ - ග - ඟ - ශ, ව - ච - ට, ඔ - ඹ - ෙ, එ - ඵ - ථ, ඩ - ඬ, ද - 

ඳ, ය - ස - ඝ, ජ - ඡ, බ - ඛ, ඨ - ඪ, ත - න) Errors of this type 

frequently occur in 1st and 2nd categories of testing 

images. When the images are not clear, the tiny variations 

of the characters are difficult to be captured.  

o Inability to recognize hal lakuna. (e.g. ක් - ක, ෙක - ෙක්, 
ව - ව්, ෙව - ෙව්)  

o Confusing with vowel modifiers and hal lakuna in the 
same character or in different characters. (e.g. වි - වී - ව් - 

චි - චී - ච්, හි - හී - භි - භී, මි - මී - ේ, මු - මූ, සි - සී) 

o Misidentification of rakaranshaya with papilla, the vowel 
modifier for 'u'. (e.g. ප්‍ර - පු, ට්‍ර - ටු, ම්‍ර - මු) 

o Inability to recognize touching letters. (e.g.  - ස්ස,  
- ද්ව,  - ට්ඨ,  - ණ්ණ,  - ේඛ) Using touching 
letters was a writing style in old Sinhala. And in Pali it is 

a rule, as Pali does not have a sign like hal lakuna to show 

the absence of the inherent vowel. This writing style can 

occur in some testing images of category 1 and 2. Since 


Estimating the Effects of Text Genre, Image Resolution and Algorithmic Complexity needed for Sinhala Optical Character Recognition    8 

 
International Journal on Advances in ICT for Emerging Regions                                                                                   July 2021 

contemporary writing does not follow this style, the 

training data is not rich with these sequences. This has 

resulted in not recognizing touching letters.  

o Inability to recognize compound consonants. The 
compound consonants given in figure 6 hardly occur in 

contemporary Sinhala and therefore they are not well 

recognized.  

Some English characters were also in the testing images of 

all three categories. As we focused on developing a better 

recognition model for Sinhala characters, we did not include 

enough English text data in the training process, this resulted 

in some errors in recognition and affected the overall accuracy 

of the system. However, the above limitations will be 

considered in the next stage as a future enhancement.  

VII. CONCLUSION AND FUTURE WORKS 

In this paper we presented a process of developing an 

Optical Character Recognition system for Sinhala. In this 

research we identified the characteristics of Sinhala script 

along with properties of writing style in Sinhala scripts. The 

training process of the OCR model was initiated with 

Tesseract 3.0 and later moved to Tesseract 4.0, as it was the 

state of the art of deep learning.  

The evaluation was carried out by comparing the results 

with the different types of Sinhala fonts and adapting systems 

to recognize varieties of test data gathered from different 

sources.  Although we tested some samples with the model 

built from a Sinhala common crawl dataset, overall accuracy 

is less compared to others and unable to identify characters. 

According to the results, our system model trained with 

font iskolapotha gave accuracy of 87.63% in contemporary 

Sinhala books. In the Sinhala old book category, models 

developed using fonts Malithi, LKLug and combined font 

models using Noto Sans, LKLug and Malithi gave accuracies 

of 87.07%, 87.15% and 87.52% respectively. Meanwhile in 

the old Sinhala newspaper category 67.02% of accuracy was 

obtained from the model developed with font iskolapotha. 

Developing OCR systems for low resource languages 

needed a considerable amount of effort from both linguistics 

and computer science domain areas. Analysing linguistics 

rules and mapping them with computer science is quite 

challenging for low resources languages like Sinhala and 

Tamil. In this stage of the research, we focused highly, only on 

the recognition of the Sinhala script. As mentioned in the 

above sections, the Sinhala script is also used to write “Pali” 

and “Sanskrit” languages in Sri Lanka. As a future 

enhancement we will work on identifying touching and 

conjoining letters which occur frequently in Pali. We also plan 

to apply some n-gram or word embedding based post-

processing techniques to enhance the accuracy. Also in real 

world OCR can be categorized as one of sequence learning 

tasks.  Therefore, it is necessary to predict the sequence of 

labels from noisy, unsegmented input data. As future work, we 

plan to combine connectionist temporal classification (CTC) 

with deep learning algorithms to train the Recurrent Neural 

Network (RNN) to label unsegmented sequences directly. 

Moreover, neural net compressions and conventional neural 

machine translations for Sinhala OCR will be studied in the 

future. 

 
ACKNOWLEDGMENT 

This work was carried out as a part of a project funded by 

Theekshana - Research and Development Company. We 

acknowledge Mrs. Dinuji Godigamuwa for assisting the work 

and thank all the members of Language Technology Research 

Laboratory of the University of Colombo School of 

Computing, Sri Lanka, who helped in various ways to make 

this work bear fruit. 

REFERENCES 

[1] R. Weerasinghe, A. Wasala, D. Herath, and V. Welgama, “Nlp 
applications of sinhala: Tts & ocr,” in Proceedings of the Third 

International Joint Conference on Natural Language Processing: 

Volume-II, 2008. 
[2] Smith, R. (2007) ‘An Overview of the Tesseract OCR Engine’, in Ninth 

International Conference on Document Analysis and Recognition 

(ICDAR 2007), pp. 629–633. doi: 10.1109/ICDAR.2007.4376991. 
[3] A. R. Weerasinghe, D. L. Herath, and N. P. K. Medagoda, “A nearest-

neighbor based algorithm for printed sinhala character recognition,” 

Innov. a Knowl. Econ., p. 11, 2006. 
[4] M. Rimas, R. P. Thilakumara, and P. Koswatta, “Optical character 

recognition for Sinhala language,” in 2013 IEEE Global Humanitarian 

Technology Conference: South Asia Satellite (GHTC-SAS), 2013, pp. 
149–153. 

[5] S. Ajward, N. Jayasundara, S. Madushika, and R. Ragel, “Converting 
printed Sinhala documents to formatted editable text,” in 2010 Fifth 

International Conference on Information and Automation for 

Sustainability, 2010, pp. 138–143. 
[6] H. W. H. Premachandra, C. Premachandra, T. Kimura, and H. 

Kawanaka, “Artificial neural network based sinhala character 

recognition,” in International Conference on Computer Vision and 
Graphics, 2016, pp. 594–603. 

[7] [Online]. Available: http://192.248.22.122/ocrsinhala/  
[8] I. Anuradha, C. Liyanage, H. Wijayawardhana, and R. Weerasinghe, 

“Deep Learning Based Sinhala Optical Character Recognition (OCR),” 

in 2020 20th International Conference on Advances in ICT for 

Emerging Regions (ICTer), 2020, pp. 298–299. 
[9] U. Manisha and S. R. Liyanage, “Sinhala Character Recognition using 

Tesseract OCR,” 2018. 

[10] C. Liyanage, T. Nadungodage, and R. Weerasinghe, “Developing a 
commercial grade Tamil OCR for recognizing font and size 

independent text,” in 2015 Fifteenth International Conference on 

Advances in ICT for Emerging Regions (ICTer), 2015, pp. 130–134. 
[11] N. Mishra, C. Patvardhan, C. V. Lakshmi, and S. Singh, “Shirorekha 

chopping integrated tesseract ocr engine for enhanced hindi language 

recognition,” Int. J. Comput. Appl., vol. 39, no. 6, pp. 19–23, 2012. 
[12] M. T. Chowdhury, M. S. Islam, B. H. Bipul, and M. K. Rhaman, 

“Implementation of an Optical Character Reader (OCR) for Bengali 

language,” in 2015 International Conference on Data and Software 
Engineering (ICoDSE), 2015, pp. 126–131. 

[13] S. Hussain, A. Niazi, U. Anjum, F. Irfan, and others, “Adapting 
Tesseract for complex scripts: an example for Urdu Nastalique,” in 
2014 11th IAPR International Workshop on Document Analysis 

Systems, 2014, pp. 191–195. 

[14] R. M. Joshi and C. McBride, Handbook of Literacy in Akshara 
Orthography, vol. 17. Springer, 2019. 

[15] J. W. Gair and W. S. Karunatilaka, “Literary Sinhala Inflected Forms: 
A Synopsis with a Transliteration Guide to Sinhala Script.,” 1976. 

[16] Guzmán, F., Chen, P. J., Ott, M., Pino, J., Lample, G., Koehn, P., ... & 
Ranzato, M. (2019). Two new evaluation datasets for low-resource 

machine translation: Nepali-english and sinhala-english. arXiv 2019. 
arXiv preprint arXiv:1902.01382. 

[17] [Online]. Available: http://ltrl.ucsc.lk/tools-and-resourses/   
[18] [Online]. Available: http://ltrl.ucsc.lk/tools-and-resourses/  
[19] V. K. Samaranayake, S. T. Nandasara, J. B. Disanayaka, A. R. 

Weerasinghe, and H. Wijayawardhana, “An introduction to UNICODE 

for Sinhala characters,” Univ. Colombo Sch. Comput., 2003. 
[20] R. Subasinghe, S. Eramudugolla, S. Samarawickrama and G. Dias, 

"Atomic Vs Anatomic Features of Sinhala Fonts", 10th Typography 

Meeting, 2019.  
 

http://192.248.22.122/ocrsinhala/
http://ltrl.ucsc.lk/tools-and-resourses/
http://ltrl.ucsc.lk/tools-and-resourses/


9            Isuri Anuradha#1, Chamila Liyanage2, Ruvan Weerasinghe3 

 
International Journal on Advances in ICT for Emerging Regions                                                                                   July 2021 

APPENDIX A 

Following include the images of three categories used for 

testing each OCR model.  

 
APPENDIX B 

Interface of developed OCR system is shown in Figure B. 

 
Fig A.1. A sample image of category 1 (old Sinhala newspapers) 

Fig A.2. A sample image of category 2 (old Sinhala books) 

 
Fig A.3. A sample image of category 1 (contemporary 

Sinhala books) 

Fig B. Online Application developed for the Sinhala OCR