69 

 
Novel Ontologies-based Optical Character Recognition-error Correction  
Cooperating with Graph Component Extraction 

 
Sarunya Kanjanawattana 

Graduate School, Shibaura Institute of Technology, Tokyo, Japan 
nb14503@shibaura-it.ac .jp  

 
Masaomi Kimura 

Information Science and Engineering, Shibaura Institute of Technology, Tokyo, Japan 
masaomi@sic .shibaura-it.ac.jp  

 
Abstract 

A graph is a data representation visually presenting significant information in the academic 

literature. Extracting graph information clearly contributes to readers, who are interested in graph 
information interpretation, because we can obtain significant information presenting in the graph. A 

typical tool used to transform image-based characters to computer editable characters is optical 
character recognition (OCR). Unfortunately, OCR cannot guarantee perfect results, because it is 
sensitive to noise and input quality. This becomes a serious problem because misrecognition 

provides misunderstanding information to readers and causes misleading communication. In this 
study, we present a novel method for OCR-error correction based on bar graphs using semantics, 

such as ontologies and dependency parsing. Moreover, we used a graph component extraction 
proposed in our previous study to omit irrelevant parts from graph components. It was applied to 
clean and prepare input data for this OCR-error correction. The main objectives of this paper are to 

extract significant information from the graph using OCR and to correct OCR errors using 
semantics. As a result, our method provided remarkable performance with the highest accuracies 
and F- measures. Moreover, we examined that our input data contained less of noise because of an 

efficiency of our graph component extraction. Based on the evidence, we conclude that our solution 
to the OCR problem achieves the objectives.         

Keywords: OCR-error correction, post-processing, dependency parsing, ontology, graph-
component extraction.  
 

1. Introduction 

Nowadays, with the advent of digital optical scanners, a digital document has been required 

on account of ease of use and search. O ver several years, a great effort has been devoted to the 
study of image-based information extraction. Images, especially graph images, typically contain 
much expedient information. For example, authors usually use graphs to present their experimental 

results, including measurement data for clear explanations. Graphs graphically provide data 
summarization presenting essential information that is simply interpreted by acquiring small 

descriptive details. Thus, an automatic system extractable latent information from the graphs 
provides many contributions to society for disclosing explicit and implicit knowledge. To obtain  a 
primary interpretation, initial graph components analyzed are axis descriptions (i.e., X- and Y-titles) 

and a legend. OCR is an approving solution used for acquiring them as a digital format of character 
letters. 

OCR is extensively used  in several applications, such as medical article citation database  
MEDLINE (Lasko & Hauser, 2000 ) and text extraction from image and video frames (C hen, 
Odobez & Bourlard, 2004). In regard to academics, countless documents have been converted from 

paper-based to digitized information using OCR. However, OCR cannot guarantee accurate outputs. 
Generally, a quality of OCR outputs is fairly decreased, if OCR inputs have various defects, e.g., 

poor printing quality, small image resolution, specific language requirement, and image noises. 
These are main causes of misrecognition that produce OCR errors. For example, a word “BED” 
may be incorrectly recognized as “8ED”. Regarding the OCR errors, there have been two types of 

word errors that may be found in our study, non-word and real- word errors (Tong & Evans, 1996). 


BRAIN. Broad Res earch i n Arti fi ci al  Intelli gence and Neuros ci ence 

Vol ume 7, Is s ue 4, November 2016, ISSN 2067-3957 (onli ne), ISSN 2068 - 0473 (pri nt) 

 
 70 

A non-word error occurs when OCR recognizes a source-text as a string that invalidly corresponds 

to any vocabulary item in the dictionary. A real- word error occurs when OCR applies to the source-
text and provides incorrect output strings which coincidently match to an item in the dictionary. For 
example, if OCR renders the source-text “Today is hot” as “Toolav is not”, then “Toolav” is a non-

word error, and “not” is a real-word error. To mitigate these errors, there has been a great deal of 
study focusing on addressing them and proposing methods based on practical techniques, such as 

semantic utilization (Jobbins et al., 1996) and statistical similarity measurement (Nagata, 1998). To 
our knowledge, using the semantics is a reliable solution to alleviate the OCR errors, because it 
analyzes not only the words themselves but also the context of corresponding sentences. O ur OCR-

error correction method utilizes a concept of semantics, including ontologies and natural language 
processing (NLP) to identify and correct the errors.     

However, OCR is unsuitable to directly apply to graph images, because there are irrelevant 
parts in the graphs, which do not necessitate for the primary interpretation, such as parts of bars and 
some numeric data. They may cause recognition noise (e.g., special characters and number); hence, 

they should be eliminated in advance by our graph component extraction for improving a quality of 
OCR results (Huang, Tan & Leow, 2005), (Kataria et al., 2008).  

The input of this study is a collection of bar graphs which contains at least axis descriptions 
(i.e., X- and Y-titles), and optionally, a legend. We here highlight only the bar graphs because its 
characteristics are dominant and easy to comprehend by both human and machine. Moreover, we 

gather related contents of documents for creating our ontology, such as image captions and cited 
paragraphs.  

In this study, we propose a novel method which is a combination of a graph component 
extraction and an OCR-error correction. Note that the graph component extraction has already been 
proposed in our previous study (Kanjanawattana & K imura, 2016), which aims to separate 

irrelevant parts from graph components. We focus on only three basic components, i.e., an X-title, a 
Y-title, and a legend; thus we determine other parts of graphs as irrelevant parts that should be 

omitted beforehand to reduce noise by the graph component extraction. To improve the OCR 
results, we also present the method of OCR-error correction in this study which is a post-processing 
system to analyze the results and correct errors based on ontologies, N LP and edit distance. We 

designed and created our ontology supporting dependency parsing of English context, including 
word categories queried from DBpedia. The crucial objectives of this study are to extract 

information from graph components and to improve the quality of OCR results recognized from the 
extracted graph components. Our system considerably contributes benefit to society, particularly in 
regard to academics, by acquiring implicit knowledge in the graphs. Moreover, it can be adapted to 

many applications, such as image search engine.  
The remainder of the paper is organized into six sections as follows: Section 2 reviews 

previous studies. We introduce our method in Section 3. Section 4 presents experiments and their 
results. In Section 5, we discuss the results and reveal s ignificant findings discovered by this study. 
Section 6 concludes and suggests future work.  

  
2. Related works 

In this section, we review existing works related to this study. There are two minor sections 
as follows:   

    
2.1 Image segmentation 
Image segmentation is currently an active research area with several unsolvable problems. 

This technique can be used to capture and separate dominant objects from image backgrounds. 
Basically, it deals with many kinds of images, such as outdoor scenes (Alvarez, LeC un & Lopez, 
2012) (C heng et al., 2012) and medical images (Hiran & Doshi, 2013). In academics, a graph image 

used to summarize and analyze essential information is another target image for this active field. 
Bar graphs are our main target in this study. We  attempted to separate the basic components to 


S. Kanjanawattana, M. Kimu ra - Nov el  Ontol ogi es -bas ed Optical  Character Recogni ti on-error Correcti on Cooperati ng 

wi th Graph Componen t Extracti on 
 

71 

 
prepare the inputs of OCR-error correction. However, to achieve graph segmentation is difficult for 
traditional techniques (such as image processing), because positions of graph components, 

especially a legend, are unfixed. A dramatic study addressing this difficulty has been presented by 
(Kataria et al., 2008). They aimed to automatically extract elements (e.g., axis labels, legends, and 

data points) from within a two-dimensional graph and mitigate a problem o f overlapping text and 
data points. They performed an image profiling to detect global features in order to identify 
coordinate axes. Moreover, they applied an extended K-median to isolate and detect the data points 

from a curve. However, they confronted a problem when trying to extract a legend. That can be 
solved by performing a connected component analysis to identify individual letters before applying 

OCR. The other interesting study is proposed in (Huang & Leow, 2005) whose main targets were to 
associate recognition results of textual and graphical information in scientific graphs. They 
individually recognized text and graphical regions of the graph images and then combined their 

results to achieve a full understanding. However, they encountered OCR errors that were solved by 
manual correction. Although these previous studies proposed effective methods to extract graph 

components, it did not identify types of individual components. In fact, each component carries 
essential information, but its role certa inly differs. For example, the X- and Y-titles evince a 
relationship of the graph. The legend provides particular information regarding data described as 

data labels. Clearly, to identify the type to each component is surely important for graph 
interpretation. O ur graph component extraction can achieve this obstacle. Moreover, we not only 

extracted graph components using the OCR technique but also tackled an OCR error problem by 
correcting errors based on our methods.                    

 
2.2 OCR-error correction 

A great deal of effort has developed many approaches to correct the OCR errors over several 

years. Nagata (Nagata, 1998) emphasized his work to correct misrecognized characters using 
character shape similarity and statistical language model. He attempted to challenge to Japanese 
whose sentences did not include word delimiters (e.g., space). However, we noticed that this 

previous study cannot correct such items as acronyms and transliterated foreign words because they 
often show in English (such as ISO and SONY) that cannot recognize by OCR included by 

Japanese language package. It differs from our method because ours can correct words universally 
as long as they appear in the source document.  

Semantic-based techniques (e.g., context-based analysis and ontology) are proper solutions 

addressing the OCR problem. Wick et al. (Wick, Ross & Learned-Miller, 2007) realized that 
conventional systems identified low-confidence outputs that were insufficient to correct 

misrecognition errors. They used topic models automatically detecting the semantic context of 
scanned documents and specified the word frequency to correct the errors. However, a limitation of 
topic models is high training time required, because users must classify documents to acquire their 

corresponding topics prior applying OCR. An interesting method related to correct OCR errors is 
also described in (Bassil & Alwani, 2012). They developed a context-based method based on 

Google’s online spelling suggestion to correct the OCR errors. They avoided using an offline 
dictionary because a huge volume of terms needed to gather in a source computer, which consumed 
a lot of resources. Google is a massive online database containing a large collection of word 

sequences. It is suitable to be a data source of correcting word suggestion. However, this technique 
is limited to use via online that need to concern about network availability and efficiency, e.g., 

speed and bandwidth.  
Recent studies addressing the problem of OCR errors tend to use ontology and semantics. 

Jobbins et al. (Jobbins, Evett & Sherkat, 1996) developed a system of automatic semantic-relation 

identification between words in Roget’s Thesaurus. This knowledge source contains explicit links 
between words and related vocabulary items for each part o f speech, unlike an ordinary dictionary. 

Their method depended on Relation algorithm that located semantic relations between words and 
calculated a relatedness score of each word. However, this technique possibly encountered a 


BRAIN. Broad Res earch i n Arti fi ci al  Intelli gence and Neuros ci ence 

Vol ume 7, Is s ue 4, November 2016, ISSN 2067-3957 (onli ne), ISSN 2068 - 0473 (pri nt) 

 
 72 

difficulty, if dealing with words in a sentence. They may obtain a real-word error in a same 

category or cross reference. To solve this problem, not only word categories but also sentence 
dependencies should be used, because each word in the sentence definitely contains at least one 
dependency linking to some other words in the same sentence. Zhuang et al. (Zhuang & Zhu, 2005) 

introduced an OCR post-processing method based on multiple forms of knowledge, for example,  
language knowledge and candidate distance information given by the OC R engine. They focused on 

Chinese characters. A similarity between this existing study and our study is to find candidates 
depended on similarity distances. However, this previous study was limited to long sentences 
containing many dependencies, because it used n-gram supportable contiguous sequence of n items  

from given sentences.          
  

 Figure 1. Illustration of graph components. 

 
3. Methodology 

In this study, we utilize a pre-processing and suggest a post-processing methods to achieve a 
difficulty of OCR errors. We introduce the new post-processing method of OCR-error correction for 

bar graphs utilizing several techniques, i.e., ontologies, N LP, and edit distance. Moreover, owing to 
reducing the presence of irrelevant parts, we used the pre-processing method of graph-component 
extraction proposed in (Kanjanawattana & K imura, 2016) to detect and extract the basic graph 

components, such as a legend and axis descriptions, including omitting irrelevant parts.  
We divide our methodology into two major modules as follows: 

 
3.1 Graph-component extraction 

 The main task of this module is to separate the components (Figure 1) into individual 

images. As described in our previous study, the method was designed based on a generality of the 
graph structure. There were two distinct parts in this module: axis description extraction and legend 

extraction. For axis descriptio n extraction, we used a horizontal and vertical partitioning process to 
get X- and Y-titles respectively. Moreover, we utilized a pixel projection to investigate locations of 
peaks of each horizontal profile in order to omit irrelevant parts from the partitioned components. 

Note that the height of peak denotes how many pixels containing significant information exist. The 
location of the first peak was discarded to obtain a cleaned X-title because the first peak represents 

an internal part of a bar or data measurement. Likewise, we kept the first peak and omitted the rest 
to obtain a cleaned Y-title. Regarding legend extraction, there were five step s that used to extract 
the legend: data preprocessing, data transformation, clustering process, discrete Fourier 

transformation (DFT) process, and classification process. The main purpose of our previous study 
was to estimate a suitable Epsilon for DBSCAN to identify a legend position in graphs. Data 

preprocessing was used to eliminate unrelated areas, such as an axis title region and a left-and-
bottom region. Data transformation was used to rescale the inputs to a smaller size using average 


S. Kanjanawattana, M. Kimu ra - Nov el  Ontol ogi es -bas ed Optical  Character Recogni ti on-error Correcti on Cooperati ng 

wi th Graph Componen t Extracti on 
 

73 

 
subsampling, including transforming to the datasets of XY-coordinate corresponding to the 
locations of the existed pixels. C lustering process is the main part of our previous study. We used 

DBSCAN that typically requires two parameters: MinPts and Epsilon. To estimate a suitable 
Epsilon for each graph image, we utilized an idea of region density. In each quarter, we obtained an 

area with the highest density. Afterward, we investigated a candidate point of each area, which had 
the furthest distance from the center of the same area but had the smallest distance measured from a 
neighbor area. Finally, we calculated distances between selected candidate points and chose the 

shortest one; moreover, we divided the shortest distance by the image width to obtain the Epsilon. 
Then, we cropped the images scaled back to original sizes, corresponding to the clustering results. 

Next, the cropped images were applied by DFT that used to reveal image characteristics represented 
in the frequency domain. According to data construction for cla ssification, the characteristics of 
each quarter of the DFT image provided similar information; thus, we selected only a single quarter 

as input for classification. Finally, classification process was utilized to classify images with a 
legend. We applied support vector machines (SVM) to the data obtained from DFT process using 

radial basis function (RBF) kernel. In the end of this module, we obtained cleaned graph 
components, i.e., axis descriptions and legends.   

 
 3.2 OCR-error correction 

This module is a core of this paper. On the subject of OCR-error correction, we utilize 

ontologies to solve OCR problems. Moreover, we integrate an edit distance and NLP to our 
correction system, because we realize how useful of sentence’s context to predict unknown or 
misspelling vocabulary items based on a word suggestion from the edit distance. We create our 

ontology supporting results of parsed sentences, i.e., part of speech (POS) tags and sentence 
dependencies, as well as Named-entity recognition (NER) queried from DBpedia.  

For this module, we divide procedures into three steps.  
 
3.2.1 Candidate selection 

 
Figure 2. Steps of candidate selection.    

 
The input of this module is the cleaned graph components acquired from the previous 
module. We apply OCR to them and obtain OCR results. We utilize the edit distance technique to 

measure word distances and rank them in ascending score. Then, we selected the top five words as 
candidates to be used to replace incorrect OCR results (Figure 2). We collect only five words 
because this quantity is reasonable for utilization and resource management. For example, a graph 

image contains a word “well” at its X-title that is incorrectly rendered as “woll”. Our system can 
select top five candidates ordered by ascending distance scores as follows: welt, will, wall, well, 

and with. Obviously, if the number of candidates is too small (e.g., one or three), we definitely miss 


BRAIN. Broad Res earch i n Arti fi ci al  Intelli gence and Neuros ci ence 

Vol ume 7, Is s ue 4, November 2016, ISSN 2067-3957 (onli ne), ISSN 2068 - 0473 (pri nt) 

 
 74 

a correct word “well”, which also appears on the list. Moreover, a high quantity of candidates 

causes unnecessary processes. Their distances are calculated between tokens from OCR results and 
the other from corresponding caption or paragraphs. Note that the distance inversely varies to a 
similarity. A higher distance represents a smaller similarity and vice versa. The selected top five 

candidates for each OCR token are stored into the list of candidates in ascending order of distance 
scores. Finally, we acquire the OCR results, including their lists of candidates.  

 
3.2.2 Ontology design and creation 
To fully support o ur OCR-error correction, we need to create our own ontology following 

the design, illustrated in Figure 3, which includes four entities (i.e., Word, TagCategory, 
PartO fSpeech, and PartTypeCategory classes) and several object properties (e.g., belong_to, 

has_type, and depend_on). The Word entity represents every individual token from captions and 
cited paragraphs of the images used in this study. The TagCategory gathers category names or NER 
attached to each token, such as person name, location, and animal, by querying DBpedia via its 

SPARQL endpoint. Furthermore, we use Stanford Named Entity Recognizer (Stanford NER) to 
identify the category of the tokens organized into seven categories, i.e., Location, Person, 

Organization, Money, Percent, Date, and Time. The PartO fSpeech collects POS tagging of each 
token. For this entity, the total number of individuals is fixed at 36 instances, whose names are from 
Penn treebank nodes, such as CC, VB, and NNP. The PartTypeCategory represents groups of POS 

taggings. For example, a singular proper noun indicates NNP belonging to the Noun group. 
Regarding properties in our ontology, we design several properties, which states relations among 

entities. The same_as represents the relations of at least two synonymous tokens that are stored in 
the Word entity. For example, “Japan” and “N ihon” are synonyms that refer to the same concept. 
Our ontology covers the synonyms expressing the same concept. Moreover, the depend_on property 

is a crucial property, because it presents dependency relationships between paired tokens parsed 
from sentences in the captions and the paragraphs. The number of sub-properties of depend_on 

relations is fixed at 67 properties representing typed dependencies, such as conj, dep, and nsubj.   
To prepare individuals for our ontology, entire sentences included in the captions and 

paragraphs are tokenized into tokens. Afterward, we utilize a dependency parser (Stanford parser)  

to analyze the sentences in order to obtain their dependencies, POS tags, and NER classes. As 
mentioned above, NER classes are obtained by the parser and the SPARQL query processed in 

DBpedia. All prepared data are gathered as instances of our ontology.           
 
 
Figure 3. Demonstration of our ontology structure describing entities, properties, and relations. 

 
S. Kanjanawattana, M. Kimu ra - Nov el  Ontol ogi es -bas ed Optical  Character Recogni ti on-error Correcti on Cooperati ng 

wi th Graph Componen t Extracti on 
 

75 

 
3.2.3 Error correction 
The main purpose of this step is to correct the OCR errors using the ontology and the lists of 

candidates from the previous steps. The basic idea is that the tokens from graph components should 
appear in corresponding captions or cited paragraphs because authors generally explain information 

based on the graphs in their documents.   
Initially, we begin to create a dependency dictionary, called DepDic that records the chain 

dependencies of the tokens. This dictionary is created, if at least one OCR token identically matches 

to the first candidate in its own list, and the token is used as a head of dependency chains. As an 
example shown in Figure 4, we obtain a word “Information” from the graph’s legend. Suppose that 

it can be found in its caption, and OCR provides a correct recognition. A parser provides results as 
typed dependencies based on a caption, including POS tags and NERs. We acquire a dependency 
chain of “Information” that includes “Sources,” “of,” “used,” “Physicians,” “Pakistani,” and “by.” 

This chain is recorded into DepDic.  
To cover all possible situations for correcting OCR errors, we divided our method into four 

major conditions, as presented in Figure 5.  
 

Figure 4. Example of grammar dependency parsing, including POS tags and typed dependencies,  
and NER classes of each token. 

 
For the first condition, we focus on eliminating recognition noises from our inputs, such as 

special characters and numbers. This condition is used to filter unused cha racters. We consider 
number characters as recognition noises because our main targets in this study are the axis 

descriptions and the legend, which generally describe in alphabet rather than numeric characters. 
Moreover, we ignore escape characters (e.g., /, <, and *), because they are reserved characters of 
SPARQL. 

 
BRAIN. Broad Res earch i n Arti fi ci al  Intelli gence and Neuros ci ence 

Vol ume 7, Is s ue 4, November 2016, ISSN 2067-3957 (onli ne), ISSN 2068 - 0473 (pri nt) 

 
 76 

 
Figure 5. Demonstration of OCR-error correction covering possible conditions to filter and correct errors.  

 
The second condition is whether the OCR result finds an exact match in its own list. Our 

system examines the similarity between the OCR result and the first word of its list whose distance 
is minimal. Typically, the paired words are identical, if the distance score is equal to zero. If this 
condition is satisfied, a replacement is unnecessary due to the accurate OCR result obtained. 

Afterward, we collect the result to a mapping list, namely NewWordMap used to store the OCR 
results and their new replacements.         

The third condition is whether OCR provides a correct recognition that matches nothing in 
its own list. As aforementioned, our basic idea is that the descriptions of graph components should 
correspondingly appear in either caption or paragraphs. Unfortunately, the descriptions possibly 

appear nowhere in the document. Under this situation, we obtain the list of candidates with high 
distances, which has a low chance to find a match from the list. This condition handles the problem 

of missing matching tokens by analyzing three minor conditions.  
Condition 3a is whether the first candidate of the list has been found in DepDic. C learly, the 

first candidate of the list contains the highest chance of matching because of the lowest distance 

score provided. Moreover, if the candidate is discovered in DepDic, its chance to be selected as  a 
new replacement should be increased. Therefore, if this minor condition is satisfied, our method 

suggests the first candidate of the list as a new replacement, because not only the smallest distance 
score is provided, but it also appears in the same chain of dependency. Condition 3b is processed to 
check whether the OCR result is actually existed by querying WordNet. Condition 3c has a 

procedure similar to Condition 3b, but it differs in using DBpedia instead of WordNet. If these two 
minor conditions are satisfied, we receive no null values returned from their SPARQL endpoints, 

and the new replacement is not needed. Regards an order of Condition 3, we normally apply these 
conditions following by this order, Condition 3a, 3b, and 3c respectively. However, if the distance 
score is over than a threshold, the order of the condition is switched to the following order, i.e., 3b, 

3c, and 3a.  
We reorder the conditions because we need to prevent errors resulting from the length of 

OCR result that is lower than a threshold, especially for two or three characters. The short-length 
words generally represent as prepositions (e.g., “in,” “on,” or “at”), conjunctions (e.g., “so” or “as”) 
and abbreviations (e.g., POS and NLP). Every sentence regularly includes at least one preposition 

or conjunction, since the short-length words are ordinarily stored in DepDic. Based on this 
explanation, they may be assigned as an incorrect replacement accidentally. For example, we obtain 

a word “is” from OCR, and the first candidate of its list is “on” already recorded in DepDic. A 


S. Kanjanawattana, M. Kimu ra - Nov el  Ontol ogi es -bas ed Optical  Character Recogni ti on-error Correcti on Cooperati ng 

wi th Graph Componen t Extracti on 
 

77 

 
distance score of them is only two, but their appearances are totally different. Following the original 
order, the word “is” is wrongly assigned by the word “on” as a new replacement. In order to reduce 

a chance to encounter this situation, a rearrangement of condition orders has been suggested. 
According to the new condition order, we obtain a correct replacement as the word “is” itself, 

because it has been found in WordNet.           
The last condition is Condition 4 separately processed based on types of components. 

Initially, our system checks on NewWordMap list. If the list is not null, Condition 4a and 4b are 

processed. Otherwise, Condition 4c is operated. A basic idea of this condition is that X-title and 
legend should be described by a corresponding category. For example, a financial bar graph 

contains an X-title, a Y-title, and a legend.  The X-title describes product names, which are in the 
same category (e.g., apple, orange, or banana).  

The NewWordMap is available if there is at least one word stored. Condition 4a is whether 

the POS and NER of the first candidate of the list corresponded to both of a value stored in 
NewWordMap. To satisfy this condition, we check the POS tag and NER of the first candidate of 

the list and the POS tag and NER of the word stored in the NewWordMap. If their POS and NER 
are consistent, we obtain a new replacement. Condition 4b is similar to Condition 4a. If either POS 
tag or NER has been matched, we also flexibly accept the first candidate of the list as the new 

replacement. Condition 4c is operated if the NewWordMap is unavailable or null. We cannot find 
any comparison from the list; thus, we introduce another solution that utilizes only the list of 

candidates. This condition is whether any candidate of the list contains the minimum score which 
sums up from both the edit distance score and a POS tagging score. Regards the POS tagging score, 
we assign a score to each POS tag depended on the priorities of word replacement selection based 

on our experience. The tagging scores are assigned as following: noun (score = 0), adjective (score 
= 1), verb (score = 2), article (score = 3), adverb (score = 4), preposition (score = 5), conjunction 

(score = 6), interjection (score = 7), others (score = 8) and number (score = 9). Noun provides the 
highest chance to appear at the X-title or the legend; since its score should be minimum. We select 
the replacement assigned by the smallest score, which basically comes from a summation of the 

smallest distance score and Noun tagging score.  
Y-title is described as a sentence or a noun phrase that is different from the X-title or the 

legend. Tokens from Y-title connect to other tokens by their dependencies; therefore using DepDic 
should be an appropriate option for selecting the most similar word in the list as a new replacement. 
Condition 4d is whether any word in the list appeared in DepDic. Every candidate in the list is 

iteratively explored in the list of DepDic until a match is retrieved and is selected as the 
replacement. 

Otherwise, if we cannot obtain any new replacement from the above conditions, the OCR 
tokens are used as their own replacements.       
 

4. Experime nts and results 
 We conducted experiments to evaluate our method. We divided them into four tests as 

presented in Table 1. To evaluate the first module, we compared results obtained from a tradition 
method and the first module of our method. The traditional method, namely image partition method, 
extracted the X- and Y-titles by image partitioning similar to our method, but an idea to extract the 

legend was different. It extracted the legend by cropping all possible areas where located the 
legend, such as the top and right side of the image, including irrelevant or relevant parts. A 

comparison between Experiment 1 and 2 revealed the significant experimental results expressing 
the performance of the different graph component extraction methods. To evaluate the second 
module, we observed the results from Experiment 1 and 3 to compare the performance between the 

edit distance technique and our OCR-error correction. The performance of our study was 
represented in Experiment 4, which was a combination of module 1 and 2. 

Several performance rates (i.e., accuracy, precision, recall, and F-measure) were evaluated 
in this study. The accuracy is a statistical measurement to identify how well a method tests 


BRAIN. Broad Res earch i n Arti fi ci al  Intelli gence and Neuros ci ence 

Vol ume 7, Is s ue 4, November 2016, ISSN 2067-3957 (onli ne), ISSN 2068 - 0473 (pri nt) 

 
 78 

correctly. A higher accuracy rate represents to the consistency of predicted values which are same 

as given values. The precision is the measurement of given data to present how many outputs are 
positively classified. The recall is to define how well the outputs cover the positives. F-measure is 
an averaged comb ination of precision and recall. Noise ratio is an evaluated measurement to 

identify how much recognition noises are produced by the system, such as numbers and special 
characters. The overall measurement results are summarized in Figure 6 and 7.  

   
Table 1. Settings of our experiments.  
Experiment Method of graph component extraction Method of OCR-error correction 

1 Image partition method Edit distance 
2 Our first module  Edit distance 

3 Image partition method Our second module  
4 Our first module  Our second module  

 
As Experiment 1, it was a combination of the image partition method and edit d istance. It 

was said to be a fundamental idea to acquire the information from graphs and correcting OCR 
results. As the results, all performance rates were presented the lowest values, except the noise 
ratio. The noise ratio was up to 29.48% that was the maximum ratio comparing to other 

experiments. However, after we examining the noise ratio from Experiment 2 which was a 
combination between our graph component extraction and the edit distance, we realized that our 

first module could efficiently handle the noises of irrelevant parts better than the image partition 
method because the noise ratio obviously presented a lower rate, 19%. Moreover, the accuracy and 
F-measure were increased to 57.28% and 50.54% respectively, whereas the performance rates of 

Experiment 1 were only 46.98% and 39.77%. Experiment 3 was a combination of the image 
partition method and our OCR-error correction. All performance rates were dramatically improved 

comparing to the first experiment. The accuracy was up to 80.75%, and the F -measure reached to 
82.28%. For Experiment 4, we combined our first and second modules proposed by this study. The 
performance was better than others. We obtained the highest accuracy rates, 84.23%, and F -

measure, 86.02%, including less of recognition noises.  
We obtained the number of errors 249 tokens from total 1579 tokens in Experiment 4. We 

analytically observed causes of errors separated into three types: missing token error, real- word 
error, and suggestion error. The missing token error presents the number  of tokens unable to extract 
from the graph. The real-word error represents the error of misrecognition but accidently matches to 

a vocabulary item in a dictionary. The suggestion error means the error from our system suggesting 
an incorrect result. To rea lize a portion of errors, the percentages of each error proportioned to the 
total number of errors were presented as follows: 27.71% for the missing error, 37.75% for the real-

word error, and 34.54% for the suggestion error. Clearly, among the errors obtained during the 
experiment, the real-word error and the suggestion error needed to be concerned and mitigated.  

 
S. Kanjanawattana, M. Kimu ra - Nov el  Ontol ogi es -bas ed Optical  Character Recogni ti on-error Correcti on Cooperati ng 

wi th Graph Componen t Extracti on 
 

79 

 
Figure 6. Illustration of accuracies and noise ratios of all experiments.  

 
Figure 7. Illustration of precision, recall and F-measure of all experiments. 

  
Moreover, we investigated the number of missing tokens for Experiment 3 and 4. Note that 

the total number of tokens, which should be able extracted by OCR, was 1165. In Experiment 3, the 
missing tokens were 151 tokens or 13% of total tokens missing. Meanwhile, in Experiment 4, we 
obtained a missing tokens only 69 tokens (6.92%). Apparently, the missing tokens were decreased, 

if we applied our data to the graph component extraction.  
Figure 8 presents accuracy rates of all conditions proposed in our OCR-error correction. 

Condition 1 used to detect and omit the recognition noises provided the 100% correction. Due to 
our effective graph component extraction, the accuracy rate of Condition 2 reached 99.15%. 
Condition 3 presented a reasonable accuracy rate about 81.11%; therefore, using ontologies to 

investigate a meaning of a word was also appropriate. However, the lowest accuracy was appeared 
at Condition 4, 29.47%.  

Furthermore, we statistically calculated the significant difference between resu lts from 
Experiment 2 and 4 by McNemar's test. This tool uses for statistically testing on paired nominal 


BRAIN. Broad Res earch i n Arti fi ci al  Intelli gence and Neuros ci ence 

Vol ume 7, Is s ue 4, November 2016, ISSN 2067-3957 (onli ne), ISSN 2068 - 0473 (pri nt) 

 
 80 

data to examine a significant change from two sets of data obtained before and after treatments. In 

our case, the before treatment referred to Experiment 2 using the edit distance, and the after 
treatment was our OCR-error correction in Experiment 4. Note that we ignored results corrected by 
Condition 1 to stable the statistical data because the edit distance cannot handle the noises. We 

calculated a two-tailed probability value (P value), which used to determine to accept or reject a 
null hypothesis. The small P value represents a significant difference between two sets of data. The 

two-tailed P value is less than 0.0001 that means the results from both experiments are considered 
to be extremely statistically significant.       

 
Figure 8. The number of tokens, including accuracy rates of each condition. 

 
5. Discussion 
 We proposed a new method of OCR-error correction based on bar graph images using 
semantics. We improved our idea by using a graph component extraction proposed in our previous 

study to reduce irrelevant parts from extracted components before applying to OCR. We aimed to 
advance the quality of OCR results and proposed an effective method to accurately detect and 

extract graph components. We conducted four tests to evaluate the performance of our method 
(Table 1). We calculated several performance rates, i.e., accuracy, precision, recall, F -measure, and 
noise ratio. The experiment representing our method was Experiment 4.  

 As presented in Figure 6 and 7, Experiment 1 provided the lowest performance rates, 
including the highest noise ratio; thus, a combination of fundamental methods (i.e., the image 

partition method and edit distance) is inappropriate to solve this study’s problems. For Experiment 
2, we observed that the noise ratio reduced around 10% from the first experiment. Moreover, the 
accuracy and F-measure had 10% increased. This positive situation happened, because we changed 

the image partition method to our first module. Obviously, the performances are improved, if the 
noiseless data are used.  

Typically, during a recognition process of OCR, it analyzes many objects inside images that 
sometimes are not character strings. They may le ad OCR misunderstanding because they interfere 
the recognition process and mainly cause errors; hence, our data cleaned by our graph component 

extraction definitely enhance the performance of the system.  
Experiment 3 provided high accuracy and F-measure: 80.75% and 82.28% respectively. To 

compare between Experiment 1 and 3, the results showed that our OCR-error correct dominantly 
affects the performance of the system, because the accuracy rate and F - measure of Experiment 3 
were much greater than Experiment 1 about 33.77% and 42.51% respectively.  

The performance of our graph component extraction supports the system to reduce noises 
around 10% as resulted in Experiment 1 and 2. Moreover, according to the different performance of 

Experiment 1 and 3, it is dra matically increased to around 38% averagely. This evidence contributes 
the quality of our second module or OCR-error correction that is much better than the edit distance 
in a case of suggesting the correct tokens because ours can handle a limitation of ed it distance.  


S. Kanjanawattana, M. Kimu ra - Nov el  Ontol ogi es -bas ed Optical  Character Recogni ti on-error Correcti on Cooperati ng 

wi th Graph Componen t Extracti on 
 

81 

 
As mentioned above, the basic idea of edit distance is to use two tokens compared to 
measure their distance. The comparable tokens are selected from OCR results and the captions or 

paragraphs. In a case of edit distance providing a correct suggestion, the tokens from graphs are 
required to have a match to tokens in the captions or paragraphs. However, the tokens may not be 

mentioned in any contents of documents because of two reasons. First, they are general words that 
should be known by readers. Thus, it is unnecessary to explain them. Second, they may not directly 
relate to their topics or studies. Based on this story, the edit distance cannot guarantee the correct 

suggestion, particularly in the situation of no matched tokens in the captions and paragraphs. On the 
other hand, our study provided the good quality of OCR-error suggestion which utilized the 

ontologies to check the meaning of tokens; therefore, it properly mitigates the shortcoming of edit 
distance. We also obtain a correct sugges tion, even if we cannot find any match in the captions or 
paragraphs.  

As the results of Experiment 4, we acquired high accuracy and F-measure: 84.23% and 
86.02% respectively; in addition, the noise ratio was decreased comparing to Experiment 1, 

19.38%. The accuracy and F- measure of our method were significantly higher than the first 
experiment about 37.25%and 46.25% respectively. The main purpose of this experiment is to prove 
the performance of the combination of our methods, i.e., our graph component extraction and OCR-

error correction. An important implication of these findings is that a cooperation of our graph 
component extraction and OCR-error correction are supportive each other because the performance 

also improves comparing to other experiments.   
We endeavored to compare the results from Experiment 4 to a state-of-art study. Zhuang et 

al. (Zhuang & Zhu, 2005 ) presented a remarkable study to correct OCR errors based on semantics 

similar to ours. Their results reported that, after we compared between their method and a basic 
method, an error reduction was 29%, and the accuracy was 83.73%. Likewise, we analyzed the 

error reduction comparing between our method and the edit distance. Our error reduction was about 
27%, and our accuracy was slightly higher than the previous study, 84.23%. Eventually, the idea to 
use the semantics to mitigate the OCR problem was agreeable, because, to compare to non-semantic 

methods, the higher performance were presented from our and the previous studies; furthermore, 
the results were corresponding.   

To analyze the errors obtained from Experiment 4, there were three types of errors occurred 
during the experiments, i.e., the missing error, the real- word error, and the suggestion error. As the 
results of Experiment 4, the highest proportion of errors was the real- word error. This error happens 

when the OCR incorrectly recognizes tokens, but it has been accidently found in ontologies. The 
possible solution is to use a specific ontology. The ontologies currently used in this stud y are 

general ontologies related to English vocabularies. Since, an opportunity to get incorrect suggestion 
should be reduced if we use the specific ontology rather than the general ones because the suggested 
results relate to a domain of the graph. For example, if a graph relates to biology; thus the biology 

ontology should be utilized. The next error that we should concern is the suggestion error. It occurs, 
if the recognized token is incorrect and not found in the ontologies; hence, our system suggests t he 

most similar token which is mostly incorrect, because there is not an identical token appearing in 
the captions and the paragraphs. O ne possible solution to this problem is to  extend the content of 
documents to increase a probability to find an identica l token, such as using whole contents of the 

document, not limited to only the captions or corresponding paragraphs. However, it is time-
consuming, because there are a lot of tokens that need to measure the distances, including querying 

to ontologies. For the missing error, it happens because of the limitation of OCR and a mistake of 
partitioning process of our graph component extraction. OCR sometimes cannot recognize any 
characters, if the font size is too small or too big; since OCR returns a null value,  which causes a 

missing token. Furthermore, we partitioned images by a constant; thus, it is possible to cut them at 
wrong positions.  For example, we have a graph containing a two-sentence Y-title. The system may 

mistake to cut at the middle of the title. We retrieve an incomplete title that causes missing token 
errors.        


BRAIN. Broad Res earch i n Arti fi ci al  Intelli gence and Neuros ci ence 

Vol ume 7, Is s ue 4, November 2016, ISSN 2067-3957 (onli ne), ISSN 2068 - 0473 (pri nt) 

 
 82 

To deeply analyze the results of each condition as presented in Figure 8, Condition 1 offered 

the 100% correction. It means that our system has a high capability to detect and omit t he 
recognition noises. For Condition 2, we obtained the high accuracy because of the benefit of our 
graph component extraction. It extracts the relevant components, which help to enhance the OCR 

performance to correctly recognize character strings. The reasonable rate is presented in Condition 
3. This condition is proved that a viewpoint to use grammar dependencies and ontologies is 

acceptable. However, most errors occurred in this condition are the real- word error. The final 
condition is Condition 4 suggesting the lowest accuracy rate comparing to among conditions. The 
suggestion errors have been mostly found in this condition. The causes and solutions of these errors 

have been described above.  
As the statistical evidence, we concluded that the difference between our OCR-error 

correction and the edit distance was the significant difference due to a small P value obtained.    
 

6. Conclusion 

In this study, we proposed the method of OCR-error correction based on semantics, 
including the graph component extraction. The objectives of this study were to extract their 

significant information by using OCR and to correct OCR results by utilizing the ontologies and 
dependency parsing.    

From the experiments that have been carried out, it is possible to conclude that o ur methods 

achieve the objectives of this study. As the experimental results, our method presented the highest 
performance rates greater than other methods, and the noise ratio was small. S imultaneously, the 

fundamental method showed the highest noise ratio and the lowest performance rates, because 
many irrelevant objects were included in the input data, which interfered the recognition process. 
Furthermore, the edit distance suggested a correction based on strings appearing in the captions and 

paragraphs; therefore, suggestion errors certainly occurred, if identical tokens were missing from 
them. The OCR-error correction introduced in this study contained four conditions to cover all 

possible situations that might occur during the correction processing. It has clearly shown that our 
method can perfectly handle and omit the recognition noises, such as numbers and special 
characters. Moreover, due to our effective graph component extraction, the cleaned components 

omitted irrelevant parts were acquired. Indeed, with this data, OCR provided accurate recognitions. 
Using the semantics to correct OCR errors is also appropriate because it mitigates the problems of 

missing identical tokens in the captions or paragraphs.   
The next stage of our research will be moved to graph-content information extraction, such 

as a height of a bar and a tendency of a line. We will design a new ontology to support extractable 

graph information and to utilize other ontologies in order to reveal latent information.  
 

         References  
Lasko, T. A., & Hauser, S. E. (2000). Approximate string matching algorithms for limited -

vocabulary OCR output correction. In Photonics West 2001-Electronic Imaging. International 

Society for Optics and Photonics, 232-240. 
Chen, D., Odobez, J. M., & Bourlard, H. (2004). Text detection and recognition in images and video 

frames. Pattern recognition, 37(3), 595-608. 
Jobbins, A., Raza, G., Evett, L., & Sherkat, N. (1996). Postprocessing for ocr: Correcting errors 

using semantic relations. In LEDAR. Language Engineering for Document Analysis and 

Recognition, AISB 1996 Workshop, Sussex, England.  
Nagata, M. (1998, August). Japanese OCR error correction using character shape similarity and 

statistical language model. In Proceedings of the 17th international confer ence on 
Computational linguistics-Volume 2, 922-928.  

Kataria, S., Browuer, W., Mitra, P., & Giles, C. L. (2008). Automatic Extraction of Data Points and 

Text Blocks from 2-Dimensional Plots in Digital Documents. In AAAI, 8, 1169-1174. 


S. Kanjanawattana, M. Kimu ra - Nov el  Ontol ogi es -bas ed Optical  Character Recogni ti on-error Correcti on Cooperati ng 

wi th Graph Componen t Extracti on 
 

83 

 
Huang, W., Tan, C. L., & Leow, W. K. (2005). Associating text and graphics for scientific chart 
understanding. In Eighth International Conference on Document Analysis and Recognition 

(ICDAR'05). IEEE, 580-584. 
Tong, X., & Evans, D. A. (1996). A statistical approach to automatic OCR error correction in 

context. In Proceedings of the fourth workshop on very large corpora, 88-100. 
Wick, M., Ross, M., & Learned-Miller, E. (2007). Context-sensitive error correction: Using topic 

models to improve OCR. In N inth International Conference on Document Analysis and 

Recognition (ICDAR 2007). IEEE, 2, 1168-1172.  
Bassil, Y., & Alwani, M. (2012). Ocr post-processing error correction algorithm using google online 

spelling suggestion. arXiv preprint arXiv:1204.0191.  
Zhuang, L., & Zhu, X. (2005). An OCR post-processing approach based on multi-knowledge. In 

International Conference on K nowledge-Based and Intelligent Information and Engineering 

Systems. Springer Berlin Heidelberg, 346-352.  
Kanjanawattana, S., & K imura, M. (2016). A proposal of a method to extract and identify 

components in bar graphs. International Journal of Computer Theory and Engineering 
(IJCTE). Forthcoming 2016.  

Alvarez, J. M., Gevers, T., LeC un, Y., & Lopez, A. M. (2012, October). Road scene segmentation 

from a single image. In European Conference on Computer Visio. Springer Berlin Heidelberg, 
376-389.  

Cheng, C., Koschan, A., C hen, C. H., Page, D. L., & Abidi, M. A. (2012). Outdoor scene image 
segmentation based on background recognition and perceptual organization. IEEE 
transactions on image processing, 21(3), 1007-1019. 

Hiran, K. K., & Doshi, R. (2013). An artificial neural network approach for brain tumor detection 
using digital image segmentation. Brain, 2(5).