TX_1~ABS:AT/TX_2:ABS~AT


UHD Journal of Science and Technology | Jan 2023 | Vol 7 | Issue 1 43

1. INTRODUCTION

The topic of  text processing has drawn the interest of  
numerous scholars as a result of  the rising prevalence of  
digital texts in modern life. The amount of  research in the 
domain of  Kurdish text processing seems to be rather minor, 
despite significant efforts with some of  the most popular 
languages, such as English, Persian, and Arabic.

Commonly, the language experts divided the used languages 
of  the world over families which are by ascending: Indo-
European, Sino-Tibetan, Niger-Congo, Austronesian, and 

some other families. The Indo-European family is the 
biggest family which speaks by the majority of  Europe, 
the lands where the Europeans migrated, as well as a 
large portion of  South-west and South Asia. This family 
divided into sub-families [1]. Kurdish language dialects 
are part of  the north-western branch of  the Indo-Iranic 
language family. The Kurdish language is an independent 
language that has its own linguistic continuum, historical 
origins, grammar rules, and extensive live linguistic skills. 
The “Median” or “Proto-Kurdish” language is where the 
Kurdish language originated. Approximately 30 million 
people in high land of  Middle East, Kurdistan, talk 
numerous dialects of  Kurdish [1].

Kurdish is referred to be a dialectical continuity, which 
means that it has a variety of  dialects, it actually has four 
primary dialects (groups) and sub dialects, including 
(Kurmanjí or Kurmanji Zhwrw and Badínaní) in the north 
of  Kurdistan and Sorani or Kurmanji Khwarw in the center 

Kurdish Kurmanji Lemmatization and 
Spell-checker with Spell-correction
Hanar Hoshyar Mustafa, Rebwar M. Nabi
Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani, Kurdistan Region, Iraq

A B S T R A C T
There are many studies about using lemmatization and spell-checker with spell-correction regarding English, Arabic, and 
Persian languages but only few studies found regarding low-resource languages such as Kurdish language and more 
specifically for Kurmanji dialect, which increased the need of creating such systems. Lemmatization is the process of 
determining a base or dictionary form (lemma) for a specific surface pattern, whereas spell-checkers and spell-correctors 
determine whether a word is correctly spelled also correct a range of spelling errors, respectively. This research aims 
to present a lemmatization and a word-level error correction system for Kurdish Kurmanji Dialect, which are the first 
tools for this dialect based on our knowledge. The proposed approach for lemmatization is built on morphological rules, 
and a hybrid approach that relies on the n-gram language model and the Jaccard Coefficient Similarity algorithm was 
applied to the spell-checker and spell-correction. The process results for lemmatization, as detailed in this article, rates 
of 97.7% and 99.3% accuracy for noun and verb lemmatization, correspondingly. Furthermore, for spell-checker and 
spell-correction, accordingly, accuracy rates of 100% and 90.77% are attained.

Index Terms: Kurdish Language, Kurmanji Dialect, Kurdish Lemmatizer, Kurdish Spell-checker and Spell-correction, 
Kurdish Dataset

Corresponding author’s e-mail:  Hanar Hoshyar Mustafa, Technical college of Informatics, Sulaimani Polytechnic University, Sulaimani 46001, 
Kurdistan Region, Iraq. E-mail: hanar.hoshyar.m@spu.edu.iq

Received: 23-10-2022 Accepted: 05-01-2023 Published: 22-02-2023

O R I G I N A L  RE SE A RC H  A RT I C L E UHD JOURNAL OF SCIENCE AND TECHNOLOGY

Access this article online

DOI: 10.21928/uhdjst.v7n1y2023.pp43-52 E-ISSN: 2521-4217
P-ISSN: 2521-4209

Copyright © 2023 Mustafa and Nabi. This is an open access article 
distributed under the Creative Commons Attribution Non-Commercial 
No Derivatives License 4.0 (CC BY-NC-ND 4.0)


44 UHD Journal of Science and Technology | Jan 2023 | Vol 7 | Issue 1

Mustafa and Nabi: Kurdish Lemmatizer and Spell Corrector

of  Kurdistan (Sulaimani and Mukrayani). Kurmanji and 
Sorani are indeed the two main dialects [2]. Additionally, 
the other two important divisions of  Kurdish language 
are Goraní (Hawrami, Zazayee and Shabak) and Luri 
(Mamasani, Kurmanshani and Kalhuri). Furthermore, 
these are categorized into dozens of  dialects and sub-
dialects [3]. This paper focuses on the Northern Kurdish 
dialect which is (Kurmanji or Kurmanji Zhwrw) dialect 
which has the biggest number of  speakers in comparison 
to other Kurdish languages dialects [4]. Several studies 
have been done related to common languages such 
English [5], [6], Arabic [7]-[9], and Persian [10]-[12]. 
Moreover, there are few studies which are consummated 
regarding Kurdish language [13], [14], despite it, a huge 
gap can be seen in the case of  Kurdish Kurmanji dialect; 
therefore, this study has been aimed to serve this gap due 
to Kurmanji dialect in the case of  creating lemmatization 
and spell-checker with spell-correction system. Hence, in 
the future, this study can be used in several applications 
that include data translation, sentence retrieval, document 
retrieval, and also can be extend and upgrade to more 
powerful similar systems.

This study presented a toolkit, which consists of  a 
lemmatization system and a spell-checker with spell-
cor rection for Kurdish Kur manji. T he aim of  the 
lemmatization is to find a root or dictionary form (calls a 
lemma) for a specific surface form. It is crucial to be able to 
normalize words into their most basic forms, particularly for 
languages with rich morphology such as Kurdish language, to 
better assist processes such as search engines and linguistic 
case studies.

Spell-checking algorithms are one of  the lemmatizer’s most 
commonly used applications. With using a spell checker, the 
system suggests a rating of  suggested corrections for each 
possibly incorrect word.

This study presented a combination algorithm which are 
n-gram language model together with Jaccard Similarity 
Coefficient for the spell-checker and spell-correction system. 
Furthermore, a rule-based method on the Kurdish Kurmanji 
morphological rules is used in creating the lemmatization 
system.

Based on the literature and to the best of  our knowledge, 
no study has been conducted regarding the spell-checking 
and lemmatization systems in Kurdish Kurmanji Dialect. 
Therefore, our study can be the base for further studies for 
Kurdish Kurmanji dialect.

2. RELATED WORK

There has been a huge amount of  research that has been 
conducted regarding the word lemmatization, spell-checker, 
and spell-correction in several common languages, such as 
English, Persian, and Arabic. However, when it comes to 
Kurdish language, a large absence can be observed, especially 
in lemmatization and spell-checking with spell-correction 
system in Kurdish Kurmanji dialect.

In the case of  lemmatizer in English language Lemma Chase 
which is a lemmatizer is created [5] address the problems 
of  the most widely used lemmatizers currently available, 
this research presents a lemmatization model. This model 
accounts for the nominalized/derived terms for which no 
lemmatizer currently in use is able to produce the proper 
lemmas. Identifying the morphological structure of  any input 
English word, and in particular understanding the structure 
of  the derivational word, is the main issue in developing a 
lemmatizer. Finding the derivational suffix from morphing 
words and then extracting the dictionary base word from 
that derived word is another crucially difficult problem 
for a lemmatizer. Some derivative terms are not handled 
by well-known and well-liked lemmatizers to retrieve their 
basis words. Lemma Chase, the mentioned lemmatizer, 
accurately retrieves the base word while taking into account 
the word’s Part of  Speech, several classes of  suffix rules, and 
effectively executing the recoding rules utilizing the WordNet 
Dictionary. All of  the derivational and nominalized word 
forms that are present in any standard English dictionary 
are successfully used by Lemma Chase to construct the base 
word form.

In addition, there have been numerous studies on spell 
checkers in Arabic. For instance, Build Fast and Accurate 
Lemmatization for Arabic [7] which is a study that covers 
the need for a quick and precise lammatization to improve 
Arabic Information Retrieval (IR) outcomes and the difficulty 
of  developing a lemmatizer for Arabic, since it has a rich and 
complex derivational morphology. Introduces a new data set 
that can be used to verify lemmatization accuracy as well as a 
powerful lemmatization algorithm that works more accurately 
and quickly than current Arabic lemmatization techniques.

Numerous studies have been published on the use of  spell 
checkers and spell correction in Persian as well. For example, 
Automated Misspelling Detection and Correction in Persian 
Clinical Text [10] is an article that explains the creation of  
an automatic method for identifying and fixing misspellings 
in Persian free texts related to radiology and ultrasound. 


UHD Journal of Science and Technology | Jan 2023 | Vol 7 | Issue 1 45

Mustafa and Nabi: Kurdish Lemmatizer and Spell Corrector

Three distinct forms of  free texts associated to abdominal 
and pelvic ultrasound, head-and-neck ultrasound, and breast 
ultrasound reports are utilized using n-gram language model 
to accomplish their aim. For free texts in radiology and 
ultrasound, the system obtained detection performance of  up 
to 90.29% with correction accuracy of  88.56%. The findings 
suggested that clinical reports can benefit from high-quality 
spelling correction. Significant cost reductions were also 
made by the system throughout the documentation and final 
approval of  the reports in the imaging department.

Kurdish stemmer pre-processing for improving information 
retrieval conducted by researcher in [13]. This article 
introduces the Kurdish stemming-step method. It is a method 
that links search phrases and indexing terms in Kurdish texts 
that are connected by morphology. In actuality, the occurrence 
of  words demonstrates a supportive role for the classification 
process. Even though it was planned to produce more or 
fewer errors to demonstrate the complexity and difficulty 
of  words in the Kurdish Sorani dialect, the handling of  
similarity changes was implemented, which helped to boost 
matching among words and decrease the storage requirements. 
However, the stemmer used in this work was capable of  
resolving most of  these issues. There are many stop words 
with added affixes in Kurdish Sorani writings. Therefore, 
by combining these commonly occurring stop words, it can 
be stemmed. In addition, it was determined that employing 
partial words during the pre-processing stage was preferable.

Likewise building a Lemmatizer and a Spell-checker for 
Kurdish Sorani presented by [14]. This study also presented 
a lemmatization and word-level error correction system for 
Kurdish Sorani. It suggested a hybrid strategy focused on 
n-gram language modeling and morphological principles. 
Systems for lemmatization and error detection are referred 
to as Peyv and Renus, respectively. The Peyv lemmatizer is 
created based on the morphological rules, and for Renus, it 
corrects words both with using a lexicon and without using 
a lexicon. It indicates that these two basic text processing 
methods can lead the way for more study on additional 
natural language processing applications for Kurdish Sorani.

Last but not least, intensive literature search has been 
conducted but no studies have been found considering the 
Kurdish Kurmanji Dialect. Therefore, this article’s primary 
goal is to propose a lemmatization and word-level spell 
checker with correction method for a Kurdish language 
dialect known as Kurmanji. The benchmark of  this paper 
is [14] which is useful for the research study, despite the 
different algorithms used in spell-correction tool, the 

lemmatization tools are nearly similar in using the methods 
and approaches, both studies suggest a hybrid strategy based 
on n-gram language model and morphological principles. 
This study employs the Python programming language to 
process data as well as to create a word processing system 
that performs lemmatization and spell checking with spell 
correction at the word level.

3. METHODS AND DATA

This section describes dataset collection, data preparation, 
and algorithms as well as approaches which have been used 
in lemmatization and spell checker.

3.1. Dataset Collection
A model dataset was produced in order to carry out this study. 
The dataset was created by reading books and articles written 
in the Kurdish Kurmanji dialect, which were then manually 
recorded and added to the dataset. Kurdish Kurmanji dialect 
words include verbs, nouns, conjunctions, stop words, 
pronouns, imperative words, superlative words, and question 
words. There are around 1200 words in the dataset. Fig. 1 
depicts the dataset’s data amounts in a pie chart. This split 
results from the differing morphological rules for nouns and 
verbs, which affect how nouns and verbs are lemmatized. The 
third dataset has a large number of  words that do not accept 
any affixes. Furthermore, it contains a few special terms with 
only one or two letters. Some of  the conjunction words, for 
instance, are written with only one or two letters.

3.2. Data Preparation
The most important features that indicated that the dataset 
was ready for analysis were its unity and quality. Furthermore, 

Fig. 1. Dataset quantity pie chart.


46 UHD Journal of Science and Technology | Jan 2023 | Vol 7 | Issue 1

Mustafa and Nabi: Kurdish Lemmatizer and Spell Corrector

because the dataset is the primary first-hand collected dataset, 
it can ensure that the dataset is clean and has no duplicates. 
The dataset is then divided into three subsets. The first subset 
includes nouns. The second subset includes verbs, while the 
third subset contains pronouns, stop words, conjunctions, 
imperative words, superlative words, and question words. 
All of  the subsets were stored in separate Excel files, each 
with two columns: the ID column and the data (word) 
column. Except for the third subset, which contains the 
verbs, it has four columns: ID, Chawg, Qad, and Rag. The 
ID column contains a unique ID for each row; the Chawg 
column contains the verb’s base; the Qad column contains 
the verb’s past root; and the Rag column contains the verb’s 
present root. Table 1 presents the structure of  the third 
(verb) Excel sheet.

3.3. Implementation
This section describes the approaches and methods used 
according to noun lemmatization, verb lemmatization and 
spell-checker.

3.3.1. Lemmatization
Lemmatizations for nouns and verbs are developed separately, 
after obtaining the fundamental morphological rules in 
Kurdish Kurmanji. Each of  noun and verb lemmatization 
use different approaches based on the morphological rules. 
For the lemmatizations a pruning method is used to find out 
the root of  the input word. In the background of  the system, 
each process is contained in a module inside the system, as a 
result to eliminate complexity and increase simplicity, also to 
made the system more readable and understandable.

The following subsections clarify each of  noun and verb 
lemmatizations in detail.

3.3.1.1. Noun lemmatization
According to the noun lemmatization, the noun lemmatization 
was created after clarifying and writing down all the rules 
in accordance with nouns in Kurdish Kurmanji dialect. 
A pruning method is used in this study. The input word to 
the system went through multiple stages and processes until 
the system found the proper root for the input noun, which 
is called a lemma in lemmatization process.

During the process of  noun lemmatization, predefined affixes 
and nouns in the dataset are used to find a proper lemma 
for an input noun. The only condition is to enter the word 
with the correct spelling.

When a noun was entered, a search algorithm was used to 
look for it in the dataset. If  the entered noun was a root 
without any affixes, the system determined that the input was 
correct and that no further processing was necessary. The 
output word in the outcome would be the base of  the entered 
word. Fig. 2 shows the flowchart diagram of  this process.

In other cases when the entered noun is with or attached to 
some affixes, in this study in the noun lemmatization module, 
three sets of  affixes were defined. First set included prefixes 
that write before the noun without attaching to the noun 
directly, in Kurdish Kurmanji, there are some prefixes that write 
with a space separated with the noun. Second set included the 
prefixes which are write and attached directly to the beginning 
of  the noun without any space. Moreover, the last set included 
suffixes which are directly attached to the end of  the noun.

The entered noun went through multiple processes to find 
the root out. The system first removes any prefixes which are 

TABLE 1: Structure of verb‑dataset
Verb dataset Column Include

ID Data ID
Chawg Base of verb
Qad Past root of verb
Rag Present root of verb Fig. 2. Noun lemmatization first process flowchart.


UHD Journal of Science and Technology | Jan 2023 | Vol 7 | Issue 1 47

Mustafa and Nabi: Kurdish Lemmatizer and Spell Corrector

attached or not attached prefixes to the word, then search in 
the dataset, if  there was no matching for the entered noun, 
the system decided that it might attached to some suffixes too, 
then the word went through another process which removed 
the possible suffixes attached to the noun, after that a search 
process look to find out if  there was any matching word in 
the dataset, if  any matching word found in the dataset, it 
would be the return root as the result. This process showed 
in a flowchart diagram in Fig. 3.

Although there were no words that matched, the system made an 
effort and forwarded the entered noun to a procedure designed 
to remove prefixes and suffixes one at a time. In another sense, 
it took away the first prefix attached and looked for a matching 
root; if  no matching root was discovered, it took away the first 
suffix attached and looked once more. It continued the process 
until the root was discovered if  a matching root had not yet 
been discovered and there were further prefixes and suffixes 
linked to the word. At the end, when there were no more affixes, 
the entered noun was well spelled and the noun root existed in 
the dataset, the system gave the correct output lemma (root) 
for the entered noun. However, the system would replay with 
the message “Input word is not in the dataset” if  there was 
no match between the entered noun and nouns in the dataset. 
Fig. 4. shows the process’ flowchart diagram.

Following these steps, the user sees the procedures’ output, 
as depicted in Figs. 5 and 6. In Fig. 5, the true word (کچ) (kiç) 
which means (girl) with two Kurdish Kurmanji prefixes (ەک) 
(ek) and (ا) (a) in the form of  means (The (kiçeka) (کچەکا) 
girl who) entered. The system replayed with (“found”, “کچ”); 

“found” denotes that the entered word is correct and already 
exists in the dataset, and “کچ” is the base root of  word (کچەکا). 
However, in Fig. 6, the user inputted the incorrect term (کجان) 
(kican) in the meaning of  but with (girls) (kiçan) (کچان) 
incorrect ending of  followed by a ,(ç) (چ) rather than (c) (ج) 
correct prefix (ان) (an). Due to the incorrect spelling of  the 
word, which confounded the system and prevented it from 
locating the specific base root of  the word, the system 
replayed with the message “Input word is not in the dataset.”

3.3.1.2. Verb lemmatization
Verb lemmatization also implemented in a pruning method 
as the noun lemmatization. After Kurdish Kurmanji dialect 
verb morphological rules are defined, the verb lemmatization 
is applied. The input verb went across several procedures 
until the tool selected and found the proper root.

Due to the Kurdish verb’s morphology, the addition of  prefixes 
and suffixes to the verb roots, and their ability to alter meaning, 
finding the root of  the verb during the lemmatization process 
is more difficult and different than finding the root of  a noun. 
Therefore, simply omitting the suffix is worthless.

Fig. 3. Noun Lemmatization second process flowchart, Phase 1.

Fig. 5. Noun Lemmatization of a legitimate noun.

Fig. 4. Noun Lemmatization second process flowchart, Phase 2.

Fig. 6. Noun Lemmatization of an incorrect spelled noun.


48 UHD Journal of Science and Technology | Jan 2023 | Vol 7 | Issue 1

Mustafa and Nabi: Kurdish Lemmatizer and Spell Corrector

In Kurdish language morphology, each verb has three states 
includes its critical state, which is called (Chawg) in Kurdish 
morphology; in this state, every verb ends with an (N) (ن) 
letter at the end of  the word; the (N) (ن) is called (the N of 
Chawg) that determines the critical state of  the verb. Another 
state is when the verb turns into its past state, which is called 
the “past root,” and this is done by removing the (N of 
Chawg) at the end of  the verb. Whenever the verb is in the 
past state, it can be used in the past tense. The final state is 
present, and it has several rules to modify a verb critical state 
and turn it into its present root. When the verb is changed 
to its present root, it can be used in the present tense [2].

When the input was processed by the system, any affix 
containing the verb had to be removed. As a result, three sets 
of  affixes are defined, which include suffixes, prefixes that 
do not attach to the verb, and prefixes that attach to the verb 
directly. After removing affixes, the remaining verb had to be 
compared with the verb dataset in the system. As it is clarified 
in the verb dataset excel file, there were four columns included 
(ID, Chawg, Qad, and Rag), in which Chawg referred to the 
critical state of  the verb, Qad referred to the past state, and 
Rag referred to the present state. After the state of  the verb 
was recognized and found, the system returned the critical 
state, which is the Chawg of  the verb as the base root of  the 
entered verb. Fig. 7 shows the process of  finding the root 
of  a verb if  the entered verb is already a root; no matter in 
which tense it appears, the system returns the base root of  it. 
Moreover, Fig. 8 depicts the processes for locating a verb root 
if  the entered verb is attached to some affixes; the processes 
are identical to those for locating a root of  a noun attached 
to affixes in noun lemmatization.

After completing these stages, the user sees the output of  the 
procedures, as shown in Figs. 9-11. In Fig. 9, the true word 
denotes the present tense of (dixom) (دخۆم)  the verb (خارن) 
(xarin) (eat), while the prefix (د) (d) indicates the present term 
of  the verb and the suffix (م) (m) is the pronoun that denotes 
(I). The system repeated (“found”, “خارن”) in the output, where 
“found” implies that the word is correctly spelled and that its 
present root, which is (خۆ) (xo), is available in the dataset, and 

 is the base root for the entered word. In addition, in ”خارن“
Fig. 10, the past tense of  the same word (خارن) (eat) is entered 

Fig. 7. Verb Lemmatization first process pseudo code.

Fig. 9. Verb Lemmatization of correct present tense of verb (خارن) 
(xarin) (eat).

Fig. 8. Verb Lemmatization second process pseudo code.

Fig. 10. Verb Lemmatization of true past tense of verb (خارن) (xarin) 
(eat).

Fig. 12. Query term bi-gram frequency calculation pseudo code.

Fig. 11. Verb Lemmatization of a wrong spelled negative imperative of 
verb (خارن) (xarin) (eat).


UHD Journal of Science and Technology | Jan 2023 | Vol 7 | Issue 1 49

Mustafa and Nabi: Kurdish Lemmatizer and Spell Corrector

as (خارمەڤە) (xarmeve), which means (I ate). This time, there 
are two suffixes: (م) (m), which is the pronoun associated to 
(I), and (ەڤە) (eve), which indicates that the event occurred and 
ended completely in the past. Once more, the system verified 
that the word root was correctly spelled that it was included 
in the dataset; it also displayed the base root of  the term. 
Furthermore, in Fig. 11, entered the wrong negative imperative 
phrase (مەخر) (mexir) instead of  (mexo) (مەخۆ) or (mexu) (مەخو) 
which means (don’t eat), but with the improper ending of  (ر) 
(r), rather than (ۆ) (o) or (و) (u). The system displayed the 
message “Input word is not in the dataset” according to the 
word’s incorrect spelling, which confused the system and 
prohibited it from finding the precise base root of  the word.

3.3.2. Spell checker and spell correction
The spell checker and spell cor rection mechanisms 
collaborated in two stages in this study: First, the spell checker 
indicated whether the word was correct or incorrect, and 
second, the spell correction process corrected the word by 
suggesting some correct words by providing the most likely 
correct word forms.

After the word entered the system, it was detected if  it was 
true or not by the spell checker’s check for word frequency 
in the dataset (including the whole of  the three files). The 
step of  finding that the word is true or detecting the word 
as wrong was done based on using n-grams. The input word, 
which is called the query term in this paper, is fragmented into 
bi-grams (two grammatical units). A bi-gram is an n-gram for 
n = 2. In this study, a 2-g (or bi-gram) is a two-letter sequence 
of  letters. The bi-grams sequences “ha,” “ap,” “pp,” and “py,” 
for instance, are two-letter grammatical sequences extracted 
from the word (happy). After the bi-gram of  the query term 
is produced, the system calculates the gram frequencies with 
the bi-grams of  the words in the dataset separately, which 
is called a dictionary term in this paper. Fig. 12 shows the 
process of  calculating the frequency of  bi-grams in the query 
term in comparison to the dictionary terms. After calculation, 
the system looked up the frequencies of  the bi-gram of  the 
query term; if  one of  the frequencies was equal to zero, then 
it detected the word as a wrong one as one of  its bi-grams 
had no repetition in comparison with the dictionary terms, 
and if  none of  the frequencies were zero, then the word 
was detected as true. Hence, in the event that a query term 
equals one of  the index terms in the dataset, this word will 
be selected as true, and if  the word is detected as true, then 
the system presents “The word is true spelled” as a result.

After detecting the query term as wrong, its bi-grams 
are handled, and the system goes to the spell correction 

procedure. The wrong word is then corrected based on the 
Jaccard similarity coefficient method, which is popularly used 
to compare how close the query terms in the dataset are to 
one another. Here, the procedure of  similarity measurement 
can be used to examine the most comparable terms that 
are structurally recorded in the dataset if  a query does not 
match any index in the dataset. Using the Jaccard similarity 
coefficient [15], Equation (1) shows the rule of  the Jaccard 
similarity coefficient.

 Jaccard sim (A,B) = P(A∩B)/(BUA) (1)

Measuring the Jaccard similarity coefficient between two 
datasets is done by dividing the number of  features that are 
common to all by the number of  properties [15].

The mechanism worked on the query term, and dictionary 
terms included all three files of  the dataset. The spell 
correction took the query term, looked for the matching 
dictionary term in file one, if  it did not exist, then sent it to 
the lemmatization files, respectively, because it may be the 
root of  a noun or a verb; also, it might be a noun or a verb 
with affixes, and the affixes should be removed as a result 
to check if  it was spelled correctly or not. After checked 
process did go well in detail, if  the word found, the system 
marked it as a true word. Otherwise, spell checker predicted 
words based on the dataset’s three files, then it chose the 
best matching words based on the highest matching degree, 
which is calculated using the Jaccard Coefficient algorithm, 
and best matches were chosen if  their matching degree were 
greater than the spell checker’s threshold, and finally the five 
highest matching degree words were chosen. The threshold 
of  this study is equal to 0.15. It has been chosen based on 
the accuracy of  the guess for the correct word or the highest 
matching words in the dataset for the wrong query term. In 
Kurdish Kurmanji, there are words with three letters; if  they 
are written incorrectly by missing a letter, they only have two 
letters. Hence, the threshold should be as small as possible 
to get a great and accurate result.

4. RESULTS AND DISCUSSION

This section presents the results of  the algorithms in both 
lemmatization and spell checker tools. Also discuss the 
benchmarking with the benchmark study of  the research.

4.1. Noun Lemmatization
To improve the efficiency and accuracy of  the noun 
lemmatization tool, two random words were chosen with 


50 UHD Journal of Science and Technology | Jan 2023 | Vol 7 | Issue 1

Mustafa and Nabi: Kurdish Lemmatizer and Spell Corrector

their derivatives which were nine derivatives of  (چیا) 
(mountain) word and 12 derivatives of  word. The (boy) (کوڕ) 
results of  lemmatization process of  both words were 
successfully giving correct root in both nine derivatives of 
first word and 12 derivatives in second word. To ensure the 
accuracy of  noun lemmatization, another 66 random words 
with possible derivatives were chose and entered into the 
system; therefore, the noun lemmatization gave correct result 
in 63 cases out of  66, which means that the noun 
lemmatization algorithm had an accuracy of  approximately 
95.45% in lemmatizing words. Overall, the accuracy of  the 
noun lemmatization process was approximately about 97.7%. 
Table 2 presents accuracy in noun lemmatization tool.

4.2. Verb Lemmatization
To evaluate the efficiency and accuracy of  the verb 
lemmatization tool, two sets of  random verb forms were 
tested with the tool. The test sets included different verb 
forms such as present and past tense, imperative and negative 
imperative, passive, and negative. Regarding the verb’s 
existence in the dataset dictionary, the verb lemmatization 
tool found the correct root of  the input verb. Each verb in 
the test set was entered with all possible derivations made 
with specific prefixes and suffixes. The first set included 171 
different forms of  different verbs. The lemmatization tool 
lemmatized 169 of  them correctly; the wrongly lemmatized 
ones were due to the ordering of  the dataset; in the case of 
imperative and negative imperative of  a verb, the lemmatized 
verb Rag was coming before the purposed verb Rag, so the 
system took the first verb Rag before it reached the purposed 
one. For example, the Kurdish verb “send” has two forms: 
 and both have the same (nartin) (نارتن) and (nardin) (ناردن)
Rag (نێر) (“nêr”). If  a user entered the imperative tense of 
this verb, which is (بنێرە) (binêre), and expected to see the 
base root of  in the result, the system replays (nardin) (ناردن) 
with the base root of  because it was recorded (nartin) (نارتن) 
before the other form (ناردن) (nardin) in the dataset excel 
file. Moreover, it is due to the system that, when it finds a 
result, it stops without going to the other verbs in the dataset.

Moreover, it is due to the system, when it finds a result, the 
system stops and the result appears without going to the other 
data in the dataset. As a result, the accuracy of  lemmatizing 

the first set was 98.83 percent. In the order of  the other 
set, there were 131 different forms of  different verbs with 
different tenses. Due to this set, the lemmatization tool 
lemmatized all of  them, which means it gave the correct root 
for each of  the forms. It can be said that with the two test 
sets, the verb lemmatization tool overall gave approximately 
99.4 percent accuracy. Table 3 shows the accuracy of  the 
verb lemmatization tool.

4.3. Spell Checker and Spell Correction
According to calculate and analyze the accuracy of  the spell-
checker and spell-correction tool, the process of  analyzation 
is more complex, due to connecting the spell-checker and 
spell-correction tool with the lemmatization tools. As 
described in the above section, there was three datasets, so 
the spell-checker and spell-correction accuracy should be 
calculated according to all the datasets. The mechanism as 
said is to first check if  the input word is correct or not, and 
the spell-checker tool is tested with three groups of  data 
which are consisted in the three datasets as well. These three 
groups included 100 words from first dataset file, 100 nouns 
from second dataset file, 100 verbs from third dataset file, 
respectively. The result always returned true which meant the 
input word spelling is correct, while the data existed in the 
dataset. Hence, it reached to be said that the spell-checker tool 
returned in all cases successfully. Table 4 shows the accuracy 
of  spell-checker tool.

For the spell-correction tool a set of  random words included 
noun, verb and others is tested, the contained nouns and 
verbs included all forms with prefixes and suffixes also simple 
noun and verbs without prefixes and suffixes. The result 
shows that whenever a bi-gram of  the original correct word 
came in the input word, it was a higher chance to get the most 
correct word and most similar word as a result. The more bi-

TABLE 2: Accuracy in noun lemmatization tool
Sets True 

lemmatization
False 

lemmatization
Total Accuracy 

(%)
1st set 21 0 21 100
2nd set 63 3 66 95.45
Total 84 3 87 97.7

TABLE 3: Accuracy in verb lemmatization tool
Sets True 

lemmatization
False 

lemmatization
Total Accuracy 

(%)
1st set 169 2 171 98.83
2nd set 131 0 131 100
Total 300 2 302 99.3

TABLE 4: Accuracy in spell checker tool
Sets True spell 

checking
False spell 
checking

Total Accuracy 
(%)

1st set 100 0 100 100
2nd set 100 0 100 100
3rd set 100 0 100 100
Total 300 0 0 100


UHD Journal of Science and Technology | Jan 2023 | Vol 7 | Issue 1 51

Mustafa and Nabi: Kurdish Lemmatizer and Spell Corrector

grams of  the original wanted word came in the input word, 
the higher similarity degree get and the more accurate results 
acquire in the outcome. In several occasions, the incorrect 
lemmatization occurred because of  the incorrect input word, 
and this led to incorrect spell-correction which at the end 
resulted in a low accuracy degree of  the outcome result.

To provide the efficiency of  the spell-correction a set included 
100 of  wrong random words with different forms were tested 
manually. First set included 61 wrong spelled nouns, the spell-
corrector with the help of  the noun lemmatization resulted 
an accuracy of  90.16% of  the correction process. Second set 
contained 80 wrong spelled verbs, in the result spell-corrector 
with the use of  verb lemmatization gave an accuracy of  
correction process with 88.75% rate. Third set consisted 
the wrong spelled pronouns, stop words, conjunctions, 
imperative words, superlative words, and question words, in 
107 words the spell-correction system corrected 100 of  them 
successfully which give accurate result as 93.4% of  accuracy 
rate. Table 5 displays the accuracy of  spell-correction tool.

As shown in Table 5, the third set had the highest accuracy 
rate among the other two sets, and as previously stated, some 
false correction cases occurred due to false lemmatization, so 
it must be stated that if  a dataset is created with all the forms 
of  the words in all three datasets, then more accurate results 
can be obtained because the spell-corrector can directly look 
for the right form of  the input misspelled word and find it 
with a high degree of  certainty.

5. CONCLUSION AND FUTURE WORKS

Information retrieval and text classification can benefit greatly 
from effective lemmatizer. In addition, incorrect words are 
detected and corrected by spell-checkers and spell-correction. 
This paper introduced the Kurdish Kurmanji lemmatizer and 
word-level spell-checker with spell-correction methodologies. 
It is the first attempt that tools of  this kind have been made 
for Kurdish Kurmanji. A hybrid technique has been utilized 
for the spell-checker and spell-correction that depends on the 
n-gram language model and the Jaccard Coefficient Similarity 

algorithm, also the proposed approach for lemmatization, is 
based on morphological principles. The outcome demonstrated 
that, while applying the suggested approach, the accuracy of  
lemmatization for each noun and verb lemmatization was 
assessed, respectively, at 97.7% and 99.3%. In addition, the 
spell-checker and spell-correction accuracy rates were 100% 
and 90.77%, respectively. The experimental findings show 
that several false correction cases were caused by incorrect 
lemmatization led by misspelled input words. Furthermore, 
according to experimental findings, more accurate results may 
be obtained if  a dataset is established with all the word forms 
in the datasets since the spell-checker will directly search for 
the correct form of  the input misspelled word and discover 
it with a high level of  equality. In the future, this work can be 
expanded to apply to a bigger dataset of  Kurdish Kurmanji 
and utilize these approaches for NLP applications like text 
mining for Kurdish Kurmanji.

As a contrast between this study and its benchmark. Actually, 
this study is done for the Kurdish Kurmanji dialect, while 
the benchmark was done for the Kurdish Sorani dialect, 
which has completely different morphological rules in so 
many phases to study and implement in the system. The 
datasets that were used were different, while this research’s 
dataset is primary, first-hand, and organized in three subsets. 
In addition, there are some variances between them in 
terms of  accuracy and the algorithms that have been used. 
This study achieved 97.7% and 99.3% accuracy for noun 
and verb lemmatization, respectively, while the benchmark 
achieved 95% and 89.4% accuracy of  two test sets for 
noun lemmatization and an average of  86.7% accuracy for 
verb lemmatization. In addition, according to the spell-
correction, this study used the Jaccard Coefficient Similarity 
algorithm and rated 90.77% accuracy, while the other study, 
as mentioned, used an edit distance algorithm and obtained 
96.4% accuracy with a lexicon while, without a lexicon, the 
correction system had 87% of  accuracy. At the end, it has 
to be said that the similarities can be seen in the theoretical 
parts and ideas, but for the practical part, a huge difference 
can be seen from using different programming languages; 
this study used the Python programming language, while 
the other used the Java programming language, up to and 
including recreating the system from the beginning to the end.

6. ACKNOWLEDGMENT

The authors would like to thank SPU for providing the 
opportunity, support, and funding for this study. Sulaimani, 
the Kurdish journalist syndicate, is also thanked.

TABLE 5: Accuracy in spell correction tool
Sets True 

correction
False 

correction
Total Accuracy 

(%)
1st set 55 6 61 90.16
2nd set 71 9 80 88.75
3rd set 100 7 107 93.4
Total 226 22 254 90.77


52 UHD Journal of Science and Technology | Jan 2023 | Vol 7 | Issue 1

Mustafa and Nabi: Kurdish Lemmatizer and Spell Corrector

REFERENCES

[1] Z. Kurdî, M.Û. Zarên Wî and H.S. Khalid. “Kurdish Language, its 
Family and Dialects”. 2020. Available from: https://www.dergipark.
org.tr/en/pub/kurdiname/issue/50233/637080 [Last accessed on 
2022 Aug 15].

[2] D.N. MacKenzie. “Kurdish Dialect Studies”. Oxford University 
Press, London, 1961. Available from: https://www.books.
g o o g l e . i q / b o o k s / a b o u t / K u r d i s h _ d i a l e c t _ s t u d i e s _ 2 _ 1 9 6 2 .
html?id=eaf2zaeacaaj&redir_esc=y [Last accessed on 
2022 May 31]

[3] “Kurdish Academy of Language Enables the Kurdish Language 
in New Horizon”. Available from: https://www.kurdishacademy.
org/?q=node/41 [Last accessed on 2022 Jun 04].

[4] N.A. Khoshnaw, Z.U.Z. Sulaimaniyah. “Awer Station”, 2011. 
Available from: https://rezmanikurde.blogspot.com/2018/01/blog-
post_26.html?m=1 [Last accessed on 2022 Jun 09].

[5] R. Gupta and A.G. Jivani. “LemmaChase: A Lemmatizer”. 
International Journal on Emerging Technologies, vol. 11, no. 2, 
pp. 817-824, 2020.

[6] D. Hládek, J. Staš, S. Ondáš, J. Juhár and L. Kovács. “Learning string 
distance with smoothing for OCR spelling correction”. Multimedia 
Tools and Applications, vol. 76, no. 22, pp. 24549-24567, 2017.

[7] H. Mubarak. “Build Fast and Accurate Lemmatization for 
Arabic”. vol. Proceedings of the European Language Resources 
Association (ELRA). Miyazaki, Japan, 2018. Available from: https://
www.aclanthology.org/L18-118 [Last accessed on 2022 Jun 08].

[8] N. Zukarnain, B.S. Abbas, S. Wayan, A. Trisetyarso and C.H. 
Kang. “Spelling Checker Algorithm Methods for Many Languages”, 

in Proceedings of 2019 International Conference on Information 
Management and Technology, (ICIMTech), 2019, pp. 198-201.

[9] A.A. Freihat, M. Abbas, G. Bella and F. Giunchiglia. “Towards an 
optimal solution to lemmatization in Arabic”. Procedia Computer 
Science, vol. 142, pp. 132-140, 2018.

[10] A. Yazdani, M. Ghazisaeedi, N. Ahmadinejad, M. Giti, H. Amjadi 
and A. Nahvijou. “Automated misspelling detection and correction 
in Persian clinical text”. Journal of Digital Imaging, vol. 33, no. 3, 
pp. 555-562. 2019.

[11] S. Mohtaj, B. Roshanfekr, A. Zafarian and H. Asghari, “Parsivar: 
A Language Processing Toolkit for Persian,” in Proceedings of 
the Eleventh International Conference on Language Resources 
and Evaluation (LREC 2018), 2018. Available from: https://www.
aclanthology.org/L18-1179 [Last accessed on 2022 Aug 20].

[12] A. Rashidi and M.Z. Lighvan. HPS: A hierarchical Persian stemming 
method. International Journal on Natural Language Computing, 
vol. 3, no. 1, pp. 11-20, 2014.

[13] A.M. Mustafa and T.A. Rashid. Kurdish stemmer pre-processing 
steps for improving information retrieval. Journal of Information 
Science, vol. 44, no. 1, pp. 15-27, 2018.

[14] S. Salavati and S. Ahmadi. “Building a Lemmatizer and a spell-
checker for Sorani Kurdish”. CoRR, vol. abs/1809.10763, 2018. 
Available from: https://www.arxiv.org/abs/1809.10763 [Last 
accessed on 2021 Aug 15].

[15] S. Niwattanakul, J. Singthongcha, E. Naenudorn, and S. 
Wanapu. “Using of Jaccard Coefficient for Keywords Similarity”, 
in Proceedings of the International Multi Conference of Engineers 
and Computer Scientists. vol. 1, 2013. Available from: https://www.
data.mendeley.com/v1/datasets/s9wyvvbj9j/draft?preview=1 [Last 
accessed on 2022 Apr 08].