Vol. 2, No. 2 | July – December 2018

Developing Sindhi Text Corpus using XML Tags

Sayed Majid Ali Shah∗ Zeeshan Bhatti∗ Imdad Ali Ismaili∗

Abstract

Sindhi language being one of the oldest languages of the world, has still very limited use in
digital age due to lack of digital contents. The use of corpus for each language has been extremely
important in facilitating the natural language processing of its script. This research work addresses
the issue of building corpus for Sindhi Language using XML based Tagging. The tree based XML
tag structure is designed to develop Sindhi Corpus that has two main nodes namely metadata and
Sindhi Document which contains the main text. The Corpus developed contains a detailed metadata
tags to represent Sindh language, documenting each relevant component of the corpus. The final
corpa would be further used in various Natural language applications for Sindh language.

Keywords: Corpus, Sindhi, Sindhi Corpus, Natural Language Processing, XML

1. Introduction

Sindhi language is a widely spoken language based
on Arabic script with similar cursive ligatures and
written from right-to-left consisting of 52 characters [1]
[2]. Sindhi language is considered as the second most
popularly written and spoken language, after Urdu,
in Pakistan. Even though Sindhi is an old language
with vast amount of literature and written resources.
However, there are very insufficient computational ma-
terial and digital coprus available for Sindhi Language
to create efficient NLP applications.

Natural Language Processing applications always re-
quire a huge collection of Corpus data for the lan-
guage. A corpus is simply collection of large amount
of structure and unstructured text for a language. The
well-defined structural format is created to store and
categorize the text in large datasets, allowing the com-
putational processing and application development.
This structured datasets facilitates the statistical anal-
ysis and grammatical validation of the script, along
with other applications of NLP [3] [4].

Corpus are considered as one of the key prerequisites for
and obligatory component for developing any Natural
Language Processing applications such as, Spell check-
ers[2][5], Machine Learnings, Speech-to-Text, Text-to-
Speech, OCR, Translation, Transliteration, etc. [6].
Due to this, there is huge need for developing a Sindhi
language corpus which is also publicly available for ev-
eryone to use.

XML has always been a key technique for designing
a structure for developing Corpa of various languages

[7] [8] [9] [10]. XML is a very flexible language due
to its tag-based structure, which allows the developer
to easily extract the required and desired informa-
tion from the structured XML document. Developing
Sindhi corpus in XML would enable rule-based tag-
ging’s, and structured designing of Corpa, allowing an
easier reusability of the corpus along with broader dis-
semination to various NLP applications.

Since Corpus is extremely essential for any language
for NLP application, a huge amount of work from
various aspects has been done to develop corpus for
various different languages. Primarily, a project named
EMILLE was developed consisting of multilingual cor-
pora for South Asian languages [10]. Similarly, Urdu
corpus was developed containing 18 million words by
the Center for Research in Urdu Language Process-
ing (CRULP) [11]. CRULP has also developed and
released Online Urdu Dictionary (OUD) containing
120,000 records of Urdu corpus with 80,000 words dic-
tionary words [9]. Whereas, Bank of English Corpus
was developed to help the dictionaries [8]. On the other
hand, Hindi Corpus was designed by IIT Bombay to
facilitate the NLP development of the language [13]
along with EMILLE corpus [12].

2. XML Structure for Sindhi
Corpus

In NLP application development, XML tag-based struc-
tured format has been widely used to create structured

∗Institute of Information and Communication Technology, University of Sindh, Jamshoro
Corresponding Email: lakiyarimajid@gmail.com

SJCMS | P-ISSN: 2520-0755 | E-ISSN: 2522-3003 c© 2018 Sukkur IBA University - All Rights Reserved
30


Sayed Majid (et al.), Developing Sindhi Text Corpus using XML Tags (30-37)

Figure 1: Proposed Model with Hypothesis relation

documents for processing and developing regional lan-
guage applications [14]. For this purpose, XML has
been used with custom tag to structure Sindhi text in
a formalized Corpus for various NLP applications. The
tags for Sindhi Corpus based on XML have been seg-
regated into two main sections consisting of Metadata
tags and Sindhi Text Document tags. Each is then fur-
ther divided to contain more detailed information in its
sub- tags.

A. Sindhi Corpus Structure

The XML based Sindhi Corpus structure has been
divided into two main sections at the top-most level
with ‘Metadata’ tag containing tags related to the orig-
inal source information related to the actual text and
document. The ‘Sindhi Text Document’ tag is second
top-most tag containing the actual text from the source
document. The full hierarchy of the XML tag struc-
ture for Sindhi Corpus is illustrated in Figure 1. Each
Sindhi Corpa document will be stored with respect to
this structure within XML tags.

B. Meta Data Header of Sindhi Corpus

Metadata is defined as the data about the data. There-
fore, this main tag contains specific detail information
related to the source document. This main section con-
tains attributes such as “Title” of the document, “Sub
Title” of Sindhi document, if any, “Topic” being dis-
cussed in the Sindhi document, “Sub Topic”, “Book”,
“Author”, “Edition”, etc. The detailed sub-tag struc-
ture of the meta data section is shown in Figure 1.
The ‘Sindhi Text Document’ tag contains the source
raw text information which is extracted from various
sources including websites, newpapers, books, articles,

etc. This tag contains further two sub tags that de-
scribe the text description of the source text and the
actual text file under “Text Description” and “Text
Document” respectively.

3. Sindhi Corpus
Representations

There are two main custom tags defined
after <sndhiLangCrps> as <sndMetaData>
</sndMetaData> and <sndTextDoc> </sndTextDoc>,
and the operator (+) shows that both custom tags have
also child tags as define operator (+) in Figure 2.

Figure 2: Super tags of <sndhiLangCrps>
SLC(Sindhi Language Corpus

Figure 3 shows the main Sindhi Language Corpus
Tag that contains to top-most sub tags Sindh Meta
Data nd Sindh Text Document tags.

Figure 3: Root and elements tags of sndhiLangCrps

Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 2 No. 2 July – December 2018 c© Sukkur IBA University
31


Sayed Majid (et al.), Developing Sindhi Text Corpus using XML Tags (30-37)

A. Portion of SndMetaData

The tag of XML <sndMetaData> “Sindhi Meta Data”
is the part of Sindhi corpus in data file which shows
that the Sindhi data about its own data, it has child
tags which also contains further information about the
Sindhi document.

Figure 4: Elements and child tags of SndMetaData
at SndhiLangCrps

In this figure 5 the elements tags of
<sndMetaData> </sndMetaData> has been defined
as <pblsher> </pblsher>, <lang> </lang> and
<fileDesc> </fileDesc> and they have also sub child
as per operator defines.

Figure 5: Sindhi XML Corpus with MetaData

Figure 6 shows another example of Publisher tags
data for Sindhi Document. In this figure, the
child tags of <pblsher> </pblsher> has been de-
fined as <pblsherName> </pblsherName>, <athor>
</athor> and <edition> </edition> custom tags. Ac-
curate data also filled in that custom tags for the build-
ing of SLC, while the other tags are here in silent mod
they have discussed in other figure and the operator (-)

shows that specific tag is displayed with own child and
no more child is hide.

Figure 6: Data in publisher tag

The XML tag <sndMetaData> has three elements tags
named as <pblsher> Publisher, <lang> Language and
<fileDesc> File Description.

Figure 7: Elements and child tags of pblsher in
SndMetaData at SndhiLangCrps

The publisher tag <pblsher> contains the informa-
tion of publications, with describes its child elements
as <pblsherName> “Publisher Name”, <athor> ”Au-
thor ” , <edition> “Edition”. The tag <pblsherName>
shows the name of publisher, the tag <athor> tells the
name of author while the tag <edition> describes the
edition of publications.

Figure 8: Elements and child tags of pblsher with
its child’s elements edition in SndMetaData at Snd-
hiLangCrps

The tag <pblsher> publisher is the child tag of Sindhi
Meta Data <sndMetaData> while the tag <edition>
Edition is the child tag of <pblsher> and <edition> tag

Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 2 No. 2 July – December 2018 c© Sukkur IBA University
32


Sayed Majid (et al.), Developing Sindhi Text Corpus using XML Tags (30-37)

has its child as <dataSrc> “Data Source”, <books>
“Books”, <news> “News”, <articles> “Articles”,
<blogs> “Blogs”. These all are the sources of infor-
mation which provide the complete data to edition and
edition makes the complete to the publisher tag.

Figure 9: Elements and child tags of Lang in Snd-
MetaData at SndhiLangCrps

The Language tag <lang> is the element tag
of <sndMetaData> which has child tags like tag
<noRds> “Number of Records”, <encoding> “Encod-
ing”, <data> “Data”.

Figure 10: Elements and child tags of fileDesc in
SndMetaData at SndhiLangCrps

Figure 11: child tags of <lang> </lang>

Figure 11 uses the child tags of <lang> </lang> tags as

<noRds> </noRds>, <encoding> </encoding> and
<date> </date>. That tags have filled by accurate
data while other tags are here silent to show the role of
that tags in corpus linguistics.

File Description <fileDesc> is the element tag of
<sndMetaData> tag consists of child tags as <title>
“Title”, <subtopic> “Sub Topic”, <keywords> “Key
Words”.

Figure 12: Elements and child tags of SndTxtDoc
at SndhiLangCrps

Figure 13: Element tags of <fileDesc> </fileDesc>

Figure 13 shows the sub tags of cus-
tom tag of <fileDesc> </fileDesc> as <title>
</title>, <sbTopic> </sbTopic> and <keyWords>
</keyWords>. All tags have assigned their own data.

B. Portion of Sindhi Text Document

The tag of XML <sndTxtDoc> “Sindhi Text Docu-
ment” is the part of Sindhi corpus in data file, it has
also child tags as <txtSrc> “Text Source”, <txtDesc>
“Text Description”.

4. Sample Sindhi Corpus
Document

The final Sindhi documents are initially created manu-
ally by extracting information form articles and saving
them in XML tags as discussed [15]. A GUI form was
designed that allowed the creating of XML document
for Sindhi text as shown in Figure 14. Each entry was
saved as an XML file as per rules and patterns discussed
above.

Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 2 No. 2 July – December 2018 c© Sukkur IBA University
33


Sayed Majid (et al.), Developing Sindhi Text Corpus using XML Tags (30-37)

Figure 14: GUI form for creating XML document
of Sindhi corpus

Figure 15: Sample Sindhi XML document

The final version of each XML document of
Sindhi corpus contain all the relevant information
that could easily be read and processed for any
NLP task as shown in Figure 15 to Figure 18.

Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 2 No. 2 July – December 2018 c© Sukkur IBA University
34


Sayed Majid (et al.), Developing Sindhi Text Corpus using XML Tags (30-37)

Figure 16: Sindhi XML Document 2 Figure 17: Sample Sindhi XML Document 3

Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 2 No. 2 July – December 2018 c© Sukkur IBA University
35


Sayed Majid (et al.), Developing Sindhi Text Corpus using XML Tags (30-37)

Figure 18: Sample Sindhi XML Document 4

5. Conclusion and Future
Work

The use of corpus in Natural Language Processing is ex-
tremely essential and important. The Sindhi Language
Corpus is designed using XML tags to facilitate the pro-
cessing of Sindhi text for various NLP tasks. XML tags
have been designed to provide maximum data facilita-
tion and a long term usability of Sindhi Corpus. The
tab structure is segregated into two main sections con-

taining metadata and main source full document. The
metadata is crucial part of any document, and so Sindhi
corpus metadata also contains many sub tags to cater
for all possible information of any document. The use
of XML for Sindhi corpus has been very fruitful and
has provided a platform to work on more processing of
Sindhi Text.

6. Acknowledgment

An earlier version of this paper was presented at
the International Conference on Computing, Math-
ematics and Engineering Technologies (iCoMET
2018) and was published in its Proceedings
available at IEEE Explorer, available at (URL:
https://ieeexplore.ieee.org/iel7/8337998/8346308/08346381.pdf)
This research work was carried out in Multimedia Ani-
mation and Graphics (MAGic) Research Group at In-
stitute of Information and Communication Technology,
University of Sindh, Jamshoro.

References

[1] Ismaili, I. A., Bhatti, Z., & Shah, A. A. (2014).
Design & Development of the Graphical User
Interface for Sindhi Language. arXiv preprint
arXiv:1401.1486.

[2] Bhatti, Z., Waqas, A., Ismaili, I. A., Hakro, D. N.,
& Soomro, W. J. (2014). Phonetic based soundex
& shapeex algorithm for sindhi spell checker sys-
tem. arXiv preprint arXiv:1405.3033.

[3] Rahman, M. U. (2010). Towards Sindhi corpus
construction. In Conference on Language and
Technology, Lahore, Pakistan.

[4] Ko, W. K., & Phyo, T. Z. (2008, January). Selec-
tion of XML tag set for Myanmar National Corpus.
In IJCNLP (pp. 33-40).

[5] Bhatti, Z., Ali Ismaili, I., Nawaz Hakro, D., &
Javid Soomro, W. (2015). Phonetic-based sindhi
spellchecker system using a hybrid model. Digital
Scholarship in the Humanities, 31(2), 264-282.

[6] Hakro, D. N., Ismaili, I. A., Talib, A. Z., Bhatti,
Z., & Mojai, G. N. (2014). Issues and challenges in
Sindhi OCR. Sindh University Research Journal-
SURJ (Science Series), 46(2).

[7] Mahar, J. A., & Memon, G. Q. (2010, February).
Rule based part of speech tagging of sindhi lan-
guage. In Signal Acquisition and Processing, 2010.
ICSAP’10. International Conference on (pp. 101-
106). IEEE.

[8] Sinclair J. (1992), Introduction. BBC English Dic-
tionary, London: Harper Collins. Tony M. and
Wilson A. (2001), Corpus Linguistics (Second Edi-
tion), Edinburgh University Press.

Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 2 No. 2 July – December 2018 c© Sukkur IBA University
36


Sayed Majid (et al.), Developing Sindhi Text Corpus using XML Tags (30-37)

[9] Rahman, S. (2005). Lexical Content and
Design Case Study. Presented at From Lo-
calization to Language Processing, Sec-
ond Regional Training of PAN Localiza-
tion Project. Online presentation version:
http://panl10n.net/Presentations/Cambodia/Shafiq/Le
xicalContent&Design.pdf.

[10] McEnery, A., Baker, J., Gaizauskas, R. & Cun-
ningham, H. (2000). EMILLE: towards a corpus
of South Asian languages, British Computing So-
ciety Machine Translation Specialist Group, Lon-
don, UK.

[11] Ijaz, M. and Hussain, S. 2007. Corpus Based Urdu
Lexicon Development. The Proceedings of Confer-
ence on Language Technology (CLT07), University
of Peshawar, Pakistan.

[12] Hardie, A., Baker, P., McEnery, T., & Jayaram,
B. D. (2006). Corpus-building for South Asian

languages. TRENDS IN LINGUISTICS STUDIES
AND MONOGRAPHS, 175, 211.

[13] Bojar, O., Diatka, V., Rychlý, P., Stranák, P., Su-
chomel, V., Tamchyna, A., & Zeman, D. (2014,
May). HindEnCorp-Hindi-English and Hindi-only
Corpus for Machine Translation. In LREC (pp.
3550-3555).

[14] Kim, J. D., Ohta, T., Tateisi, Y., Mima, H., &
Tsujii, J. I. (2001). XML-based linguistic annota-
tion of corpus. In Proc. of the First NLP and XML
Workshop.

[15] Shah, S. M. A., Bhatti, Z., Ismaili, I. A., &
Waqas, A. Designing XML tag based Sindhi Lan-
guage Corpus. International Conference on Com-
puting, Mathematics and Engineering Technolo-
gies – iCoMET 2018. IEEEXplore

Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 2 No. 2 July – December 2018 c© Sukkur IBA University
37