Vol. 2, No. 2 | July – December 2018 Developing Sindhi Text Corpus using XML Tags Sayed Majid Ali Shah∗ Zeeshan Bhatti∗ Imdad Ali Ismaili∗ Abstract Sindhi language being one of the oldest languages of the world, has still very limited use in digital age due to lack of digital contents. The use of corpus for each language has been extremely important in facilitating the natural language processing of its script. This research work addresses the issue of building corpus for Sindhi Language using XML based Tagging. The tree based XML tag structure is designed to develop Sindhi Corpus that has two main nodes namely metadata and Sindhi Document which contains the main text. The Corpus developed contains a detailed metadata tags to represent Sindh language, documenting each relevant component of the corpus. The final corpa would be further used in various Natural language applications for Sindh language. Keywords: Corpus, Sindhi, Sindhi Corpus, Natural Language Processing, XML 1. Introduction Sindhi language is a widely spoken language based on Arabic script with similar cursive ligatures and written from right-to-left consisting of 52 characters [1] [2]. Sindhi language is considered as the second most popularly written and spoken language, after Urdu, in Pakistan. Even though Sindhi is an old language with vast amount of literature and written resources. However, there are very insufficient computational ma- terial and digital coprus available for Sindhi Language to create efficient NLP applications. Natural Language Processing applications always re- quire a huge collection of Corpus data for the lan- guage. A corpus is simply collection of large amount of structure and unstructured text for a language. The well-defined structural format is created to store and categorize the text in large datasets, allowing the com- putational processing and application development. This structured datasets facilitates the statistical anal- ysis and grammatical validation of the script, along with other applications of NLP [3] [4]. Corpus are considered as one of the key prerequisites for and obligatory component for developing any Natural Language Processing applications such as, Spell check- ers[2][5], Machine Learnings, Speech-to-Text, Text-to- Speech, OCR, Translation, Transliteration, etc. [6]. Due to this, there is huge need for developing a Sindhi language corpus which is also publicly available for ev- eryone to use. XML has always been a key technique for designing a structure for developing Corpa of various languages [7] [8] [9] [10]. XML is a very flexible language due to its tag-based structure, which allows the developer to easily extract the required and desired informa- tion from the structured XML document. Developing Sindhi corpus in XML would enable rule-based tag- ging’s, and structured designing of Corpa, allowing an easier reusability of the corpus along with broader dis- semination to various NLP applications. Since Corpus is extremely essential for any language for NLP application, a huge amount of work from various aspects has been done to develop corpus for various different languages. Primarily, a project named EMILLE was developed consisting of multilingual cor- pora for South Asian languages [10]. Similarly, Urdu corpus was developed containing 18 million words by the Center for Research in Urdu Language Process- ing (CRULP) [11]. CRULP has also developed and released Online Urdu Dictionary (OUD) containing 120,000 records of Urdu corpus with 80,000 words dic- tionary words [9]. Whereas, Bank of English Corpus was developed to help the dictionaries [8]. On the other hand, Hindi Corpus was designed by IIT Bombay to facilitate the NLP development of the language [13] along with EMILLE corpus [12]. 2. XML Structure for Sindhi Corpus In NLP application development, XML tag-based struc- tured format has been widely used to create structured ∗Institute of Information and Communication Technology, University of Sindh, Jamshoro Corresponding Email: lakiyarimajid@gmail.com SJCMS | P-ISSN: 2520-0755 | E-ISSN: 2522-3003 c© 2018 Sukkur IBA University - All Rights Reserved 30 Sayed Majid (et al.), Developing Sindhi Text Corpus using XML Tags (30-37) Figure 1: Proposed Model with Hypothesis relation documents for processing and developing regional lan- guage applications [14]. For this purpose, XML has been used with custom tag to structure Sindhi text in a formalized Corpus for various NLP applications. The tags for Sindhi Corpus based on XML have been seg- regated into two main sections consisting of Metadata tags and Sindhi Text Document tags. Each is then fur- ther divided to contain more detailed information in its sub- tags. A. Sindhi Corpus Structure The XML based Sindhi Corpus structure has been divided into two main sections at the top-most level with ‘Metadata’ tag containing tags related to the orig- inal source information related to the actual text and document. The ‘Sindhi Text Document’ tag is second top-most tag containing the actual text from the source document. The full hierarchy of the XML tag struc- ture for Sindhi Corpus is illustrated in Figure 1. Each Sindhi Corpa document will be stored with respect to this structure within XML tags. B. Meta Data Header of Sindhi Corpus Metadata is defined as the data about the data. There- fore, this main tag contains specific detail information related to the source document. This main section con- tains attributes such as “Title” of the document, “Sub Title” of Sindhi document, if any, “Topic” being dis- cussed in the Sindhi document, “Sub Topic”, “Book”, “Author”, “Edition”, etc. The detailed sub-tag struc- ture of the meta data section is shown in Figure 1. The ‘Sindhi Text Document’ tag contains the source raw text information which is extracted from various sources including websites, newpapers, books, articles, etc. This tag contains further two sub tags that de- scribe the text description of the source text and the actual text file under “Text Description” and “Text Document” respectively. 3. Sindhi Corpus Representations There are two main custom tags defined after as and , and the operator (+) shows that both custom tags have also child tags as define operator (+) in Figure 2. Figure 2: Super tags of SLC(Sindhi Language Corpus Figure 3 shows the main Sindhi Language Corpus Tag that contains to top-most sub tags Sindh Meta Data nd Sindh Text Document tags. Figure 3: Root and elements tags of sndhiLangCrps Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 2 No. 2 July – December 2018 c© Sukkur IBA University 31 Sayed Majid (et al.), Developing Sindhi Text Corpus using XML Tags (30-37) A. Portion of SndMetaData The tag of XML “Sindhi Meta Data” is the part of Sindhi corpus in data file which shows that the Sindhi data about its own data, it has child tags which also contains further information about the Sindhi document. Figure 4: Elements and child tags of SndMetaData at SndhiLangCrps In this figure 5 the elements tags of has been defined as , and and they have also sub child as per operator defines. Figure 5: Sindhi XML Corpus with MetaData Figure 6 shows another example of Publisher tags data for Sindhi Document. In this figure, the child tags of has been de- fined as , and custom tags. Ac- curate data also filled in that custom tags for the build- ing of SLC, while the other tags are here in silent mod they have discussed in other figure and the operator (-) shows that specific tag is displayed with own child and no more child is hide. Figure 6: Data in publisher tag The XML tag has three elements tags named as Publisher, Language and File Description. Figure 7: Elements and child tags of pblsher in SndMetaData at SndhiLangCrps The publisher tag contains the informa- tion of publications, with describes its child elements as “Publisher Name”, ”Au- thor ” , “Edition”. The tag shows the name of publisher, the tag tells the name of author while the tag describes the edition of publications. Figure 8: Elements and child tags of pblsher with its child’s elements edition in SndMetaData at Snd- hiLangCrps The tag publisher is the child tag of Sindhi Meta Data while the tag Edition is the child tag of and tag Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 2 No. 2 July – December 2018 c© Sukkur IBA University 32 Sayed Majid (et al.), Developing Sindhi Text Corpus using XML Tags (30-37) has its child as “Data Source”, “Books”, “News”, “Articles”, “Blogs”. These all are the sources of infor- mation which provide the complete data to edition and edition makes the complete to the publisher tag. Figure 9: Elements and child tags of Lang in Snd- MetaData at SndhiLangCrps The Language tag is the element tag of which has child tags like tag “Number of Records”, “Encod- ing”, “Data”. Figure 10: Elements and child tags of fileDesc in SndMetaData at SndhiLangCrps Figure 11: child tags of Figure 11 uses the child tags of tags as , and . That tags have filled by accurate data while other tags are here silent to show the role of that tags in corpus linguistics. File Description is the element tag of tag consists of child tags as “Title”, <subtopic> “Sub Topic”, <keywords> “Key Words”. Figure 12: Elements and child tags of SndTxtDoc at SndhiLangCrps Figure 13: Element tags of <fileDesc> </fileDesc> Figure 13 shows the sub tags of cus- tom tag of <fileDesc> </fileDesc> as <title> , and . All tags have assigned their own data. B. Portion of Sindhi Text Document The tag of XML “Sindhi Text Docu- ment” is the part of Sindhi corpus in data file, it has also child tags as “Text Source”, “Text Description”. 4. Sample Sindhi Corpus Document The final Sindhi documents are initially created manu- ally by extracting information form articles and saving them in XML tags as discussed [15]. A GUI form was designed that allowed the creating of XML document for Sindhi text as shown in Figure 14. Each entry was saved as an XML file as per rules and patterns discussed above. Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 2 No. 2 July – December 2018 c© Sukkur IBA University 33 Sayed Majid (et al.), Developing Sindhi Text Corpus using XML Tags (30-37) Figure 14: GUI form for creating XML document of Sindhi corpus Figure 15: Sample Sindhi XML document The final version of each XML document of Sindhi corpus contain all the relevant information that could easily be read and processed for any NLP task as shown in Figure 15 to Figure 18. Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 2 No. 2 July – December 2018 c© Sukkur IBA University 34 Sayed Majid (et al.), Developing Sindhi Text Corpus using XML Tags (30-37) Figure 16: Sindhi XML Document 2 Figure 17: Sample Sindhi XML Document 3 Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 2 No. 2 July – December 2018 c© Sukkur IBA University 35 Sayed Majid (et al.), Developing Sindhi Text Corpus using XML Tags (30-37) Figure 18: Sample Sindhi XML Document 4 5. Conclusion and Future Work The use of corpus in Natural Language Processing is ex- tremely essential and important. The Sindhi Language Corpus is designed using XML tags to facilitate the pro- cessing of Sindhi text for various NLP tasks. XML tags have been designed to provide maximum data facilita- tion and a long term usability of Sindhi Corpus. The tab structure is segregated into two main sections con- taining metadata and main source full document. The metadata is crucial part of any document, and so Sindhi corpus metadata also contains many sub tags to cater for all possible information of any document. The use of XML for Sindhi corpus has been very fruitful and has provided a platform to work on more processing of Sindhi Text. 6. Acknowledgment An earlier version of this paper was presented at the International Conference on Computing, Math- ematics and Engineering Technologies (iCoMET 2018) and was published in its Proceedings available at IEEE Explorer, available at (URL: https://ieeexplore.ieee.org/iel7/8337998/8346308/08346381.pdf) This research work was carried out in Multimedia Ani- mation and Graphics (MAGic) Research Group at In- stitute of Information and Communication Technology, University of Sindh, Jamshoro. References [1] Ismaili, I. A., Bhatti, Z., & Shah, A. A. (2014). Design & Development of the Graphical User Interface for Sindhi Language. arXiv preprint arXiv:1401.1486. [2] Bhatti, Z., Waqas, A., Ismaili, I. A., Hakro, D. N., & Soomro, W. J. (2014). Phonetic based soundex & shapeex algorithm for sindhi spell checker sys- tem. arXiv preprint arXiv:1405.3033. [3] Rahman, M. U. (2010). Towards Sindhi corpus construction. In Conference on Language and Technology, Lahore, Pakistan. [4] Ko, W. K., & Phyo, T. Z. (2008, January). Selec- tion of XML tag set for Myanmar National Corpus. In IJCNLP (pp. 33-40). [5] Bhatti, Z., Ali Ismaili, I., Nawaz Hakro, D., & Javid Soomro, W. (2015). Phonetic-based sindhi spellchecker system using a hybrid model. Digital Scholarship in the Humanities, 31(2), 264-282. [6] Hakro, D. N., Ismaili, I. A., Talib, A. Z., Bhatti, Z., & Mojai, G. N. (2014). Issues and challenges in Sindhi OCR. Sindh University Research Journal- SURJ (Science Series), 46(2). [7] Mahar, J. A., & Memon, G. Q. (2010, February). Rule based part of speech tagging of sindhi lan- guage. In Signal Acquisition and Processing, 2010. ICSAP’10. International Conference on (pp. 101- 106). IEEE. [8] Sinclair J. (1992), Introduction. BBC English Dic- tionary, London: Harper Collins. Tony M. and Wilson A. (2001), Corpus Linguistics (Second Edi- tion), Edinburgh University Press. Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 2 No. 2 July – December 2018 c© Sukkur IBA University 36 Sayed Majid (et al.), Developing Sindhi Text Corpus using XML Tags (30-37) [9] Rahman, S. (2005). Lexical Content and Design Case Study. Presented at From Lo- calization to Language Processing, Sec- ond Regional Training of PAN Localiza- tion Project. Online presentation version: http://panl10n.net/Presentations/Cambodia/Shafiq/Le xicalContent&Design.pdf. [10] McEnery, A., Baker, J., Gaizauskas, R. & Cun- ningham, H. (2000). EMILLE: towards a corpus of South Asian languages, British Computing So- ciety Machine Translation Specialist Group, Lon- don, UK. [11] Ijaz, M. and Hussain, S. 2007. Corpus Based Urdu Lexicon Development. The Proceedings of Confer- ence on Language Technology (CLT07), University of Peshawar, Pakistan. [12] Hardie, A., Baker, P., McEnery, T., & Jayaram, B. D. (2006). Corpus-building for South Asian languages. TRENDS IN LINGUISTICS STUDIES AND MONOGRAPHS, 175, 211. [13] Bojar, O., Diatka, V., Rychlý, P., Stranák, P., Su- chomel, V., Tamchyna, A., & Zeman, D. (2014, May). HindEnCorp-Hindi-English and Hindi-only Corpus for Machine Translation. In LREC (pp. 3550-3555). [14] Kim, J. D., Ohta, T., Tateisi, Y., Mima, H., & Tsujii, J. I. (2001). XML-based linguistic annota- tion of corpus. In Proc. of the First NLP and XML Workshop. [15] Shah, S. M. A., Bhatti, Z., Ismaili, I. A., & Waqas, A. Designing XML tag based Sindhi Lan- guage Corpus. International Conference on Com- puting, Mathematics and Engineering Technolo- gies – iCoMET 2018. IEEEXplore Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 2 No. 2 July – December 2018 c© Sukkur IBA University 37