IEEE Paper Template in A4 (V1) International Journal on Advances in ICT for Emerging Regions 2022 15 (3): December 2022 International Journal on Advances in ICT for Emerging Regions Neural Machine Translation for Sinhala-English Code-Mixed Text Archchana Kugathasan#1, Sagara Sumathipala Abstract— Multilingual societies use a mix of two or more languages when communicating. It has become a famous way of communication in social media in South Asian communities. Sinhala-English Code-Mixed Texts (SCMT) are known as the most popular text representation used in Sri Lanka in the informal context such as social media chats, comments, small talks etc. The challenges in utilizing the SCMT sentences are addressed in this paper. The main focus of this study is translating code-mixed sentences written in Sinhala-English to the standard Sinhala language. Since Sinhala is a low-resource language, we were able to collect only a limited number of SCMT- Sinhala parallel sentences. Creating the parallel corpus of SCMT-Sinhala was a time-consuming and costly task. The proposed architecture of Neural Machine Translation(NMT) to translate SCMT text to Sinhala, is built with a combination of normalization pipeline, Long Short Term Memory(LSTM) units, Sequence to Sequence(Seq2Seq) and Teachers Forcing mechanism. The proposed model is evaluated against the current state-of-the-art models using the same experimental setup, which proves the Teacher Forcing Algorithm combined with Seq2Seq and Normalization improves the quality of the translation. The predicted outputs from the model are compared using the BLEU (Bilingual Evaluation Understudy) metric and our proposed model achieved a better BLEU score of 33.89 in the evaluation. Keywords— Neural Machine Translation, LSTM, Seq2Seq, Sinhala-English Code-Mixed I. INTRODUCTION Code-mixing has been a practice in multilingual communities. In a given sentence, if the elements of one language such as terms, morphemes and words are mixed with the elements of another language, it is called as code-mixing. Lexicon and syntactic formulation from two different languages are combined to generate a code-mixed sentence [1]. The communities which use more than one language for communication are called multilingual communities. Most Srilankans are multilingual people who speak Sinhala-English, Tamil-English, Malay-English, etc. Several research studies have proven that multilingual communities use online social media as the chosen platform to express their opinions and feelings [2]. Posts, comments, reviews etc., are considered user- generated texts in social media. Information extraction from user-generated text has great demand when it comes to business. Analysing the sentiment, extracting the entities, Correspondence: Archchana Kugathasan (E-mail: archchanakugathasan@gmail.com) Received: 20.12-2021 Revised:07-11-2022 Accepted: 14-11-2022 Archchana Kugathasan, from Sri Lanka Institute of Information Technology and Sagara Sumathipala from University of Moratuwa, Sri Lanka (archchanakugathasan@gmail.com, sagaras@uom.lk ). DOI: http://doi.org/10.4038/icter.v15i3.7250 © 2022 International Journal on Advances in ICT for Emerging Regions identifying the user interest and providing personalized content for users has become a trending protocol followed when it comes to business marketing strategies using social media [3, 4]. Code-mixing has been identified as a barrier on utilizing user-generated texts for processing due to the mixing of languages. The need of the translation of code-mixed texts to a standard language has been a requirement for a long time. Due to the increasing amount of usage of SCMT in social media, there is a huge demand nowadays to translate SCMT into the Sinhala language. The focal point of this research study is to translate Sinhala-English Code-Mixed (SCM) sentence into a Sinhala sentence. Currently, available translation systems are not very successful in translating code- mixed texts to a standard language [5]. Code-mixed sentence of Sinhala-English has the syntax of the Sinhala language but borrow a few vocabularies from English. Figure 1 shows an example of Sinhala code-mixed text, where the word ‘Price’ is an English word, ‘eka’ and ‘wadi’ are transliterated Sinhala words. Transliteration is the process where a word from one language is represented using the alphabet of another language. The words ‘eka’ and ‘wadi’ are words from the Sinhala language written with the English alphabet. Translating SCMT into Sinhala is a formidable task. The major challenge is the implementation of a Machine Translation system needs a parallel corpus [6]. This sort of dataset is typically available for standard languages, and for SCMT, there is no available data resource. Due to this issue an SCMT - Sinhala parallel corpus is built in this study. Also, this paper discusses a detailed analysis of SCMT and proposes an approach to using and adopting the prevailing models with the goal to translate SCMT to the Sinhala language. The basic architecture of the proposed model is a Neural Network model which includes the combination of normalization, Seq2Seq, LSTM and Teacher Forcing mechanism [7]. Capability to learn temporal dependencies is very successful in LSTM [8]. Fig.1 Example of Sinhala-English code-mixed text with language tags This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited mailto:archchanakugathasan@gmail.com doi:%20http://doi.org/10.4038/icter.v15i3.7250 doi:%20http://doi.org/10.4038/icter.v15i3.7250 https://creativecommons.org/licenses/by/4.0/legalcode https://creativecommons.org/licenses/by/4.0/legalcode 61 Archchana Kugathasan#1, Sagara Sumathipala International Journal on Advances in ICT for Emerging Regions December 2022 The Seq2Seq model is chosen because it can map the sequence of different lengths of source and target sentences to each other [9]. The Teacher forcing mechanism is applied in the decoding phase of the Seq2Seq model to fasten the training and reduce the prediction errors. Finally, the inference model will predict the Sinhala translation for the given SCMT. BLEU evaluation metric is used to evaluate the model. The rest of the paper is divided into the following sections: Initiated with a study on the groundwork of the research area TABLE I SURVEY RESULT - USAGE OF SINHALA-ENGLISH CODE-MIXED TEXT Questions in Survey Answer options Response Percentage Communication method often used when communicating through text in social media platforms or other online platforms? Using Sinhala-English code-mixed text in social media 85.2% Using native language in social media 8.5% other 6.3% What is the main reason to use Sinhala- English code-mixed text? Using Sinhala-English code-mixed text because of easiness/flexibility with the keyboard 78.0% Interested in using Sinhala-English code-mixed text 12.2% Other 10.0% In what kind of platforms you use Sinhala-English code-mixed text? Social Networking sites(Facebook, Twitter, Instagram etc.) 59.80% Chat Applications(WhatsApp, Viber, Emo etc) 93.90% Community blogs 8.50% Discussion Forums 7.30% Other 1.20% of Normalization and Machine Translation in section II. The next section discusses code-mixing in Sri Lanka. It provides details about the challenges in SCM sentences and usage of code-mixed text in Sri Lanka. Section IV discusses the parallel corpus preparation and its features. Section V &VI includes detail such as the system architecture, model, experimental setting and the obtained result. Section VII describes the evaluation study and discusses the results, and Section VIII concludes with the conclusion. II. RELATED WORK A. Normalization of Code-Mixed Text The rapid growth of user-generated texts in social media allures researchers to focus on the normalization domain. Normalization of the code-mixed texts could lead the models to improve their accuracy. The first corpus for normalization was introduced by Wong and Xia et al. (2008) [10]. Source Channel Model, which finds the most suitable translation based on probability and phonetic mapping, is used to normalize the corpus text. Furthermore, this model was improved by Xue et al.(2011) as a multi-channel model that considers the phonetic factor, orthographic factor, acronym expansion, and contextual factor [11]. Two approaches were proposed by Mandal et al. (2018) [12] to convert the phonetically transliterated text to standard Roman transliteration. Sequence to Sequence (Seq2Seq) model with RNN (Recurrent Neural Network) and Long Short Term Memory (LSTM) is used in the first approach for the conversion. The second approach is based on string matching using Levenshtein Distance [13]. The first approach provided better accuracy than the second approach for the code-mixed text normalization task. Singh et al.(2018) [14] proposed a skip-gram [15] edit distance [16] method to normalize the anomalies of code-mixed text such as spelling variations and grammatical errors. Skip-gram has a similarity metric created from considering the context of a word in a given semantic space. Considering the similarity metric, the most frequently used word is used as the substitution for the variation of the same word, which normalises the data and reduces the noise. Barik et al. (2019) [17] introduce a normalization approach with language identification with CRF (Conditional Random Field) and lexical normalization by replacing the OOV (Out Of Vocabulary) tokens with its standard tokens from the dictionary. Lourentzou et al. (2019) [18] and Dirkson et al. (2019) [19] proposed character-based and word-based normalization approaches for Out Of Vocabulary (OOV) words. Arora and Kansal (2019) [20] used a Convolutional Neural Network (CNN) model with character embedding to normalize the unstructured and noisy texts from social media. A similar approach was followed by Kayest and Jain (2019) [21] and Liu et al. (2021) [22]. B. Machine Translation The importance of Machine Translation (MT) is increased because of the high demand for translation in overseas businesses, military services, profitable customers with the prevalence of different languages and valuable social media content for business development. Neural Machine Translation(NMT) is the currently trending domain in Machine Translation. Recurrent Neural Network [23], Seq2Seq approach [8], Attention based NMT [24] are considered trending approaches for NMT Neural Machine Translation for Sinhala-English Code-Mixed Text 62 December 2022 International Journal on Advances in ICT for Emerging Regions TABLE II CHALLENGES IDENTIFIED IN SINHALA-ENGLISH CODE-MIXED SENTENCES Sinhala-English Code-Mixed Sentence(SCMT) Sinhala Sentence English Sentence Identified Issues in SCMT kama vry gd කෑම ග ොඩොක් ග ොඳයි Food is very good Spelling error - The words ‘vry’ and ‘gd’ represents the English words ‘very’ and ‘good’. mama wathura bonawa මම වතුර ග ොනවො I drink water Inconsistent phonetic transliteration - The same sentence is written in different patterns. The word ‘water’ is represented as ‘vathura’,‘wathura’ and the word ‘drinking’ is represented as ‘bonawa’,‘bonawaa’. mama vathura bonawa mama vathura bonawaa 4to gaththa ඡායා රූප ගත්තා Took photo The use of special characters and numeric characters The word ‘4to’, it absorbs the phonetic sound of word ‘four’ and combines it with the word ‘to’, together it represents the phonetic sound of photo. service eka hondai ස ේවාව ස ාඳයි Service is good Borrowing of words - The sentence starts with an English ‘Service’ and suddenly switches to Sinhala transliterated words ‘eka’ and ‘hondai’. teacherla hamoma enna ගුරුවරුන් ැසමෝම එන්න All the teachers are requested to come Integration of suffixes - the word ‘teachers’ is an English word which is a singular noun and the suffix ‘la’ is in the transliterated form taken from Sinhala. Together the word stands for the meaning ‘teachers’ which is plural. niyama kama so ayeth kanna hithenava නියම කෑම ඒ නි ා ආසයත් කන්න හිතනවා Great food, so like to eat again Switching for discourse marker - In this sentence, an English discourse marker 'So' is used to join the two Sinhala transliterated sentences. Many studies have been carried on translation based on monolingual datasets. Gulcehre et al.(2015) [25] present two methods, shallow and deep fusion to combine language models with Neural Machine Translation(NMT) techniques. Sennrich et al. (2016) [26] proposed two techniques to use monolingual data for translation. To fix the encoder and attention model parameters when training, the monolingual dataset is matched with dummy inputs in the first approach. The second approach suggested is using a model trained on a parallel corpus with neural translation techniques for monolingual translation. Cheng(2019) [27] proposed a semi- supervised approach for monolingual machine translation by combining labelled and unlabelled corpus. Labelled corpus is parallel language corpus and unlabelled corpus is monolingual corpus. There are multilingual NMT models available where a single model supports translating from multiple source languages to multiple target languages. These systems inspire knowledge translation among language pairs[28, 29], zero- shot translation(direct translation among a language pair that has never been used in the training phase) [30, 31, 32, 33] and enhance translation of low resource language pairs[34, 35]. Rather than these benefits, multilingual NMT systems show poor performance [32,34] and bad translations when accommodating many languages [36]. Zhanget al. (2020) [37] propose an improved NMT model where a normalization layer and linear transformation layers are used to overcome the representation issue of other multilingual NMT models. Also, the research study [37] addresses how the output from multilingual NMT models are affected by the unavailability of the parallel corpus. A Random Online Back Translation approach(ROBT) is proposed to overcome the issue of unseen December 2022 International Journal on Advances in ICT for Emerging Regions TABLE III SAMPLE SENTENCES FROM THE ANNOTATED CORPUS; AN1 – ANNOTATOR 1, AN2 – ANNOTATOR 2 Sinhala-English Code-Mixed Sentence Sinhala Sentence translated by Human Translator AN1 AN2 Alternate translation by Annotator1 Alternate translation by Annotator2 Finalized translations by the translator gaana wadi ගාන වැඩියි FC FC N/A N/A N/A Price ekata shape wenna hoda rasata kama hambenawa මිලට රියන්න ස ාඳ සට ේි කෑම ම්සෙනවා FC CR N/A මිලට රියන්න ස ාඳ ර වත් කෑම ම්සෙනවා මිලට රියන්න ස ාඳ ර වත් කෑම ම්සෙනවා calm place ekak, enjoy kranna puluwn කාම් තැනක් , එන්සජෝයි කරන්න පුළුවන් CR CR න්ුන් තැනක් , විසනෝද කරන්න පුළුවන් න්ුන් තැනක් , එන්සජෝයි කරන්න පුළුවන් න්ුන් තැනක් , විසනෝද කරන්න පුළුවන් Singappooru kola kiyalai api kiyanne me gedi hedena gahata සිංගපූරු සකෝලා කියලයි අපි කියන්සන් සම් සගඩි ැසදන ග ට FC FC N/A N/A N/A parking loku aulak na. පාකින් සලාකු අප ු නැත CR FC වා න නැවැත්ීසම් සලාකු අප ු නැත N/A වා න නැවැත්ීසම් සලාකු අප ු නැත When it comes to code-mixed languages, the translation domain consists of only very few research. Carrera et al. (2009) [38] introduce a qualitative study on the combined code- switched corpus from social media. According to the study, hybrid models combined with Statistical Modelling [39] and the Knowledge Translation approach [40] achieved comparatively good translation. In the code-mixed machine translation model introduced by Rijhwani et al.(2016) [6], the dominant language in a sentence is called matrix language. The non-dominant language is called an embedded language. The initial task in this model is word-level language identification and matrix language detection. Then the data is applied to a current translator to translate code-mixed tweets to the language of the user’s choice. An augmentation pipeline for code-mixed text machine translation is proposed by Dhar et al. (2018) [5]. They introduce a parallel corpus with code mixed Hindi-English sentences as source sentences and English sentences as target sentences. The pipeline includes language identification, matrix language identification, translation to matrix language, and translation to the target language. The final output from the model would be translated monolingual sentence. The augmentation pipeline is applied with current translation models such as Google’s Neural Machine Translation System (NMTS) [41], Moses [42] and Bing Translator. Each of these models provided an improved BLEU score when the augmentation pipeline is added in the pre-processing phase. Masoud et al. (2019) [43] introduced a Back Translation model for Tamil-English code-switched text. Baseline, monolingual and hybrid approaches are used to evaluate the system. The back-translated approach gave the highest BLEU score of 25.28 for the code-switched sentences. III. CODE-MIXING IN SRI LANKA Kachru (1986) [44] explains the necessity of English in South Asia in his research study. Many former Anglo- American colonies have been identified with English language varieties, which is called a deviation from standard English to the later development world. According to his observation in South Asia, the English language is considered as a sign of ‘modernization’, ‘achievement’ and ‘strength’. He defines code mixing as a highlight of modernization, social and economic status and membership in an aristocratic society. The widest code-mixing range is identified with the English language. The main reason for code-mixing in Sri Lanka occurred due to the colonization of the British. Sri Lanka acknowledges Sinhala, English and Tamil as the formal languages used for official activities. Sri Lanka mainly has two code-mixed language categories: Sinhala- English and Tamil-English, but there is no mixing between Sinhala and Tamil languages. People have massively adopted internet usage in the 21st century. Code-mixed texts are adapted to the vocabulary and grammar of languages used by the particular bilingual or multilingual user. The structure of code-mixed text used is depended on the individuals [45]. The Sinhala language has a base of Brahmi script in its ornamentation of writing. According to the Unicode standard, 41 consonants, 18 vowels, and 2 half vowels altogether 61 characters are there in the latest Sinhala alphabet [46]. Even though there are 61 letters, the language has only 40 different sounds represented by those letters [47]. Sinhala-English code- mixing originated from the multilingual society of Sinhala - English speaking people. Srilankans use SCM as one of the main communication languages in social media. It has become very popular among the younger generation of the 21st century. We conducted a survey study for identifying the necessity of translation of Sinhala code-mixed text. According to a recent research study on social media usage, users aged 20-29 are 32.2% of the whole social media users[56]. To identify the extent of usage of the code-mixed text in Sri Lankan social media, we decided this specific age group would be more appropriate to collect reliable data as they are the most active age group of social media. 82 individuals participated in this survey study who are native Sinhala language speakers and aged between 20-29. According to the survey result shown in Table I, 85.2% of people have stated as using code-mixed text for writing in social media rather than the native language. Increased usage of SCMT increases the demand for processing the SCMT. The best way to use the code-mixed text is to translate the text into a standard language so the data could be easily used for Machine Learning tasks such as recommendations, sentiment analysis, entity extractions etc. Neural Machine Translation for Sinhala-English Code-Mixed Text 64 December 2022 International Journal on Advances in ICT for Emerging Regions In SCMT, there are several challenges in representing the text: Spelling errors, integration of suffixes, the usage of special and numeric characters in the text, borrowing words from another language, combining languages, switching of discourse markers and inconsistent phonetic transliteration. Table II provides a detailed description of challenges in Sinhala-English code-mixed text with examples. Due to different patterns of SCMT, it is difficult to translate SCMT without a parallel corpus. IV. CORPUS CREATION Most machine translation systems need a remarkable number of parallel sentences to accomplish a good outcome. Our study required creating a parallel corpus with parallel sentences of SCMT and Sinhala text. To achieve this goal, SCM (Sinhala-English Code-Mixed) sentences were gathered from social media. 5000 SCM sentences are used to create the parallel corpus. After the extraction process, each SCM sentence in the corpus is human translated into Sinhala sentences with the help of a human translator, who is a Sinhala native speaker. The translator followed the mapping proposed in the research study of Kugathasan and Sumathipala et al. (2020) [48] for the manual human translation process. Thus, the SCM sentence is the source sentence, and the Sinhala sentence is the target sentence. The translated dataset is validated using the Crowd Sourcing method [49]. Using the Crowd Sourcing technique in our research aims to discriminate good translations from bad ones. We split our corpus into groups of 15 where each annotator gets approximately 300 sentences and each group had a number of 2 annotators who are Sinhala native speakers, bilingual and good in English. The reviewers were instructed to make sure that their Sinhala translation: does not have any spelling errors, and should be grammatically correct and natural-sounding Sinhala. The annotators judge the translated Sinhala sentences into two categories. Fully Correct(FC) and Change Required(CR). If the sentence is labelled with CR, then an alternative Sinhala translation would also be provided by the same annotator. The alternative sentence provided for each SCM sentence was a more fluent and grammatically correct Sinhala sentence. When there are contradictory tags by annotators for a specific translation, only the alternative translation with the CR tag is considered. When both annotators have annotated with CR tag, the best alternate provided is selected by the human translator who worked in the initial phase of creating the corpus. Some annotated sample sentences from the corpus are shown in Table III. After correcting the alternatives, the corpus is updated with the corrections. We randomly choose 100 translated sentences, provided them to the linguistic experts of Sinhala language, and asked them to rank the translation good or bad, considering the following factors: spelling errors, the grammatical pattern of the sentence, and meaningful translation. In the ranking process, we gained judgments from three different linguists. Each translation has 3 rating labels from two categories. We used Fleiss’ Kappa method [50, 51] to measure the reliability of the agreement between the raters while assigning a rating for the translated sentences. The Fleiss’ Kappa score received for the translation of SECM to Sinhala is 0.88, which is almost near to full agreement for the translated sentences are correct. V. SYSTEM ARCHITECTURE The MT model proposed in this study is an adopted and enhanced approach to the research work of Sutskever et al. (2014) [8]. The model consists LSTM, Seq2Seq, Teachers Forcing mechanism and a normalization pipeline to translate the code-mixed text. A. Sequence to Sequence(Seq2Seq) Seq2Seq approach introduced by Sutskever et al.(2014) is a model with the goal of mapping the input sequence with a fixed length to an output sequence with fixed length even though the input and output lengths are different. For example, “Did you eat?” in English has three words as input and its output sentence in Sinhala “ඔයා කෑවද?” has two words. In this approach sequence of source sentences is matched with the sequence of the target sentence[20]. In this machine translation model, source sequence would be the input and target sequence would be the output. Seq2Seq model is also called as Encode-Decoder framework as shown in Figure 3. Source language is read and used as the input to the encoder. A context vector which can also be called the hidden state is created with the encoder by encoding the input data into a real- valued vector. Word-by-word encoder reads the input sequence. Meaning of the input sequence encoded into a single vector. The outputs gained from the encoder are discarded and only the hidden states have proceeded as the inputs to the decoder. The decoder takes the hidden state and the starting string ‘START’ as the input. Hidden states are produced by the encoder and the input of the decoder is read word by word during decoding. In the training phase of the decoder, the Seq2Seq baseline model lets the predicted output from the previous timestamp as the input to the next timestamp in the decoder. But in our proposed approach we applied Teacher Forcing. Fig. 3 Seq2Seq model December 2022 International Journal on Advances in ICT for Emerging Regions Fig. 4 System diagram of the proposed model Source language is read and used as the input to the encoder. A context vector which can also be called as the hidden state is created with the encoder by encoding the input data into a real-valued vector. Word-by-word encoder reads the input sequence. Meaning of the input sequence encoded into a single vector. The outputs gained from the encoder are discarded and only the hidden states have proceeded as the inputs to the decoder. Decoder takes the hidden state and the starting string ‘START’ as the input. Hidden states are produced by the encoder and the input of the decoder is read word by word during decoding. In the training phase of the decoder, the Seq2Seq baseline model lets the predicted output from the previous timestamp as the input to the next timestamp in the decoder. But in our proposed approach we applied Teacher Forcing Mechanism in the training phase of the decoder neglecting the predicted outputs from the timestamps. B. Long Short Term Memory(LSTM) LSTM network is chosen as the basic unit for text generation with the Seq2Seq model as shown in Figure 3. LSTM has internal technique gates that control the flow of information. Gates decides the important details to keep or forget in the cell state along the long chain of sequence. Gates learns what information is relevant and what to keep or throw away during the training. LSTM cell has three main gates, which are the input gate, forget gate and output gate as shown in Figure 5. According to the concept, when an input is given to the LSTM unit, it is converted into machine-readable vectors and these sequences of vectors would be processed one by one. In the forget gate the information from the hidden state from the previous timestep(ht-1) and current input(Xi) would be passed as inputs. Forget gate has a Sigmoid activation function which turns the values between 0 to 1. If the output value from the sigmoid is closer to 0, that information will be forgotten and if it is closer to 1, it will be stored. In the input gate previous hidden state(ht-1) from the previous timestep and current input(Xi) would be passed into the sigmoid function and Tanh function separately. Tanh activation function turns the values in between -1 to 1 to control the network. Tanh output would be multiplied with the output from the sigmoid and the sigmoid would decide which information to keep and forget. Outputs gathered from forget gate and input gate would be utilized to upgrade the cell state. The next hidden state(ht) would be decided by the output gate. The preceding hidden state(ht-1) and the current input(Xi) passed into the sigmoid function and the newly upgraded cell state would be transited through tanh function. Sigmoid and tanh output decides the information that should be carried by the next hidden state. The upgraded new cell state (Ct) and the hidden state(ht) would be transited to the next time step. Likewise, each unit of LSTM would run through these gates to store only the important details from the sequence. C. Teacher Forcing Using the ground truth from a prior timestamp as input for the current timestamp for quick and efficient training of Recurrent Neural Network is called as Teacher Forcing method [54]. Teacher Forcing method functions by utilizing the actual output from the previous timestamp t as input to the next timestep t+1. Figure 6 shows how the decoder of Seq2Seq model would be trained with Teacher Forcing and without Teacher Forcing. In our proposed model to translate SCMT to Sinhala, Teacher Forcing method is applied in the decoding phase. Fig. 5 Architecture inside a LSTM unit Neural Machine Translation for Sinhala-English Code-Mixed Text 66 December 2022 International Journal on Advances in ICT for Emerging Regions Fig 6. Example of decoder with the application of Teacher Forcing method and without Teacher Forcing method VI. MODEL, EXPERIMENTAL SETTING & RESULT The initial phase of the model consists of the data pre- processing. Then, the dataset is cleaned by converting the sentences into lowercase, removing emojis, removing quotes and removing unnecessary spaces. Normalization is considered an important process when it comes to the translation of code-mixed text. Compared to monolingual sentences, code-mixed sentences have more noisy data. Dictionary-based approach and Levenshtein Edit Distance [52] based approaches are used for the normalization task in our model. Spelling error is one of the challenges in Sinhala-English code-mixed. For example, ‘accident’ can be misspelt ‘accsident; accxident; acddent etc’. This happens mainly because most bilingual users are fluent only in their native language Sinhala and not experts in the second language English. The first step of the normalization is the out-of- vocabulary English words from the texts are normalized using the Birkbeck spelling error corpus dictionary [53], which contains 36,133 misspellings of 6,136 words gathered from various sources. Slang words in the code-mixed text were identified as another barrier to the translation of the SCM sentences. This issue is sorted using the SlangNorm dictionary, which contains 5427 slang words. For example, words such as ‘2mrw’ and ‘3wheeler’ will be replaced with the correct form ‘Tomorrow’ and ‘three wheeler’ using SlangNorm dictionary. In SCMT the same word is represented in different transliterated forms in various sentences in the corpus. Levenshtein Edit Distance approach [52] is used to normalize the transliterations by substituting the high-frequency words with the corresponding low-frequency words based on the edit distance. A dictionary with a frequency list of the words in the corpus is maintained. After the normalization of the sentences, target sentences are added with a ‘START’ token at the beginning of the sentence and an ‘END’ token is added at the completion of the sentence. Tokens assist the model to recognize when to begin the translation and end the translation in the decoder. The distinctive words are identified from the source and target corpus. A unique number is allocated to each distinctive word to create dictionaries of words to index and vice versa. These dictionaries are used in the embedding phase of the encoder and decoder. In this research, a Seq2Seq model is fabricated using LSTM as the basic unit. The sequence of the source sentence is matched with the sequence of the target sentence where the source sequence would be the SCM sentence, and the target sequence would be the Sinhala sentence. The primary hidden layer of the encoder is the embedding layer. Large scattered vectors are transformed into a dense dimensional space in the embedding layer. Semantic relationships will be conserved by LSTM units even though the transformation happens. Outputs from the encoder are repudiated and only the hidden states in the context vector are passed to the decoder. The decoder also has embedding as its primary hidden layer. Hidden states passed from the encoder and the outputs given by the embedding layer in the decoder will be taken as the input of LSTM layer in the decoder. Teachers Forcing mechanism is applied in the training part of the decoder. Decoder pursues to implement a word at t+1 timestamp, considering the actual output at t timestamp, not the predicted output. This lets the model learn from the actual values rather than wrongly predicted values. LSTM layer in the decoder returns internal states and output sequences. Internal states are stored and used in the prediction phase. The dense layer is applied with the Softmax activation, and decoder outputs are generated. The data is shuffled before training to lower the variance to make sure the model overfits less and the model is more vigorous. We allocate 70% of the dataset for training and 30% for testing. Encoder and decoder inputs are in the shape of a 67 Archchana Kugathasan#1, Sagara Sumathipala International Journal on Advances in ICT for Emerging Regions December 2022 TABLE IV EXAMPLE OF SOME PREDICTED SINHALA TRANSLATION AND BLEU SCORE. REF AND PRE COLUMN REFERS TO THE NUMBER OF WORDS IN THE REFERENCE SENTENCE AND PREDICTED SENTENCE, THE REST OF THE COLUMNS SHOWS THE COUNT OF THE N-GRAM TOKENS USED FOR THE CALCULATION OF MODIFIED PRECISION No INPUT REFERENCE PREDICTION LENGTH MODIFIED PRECISION REF PRE 1- GRAM 2- GRAM 3- GRAM 4- GRAM 1 ganan wadi ගාන වැඩියි ගණන් වැඩියි 2 2 1 2 0 1 0 1 0 1 2 Budu saranai dewi pihitai බුදු රණයි සදවි පිහිටයි බුදු රණයි සදවි පිහිටයි 4 4 4 4 3 3 2 2 1 1 3 place eka super clean තැන ුපිරි පිරිසදුයි තැන ුපිරි පිරිසදුයි 3 3 3 3 2 2 1 1 0 1 4 kama raha unta gana hondatama wadi eh gaanata worth na කෑම ර උනාට ගාන ස ාඳටම වැඩියි ඒ ගානට වින්සන් නෑ කෑම ර උනාට ගාන ස ාඳටම වැඩියි ඒ ගානට වින්සන් නෑ 10 10 10 10 9 9 8 8 7 7 5 Price eka tikak wadi Customer service eka madi Staff eka thawa improve wenna one මිල ිකක් වැඩියි පාරිස ෝගික ස ේවය මදියි කාර්ය මණ්ඩලය වැඩි දියුණු කළ යුතුයි මිල ිකක් වැඩියි ැෙැයි කාර්ය මණ්ඩලය වැඩි 12 7 6 7 4 6 2 5 0 4 6 Meya hithan inne I phone thiyenne photo ganna witarai kiyala සමයා හිතන් ඉන්සන් අයි ස ෝන් තිසයන්සන් ස ාසටා ගන්න විතරයි තියන්සන් කියලා සමයා තියන්සන් කියලා 11 3 3 3 1 2 0 1 0 1 7 mn recommend karana thanak මන් නිර්සේශ කරන තැනක් මන් නිර්සේශ කරන තැනක් 4 4 4 4 3 3 2 2 1 1 8 main road eka laga nisa noisy ප්රධාන පාර ළඟ නි ා ේද වැඩියි පාර ළඟ නි ා ේද වැඩියි 6 5 5 5 4 4 3 3 2 2 9 kaama echchara special naha කෑම එච්චර විසශේෂ නෑ ැ කෑම එච්චර විසශේෂ නෑ ැ 4 4 4 4 3 3 2 2 1 1 10 kama denna puluwan කෑම සදන්න පුළුවන් කෑම සදන්න පුළුවන් 3 3 3 3 2 2 1 1 0 1 2D array. The encoder 2D array has batch sizes of 10, the maximum source sentence length is 27, and the shape of the encoder input will be (10,27). The decoder 2D array has batch sizes of 10, a maximum source sentence length of 26 and the shape of the encoder input is (10,26). Decoder outputs are in the shape of a 3D array with a batch size of 10, the maximum target sentence length 26. NumPy, Pandas, TensorFlow, Sacrebleu are some important libraries used to build the model in the technological point of view. After the training phase of the model, to produce the translation outputs, a prediction phase is implemented. In the prediction phase, an input sequence from the corpus(SCM sentence) will be provided to predict the Sinhala translation. This phase contains an encoder-decoder framework without Teacher Forcing mechanism, where the predicted output from the previous timestamp t would be fed for the current timestamp t+1 instead of the actual output. Figure 4 shows the system architecture of the proposed model. VII. EVALUATION & DISCUSSION We evaluated the performance of our system by comparing our model with the most commonly used translation models. We applied our dataset to the Seq2Seq Baseline model [8] and the Attention model [24] with the same experimental setting. Each model was trained with the normalization pipeline and without the normalization pipeline. After training the models, we evaluated the translation outputs using BLEU [55] metric. 𝐵𝐿𝐸𝑈 = 𝐵𝑃.exp⁡(∑ 𝑊𝑛𝑙𝑜𝑔𝑃𝑛 𝑁 𝑛=1 )⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡(1) 𝐵𝑃⁡ = { 1⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡𝑖𝑓⁡𝑐 > 𝑟 exp (1 − 𝑟 𝑐 )⁡⁡⁡⁡⁡⁡⁡𝑖𝑓⁡𝑐 ≤ 𝑟 ⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡(2) In the BLEU score equation (1), BP is the Brevity Penalty, N is the number of n-grams(1-gram,2-gram,3-gram,4-gram), Wn is the weight for each modified precision, Pn is the modified precision [55]. Pn for each n-gram up to 4-gram is calculated based on the clipped count and the total number of the particular n-gram in the predicted sentence [55]. When the n- gram order is greater than the length of the reference sentence, to avoid the zero division error the total number of n-gram values is set to 1. The Brevity Penalty(BP) depends on the values of c, the count of unigrams in all the predicted sentences and r is the most probable matching length of sentence in the corpus. Hundred Sinhala code-mixed sentences are selected from the corpus. Its relevant translation of Sinhala sentences is predicted using our proposed model. Initially, the number of clipped counts [55] and the total number of the particular n-gram in the predicted sentence are extracted to calculate the modified precision as shown in Table IV. Then the overall BLEU score is calculated for those hundred sentences. Finally, the same evaluation approach with the same experimental setting as explained in Section IV, is applied with the Seq2Seq Baseline model and Attention models with and without the normalization task. A summary of the comparison among the models is shown in Table V. Neural Machine Translation for Sinhala-English Code-Mixed Text 68 December 2022 International Journal on Advances in ICT for Emerging Regions TABLE V COMPARISON OF RESULTS RECEIVED FROM DIFFERENT MODELS Seq2Seq Baseline model with normalization and without normalization showed the lowest performance and achieved the lowest BLEU score compared to the other two models. Among the Attention and Teacher Forcing models, the best BLEU score is 33.89 received by Teacher Forcing Algorithm, proving the proposed model comparatively works well with Sinhala-English Code-Mixed text. Also, the comparison study with and without the normalization task demonstrated that the models performed well and provided a better BLEU score when the normalization pipeline is applied to each of the models. Not only the BLEU scores, but the proposed model also achieved comparatively fair values for training and testing accuracies and loss as shown in Figure 8. An analysis of the predicted sentences is performed to identify whether the proposed model helped to overcome the challenges pointed out in Table II. If we take the sample sentence (1) shown in Table IV, (Code- mixed text - CMT, Reference text- REF, Translated text - TRANS): CMT : gaana wadi (1) REF : ගාන වැඩියි TRANS : ගණන් වැඩියි In this sentence (1) even though the TRANS doesn’t match the exact REF sentence, the meaning of both sentences is the same, and the prediction is correct. In the sample sentence (3) shown in Table IV, CMT : place eka super clean (3) REF : තැන ුපිරි පිරිසදුයි TRANS : තැන ුපිරි පිරිසදුයි In sentence (3), the CMT sentence contains English words such as ‘place’, ‘super’ and ‘clean’. In TRANS the words are translated to Sinhala. This translation shows us that borrowing words from another language issue is sorted out with our proposed translation model. In the sample sentence (9),(10) shown in Table IV, CMT : kaama echchara special naha (9) TRANS : කෑම එච්චර විසශේෂ නෑ ැ CMT : kama denna puluwan (10) TRANS : කෑම සදන්න පුළුවන් Model Training Accuracy Training Loss Testing Accuracy Testing Loss Precision Brevity Penalty (BP) BLEU Score 1-gram 2-gram 3-gram 4-gram W1 = 0.25 W2 = 0.25 W3 = 0.25 W4 = 0.25 W1*log(P1) W2*log(P2) W3*log(P3) W4*log(P4) Seq2Seq Baseline Model without Normalization 53.83 1.4032 27.92 1.76 -0.16229 -0.323259 -0.496841 -0.628076 0.6397 12.78 Seq2Seq Baseline Model + Normalization 57.11 0.7753 31.97 1.75 -0.145237 -0.204693 -0.275824 -0.389159 0.573 20.77 Seq2Seq + Attention without Normalization 70.55 0.303 30.3 1.15 -0.080998 -0.162399 -0.252416 -0.369135 0.6876 28.95 Seq2Seq + Attention + Normalization 70.22 0.5023 31.05 1.05 - 0.0689162 -0.141996 -0.208556 -0.292517 0.6413 31.46 Seq2Seq + Teacher Forcing without Normalization 71.42 0.5095 37.17 0.38 -0.066960 -0.1232 -0.181972 -0.262455 0.595 31.54 Seq2Seq + Teacher Forcing + Normalization 71.57 0.4979 37.87 0.38 -0.06046 -0.1232717 -0.189274 -0.251089 0.6326 33.89 69 Archchana Kugathasan#1, Sagara Sumathipala International Journal on Advances in ICT for Emerging Regions December 2022 Fig 8. Experimented models accuracies, loss & relevant BLEU scores The sentences (9) and (10) have the same word in two different transliterations format. But in the predicted sentence both the words ‘kaama’ and ‘kama’ are correctly identified as one Sinhala word ‘කෑම’. The transliteration issue has also been solved with our model. The use of special characters and numeric character issues were sorted in the normalization phase with the SlangNorm dictionary. VIII. CONCLUSION The main goal of this research is to utilize the user- generated Sinha-English code-mixed sentences and convert the sentences into a standard language, so the code-mixed texts can also be used for several research and business purposes. From analyzing the challenges in SCMT text, we pointed out the key issues that have been a barrier to processing the Sinhala-English code-mixed text. Creating a dataset for this research study was a challenging task due to the unavailability of current resources. The dataset created in the study was created following several processes such as manual translation with a human translator, crowdsourcing to annotate the dataset to check whether the human-translated sentences are correct and rating the translation with linguistic experts to analyze the Fleiss’ Kappa score. The received score of 0.88 shows almost full agreement with the translation. The corpus created in this study using proper rules and regulations could promote research based on the Sinhala code-mixed domain. The proposed approach, which is a combination of the Seq2Seq model with the LSTM unit and the Teachers Forcing mechanism gives a comparatively higher BLEU score of 33.89 for code-mixed text translation compared to the other models. Moreover, the evaluation study proves that most of the challenges identified in SCM sentences can be solved using our proposed model. But somehow, a few of the challenges such as integration of suffixes, and change of discourse marker remain unsolved. This research study can be considered an initiative for Sinhala-English code-mixed text translation. As the future work of this study, we are planning to solve the rest of the challenges which we were not able to solve with the current proposed model. Furthermore, we would like to extend the corpus to focus on other tasks of code-mixing such as sentiment analysis, language identification, entity extraction etc. REFERENCES [1] E. E. Davies and A. Bentahila, “Contact linguistics: Bilingual encounters and grammatical outcomes,” 2007. [2] K. R. Chandu, M. Chinnakotla, A. W. Black, and M. Shrivastava, “Webshodh: A code mixed factoid question answering system for web,” in International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 2017, pp. 104– 111. [3] M. Yang, Y. Ren, and G. Adomavicius, “Understand- ing user- generated content and customer engagement on facebook business pages,” Information Systems Research, vol. 30, no. 3, pp. 839–855, 2019. [4] E. Qualman, Socialnomics: How social media trans- forms the way we live and do business. John Wiley & Sons, 2012. [5] M. Dhar, V. Kumar, and M. Shrivastava, “Enabling code-mixed translation: Parallel corpus creation and MT augmentation approach,” in Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing. Santa Fe, New Mexico, USA: Association for Computational Linguistics, Aug. 2018, pp. 131–140. [Online]. Available: https://www.aclweb.org/anthology/W18- 3817 [6] S. Rijhwani, R. Sequiera, M. C. Choudhury, and K. Bali, “Translating codemixed tweets: A language detection based system,” in 3rd Workshop on Indian Language Data Resource and Evaluation- WILDRE- 3, 2016, pp. 81–82. [7] P. Goyal, S. Pandey, and K. Jain, “Deep learning for natural language processing,” New York: Apress, 2018. [8] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” arXiv preprint arXiv:1409.3215, 2014. Neural Machine Translation for Sinhala-English Code-Mixed Text 70 December 2022 International Journal on Advances in ICT for Emerging Regions [9] K. Cho,B.van Merriënboer, C.Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1724–1734. [Online]. Available: https://www.aclweb.org/anthology/D14-1179 [10] K.-F. Wong and Y. Xia, “Normalization of chinese chat language,” Language Resources and Evaluation, vol. 42, no. 2, pp. 219–242, 2008. [11] Z. Xue, D. Yin, and B. D. Davison, “Normalizing microtext,” in Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence. Citeseer, 2011. [12] S. Mandal, S. D. Das, and D. Das, “Language identification of bengali- english code-mixed data using character & phonetic based lstm models,” arXiv preprint arXiv:1803.03859, 2018. [13] L. Yujian and L. Bo, “A normalized levenshtein dis- tance metric,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 6, pp. 1091– 1095, 2007. [14] R. Singh, N. Choudhary, and M. Shrivastava, “Auto- matic normalization of word variations in code-mixed social media text,” arXiv preprint arXiv:1804.00804, 2018. [15] D. Guthrie, B. Allison, W. Liu, L. Guthrie, and Y. Wilks, “A closer look at skip-gram modelling.” in LREC, vol. 6. Citeseer, 2006, pp. 1222–1225. [16] [16] E. S. Ristad and P. N. Yianilos, “Learning string-edit distance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 5, pp. 522–532, 1998. [17] A. M. Barik, R. Mahendra, and M. Adriani, “Nor- malization of indonesian-english code-mixed twitter data,” in Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), 2019, pp. 417– 424. [18] I. Lourentzou, K. Manghnani, and C. Zhai, “Adapting sequence to sequence models for text normalization in social media,” in Proceedings of the International AAAI Conference on Web and Social Media, vol. 13, 2019, pp. 335–345. [19] A. Dirkson, S. Verberne, A. Sarker, and W. Kraaij, “Data-driven lexical normalization for medical social media,” Multimodal Technologies and Interaction, vol. 3, no. 3, p. 60, 2019. [20] M. Arora and V. Kansal, “Character level embedding with deep convolutional neural network for text nor- malization of unstructured data for twitter sentiment analysis,” Social Network Analysis and Mining, vol. 9, no. 1, pp. 1–14, 2019. [21] M. Kayest and S. K. Jain, “An incremental learning approach for the text categorization using hybrid optimization,” International Journal of Intelligent Computing and Cybernetics, 2019. [22] J. Liu, S. Zheng, G. Xu, and M. Lin, “Cross- domain sentiment aware word embeddings for review sentiment analysis,” International Journal of Machine Learning and Cybernetics, vol. 12, no. 2, pp. 343–354, 2021. [23] N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation models,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational Linguistics, Oct. 2013, pp. 1700–1709. [Online]. Available: https://www.aclweb.org/anthology/D13-1176 [24] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014. [25] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and Y. Bengio, “On using monolingual corpora in neural machine translation,” arXiv preprint arXiv:1503.03535, 2015. [26] R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 86–96. [Online]. Available: https://www.aclweb.org/anthology/P16-1009 [27] Y. Cheng, “Semi-supervised learning for neural ma- chine translation,” in Joint training for neural machine translation. Springer, 2019, pp. 25– 40. [28] S. M. Lakew, M. Cettolo, and M. Federico, “A com- parison of transformer and recurrent neural networks on multilingual neural machine translation,” arXiv preprint arXiv:1806.06957, 2018. [29] X. Tan, J. Chen, D. He, Y. Xia, T. Qin, and T.-Y. Liu, “Multilingual neural machine translation with lan- guage clustering,” arXiv preprint arXiv:1908.09324, 2019. [30] M. Al-Shedivat and A. Parikh, “Consistency by agreement in zero-shot neural machine translation,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computa- tional Linguistics, Jun. 2019, pp. 1184–1197. [Online]. Available: https://www.aclweb.org/anthology/N19- 1121 [31] J. Gu, Y. Wang, K. Cho, and V. O. Li, “Improved zero-shot neural machine translation via ignoring spurious correlations,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 1258–1268. [Online]. Available: https://www.aclweb.org/anthology/P19-1121 [32] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y.Wu, Z.Chen, N.Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean, “Google’s multilingual neural machine translation system: Enabling zero-shot translation,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 339–351, 2017. [Online]. Available: https://www.aclweb.org/anthology/Q17- 1024 [33] O. Firat, B. Sankaran, Y. Al-onaizan, F. T. Yarman Vural, and K. Cho, “Zero-resource transla- tion with multi-lingual neural machine translation,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, Nov. 2016, pp. 268–277. [Online]. Available: https://www.aclweb.org/anthology/D16-1026 [34] N. Arivazhagan, A. Bapna, O. Firat, D. Lepikhin, M. Johnson, M. Krikun, M. X. Chen, Y. Cao, G. Foster, C. Cherry et al., “Massively multilingual neural machine translation in the wild: Findings and challenges,” arXiv preprint arXiv:1907.05019, 2019. [35] T.-L. Ha, J. Niehues, and A. Waibel, “Toward multi- lingual neural machine translation with universal en- coder and decoder,” arXiv preprint arXiv:1611.04798, 2016. [36] K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, “Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021. [37] B. Zhang, P. Williams, I. Titov, and R. Sennrich, “Improving massively multilingual neural machine translation and zero-shot translation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 1628–1639. [Online]. Avail- able: https://www.aclweb.org/anthology/2020.acl- main.148 [38] J. Carrera, O. Beregovaya, and A. Yanishevsky, “Ma- chine translation for cross-language social media,” PROMT Americas Inc, 2009. [39] M. C. Neale, S. M. Boker, G. Xie, and H. M. Maes, “Statistical modeling,” Richmond, VA: Department of Psychiatry, Virginia Commonwealth University, 1999. [40] P. Sudsawad, Knowledge translation: introduction to models, strategies and measures. Southwest Educa- tional Development Laboratory, National Center for the …, 2007. [41] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural machine translation system: Bridging the gap between hu- man and machine translation,” arXiv preprint arXiv:1609.08144, 2016. [42] P. Koehn, H. Hoang, A. Birch, C. Callison- Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst, “Moses: Open source toolkit for statistical machine translation,” in Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. Prague, Czech Republic: Association for Computational Linguistics, Jun. 2007, pp. 177–180. [Online]. Available: https://www.aclweb.org/anthology/P07-2045 [43] M. Masoud, D. Torregrosa, P. Buitelaar, and M. Arčan, “Back- translation approach for code- switching machine translation: A case study,” in 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science. AICS2019, 2019. [44] B. B. Kachru, “The power and politics of english,” World Englishes, vol. 5, no. 2-3, pp. 121–140, 1986. [45] N. Choudhary, R. Singh, I. Bindlish, and M. Shrivas- tava, “Sentiment analysis of code-mixed languages leveraging resource rich languages,” arXiv preprint arXiv:1804.00806, 2018. [46] M. Punchimudiyanse and R. Meegama, “Unicode sinhala and phonetic english bi-directional conversion for sinhala speech recognizer.” IEEE International Conference on Industrial and Information Systems 2015, 2015. [47] A. M. Gunasekara, A comprehensive grammar of the Sinhalese language. Asian Educational Services, 1999. [48] A. Kugathasan and S. Sumathipala, “Standardizing sinhala code-mixed text using dictionary based ap- proach,” in 2020 International Conference on Image Processing and Robotics (ICIP). IEEE, 2020, pp. 1–6. https://www.aclweb.org/anthology/Q17-%201024 https://www.aclweb.org/anthology/P07-2045 71 Archchana Kugathasan#1, Sagara Sumathipala International Journal on Advances in ICT for Emerging Regions December 2022 [49] E. Estellés-Arolas and F. González-Ladrón-de Gue- vara, “Towards an integrated crowdsourcing defini- tion,” Journal of Information science, vol. 38, no. 2, pp. 189–200, 2012. [50] T. R. Nichols, P. M. Wisner, G. Cripe, and L. Gu- labchand, “Putting the kappa statistic to use,” The Quality Assurance Journal, vol. 13, no. 3-4, pp. 57– 61, 2010. [51] J. J. Randolph, “Online kappa calculator,” Retrieved October, vol. 20, p. 2011, 2008. [52] G. Navarro, “A guided tour to approximate string matching,” ACM computing surveys (CSUR), vol. 33, no. 1, pp. 31–88, 2001. [53] “Birkbeck spelling error corpus / roger mitton,” oxford Text Archive. [Online]. Available: http://hdl.handle.net/20.500.12024/0643 [54] I. Goodfellow, Y. Bengio, and A. Courville, “Deep learning (adaptive computation and machine learning series),” p. 372, 2016. [55] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meet- ing of the Association for Computational Linguistics, 2002, pp. 311–318. [56] K. Simon, “DIGITAL 2022: GLOBAL OVERVIEW REPORT,” Jan. 26, 2022. https://datareportal.com/reports/digital-2022-global- overview-report [57] [58] K. Simon, “DIGITAL 2022: GLOBAL OVERVIEW REPORT,” Jan. 26, 2022. https://datareportal.com/reports/digital-2022-global- overview-report [59] 40th annual meet- ing of the Association for Computational Linguistics2