Microsoft Word - 31plusREVIEW_29-2288_s Engineering, Technology & Applied Science Research Vol. 8, No. 5, 2018, 3475-3478 3475 www.etasr.com Chopra et al.: A Review on Machine Translation in Indian Languages A Review on Machine Translation in Indian Languages Deepti Chopra Department of Computer Science Banasthali University Newai, India deeptichopra11@yahoo.co.in Nisheeth Joshi Department of Computer Science Banasthali University Newai, India jnisheeth@banasthali.in Iti Mathur Department of Computer Science Banasthali University Newai, India mathur.iti@rediffmail.com Abstract—Machine translation (MT) is considered an important task that can be used to attain information from documents written in different languages. In the current paper, we discuss different approaches of MT, the problems faced in MT in Indian languages, the limitations of some of the current existing MT systems, and present a review of the work that has been done until now in MT in Indian language perspective. Keywords-machine translation; rule based approach; example based approach; natural language processing; bilingual evaluation understudy I. INTRODUCTION Machine translation may be defined as the task of conversion of text from one language, called source language, to another language, called target language. There are two kinds of MT, metaphrase and paraphrase. In metaphrase, an exact word to word translation takes place but the translated text may or may not have the similar semantics as the source text. In paraphrase, translation is not performed at word level but at sentence level. Here, the semantics of source text are conserved while translating into translated text. MT is one application of natural language processing. MT involves five major goals: 1. Morphological analysis is the process of generation of all possible roots from word level information. 2. Part-of-speech tagging is the process of assigning part-of- speech tags to every word in a given sentence. 3. Chunking is the process of identification of phrases such as noun phrase (NP), adjective phrase (JJP), verb phrase (VP) etc. in a given sentence. 4. Parsing is the process of generation of a parse tree with the help of the information obtained from part-of-speech tagging and chunking. 5. Word sense disambiguation is the process of identification of meaning of a word in a particular sentence when a given word has multiple meanings. II. PROBLEMS FACED IN MT IN INDIAN LANGUAGES Problems faced in MT in Indian languages include: 1. Indian languages are free word order languages. 2. They are morphologically and inflectionally rich languages. 3. Named entity recognition (NER) can be used to improve MT. But, NER in Indian languages is not an easy task since these languages do not provide capitalization information that helps in performing NER. 4. Many common nouns exist as proper nouns. So, these languages involve a large amount of semantic ambiguity. 5. There is scarcity of resources pertaining to Indian languages on web. Today, there are many available machine translators pertaining Indian languages but still these machine translators do not produce translations with very high accuracy. Consider the following Source text: “Jammu and Kashmir, India’s one of the most picturesque state lies on the peaks of Himalayan Ranges with varying topography and culture. Jammu was the stronghold of Hindu Dogra kings and abounds with popular temples and secluded forest retreats. Kashmir’s capital city, Srinagar offers delightful holidays on the lakes with their shikaras and houseboats”. This source text in English is translated into Hindi using different machine translators. The translations are shown in Figure 1. It is undeniable that the translated texts obtained from existing machine translators are not of good quality. Some of the words in the translated text appear in English, and some of the words are transliterated instead of translated. So, there is a need to develop machine translator that can produce good translations. III. APPROACHES OF MT Various approaches of MT are depicted in Figure 2. These include the following: • Direct machine translation • Rule based machine translation • Corpus based machine translation, including statistical machine translation and example based machine translation. The description, advantages and disadvantages of the Engineering, Technology & Applied Science Research Vol. 8, No. 5, 2018, 3475-3478 3476 www.etasr.com Chopra et al.: A Review on Machine Translation in Indian Languages different approaches are shown in Table I. Hybrid approach involves a combination of the above listed approaches. MT quality is expected to improve if hybrid approach is used to perform MT in Indian languages. Fig. 1. Output of existing machine translators Fig. 2. Classification of approaches of MT IV. LITERATURE REVIEW The work done in MT pertaining Indian languages is shown in Table II. V. EVALUATION For evaluation of MT system, automatic evaluation metrics or human evaluation metrics may be used. A. Automatic Evaluation Metrics 1) Precision, Recall and F-Measure: Precision (P)=Match/System Output Recall (R)=Match/Human Output F-Measure (R)=2×P×R/(P+R) Here, Precision is calculated by considering the number of matches between the two outputs divided by the total number of system outputs. Recall is calculated by considering the number of matches between the two outputs divided by the total number of human outputs and F-Measure would be the combination of the two. Apart from these automatic evaluation metrics, BLEU, METEOR etc. can also be used for evaluation of MT output. TABLE I. MT APPROACHES Approach Description Advantages/Disadvantages Direct MT No parallel corpus is used. It makes use of bilingual dictionary, target language and source language corpus. -Produces good accuracy. -Less tedious and less time consuming Rule Based MT Rules are constructed. Involves analysis of source and target text at syntactic, semantic and morphological level. -Good quality translation. -Complex rules are needed to be constructed. -It involves tedious tasks and is time consuming. Corpus based MT Rules are constructed by analysis of parallel corpus. Accuracy can be improved by adding more examples to corpus. Statistical MT Statistical models are used to perform MT. MT quality can be improved by adding more examples to parallel corpora. Example based MT It makes use of translation memories. It performs translation by analogy. MT quality can be improved by adding more examples to parallel corpora. 2) Bilingual Evaluation Understudy (BLEU) Its value lies between 0 and 1. It indicates how close a machine translated text is to the expected translated text. Average of BLEU scores of all sentences is taken to get the whole corpus overall score. 3) NIST Apart from calculating n-gram precision, it also assigns weights to n-gram. A low weight is assigned if n-gram matches exactly with the expected translation otherwise high weight is assigned. 4) Word Error Rate (WER) Estimates the number of tokens that differ between machine translated text and expected translated text. 5) METEOR Estimates weighted harmonic mean of unigram precision and recall. It also involves matching of synonyms and lemmatized forms. 6) LEPOR Involves a collection of different evaluation factors such as precision, recall, sentence length penalty, and word order penalty based on n-gram. Engineering, Technology & Applied Science Research Vol. 8, No. 5, 2018, 3475-3478 3477 www.etasr.com Chopra et al.: A Review on Machine Translation in Indian Languages TABLE II. DESCRIPTION OF MTS OF DIFFERENT LANGUAGE PAIRS Reference About Detailed Description [1] English to Hindi MT Hybrid approach. EBMT and SMT is used to perform MT. Parallel corpus (54K English-Hindi sentences) is used. Training: 53K sentences. Testing: 100 random sentences. BLEU score:0.432. [3] English to Assamese Phrase based MT+Transliteration. Parallel Corpus: 14,371 sentences of English and Assamese. Testing: 500 sentences. Wordnet of Assamese is used to improve MT output [9] Hindi to Punjabi MT Overall accuracy: 95.12%. Input taken from daily news, articles, official language quotes, blogs and literature. 95.4% sentences are found to be intelligible. Accuracy obtained: 87.6% [10] Bilingual Hindi English text to pure Hindi and pure English English and Hindi morphological analyzers are used. Plural forms are identified. Unknown words are considered to be proper nouns. Complex sentences are converted to simplified sentences and source text is translated to pure Hindi and pure English. In 90% cases this approach has obtained satisfactory results. [13] English to Hindi MT using SMT approach Training: 120153 words, Testing: 8557 words. BLEU score (using baseline approach) is 12.10. BLEU score (by combining syntactic, morphological and baseline approach) is 15.88 [14] English to Hindi MT EBMT+RBMT+Post editing approach. Can produce 90% correct results for sentences upto length of 20 words. [15] English-Hindi bilingual text to Hindi Morphological analyzer is used to detect unknown words and unknown plural words of Hindi and English. Correct results: 90%. [19] Transliteration of English to Hindi using SMT Accuracy: 46.3%. Alignment of English and Hindi letters is done using GIZA++, SRILM toolkit was used for training. Mean F-Measure obtained: 0.876. [20] English to Tamil MT Rule based text simplification approach is used for enhancing English- Hindi MT. Testing: 200 sentences. Accurate results in 115 sentences. Accuracy of MT system increased by 28% by introduction of text simplification approach. [21] Hindi to English MT MT output is improved by simplification of source text. Testing: 100 sentences. BLEU score: 0.805 [22] Tamil to English MT Statistical machine translation system, performs Tamil to English MT. A bilingual corpus comprising of Tamil and English sentences is formed consisting of 1300 Tamil-English sentence pairs. Tamil side consisted of 24,000 tokens. [23] English to Bangla MT (SMT) Phrase based MT is performed for English to Bangla MT. Transliteration approach is used to deal with the words not present in vocabulary. Accuracy of transliteration module: 0.18. Preposition handling is also performed. Overall BLEU score: 11.7. BLEU score obtained for short sentences is 23.3 and 0.63 TER [24] Hindi to English MT Testing: 100 sentences taken from Hindi Treebank. Source text simplification is used to improve MT. BLEU score: 4.45. B. Human Evaluation Metrics For human evaluation, authors in [10] used some linguistic features that include: • Translation of gender and number of noun(s) • Translation of voice in the sentence. • Translation of tense in the sentence • Identification of the proper noun(s) • Use of adjectives and adverbs corresponding to nouns and verbs • Selection of proper words/synonyms (lexical choice). • Sequence of phrases and clauses in the translation. • Use of punctuation marks in the translation. • Fluency of translated text and translator’s proficiency. • Maintaining semantics of source sentence in the translation. • Evaluating the translation of source sentence (with respect to syntax and intended meaning). In order to access the translation quality, a five-point scale is used, which is shown in Table III. Similarly, adequacy and fluency score may be calculated using five-point scales as represented in Tables IV and V. TABLE III. FIVE- POINT SCALE TO ACCESS TRANSLATION QUALITY Score Meaning 4 Ideal 3 Perfect 2 Acceptable 1 Partially acceptable 0 Not acceptable TABLE IV. FIVE POINT SCALE TO ACCESS ADEQUACY Score Meaning 5 Complete Information 4 Most Information 3 Much Information 2 Little Information 1 None TABLE V. FIVE POINT SCALE TO ACCESS FLUENCY Score Meaning 5 Ideal 4 Good 3 Non Native 2 Disfluent 1 Incomprehensible VI. CONCLUSION In this paper we discussed about MT, problems faced in MT in Indian language context, problems with existing machine translators, approaches of MT and the work that has been done till now in MT regarding Indian languages. As we have seen, the quality of existing MT systems is not good, so there is a need to develop machine translators that can provide Engineering, Technology & Applied Science Research Vol. 8, No. 5, 2018, 3475-3478 3478 www.etasr.com Chopra et al.: A Review on Machine Translation in Indian Languages good translation with high accuracy. We have discussed about automatic evaluation metrics and human evaluation metrics that can be used to access the translation quality. REFERENCES [1] V. Ambati, U. Rohini, “A hybrid approach to example based machine translation for Indian languages”, 5th International Conference on Natural Language, Hyderabad, India, January, 2007 [2] B. Babych, A. Hartley, “Improving machine translation quality with automatic named entity recognition”, 7th International EAMT Workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT, Budapest, Hungary, April 13, 2003 [3] A. K. Barman, J. Sarmah, S. K. Sarma, “Assamese WordNet based Quality Enhancement of Bilingual Machine Translation System”, 7th Global WordNet Conference, Tartu, Estonia, January 25-29, 2014 [4] J. G. Carbonell, S. Klein, D. Miller, M. Steinbaum, T. Grassiany, J. Frey, “Context-based machine translation”, 7th Conference of the Association for Machine Translation in the Americas, Cambridge, USA, August, 2006 [5] G. V. Garje, G. K. Kharate, “Survey of machine translation systems in India”, International Journal on Natural Language Computing, Vol. 2, No. 4, pp. 47-65, 2013 [6] A. Hassan, H. Fahmy, H. Hassan, “Improving named entity translation by exploiting comparable and parallel corpora”, AMML07, 2007 [7] J. Hutchins, “Towards a definition of example-based machine translation”, MT Summit X, Workshop on Example-Based Machine Translation, Phuket, Thailand, September 16, 2005 [8] L. Jiang, M. Zhou, L. F. Chien, C. Niu, “Named Entity Translation with Web Mining and Transliteration”, IJCAI-07, Hyderabad, India, pp. 1629-1634, January 6-12, 2007 [9] N. Joshi, H. Darbari, I. Mathur, “Human and Automatic Evaluation of English to Hindi Machine Translation Systems”, in: Advances in Computer Science, Engineering & Applications, pp. 423-432, Springer Berlin Heidelberg, 2012 [10] N. Joshi, I. Mathur, H. Darbari, A. Kumar, “HEval: Yet another human evaluation metric”, International Journal on Natural Language Computing, Vol. 2, No. 5, pp. 21-36, 2013 [11] N. Joshi, Implications of Linguistic Feature Based Evaluation in Improving Machine Translation Quality: A case of English to Hindi Machine Translation, PhD Thesis, Banasthali University, India, 2014 [12] S. Nirenburg, C. Domashnev, D. J. Grannes, “Two approaches to matching in example-based machine translation”, 5th International Conference on Theoretical and Methodological Issues in Machine Translation, Kyoto, Japan, 1993 [13] A. Ramanathan, P. Bhattacharyya, J. Hegde, R. M. Shah, M. Sasikumar, “Simple Syntactic and Morphological Processing Can Help English- Hindi Statistical Machine Translation”, Proceedings of the Third International Joint Conference on Natural Language Processing, Vol. I, pp. 513-520, 2008 [14] R. M. K. Sinha, A. Jain, “AnglaHindi: an English to Hindi machine- aided translation system”, MT Summit IX, New Orleans, USA, September 23-27, 2003 [15] R. M. K. Sinha, A. Thakur, “Machine translation of bi-lingual hindi- english (hinglish) text”, MT Summit X, Workshop on Example-Based Machine Translation, Phuket, Thailand, September 16, 2005 [16] V. Goyal, G. S. Lehal, “Evaluation of Hindi to Punjabi machine translation system”, International Journal of Computer Science Issues, Vol. 4, No. 1, pp. 36-39, 2009 [17] V. Goyal, G. S. Lehal, “Web based Hindi to Punjabi machine translation system”, Journal of Emerging Technologies in Web Intelligence, Vol. 2, No. 2, pp. 148-151, 2010 [18] G. S. Josan, G. S. Lehal, “A Punjabi to Hindi machine translation system”, 22nd International Conference on Computational Linguistics: Demonstration Papers, Manchester, UK, August 18-22, 2008 [19] T. Rama, K. Gali, “Modeling machine transliteration as a phrase based statistical machine translation problem”, 2009 Named Entities Workshop: Shared Task on Transliteration, Suntec, Singapore, August 7, 2009 [20] C. Poornima, V. Dhanalakshmi, M. A. Kumar, K. P. Soman, “Rule based sentence simplification for english to tamil machine translation system”, International Journal of Computer Applications, Vol. 25, No.8, pp. 38-42, 2011 [21] A. Soni, S. Jain, D. M. Sharma, “Exploring Verb Frames for Sentence Simplification in Hindi”, International Joint Conference on Natural Language Processing, Nagoya, Japan, October 14-18, 2013 [22] U. Germann, “Building a statistical machine translation system from scratch: how much bang for the buck can we expect?”, Workshop on Data-driven Machine Translation, Toulouse, France, July 7, 2001 [23] M. Z. Islam, J. Tiedemann, A. Eisele, “English to Bangla phrase-based machine translation”, 14th Annual Conference of the European Association for Machine Translation, St Raphael, France, May, 2010 [24] K. Mishra, A. Soni, R. Sharma, D. M. Sharma, “Exploring the effects of Sentence Simplification on Hindi to English Machine Translation System”, Workshop on Automatic Text Simplification: Methods and Applications in the Multilingual Society, Dublin, Ireland, August 24, 2014