International Journal of Applied Sciences and Smart Technologies International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 2, pages 153–160 p-ISSN 2655-8564, e-ISSN 2685-9432 153 Text Classification on Tamil Omprakash Yadav 1 , Alcina Judy 1 , Praveen D’souza 1 , Calvin Galbaw 1,* , Hinal Rane 1 1 Department of Computer, Xavier Institute of Engineering, Mahim, Mumbai 400016, India * Corresponding Author: calving2012@gmail.com (Received 01-09-2020; Revised 29-12-2021; Accepted 29-12-2021) Abstract By and large, we don't know to talk and read the territorial dialects that are spoken in our nation. So we have accepted Tamil language as it is our territorial and numerous doesn't get it. In our task, the content in Tamil language is stacked from Wikipedia. It is then sifted through and extraordinary characters are evacuated it is then characterized by the titles like id, title, URL, etc. It is then used to prepare the model utilizing CNN calculation and the dataset is created. Along these lines, you would now be able to test utilizing an irregular Wikipedia page and the content is grouped by the titles and anticipated. Keywords: tamil text classification, feature classification, vocabulary set or bag-of-words, text mining, natural language processing 1 Introduction For the most part, we don't comprehend huge numbers of the local dialects in our nation. So at whatever point an individual of various state language is spoken or composed, we were unable to get it. In this task, we characterize the content dependent on the sort like name, nation, id, and so on. Here, we use CNN to arrange the content International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 2, pages 153–160 p-ISSN 2655-8564, e-ISSN 2685-9432 154 and train the dataset. It is useful for individuals to order the sort and in any event, get a thought of what the content looks like. It will be simple for the individual to know and recognize the various segments present in the information. The sort of information is helpful for various logical purposes for getting it. 2 Literature Survey We refer to references [1], [2], [3], [4], [5]. In deep learning, a convolutional neural system (CNN or ConvNet) is a class of deep neural systems, most usually applied to investigating visual features. These utilize the spatial loads of channels to extricate highlights from the picture. They have applications in picture and video acknowledgment, recommender frameworks, picture arrangement, clinical picture examination, regular language handling, and money related time arrangement. Convolutional neural systems use convolutional layers as building squares to gain from the dataset. Alongside these, pooling layers and completely associated layers are utilized. A convolution is the basic use of a channel to an info that outcomes in an activation. These channels slide over width and tallness to convolve the information and use actuation capacity to make a highlighted map. This guide can be passed to another convolutional layer to make an increasingly itemized map. These component maps can be unfurled to take care of into a completely associated layer to get the explicit prescient displaying issue, for example, picture arrangement. Since information like pictures, recordings, and other multi-dimensional information have a quadratic number of highlights, an ordinary neural system needs to process a huge measure of straight capacities and enactments which takes a quadratic measure of time. Be that as it may, convolutional organize registers each weight in a straight time utilizing channels. International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 2, pages 153–160 p-ISSN 2655-8564, e-ISSN 2685-9432 155 he outcome is profoundly explicit highlights that can be distinguished anyplace on input images [3]. 1. Convolutional neural systems apply a channel to a contribution to make a component map that sums up the nearness of recognized highlights in the input [3]. 2. Filters can be high quality, for example, line finders, yet the advancement of convolutional neural systems is to get familiar with the channels during preparing with regards to a particular forecast problem [3]. 3. How to figure the component map for one-and two-dimensional convolutional layers in a convolutional neural system [3]. For regular language handling tasks, counterfeit neural systems, for example, intermittent neural systems (RNN) and long transient memory systems (LSTM) are favored because they go off past initiation or yield as a contribution to the following concealed states. This aids in recalling the word/character figured which helping I processing the following ward word. That is the reason these models are utilized most often in language models. Since the attempted assignment is of order, convolutional neural systems are utilized which changes in input. Instead of contributing a picture, word installing can be utilized as the info. Word installing is made utilizing different models, for example, Word2Vec. Since the forecast will be made on Wikipedia information, we have made an installation on Wikipedia pages. Input Layers: It's the layer where we contribute to our model. The quantity of neurons in this layer is equivalent to add up to estimate of the word implanting. Hidden Layer: For grouping utilizing word implanting, for the most part, a single layer of completely associated layers are utilized to shape the concealed layer. Additionally, a single layer of convolutional layer followed by a completely associated layer can be utilized. International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 2, pages 153–160 p-ISSN 2655-8564, e-ISSN 2685-9432 156 Output Layer: Since there are n number of words, the yield from the concealed layer is then taken care of into a calculated capacity of the softmax layer which changes over the yield of each word into the likelihood score of each class. The information is then taken care of into the model and yield from each layer is acquired this progression is called feedforward, we at that point figure the blunder utilizing a mistake work, some normal mistake capacities are cross-entropy, square misfortune blunder and so forth. From that point forward, we back engender into the model by figuring the subsidiaries. This progression is called Backpropagation which fundamentally is utilized to limit the misfortune. 3 Existing System Natural language processing represents computational techniques used for processing human language. The language can either be represented in terms of text or speech. NLP in the context of deep learning has become very popular because of its ability to handle text which is far from being grammatically correct. The ability to learn from the data has made the machine learning system powerful enough to process any type of unstructured text. Machine learning approaches have been used to achieve state of the art results on NLP tasks like text classification, machine translation, question answering, text summarization, text ranking, relation classification, and others. The focus of our work is text classification of Tamil language. Text classification is the most widely used NLP task. It finds application in sentiment analysis, spam detection, email classification, and document classification to name a few. It is an integral component of conversational systems for intent detection. There have been very few text classification works in literature focusing on the resource-constrained Tamil language. While the most important reason for this is the unavailability of large training data; another reason is the generalizability of deep learning architectures to different languages. However, Tamil is a morphologically rich and relatively free word order language so we investigate the performance of different models on the Tamil text classification task. Moreover, there has been a substantial rise in Tamil language digital content in recent years. Service providers, e-commerce industries are now targeting local languages to improve their International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 2, pages 153–160 p-ISSN 2655-8564, e-ISSN 2685-9432 157 visibility. An increase in the robustness of translation and transliteration systems has also contributed to the rise of NLP systems for Tamil text. This work will help in the selection of the right models and provide a suitable benchmark for further research in Tamil text classification tasks. 4 Proposed Methodology The proposed methodology is as follows. Step 1: Obtain the text from Wikipedia for Tamil pages Go to https://ta.wikipedia.org/wiki/ _ from this extract the text and convert it into csv file this file is then taken for further processing. Step 2: Filtering and removal of special characters: The special characters and the ambiguity present in the text are removed such as comma, semicolon, asterisk mark, brackets and so on. This will help the text to be simplified for further processing of data. Step 3: Classify using titles The text is classified according to the titles such as id, name, title, url, recursive words etc. Step 4: Train the dataset using CNN The dataset is trained using Convolutional Neural Networks (CNN) is one kind of feed forward neural network. CNN is an efficient recognition algorithm which is widely used in pattern recognition and image processing. It has many features such as simple structure, less training parameters and adaptability. Step 5: Test using random Wikipedia page Now we are able to test any random Wikipedia page and the text is classified according to the titles and predicts the results. International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 2, pages 153–160 p-ISSN 2655-8564, e-ISSN 2685-9432 158 5 Implementation Figure 1. Flowchart. The above Figure 1 is the Flowchart of our system. The working of our system is as follows: 1. The text from Tamil Wikipedia pages are extracted and checked for special characters. 2. Such characters create problem while classifying that is these special characters are not important to be classified. 3. The text is classified according to the title, tags, key words, and what the text is about. 4. This is used as the dataset for the model to be trained on. 5. Once we achieve high accuracy on the model, the user can use this model to get the details of an unknown Tamil text such as titles, etc. International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 2, pages 153–160 p-ISSN 2655-8564, e-ISSN 2685-9432 159 6 Conclusion In this report, we have introduced a Tamil language text arrangement that encourages the client to distinguish the sort of text and create a dataset by expelling all the ambiguities in the content and preparing the dataset which will be useful to test any irregular Wikipedia page. References [1] E. Annamalai and S. B. Steever. Modern Tamil in Dravidian languages. Newyork: Routledge Publication, 1999. [2] R. K. Belew, “Adaptive information retrieval.” In Proceedings of the 12th annual international ACM/SIGIR conference on research and development in information retrieval, NY, 11–20, 1989. [3] L. Chanunya and R. Peachavanish, “Automatic Thai language essay scoring using neural network and latent semantic analysis.” In Proceedings of the first Asia international conference on modeling and simulation, 2007. [4] C.H. Li and S.C. Park, “Text categorization based on artificial neural networks.” In ICONIP, 4234, LNCS 302–311, 2006. [5] C.H. Li and S.C. Park, “Neural network for text classification based on singular value decomposition.” In Seventh international conference on computer and information technology, 47–52, 2007. International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 2, pages 153–160 p-ISSN 2655-8564, e-ISSN 2685-9432 160 This page intetntionally left blank