KEDS_Paper_Template Knowledge Engineering and Data Science (KEDS) pISSN 2597-4602 Vol 4, No 2, December 2021, pp. 105–116 eISSN 2597-4637 https://doi.org/10.17977/um018v4i22021p105-116 ©2021 Knowledge Engineering and Data Science | W : http://journal2.um.ac.id/index.php/keds | E : keds.journal@um.ac.id This is an open access article under the CC BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/) KEDS is Sinta 2 Journal (https://sinta.kemdikbud.go.id/journals/detail?id=6662) accredited by Indonesian Ministry of Education, Culture, Research, and Technology Similarity Identification of Large-scale Biomedical Documents using Cosine Similarity and Parallel Computing Merlinda Wibowo a, 1, *, Christoph Quix b, 2 , Nur Syahela Hussien c, 3 , Herman Yuliansyah d, 4 , Faisal Dharma Adhinata a, 5 a Faculty of Informatics, Institut Teknologi Telkom Purwokerto, Jl. DI Panjaitan No.128, Karangreja, Purwokerto, Indonesia b Information Systems & Data Science, Hochschule Niederrhein, Adlerstraße 35, 47798 Krefeld, Germany c Universiti Kuala Lumpur Malaysian Institute of Information Technology (UniKL MIIT) 1016, Jln Sultan Ismail, Bandar Wawasan, 50250 Kuala Lumpur, Malaysia d Informatics Department, Universitas Ahmad Dahlan Jl. Kapas No.9, Semaki,Umbulharjo, Yogyakarta, Indonesia 1 merlinda@iitelkom-pwt.ac.id*; 2 christoph.quix@hs-niederrhein.de; 3 syahela@unikl.edu.my; 4 herman.yuliansyah@tif.uad.ac.id; 5 faisal@ittelkom-pwt.ac.id * corresponding author I. Introduction The number of articles added to the literature databases is proliferating. Large amounts of textual data could be collected as a part of the research, such as scientific literature, transcripts in the marketing and economic sectors, speeches in political discourse, such as presidential campaigns and inauguration speeches, and meeting transcripts [1]. PubMed dataset of MEDLINE also has grown enormously [2]. This large amount of textual information has created the problem of finding the relevance level between documents. Besides, it has become challenging to manage and exploit them. This difficulty is closely related to the semantic aspect of these documents. A large amount of data brings about new opportunities for discovering new values, helps to gain an in-depth understanding of hidden values, and incurs new challenges such as how effectively organized and recognized data character [3][4]. There are two main parts for identifying PubMed documents to overcome the challenges. The two parts are abstract and Medical Subject Heading (Mesh) heading. Mesh heading ARTICLE INFO A B S T R A C T Article history: Submitted 7 December 2021 Revised 25 December 2021 Accepted 29 December 2021 Published online 31 December 2021 Document similarity computation is an important research topic in information retrieval, and it is a crucial issue for automatic document categorization. The similarity value is between 0 and 1, then the closest value to 1 is represented both documents is considered more relevant, vice versa. However, the large scale of textual information has created the problem of finding the relevance level between documents. Therefore, the relevance between mesh heading text in the PubMed documents is higher than the relevance of the abstract text in the PubMed documents. Furthermore, parallel computing is implemented to speed up the large-scale documents similarity identification process that automatically calculates in the PubMed application. The execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. The execution time of mesh heading is higher than abstract because abstract contains more words than mesh heading. This study has successfully identified the similarity between large-scale biomedical documents of the PubMed documents that implemented a cosine similarity algorithm. The result has shown that the cosine similarity of the mesh heading texts is higher than the abstract text in the form of a graph and table shown in the PubMed application. The cosine similarity is useful to measure the similarity between documents based on the TF*IDF calculation result. This is an open access article under the CC BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/). Keywords: Biomedical Documents Cosine Similarity Keyword Extraction Large Scale Parallel Computing Similarity Identification http://u.lipi.go.id/1502081730 http://u.lipi.go.id/1502081046 https://doi.org/10.17977/um018v4i22021p105-116 http://journal2.um.ac.id/index.php/keds mailto:keds.journal@um.ac.id https://creativecommons.org/licenses/by-sa/4.0/ https://sinta.kemdikbud.go.id/journals/detail?id=6662 https://creativecommons.org/licenses/by-sa/4.0/ 106 M. Wibowo et al. / Knowledge Engineering and Data Science 2021, 4 (2): 105–116 is the thesaurus for indexing, cataloging, and searching biomedical and health-related information. The relevance between mesh heading text in the PubMed documents is higher than the relevance of the abstract text in the PubMed documents. Besides, the National Library of Medicine provides the mesh heading. Text mining in big data analytics is emerging as a powerful tool for harnessing the power of unstructured textual data by analyzing it to extract new knowledge and to identify significant patterns and correlations hidden in the data [1][5]. Furthermore, quickly detecting similar documents becomes a fundamental problem as times go on [6]. This difficulty is closely related to the semantic aspect of these documents. Indeed, manual operation is possible and gives good results. However, a manual procedure is not possible with a large corpus. Therefore, document similarity computation is an important research topic in information retrieval, and it is a crucial issue for automatic document categorization. Moreover, parallel computing (for big data) reduces the processing time and quickly detects similar documents [7][8]. Thus, the parallelization of big data is emerging as an essential framework for large-scale parallel data applications. Some research determines the similarity between text used extracted keywords generated based on term frequency-inverse document frequency (TF*IDF) [9][10][11][12]. This research focuses on detecting the similarity of the document. The method for calculating similarity is cosine similarity then the result demonstrates that cosine similarity can calculate the difference of text document. Keyword extraction is a vital algorithm to extract appropriate keywords that can easily choose which document to read to learn the relationship between documents in the form of document retrieval, web page retrieval, document clustering, summarization, text mining, and others. It will automatically identify terms that best describe the keywords of a document [2][9][13]. Then, to obtain a suitable text relevance algorithm to demonstrate relevance calculation between two documents, many studies have been implemented the cosine similarity [9][14][15]. The cosine similarity is useful to measure the similarity between documents based on the result of the keyword extraction. However, the large-scale documents are needed extra time execution. Therefore, parallel computing is implemented to enhance the computing speeds by running several different tasks simultaneously on the same data [7][8]. Parallel computing refers to the breaking process of a more significant problem into smaller, independent parts. Often it can be executed concurrently by multiple processors communicating via shared memory then the results are combined upon completion as part of the overall algorithm. The main purpose of parallel computing is to increase the available computing power for faster application processing and troubleshooting. This research aims to develop a text mining application that adapts a text similarity algorithm for the biomedical domain to identify the relationship and relevance between large-scale documents. The implemented algorithms are run on a set of the published article from the biomedical documents to which keyword annotations by experts exist to compare with automatically extracted keywords by a parallel computing engine. II. Methods In this study, the similarity identification framework provided a guideline to conduct and organize the research properly. The framework illustrated in Figure 1 showed the workflow divided into several research phases that describe the action plan step by step as a guide to complete this study. Each phase will require the output to ensure that the research goals are achieved successfully. A. Master Data PubMed is an open-access search engine launched in January 1996 and made freely available online one year and a half years later. It has become one of the most commonly used search tools for retrieving scientific data. An almost continuous increase in the performed searches has been observed in Biomedical and Life Sciences [2][16][17][18]. PubMed is a search tool provided by the United States National Library of Medicine (NLM). MEDLINE is a central bibliographic database maintained by the United States National Library of Medicine (NLM), is the most commonly used electronic database in applied, systematic reviews of biomedical research. It covers articles published from 1946 to the present, primarily in a scholarly journal. This database is freely accessible via the PubMed website for 24 million records. The sample of PubMed documents is depicted in Figure 2. M. Wibowo et al. / Knowledge Engineering and Data Science 2021, 4 (2): 105–116 107 Figure 2(a) depicts the sample image of PubMed Document, and Figure 2(b) shows the dataset represented in the XML format. Each XML file consists of different publication articles; more than three thousand articles are in every XML file. Dataset will be stored in MongoDB to support the parallel computing process for document similarity identification. MongoDB is the most popular NoSQL database system [19]. MongoDB is a cross-platform document-oriented database system. As a NoSQL database, MongoDB avoids traditional table-based relational database structures that support JSON documents with dynamic schemes, making data integration in some application types easier and faster. Data is stored in a document consisting of key and value with type and size variable (not set before). Figure 3 illustrates the sample of the PubMed documents stored in MongoDB. The data successfully inserted in MongoDB will be used for the following process. This dataset will be in JSON format inside the MongoDB collection with the same tag as data in XML format. This tag can be used for reading the data for the following process. MongoDB does not use the query to read the data like a SQL database. Fig. 1. The Similarity identification framework (a) (b) Fig. 2. (a) Sample image of PubMed document, and (b) The image of a data set represented in XML format 108 M. Wibowo et al. / Knowledge Engineering and Data Science 2021, 4 (2): 105–116 B. Documents Similarity Engine Machine learning is a type of artificial intelligence that can learn from the data without explicit instructions and follow the instructions programmed [4]. Machine learning will assist in finding a solution optimizing performance by using sample data or previous experience to gain new insights, reveal new patterns, and produce more accurate results. This research will implement machine learning in the documents similarity engine to identify the similarity between large-scale documents known as master data by automatically extracting keywords using node.js. JavaScript is a programming language that runs on the client or browser side only, then Node.js exists to complete the JavaScript role. It can also apply as a programming language running on the server-side, like PHP, Ruby, or Perl. With parallel computing, the process will reduce the processing time and quickly detect the relationship and relevance between large-scale documents. 1) Preprocessing At this stage, the results obtained from the master data will automatically go through to preprocess. The tag used in this study is Mesh Heading and Abstract. Both of the tags can represent the entire contents of the article published as testing data. This preprocessing will reduce the number of words that exist by removing stopwords and changing the words into the basic form (stemming) [9][20]. Stopword is words that are not a feature or unique word of a document like conjunctions. Taking into stopword in-text transformation will make the whole text mining system depend on the language factor. Therefore, it is a weakness of the stopword removal process. However, the stopword removal process is still used because this process will significantly reduce the system workload. By removing the stopword of a text, the system will only consider the considered important words. Stemming reduces derived words to their word stem, base, or basic form. One of the most widely used stemming algorithms is the Porter Stemmer [9][20]. The process of treating words with the same stem as synonyms, e.g., query expansion for search engines, is called conflation. The stem does need not be identical to the morphological root of a word since, for purposes of conflation, it is usually sufficient that related words map to the same stem even if this stem is not in itself a valid root. For example, the preprocessing depicts in Figure 4. 2) Representative Algorithm: TF*IDF This phase is representative of algorithm TF*IDF. The TF*IDF-statistic short for term frequency times inverse the document frequency can extract keywords from a document by considering a single document and all documents from the corpus [2][21]. The promising candidate for a keyword in a specific document if it shows up relatively often within the document and rarely in the rest of the Fig. 3. Sample of PubMed documents stored in MongoDB M. Wibowo et al. / Knowledge Engineering and Data Science 2021, 4 (2): 105–116 109 corpus is a word in the term of TF*IDF. The term frequency is given by the ratio of the number of term occurrences in the document and the number of occurrences of the most frequent word in one document. The formula of TF*IDF is shown in equation (1). 𝑇𝐹 ∗ 𝐼𝐷𝐹 = 𝑓𝑟𝑒𝑞(𝑃,𝐷) 𝑠𝑖𝑧𝑒(𝐷) . 𝑙𝑜𝑔2 ( 𝑁 𝑑𝑓(𝑃) ) (1) where freq(P,D) is the number of times P occurs in document D, size(D) is the number of words in document D, df(P) is the number of documents containing P in the global corpus, and N is the size of the global corpus. 3) Cosine Similarity Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them [9][14][15]. Cosine Similarity measures the similarity between two vectors in a dimensional space obtained from the cosine value of the angle from the product of the two vectors being compared because the cosine of 0° is 1 and less than 1 for other angles values. The similarity value of the two vectors is similar when the value of cosine similarity is 1. Cosine similarity is used in positive space, where the result is limited between values 0 and 1. If the value is 0, then the document is similar. If the result is 1, then the value is said to be dissimilar [9][14][15]. This limit applies to some dimensions. Therefore, cosine similarity is most often used in high-dimensional positive spaces. For example, in Information Retrieval, each term is assumed to be a different dimension. Furthermore, the document is marked with a vector where each dimension corresponds and how many terms appear. Equation (2) depicts the formula of cosine similarity. similarity = cos(𝜃) = 𝐴 .𝐵 ||𝐴||||𝐵|| = ∑ 𝐴𝑖 𝐵𝑖 𝑛 𝑖=1 √∑ 𝐴𝑖 2𝑛 𝑖=1 √∑ 𝐵𝑖 2𝑛 𝑖=1 (2) where Ai and Bi are components of vectors A and B. A is the weight of each feature in vector A. B is the weight of each feature in B. If it is associated with information retrieval, then A is the weight of each term in document A, and B is the weight of each term in document B. In this study, cosine similarity is used because large-scale PubMed documents are high-dimensional data. In large-scale PubMed documents that contain many published articles, it also can be said that each document consists of many different tags. Measurement of similarity can be done by comparing document 1 with document 2 then the system will calculate the similarity value. Ai.Bi is the value obtained from term A and term B, then the two values are added together. The value of Ai 2 is all values of term- document A, and all values are raised to the power of two, and term Bi 2 , all values obtained are raised to the power of two, then all values obtained are added up. C. Similarity Identification Result In this stage, the identification results of document similarities will be represented in a graph, statistical table, and web application. The visualization data using a graph and statistical table are intended to make it easier to present and understand the result [4][22]. Meanwhile, web application development can enhance the end-user experience and real-time data collection and provide custom content [22]. This study will show the graph and statistical table in the web application after the document similarity engine process has finished. For example, the PubMed Application interface web application depicts in Figure 5. The documents will be uploaded to the application. The application will automatically calculate the similarity between biomedical documents with parallel computing, Fig. 4. Preprocessing 110 M. Wibowo et al. / Knowledge Engineering and Data Science 2021, 4 (2): 105–116 reducing the processing time and quickly detecting the relationship and relevance between large-scale documents. Therefore, the results will be in the form of a graph and table that facilitate reading the calculation results. III. Results and Discussions The PubMed application developed as an identification documents similarity engine as an intelligent application that automatically calculated the similarity between biomedical documents then visualized the identification result in the form of a graph and table. The calculation process is used parallel computing that is reduced the processing time and quickly detects the relationship and relevance between large-scale documents. The first process is storing the master data in MongoDB. Then the punctuation will be removed, converted to lower case, implemented stop word removal, and extracted the basic word using the Porter Stemming algorithm. Two tags were used in this study, abstract and mesh heading. This tag can be used to read the data for the next process. Figure 6 depicts the sample abstract dataset from PubMed publications captured from MongoDB. In addition, the captured dataset is then transformed into the basic word. The basic word is the biomedical word, including the chemical formulation, medicine name, and others. Therefore, this need is needed to be considered. Fig. 5. PubMed application M. Wibowo et al. / Knowledge Engineering and Data Science 2021, 4 (2): 105–116 111 Fig. 6. Sample captured abstract dataset The listing program to get the extracted keywords can be seen in preprocessing program. The input in preprocessing program is all abstract data, and the output is the string of each word from the abstract. The first step of preprocessing is removing all conjunction and punctuation in the abstract then transforming the letter into lowercase. The next step is stemming the words into their roots. Preprocessing program Input: abs_all Output: all_string Initialization var abs_all, all_string, removed_conjuction, text_array, reg, rm_punctutation, reg removed_conjuction  abstrak_fix.replace(regex_rm_conjuction," ") text_array  removed_conjuction.replace(/(\s)?\d\s+/g, ' ').replace(/\n+/g,' ').split(" ").filter((d) => { return d != '' && conjuction_list.indexOf(d.toLowerCase()) < 1 }).map((d) => { reg  new RegExp(/\d/,'gi') rm_punctuaction  d.replace(regex_rm_punctuaction,'') return reg.test(d) ? d : stemmer.stem(rm_punctuaction) }) The sample of extracted keywords result is depicted in Figure 7. 112 M. Wibowo et al. / Knowledge Engineering and Data Science 2021, 4 (2): 105–116 Fig. 7. Sample of extracted keyword results Afterward, the extracted keyword weighting is carried out to calculate the frequency of occurrence of each word of the testing document in each document in the dataset. This phase is representative of algorithm TF*IDF. The TF*IDF can extract keywords from a document by considering a single document and all documents from the corpus. Finally, the TF*IDF calculation result is used to calculate the similarity of the documents testing with the PubMed documents using the cosine similarity algorithm. The listing program to get the term frequency value can be seen in the TFIDF program. TFIDF program Input: all_string Output: tf Initialization var all_string, tfidf, tf TfIdf  natural.TfIdf tfidf  new TfIdf() abs_all.forEach((dataa) => { tfidf.addDocument(dataa) }) all_string.forEach((as) => { tfidf.tfidfs(as, function(i, measure) { }) The sample of TF*IDF results stored in MongoDB is captured in Figure 8. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in 0 and 1. This similarity calculation will result in a value between 0 and 1. The closer value to 1, then both documents are more related, vice versa. M. Wibowo et al. / Knowledge Engineering and Data Science 2021, 4 (2): 105–116 113 Fig. 8. Sample of captured TF*IDF results From the similarity process that has been done, the cosine similarity produces similarity values between one document compared to other documents. The document comparison focused on the Abstract and Mesh Heading tag of the PubMed publications document as the testing data. The listing code to measure the cosine similarity between documents can be seen in the cosine similarity program. Cosine_similarity program Input: tf Output: cos_sim Initialization var tf, cos_sim_all, l1, l2, tf1, tf2, sum, a, b, A, B, cos_sim, len_avg, len_avg2, tf_sum l1  tf[item.first].length l2  tf[item.second].length tf1  tf[item.first] tf2  tf[item.second] if ( l1 > l2 ) { len_avg  l1-l2 for (var j=0; j { a  tf2.filter((d) => { return item.term == d.term && item.term != '-' && d.term != '-'}) if (a.length > 0) { b  item.tfdif*a[0].tfdif tf_sum.push(b) }}) sum  tf_sum.length > 0 ? tf_sum.reduce((accumulator, currentValue) => accumulator + currentValue) : 0 A  tf1.map((data, index) => {return Math.pow(data.tfdif,2)}).reduce((accumulator, currentValue) => accumulator + currentValue) B  tf2.map((data, index) => {return Math.pow(data.tfdif,2)}).reduce((accumulator, currentValue) => accumulator + currentValue) Cos_sim sum / (Math.sqrt(A)*Math.sqrt(B)) 114 M. Wibowo et al. / Knowledge Engineering and Data Science 2021, 4 (2): 105–116 The cosine similarity results shown in Figure 9 illustrated the sample result of cosine similarity between abstract text with different abstracts in other publications and mesh heading text with the different mesh heading in other publications. For example, the cosine similarity between document 2 and document 1 between the mesh heading of published articles in the PubMed documents is 0.0045 and indicates that the cosine similarity is 0.45%. Figure 10 illustrates the result of cosine similarity measurement between documents. In this case, it is using abstract and mesh heading text in each PubMed document. The graph of the cosine similarity result from this PubMed document is shown the mesh heading texts cosine similarity is higher than the abstract text. The results showed that the relevance between mesh heading text in the PubMed documents is higher than the relevance of the abstract text in the PubMed documents. Hence, the relationship and correlation between published articles in PubMed documents can be known from the mesh heading text. The number of words and terms in the abstract can affect text similarity results. Besides, this mesh heading tag can be used for subsequent data processing, such as classifying or clustering the PubMed documents. Fig. 9. Cosine similarity results between biomedical documents Fig. 10. Visualization of comparison of cosine similarity result between documents M. Wibowo et al. / Knowledge Engineering and Data Science 2021, 4 (2): 105–116 115 Both visualizations of the calculation similarity result depicted in Figure 9 and Figure 10, known as similarity identification results, make it easier to present and understand the comparison result. This identification similarity result is shown in the PubMed application. In addition, this result is produced by the parallel computing engine in the PubMed application that reduced the processing time and quickly detected the relationship and relevance between large-scale biomedical documents. Meanwhile, Figure 11 is shown the execution time of the similarity engine application. The execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. The execution time of mesh heading is higher than abstract because abstract contains more words than mesh heading. Documents similarity identification application has successfully identified the similarity between large-scale documents of the PubMed documents known as biomedical documents. The implemented cosine similarity and parallel computing as the document similarity engine is executed the documents faster. The execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. Based on the results, the mesh heading runtime is higher than the abstract because the abstract contains more words than the mesh heading. In addition, using the abstract and mesh heading tag can represent the similarity between documents. The result is shown that the cosine similarity of the mesh heading texts is higher than the mesh abstract text. IV. Conclusion The documents similarity identification application has successfully identified the similarity between large-scale documents of the PubMed documents known as biomedical documents. This study implemented cosine similarity and parallel computing as the document similarity engine that executed the documents faster. The execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. The mesh heading runtime is higher than the abstract because the abstract contains more words than the mesh heading. Therefore, using the abstract and mesh heading tag can represent the similarity between documents—the result is shown that the cosine similarity of the mesh heading texts is higher than the mesh abstract text. Besides, the results showed that the relevance between mesh heading text in the PubMed documents is higher than the relevance of the abstract text in the PubMed documents. On the other hand, the number of words and terms in the abstract can affect the percentage of text similarity results. In the future, this mesh heading and abstract tag can be used for the next data processing, such as classification or clustering datasets. Declarations Author contribution All authors contributed equally as the main contributor of this paper. All authors read and approved the final paper. Funding statement This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Conflict of interest The authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. Additional information Reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. Publisher’s Note: Department of Electrical Engineering - Universitas Negeri Malang remains neutral with regard to jurisdictional claims and institutional affiliations. Fig. 11. Execution time of document similarity application http://journal2.um.ac.id/index.php/keds 116 M. Wibowo et al. / Knowledge Engineering and Data Science 2021, 4 (2): 105–116 References [1] H. Hassani, C. Beneki, S. Unger, and M. T. Mazinani, “Text Mining in Big Data Analytics,” Big Data Cogn. Comput., vol. 4, pp. 1–34, 2020. [2] R. Islamaj et al., “PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database,” Database, vol. 1, pp. 1–13, 2019. [3] S. F. Wamba, A. Gunasekaran, S. Akter, S. J. Ren, R. Dubey, and S. J. Childe, “Big data analytics and firm performance: Effects of dynamic capabilities,” J. Bus. Res., vol. 70, pp. 356–365, 2016. [4] M. Wibowo, F. Noviyanto, S. Sulaiman, and S. M. Shamsuddin, “Machine Learning Technique For Enhancing Classification Performance In Data Summarization Using Rough Set And Genetic Algorithm,” Int. J. Sci. Technol. Res., vol. 8, no. 10, pp. 1108–1119, 2019. [5] R. M. Packiam and V. S. J. Prakash, “An empirical study on text analytics in big data,” 2016. [6] M. Erritali, A. Beni-hssane, M. Birjali, and Y. Madani, “An Approach of Semantic Similarity Measure between Documents Based on Big Data,” Int. J. Electr. Comput. Eng., vol. 6, no. October 2017, pp. 2454–2463, 2016. [7] L. A. Rahim, K. Mohan, K. Id, and S. Bahattacharjee, “Framework for parallelisation on big data,” PlosOne 14(5), pp. 1–19, 2019. [8] B. Parhami, “Parallel Processing with Big Data,” pp. 1–7, 2018. [9] R. Darmawan, R. S. Wahono, “Hybrid Keyword Extraction Algorithm and Cosine Similarity for Improving Sentences Cohesion in Text Summarization,” J. Intell. Syst., vol. 1, no. 2, pp. 109–114, 2015. [10] S. W. Iriananda, M. A. Muslim, and H. S. Dachlan, “Identifikasi Kemiripan Teks Menggunakan Class Indexing Based dan Cosine Similarity Untuk Klasifikasi Dokumen Pengaduan,” Matics, vol. 10, no. 2, p. 30, 2019. [11] D. A. R. Ariantini, A. S. M. Lumenta, and A. Jacobus, “Pengukuran Kemiripan Dokumen Teks Bahasa Indonesia Menggunakan Metode Cosine Similarity,” J. Tek. Inform., vol. 9, no. 1, pp. 1–8, 2016. [12] M. Z. Naf’an, A. Burhanuddin, and A. Riyani, “Penerapan Cosine Similarity dan Pembobotan TF-IDF untuk Mendeteksi Kemiripan Dokumen,” J. Linguist. Komputasional, vol. 2, no. 1, pp. 23–27, 2019. [13] J. Wang and Y. Dong, “Measurement of text similarity: A survey,” Inf., vol. 11, no. 9, pp. 1–17, 2020. [14] D. Kurniadi, S. F. C. Haviana, and A. Novianto, “Implementasi Algoritma Cosine Similarity pada sistem arsip dokumen di Universitas Islam Sultan Agung,” J. Transform., vol. 17, no. 2, p. 124, 2020. [15] D. Gunawan, C. A. Sembiring, and M. A. Budiman, “The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents,” J. Phys. Conf. Ser., vol. 978, no. 1, 2018. [16] J. Bian, M. Amin, S. Jonnalagadda, G. Luo, and G. Del, “Automatic identification of high impact articles in PubMed to support clinical decision making,” J. Biomed. Inform., vol. 73, pp. 95–103, 2017. [17] C. W. Halladay, T. A. Trikalinos, I. T. Schmid, C. H. Schmid, and I. J. Dahabreh, “Using data sources beyond PubMed has a modest impact on the results of systematic reviews of therapeutic interventions,” in Journal of Clinical Epidemiology, 2015, vol. 68, no. 9, pp. 1076–1084. [18] K. Z. Vardakas, G. Tsopanakis, A. Poulopoulou, and M. E. Falagas, “An analysis of factors contributing to PubMed’s growth,” J. Informetr., vol. 9, no. 3, pp. 592–617, 2015. [19] MongoDB, “MongoDB,” 2017. [20] P. dwi Nurfadila, A. P. Wibawa, I. A. E. Zaeni, and A. Nafalski, “Journal Classification Using Cosine Similarity Method on Title and Abstract with Frequency-Based Stopword Removal ,” Int. J. Artif. Intell. Res., vol. 3, no. 2, 2019. [21] N. Ghasemi and S. Momtazi, “Neural text similarity of user reviews for improving collaborative filtering recommender systems,” Electron. Commer. Res. Appl., vol. 45, no. October 2019, p. 101019, 2021. [22] M. Wibowo, S. Sulaiman, S. Mariyam, and H. Hashim, “Mobile Analytics Database Summarization Using Rough Set,” Int. J. Innov. Comput., vol. 7, no. 2, pp. 6–12, 2017. https://doi.org/10.3390/bdcc4010001 https://doi.org/10.3390/bdcc4010001 https://doi.org/10.1093/database/baz064 https://doi.org/10.1093/database/baz064 https://doi.org/10.1016/j.jbusres.2016.08.009 https://doi.org/10.1016/j.jbusres.2016.08.009 https://www.ijstr.org/paper-references.php?ref=IJSTR-1019-23769 https://www.ijstr.org/paper-references.php?ref=IJSTR-1019-23769 https://www.ijstr.org/paper-references.php?ref=IJSTR-1019-23769 https://doi.org/10.1109/ICCIC.2015.7435747 http://doi.org/10.11591/ijece.v6i5.pp2454-2461 http://doi.org/10.11591/ijece.v6i5.pp2454-2461 https://doi.org/10.1371/journal.pone.0214044 https://doi.org/10.1371/journal.pone.0214044 https://doi.org/10.1007/978-3-319-63962-8_165-1 http://journal.ilmukomputer.org/index.php?journal=jis&page=article&op=view&path%5B%5D=44 http://journal.ilmukomputer.org/index.php?journal=jis&page=article&op=view&path%5B%5D=44 https://doi.org/10.18860/mat.v10i2.5327 https://doi.org/10.18860/mat.v10i2.5327 https://doi.org/10.35793/jti.9.1.2016.13752 https://doi.org/10.35793/jti.9.1.2016.13752 https://doi.org/10.26418/jlk.v2i1.17 https://doi.org/10.26418/jlk.v2i1.17 https://doi.org/10.3390/info11090421 http://dx.doi.org/10.26623/transformatika.v17i2.1613 http://dx.doi.org/10.26623/transformatika.v17i2.1613 https://doi.org/10.1088/1742-6596/978/1/012120 https://doi.org/10.1088/1742-6596/978/1/012120 https://doi.org/10.1016/j.jbi.2017.07.015 https://doi.org/10.1016/j.jbi.2017.07.015 https://doi.org/10.1016/j.jclinepi.2014.12.017 https://doi.org/10.1016/j.jclinepi.2014.12.017 https://doi.org/10.1016/j.jclinepi.2014.12.017 https://doi.org/10.1016/j.joi.2015.06.001 https://doi.org/10.1016/j.joi.2015.06.001 https://www.mongodb.com/ https://doi.org/10.29099/ijair.v3i2.99 https://doi.org/10.29099/ijair.v3i2.99 https://doi.org/10.29099/ijair.v3i2.99 https://doi.org/10.1016/j.elerap.2020.101019 https://doi.org/10.1016/j.elerap.2020.101019 https://ijic.utm.my/index.php/ijic/article/view/144 https://ijic.utm.my/index.php/ijic/article/view/144 I. Introduction II. Methods A. Master Data B. Documents Similarity Engine 1) Preprocessing 2) Representative Algorithm: TF*IDF 3) Cosine Similarity C. Similarity Identification Result III. Results and Discussions IV. Conclusion Declarations Author contribution Funding statement Conflict of interest Additional information References