Knowledge Engineering and Data Science (KEDS) pISSN 2597-4602
Vol 5, No 2, December 2022, pp. 150–159 eISSN 2597-4637
https://doi.org/10.17977/um018v5i22022p150-159
©2022 Knowledge Engineering and Data Science | W : http://journal2.um.ac.id/index.php/keds | E : keds.journal@um.ac.id
This is an open access article under the CC BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/)
Can Multinomial Logistic Regression Predicts Research Group
using Text Input?
Harits Ar Rosyid a,1,*, Aulia Yahya Harindra Putra a,2, Muhammad Iqbal Akbar a,3,
Felix Andika Dwiyanto b,4
a Department of Electrical Engineering, Universitas Negeri Malang,
Malang 65145, Indonesia
b Faculty of Computer Science, Electronics, and Telecommunications, AGH University of Science and Technology,
30-059 Kraków, Poland
1 harits.ar.ft@um.ac.id*; 2 yahya.harindraputra.1905356@students.um.ac.id; 3iqbal.akbar.ft@um.ac.id;
4dwiyanto@agh.edu.pl
* corresponding author
I. Introduction
The Department of Electrical Engineering and Informatics (DEEI), Universitas Negeri Malang,
has a thesis and final project management site, SISINTA UM. Every student submitting a thesis title
must adjust the title and abstract of the thesis to match the research group. Based on a short survey
of 25 students who have submitted titles and abstracts to SISINTA UM, the results show that students
feel confused and have difficulty adjusting the proposed thesis's title and abstract. Most lecturers
from the target research group usually respond briefly to any mismatch between the proposal and the
research group. This subjective response could lead to more confusion for the students.
The traditional solution would be to consult their topic with lecturers or academic supervisors.
This approach is somewhat complex and not straightforward. Factors like time and place
arrangements between students and lecturers are too dynamic. The system should be able to
recommend the best research group based on the information referring to a thesis or final project.
This approach is adapted from [1], which shows a Lexile Level within an article posted on a website.
This straightforward information will help readers to find the preferable articles.
We propose a text classification technique to construct a research group recommendation based
on text input: title and/or abstract. The main idea is driven by the abundant text information stored in
the SISINTA database. Once this text data is retrieved, we apply a text mining process initialized by
text preprocessing to clean and restructure the text. Then, the term weighting stage applies to convert
text into a computable form: numbers. Subsequently, resampling is essential to tackle the imbalanced
distribution of classes. In the next stage, we applied the Logistic Regression (LR) algorithm [2] that
will learn to distinguish research groups based on the title and or abstract. LR is a classification
algorithm to predict the probability of the target variable [3]. This algorithm is useful in text
ARTICLE INFO A B S T R A C T
Article history:
Received 11 November 2022
Revised 29 November 2022
Accepted 9 December 2022
Published online 30 December 2022
While submitting proposals in SISINTA, students often confuse or falsely submit their
proposals to the less relevant or incorrect research group. There are 13 research groups
for the students to choose from. We proposed a text classification method to help
students find the best research group based on the title and/or abstract. The stages in
this study include data collection, preprocessing data, classification using Logistic
Regression, and evaluation of the results. Three scenarios in research group
classification are based on 1) title only, 2) abstract only, and 3) title and abstract. Based
on the experiments, research group classification using title-only input is the best
overall. This scenario gets the most optimal results with accuracy, precision, recall,
and f1-score successively at 63.68%, 64.91%, 63.68%, and 63.46%. This result is
sufficient to help students find the best research group based on the text titles. In
addition, lecturers can comment more elaborately since the proposals are relevant to
the research group’s scope.
This is an open-access article under the CC BY-SA license
(https://creativecommons.org/licenses/by-sa/4.0/).
Keywords:
Classification
Logistic Regression
Title
Abstract
Research Group
Thesis
http://u.lipi.go.id/1502081730
http://u.lipi.go.id/1502081046
http://journal2.um.ac.id/index.php/keds
mailto:keds.journal@um.ac.id
https://creativecommons.org/licenses/by-sa/4.0/
https://creativecommons.org/licenses/by-sa/4.0/
151 H.A. Rosyid / Knowledge Engineering and Data Science 2022, 5 (2): 150-159
classification, such as sentiment analysis [4]. Finally, we evaluate how well the LR predicts the
research group based on the text input.
II. Method
In this research, several stages of the research methodology are described in Figure 1. We
collected raw data from DEEI's SISINTA database at the data collection stage by dumping the SQL
data into a Microsoft Excel file. No personal information such as students, supervisors, grades, or
logs was included during data exporting. The main content we retrieved was text information relevant
to these and the final projects. Data obtained from 16 April 2016 to 4 October 2022 contained 2164
samples, and the SISINTA administrator confirmed that these data are accurate. Each sample has
independent variables: the title, abstract, and research group class. Thirteen research groups and their
class distributions are shown in Table 1. From this table, we can see an imbalanced distribution of
research groups. A challenge to be tackled by Resampling Technique in our proposed method.
Fig 1. Research Methodology
Text preprocessing is carried out to ensure text data is ‘clean’ and the algorithm can learn from it
[5]. Text preprocessing involves stages to make text information more structured [6], which include
text cleaning, removing missing values, removing duplicate rows, tokenization, stopword removal,
and stemming.
Text cleaning consists of four steps. First, tag removal aims to remove HTML tags contained in
the document [7]. Many of the text data contains HTML tags. This often happens when students
copy-paste text from the document processor to the SISINTA input form. We use regular expression
filtering (a.k.a regex) to remove HTML tags and keep informative text. Say, inputText = “
Hello
”. By applying regex = re.compile(r'<[^>]+>'), the function
regex.sub('', inputText) will ouput → Hello. Second, case folding aims to convert
Table 1. Number of rows in each research group of the data studied
Research Group Total
Pengembangan Aplikasi dan Media Pembelajaran Teknologi dan Kejuruan 463
Strategi Pembelajaran Teknologi dan Kejuruan 395
Kurikulum Pendidikan Teknologi dan Kejuruan 200
Rekayasa pengetahuan dan ilmu data (Knowledge Engineering and Data Science) 174
Evaluasi dan Pengelolaan Pendidikan Kejuruan 155
Ketenegakerjaan Teknologi dan Kejuruan 142
Teknologi Digital Cerdas (Ubiquitous Computing Technique) 132
Intelligent Power and Advanced Energy System (IPAES) 121
Intelligent Power Electronics and Smart Grid (IPESG) 104
Game Technology and Machine Learning Applications 90
Telematics loT System and Devices 88
Biomedic and Intelligent Assistive Technology (TAT) 55
Sistem Dinamis, Kendali, dan Robotika (Dynamic Systems, Control, and Robotics) 45
H.A. Rosyid / Knowledge Engineering and Data Science 2022, 5 (2): 150–159 152
capital letters to lowercase. It is helpful to prevent the computer from interpreting the same word
with different meanings [8]. For instance, Python case fold (“Case”) will output the case.
The third stage, trim text, aims to remove white space at the beginning and end of the text [9]. In
Python, it is achieved by running the strip() function to remove spaces from both ends. The last
stage removes punctuation, special characters, double white space, and the number [10]. We apply
the regex for this purpose by adding more memorable characters to be removed.
The second stage of text preprocessing is to remove missing values. This step is carried out to
handle missing data by removing columns or rows whose data is not available or NaN (Not a
Number). This deletion's purpose is to reduce data bias [11]. This study's third stage of text
preprocessing is to remove duplicates or redundant samples [12]. This will minimize the overfitting
effect due to duplicates [13].
We use the Natural Language Toolkit (NLTK) for this step specifically the nltk.tokenize
package. The goal is to break down sentences into words or tokens [14]. In this study, tokenization
applies to the title and abstract into word fragments to identify words and the separators. Hence,
tokenization helps extract meaning from text.
This study's fifth stage of text preprocessing is stopword removal or text filtering. We use
nltk.corpus → stopwords, to filter out stop words such as 'diperlukan',
'hendaknya', and 'tapi'.
The final text preprocessing stage stems [15]. Stemming is used to cut prefixes, suffixes, inserts,
combinations of prefixes and endings, and remove affixes [16]. Besides that, it can also eliminate
word inflection to its basic form. The steaming process can be done using a particular Indonesian
language streamer library, Sastrawi. This process aims to make the computer interpret a word
constructed from the same root word with a different meaning [17]. For instance, when stemming is
applied, the word “kecepatan” will produce “cepat”.
Once the text data is clean and ready, term weighting converts data into a numeric form [18]. We
apply the Term Frequency-Inverse Document Frequency (TF-IDF) method in this study. TF-IDF
assigns a weight to each word that frequently appears to quantitatively measure how strong the
relationship between the word and the document is [19]. When a word appears more frequently in a
document, its weight increases proportionally. In contrast, the weight decreases if the word appears
more regularly in many documents [20]. We apply the sci-kit-learn library,
sklearn.feature_extraction.text.TfidfVectorizer for this purpose.
Until the resampling stage, the dataset was distributed unevenly between research groups.
Although there are significant sample drops within each research group, the distribution is not
balanced, as seen in Figure 2. The imbalanced dataset can cause bias in the data, where partial data
tends to make the classifier performs best only when predicting dominant classes [21]. Therefore, we
applied the resampling method, the Synthetic Minority Oversampling Technique (SMOTE). SMOTE
iteratively generates artificial samples based on the original neighboring samples. This phase stops
until all classes have the same number of samples, 194 samples each.
Fig 2. Class distribution on the raw dataset
153 H.A. Rosyid / Knowledge Engineering and Data Science 2022, 5 (2): 150-159
This study used Multinomial Logistic Regression (MLR) due to 13 research group classes. Before
modeling, we separated the dataset into 70% training and 30% test sets. The training set was then
used to train and optimize the MLR via Grid Search Cross Validation (GSCV) method. This tuning
method aims to find a combination of parameters from the model that produces the most optimal and
effective predictions [22]. The GSCV method heuristically constructs and evaluates the MLR model
using all parameter value combinations in Table 2 in a cross-validated environment (we use 10-fold).
The GSCV method produces insights into using different parameter combinations regarding
classification performances. Then, we refitted the MLR using the parameters that produce the highest
classification performance.
Since there are two types of input relevant to the research group: title and abstract, we ran three
scenarios of MLR prediction based on: 1) a title, 2) an abstract, and 3) a combination of a title and
abstract. The goal is to identify which classifier performs best. Hence, the GSCV method is applied
within each scenario producing 12 model candidates. In total, there are 36 candidates for the research
group prediction model.
In the evaluation stage, the best model from each scenario was tested using 30% test data. The
metrics used were accuracy, precision, recall, and f1-score. The goal was to test how effective the
MLR was based on the classification performance or correctness level [23]. From there, we can
choose which MLR is best applied for SISINTA.
III. Results and Discussion
The retrieved 2164 rows of data were raw text structured into columns: title, abstract, and research
group. Figure 3 shows the rawness of the dataset.
Fig 3. Example of data collection results
The process of tag removal, case folding, small text, and removal of punctuation marks, special
characters, double spaces, and numbers is carried out at the next cleaning stage. The processing
results of this stage can be seen in Figure 4.
Fig 4. Example of text cleaning results
Table 2. MLR parameters for grid search CV
Parameter Spesification
multi_class multinomial
solver saga
penalty [‘l1’, ‘l2’, ‘none’]
C [‘0.1’, ‘1.0’, ‘5’, ‘10’]
H.A. Rosyid / Knowledge Engineering and Data Science 2022, 5 (2): 150–159 154
The next step is to remove the missing values. There are four rows of missing values in the title
column and 896 rows of missing values in the abstract column, where the number of missing values
in the dataset can be seen in Figure 5.
Fig 5. Number of missing values in each dataset column
Furthermore, we identified one data duplication from the title column but none from the abstract.
As a result of text preprocessing, the distribution of the dataset falls short, but there are imbalanced
distributions of research group classes, see Figure 2.
The tokenization stage is carried out to separate text into tokens or words [24]. Figure 6 and Figure
7 show examples of the tokenization result in the title and abstract columns.
Fig 6. Tokenization results in the title column
Fig 7. Tokenization results in the abstract column
The stopwords removal stage is carried out to remove words or tokens that appear frequently and
have no critical meaning in the text [25]. The results of the stopwords removal process in the title
and abstract columns can be seen in Figure 8 and Figure 9.
Fig 8. Stopwords removal results in the title column
Fig 9. Stopwords removal results in the abstract column
155 H.A. Rosyid / Knowledge Engineering and Data Science 2022, 5 (2): 150-159
The stemming stage is carried out to remove all affixes in words, such as suffixes, inserts, prefixes,
and combinations between prefixes and suffixes [26]. The results of the steaming process in the title
and abstract columns can be seen in Figure 10 and Figure 11.
Fig 10. Stemming results in the title column
Fig 11. Stemming results in the abstract column
TF-IDF produced a matrix in the training set of the title scenario in the form of a vector of 884
samples x 2300 columns. Meanwhile, the matrix test set of the title scenario makes a vector of 380
samples x 2300 columns. For the second and third scenarios, the remaining scenarios produced nearly
quadrupled columns: 8218 and 8485 columns. An example view of term weighting using TF-IDF
can be seen in Figure 12.
Fig 12. Term weighting examples using TF-IDF: (a) Title scenario, (b) Abstract scenario, and
(c) Combination of title and abstract
We applied the default configuration of the SMOTE in generating synthetic samples (n_neighbors
= 5). There are 194 data on each RESEARCH GROUP after the resampling process using SMOTE.
In total, there are 2522 samples ready for model training.
In title scenario, using the Grid Search Cross-Validation (GSCV) method, the best parameter
configurations for the MLR were C=0.1 and using a 'none' penalty. Fig. 13 depicts the comparison
between the candidates’ performances (in dots) that applies various regularization parameters (x-
axis) and penalty (colored line). This graph shows that the MLR performs best when the C value is
high, ignoring the penalty type. The result of MLR in the green line is suspect of overfitting because
the other MLRs (orange and blue lines) underperformed when the C is lowest. This means that
regularization is essential for the MLR to perform generically. From Figure 13, the L2-type
regularization (orange line) should be the best since it performs better even using a low C value
compared to the L1-type. The higher the C value, the MLR using L2-type is always on top of the
MLR with L1-type. Therefore, the MLR was refitted in this scenario using the Penalty=L2 with C=5
as the most optimal one.
H.A. Rosyid / Knowledge Engineering and Data Science 2022, 5 (2): 150–159 156
Fig 13. Grid search CV results on title scenario
In the abstract scenario, the results of the most optimal combination of parameters can be seen in
Figure 14. Our analysis in this second scenario is similar to the first one. The difference appears only
slightly in the resulting scores. From this graph, the MLR using abstract as input is refitted with
Penalty=L2 and C=5.
Fig 14. Grid search CV results on the abstract scenario
GSCV results for the third scenario can be seen in Figure 15. Our analysis in this third scenario
is similar to the former two. The difference appears only slightly in the resulting scores. From this
graph, the MLR using abstract as input is refitted with Penalty=L2 and C=5.
Fig 15. Grid Search CV results on the title and abstract scenario
157 H.A. Rosyid / Knowledge Engineering and Data Science 2022, 5 (2): 150-159
From the three scenarios using GSCV, there were no significant differences between the effect of
input used. Even the performances were relatively identical. However, we tested each using the test
data to delve deeper into how the three MLR model performs. We measured each scenario's
performance metrics; the results can be seen in Table 3.
The evaluation results show that the title scenario is the best and optimal scenario. Although this
result is insignificant compared to the other two scenarios, it is more efficient since the input size for
MLR is way smaller if using the title only. As such is a way to reduce the curse of dimensionality in
research group classification. Hence, a minor computation power is available. In addition, there will
be a slight chance of repeated words in the titles (except stopwords) compared to the abstract. Hence,
we argue that using the title is more concise for the classification’s performance.
We also pointed out the overall metrics that are below 70%. We identified the causes:
typographical error (TYPO) within the title or abstract, coupled words, and the lack of a validation
process to check for these errors. Examples of errors contained in the dataset can be seen in Figure
16. The words highlighted were only a few in a brief observation. However, these words are not core
or root words that highly correlate with the research group. The classification model will lose some
accuracy if this word is mistyped while contributing to a particular research group. The solution is
applying a policy in the SISINTA that any typo entered in the title or abstract will dismiss the students
to get comments from the research group. Either manual observation or automatic one is feasible.
Alternatively, by applying additional text preprocessing to identify these typos and decide whether
to correct or remove them.
Fig16. Writing errors on the dataset
In addition, great topics overlap between research group classes. For instance: the research group
"Game Technology and Machine Learning" and "Knowledge Engineering and Data Science". Both
research groups contain research with the keywords “machine learning”, “data mining”,
“classification”, etc. Too many terms were shared between these two examples of research groups.
Only a few keywords disparate the two research groups, for instance, “game” and “text”. To
overcome the problem of shared words by looking at the linked words, we can use n-grams that
decompose a text into n-character chunks so that linked words can be parsed. However, using the n-
gram feature significantly enlarges the dimension. Hence, more complex algorithms like Deep
Learning should fit the task.
Finally, our proposed method is applicable in different departments as long as the digital storage
of the student’s research is organized in the research group (web-based information system and the
database). Based on our findings, the future implementation may only need to structure the data into
the title column and research group. Then, additional text preprocessing to identify and replace typos
in the content is also essential to ensure the dataset's quality for the learning algorithm. Other learning
Table 3. Performance comparison
No Input Type Accuracy Precision Recall F1-Score
1. Title 63.68% 64.91% 63.68% 63.46%
2. Abstract 61.05% 61.16% 61.05% 60.73%
3. Title+Abstract 62.89% 63.17% 62.89% 62.57%
H.A. Rosyid / Knowledge Engineering and Data Science 2022, 5 (2): 150–159 158
algorithms are available depending on the target classes and the size of the dataset provided.
Parameter tuning should be performed using GSCV with more combinations since the dataset's target
case differs from our research. The remaining stages of research group recommendation are
repeatable as is.
When SISINTA implements a recommendation of a research group based on user input, the initial
procedure of the thesis or final project proposal can be done in seconds. This can also help lecturers
in the research group to provide more elaborated and comprehensive comments within their scope of
knowledge regarding the proposals. If there are revisions required for the proposal are relevant and
constructive to make their research go in the right direction. Overall, this automatic instruction in
SISINTA can make it an intelligent information system for educational purposes. Not only applicable
in DEEI, but this approach should also be applicable in other departments as long as there are good
platforms and data.
IV. Conclusion
This research showed that we successfully applied Multinomial Logistic Regression (MLR)
Algorithm to predict the research group based on text input, either the title or thesis abstract. The
stages we followed in the text mining technique were straightforward, and MLR performed
adequately well to classify 13 research groups. The best scenario in this study was the MLR with the
input variable from the title. Using title data as a model training scenario is considered adequate,
optimal, and efficient. This is because there will be rare to write repeated words within a thesis title,
except stopwords. With performances just above 63% in overall metrics, we argue that this MLR
model with title text input is optimal due to its small dimensionality. However, the relatively low
performances below the 70% threshold were limited because research groups shared similar
keywords and typos inside the dataset. These typos can become noise or must be extracted from the
core word. Therefore, additional text preprocessing should consider these typos.
Declarations
Author contribution
All authors contributed equally as the main contributor of this paper. All authors read and approved the final paper.
Funding statement
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit
sectors.
Conflict of interest
The authors declare no known conflict of financial interest or personal relationships that could have appeared to
influence the work reported in this paper.
Additional information
Reprints and permission information are available at http://journal2.um.ac.id/index.php/keds.
Publisher’s Note: Department of Electrical Engineering - Universitas Negeri Malang remains neutral with regard to
jurisdictional claims and institutional affiliations.
References
[1] H. A. Rosyid, U. Pujianto, and M. R. Yudhistira, “Classification of Lexile Level Reading Load Using the K -Means
Clustering and Random Forest Method,” Kinet. Game Technol. Inf. Syst. Comput. Network, Comput. Electron.
Control, pp. 139–146, May 2020.
[2] M. Taddy, “Multinomial inverse regression for text analysis,” J. Am. Stat. Assoc., vol. 108, no. 503, pp. 755–770,
2013.
[3] H. Chai, Y. Liang, S. Wang, and H. Shen, “A novel logistic regression model combining semi-supervised learning
and active learning for disease classification,” Sci. Rep., vol. 8, no. 1, p. 13009, Aug. 2018.
[4] W. P. Ramadhan, S. T. M. T. Astri Novianty, and S. T. M. T. Casi Setianingsih, “Sentiment analysis using multinomial
logistic regression,” in 2017 International Conference on Control, Electronics, Renewable Energy and
Communications (ICCREC), Sep. 2017, pp. 46–49.
[5] S. A. Salloum, M. Al-Emran, A. A. Monem, and K. Shaalan, “Using Text Mining Techniques for Extracting
Information from Research Articles,” in Studies in Computational Intelligence, 2018, pp. 373–397.
http://journal2.um.ac.id/index.php/keds
https://doi.org/10.22219/kinetik.v5i2.897
https://doi.org/10.22219/kinetik.v5i2.897
https://doi.org/10.22219/kinetik.v5i2.897
https://doi.org/10.1080/01621459.2012.734168
https://doi.org/10.1080/01621459.2012.734168
https://doi.org/10.1038/s41598-018-31395-5
https://doi.org/10.1038/s41598-018-31395-5
https://doi.org/10.1109/ICCEREC.2017.8226700
https://doi.org/10.1109/ICCEREC.2017.8226700
https://doi.org/10.1109/ICCEREC.2017.8226700
https://doi.org/10.1007/978-3-319-67056-0_18
https://doi.org/10.1007/978-3-319-67056-0_18
159 H.A. Rosyid / Knowledge Engineering and Data Science 2022, 5 (2): 150-159
[6] V. Dogra, A. Singh, S. Verma, Kavita, N. Z. Jhanjhi, and M. N. Talib, “Understanding of Data Preprocessing for
Dimensionality Reduction Using Feature Selection Techniques in Text Classification,” in Intelligent Computing and
Innovation on Data Science, 2021, pp. 455–464.
[7] Y. HaCohen-Kerner, D. Miller, and Y. Yigal, “The influence of preprocessing on text classification using a bag -of-
words representation,” PLoS One, vol. 15, no. 5, p. e0232525, May 2020.
[8] P. F. Muhammad, R. Kusumaningrum, and A. Wibowo, “Sentiment Analysis Using Word2vec And Long Short-Term
Memory (LSTM) For Indonesian Hotel Reviews,” Procedia Comput. Sci., vol. 179, pp. 728–735, 2021.
[9] J. Lever et al., “PGxMine: Text mining for curation of PharmGKB Jake,” Pac Symp Biocomput, no. 25, pp. 611–622,
2020.
[10] S. Vijayaraghavan et al., “Fake News Detection with Different Models,” ArXiv, 2020.
[11] ReLearn: A Robust Machine Learning Framework in Presence of Missing Data for Multimodal Stress Detection from
Physiological Signals,” in 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology
Society (EMBC), Nov. 2021, pp. 535–541.
[12] P. R. Vishnu, P. Vinod, and S. Y. Yerima, “A Deep Learning Approach for Classifying Vulnerability Descriptions
Using Self Attention Based Neural Network,” J. Netw. Syst. Manag., vol. 30, no. 1, p. 9, Jan. 2022.
[13] H. Inoue, “Multi-Sample Dropout for Accelerated Training and Better Generalization,” ArXiv, 2019.
[14] G. N. R Prasad Sr Asst professor, “Identification of Bloom’s Taxonomy level for the given Question paper using NLP
Tokenization technique,” Turkish J. Comput. Math. Educ., vol. 12, no. 13, pp. 1872–1875, 2021.
[15] Y. A. Alhaj, J. Xiang, D. Zhao, M. A. A. Al-Qaness, M. Abd Elaziz, and A. Dahou, “A Study of the Effects of
Stemming Strategies on Arabic Document Classification,” IEEE Access, vol. 7, pp. 32664–32671, 2019.
[16] M. Adriani, J. Asian, B. Nazief, S. M. M. Tahaghoghi, and H. E. Williams, “Stemming Indonesian,” ACM Trans.
Asian Lang. Inf. Process., vol. 6, no. 4, pp. 1–33, Dec. 2007.
[17] M. A. Rosid, A. S. Fitrani, I. R. I. Astutik, N. I. Mulloh, and H. A. Gozali, “Improving Text Preprocessing For Student
Complaint Document Classification Using Sastrawi,” IOP Conf. Ser. Mater. Sci. Eng., vol. 874, no. 1, p. 012017, Jun.
2020.
[18] J. M.-T. Wu, G. Srivastava, J. C.-W. Lin, and Q. Teng, “A Multi-Threshold Ant Colony System-based Sanitization
Model in Shared Medical Environments,” ACM Trans. Internet Technol., vol. 21, no. 2, pp. 1–26, Jun. 2021.
[19] S. Qaiser and R. Ali, “Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents,” Int. J. Comput.
Appl., vol. 181, no. 1, pp. 25–29, Jul. 2018.
[20] N. S. Mohd Nafis and S. Awang, “An Enhanced Hybrid Feature Selection Technique Using Term Frequency-Inverse
Document Frequency and Support Vector Machine-Recursive Feature Elimination for Sentiment Classification,”
IEEE Access, vol. 9, pp. 52177–52192, 2021.
[21] M. Umer et al., “Scientific papers citation analysis using textual features and SMOTE resampling techniques,” Pattern
Recognit. Lett., vol. 150, pp. 250–257, Oct. 2021.
[22] G. S. K. Ranjan, A. Kumar Verma, and S. Radhika, “K-Nearest Neighbors and Grid Search CV Based Real Time
Fault Monitoring System for Industries,” in 2019 IEEE 5th International Conference for Convergence in Technology
(I2CT), Mar. 2019, pp. 1–5.
[23] B. H. Shekar and G. Dagnew, “Grid Search-Based Hyperparameter Tuning and Classification of Microarray Cancer
Data,” in 2019 Second International Conference on Advanced Computational and Communication Paradigms
(ICACCP), Feb. 2019, pp. 1–8.
[24] M. P. Geetha and D. Karthika Renuka, “Improving the performance of aspect based sentiment analysis using fine-
tuned Bert Base Uncased model,” Int. J. Intell. Networks, vol. 2, pp. 64–69, 2021.
[25] A. W. Pradana and M. Hayaty, “The Effect of Stemming and Removal of Stopwords on the Accuracy of Sentiment
Analysis on Indonesian-language Texts,” Kinetik: Game Technology, Information System, Computer Network,
Computing, Electronics, and Control, pp. 375–380, Oct. 2019, doi: 10.22219/kinetik.v4i4.912.
[26] J. Jumadi, D. S. Maylawati, L. D. Pratiwi, and M. A. Ramdhani, “Comparison of Nazief-Adriani and Paice-Husk
algorithm for Indonesian text stemming process,” IOP Conf. Ser. Mater. Sci. Eng., vol. 1098, no. 3, p. 032044, Mar.
2021.
http://dx.doi.org/10.1007/978-981-16-3153-5_48
http://dx.doi.org/10.1007/978-981-16-3153-5_48
http://dx.doi.org/10.1007/978-981-16-3153-5_48
https://doi.org/10.1371/journal.pone.0232525
https://doi.org/10.1371/journal.pone.0232525
https://doi.org/10.1016/j.procs.2021.01.061
https://doi.org/10.1016/j.procs.2021.01.061
https://www.researchgate.net/publication/338301011_PGxMine_Text_mining_for_curation_of_PharmGKB
https://www.researchgate.net/publication/338301011_PGxMine_Text_mining_for_curation_of_PharmGKB
https://arxiv.org/abs/2003.04978
https://doi.org/10.1109/EMBC46164.2021.9630040
https://doi.org/10.1109/EMBC46164.2021.9630040
https://doi.org/10.1109/EMBC46164.2021.9630040
https://doi.org/10.1007/s10922-021-09624-6
https://doi.org/10.1007/s10922-021-09624-6
https://arxiv.org/abs/1905.09788
https://www.turcomat.org/index.php/turkbilmat/article/view/8839
https://www.turcomat.org/index.php/turkbilmat/article/view/8839
https://doi.org/10.1109/ACCESS.2019.2903331
https://doi.org/10.1109/ACCESS.2019.2903331
http://dx.doi.org/10.1145/1316457.1316459
http://dx.doi.org/10.1145/1316457.1316459
https://iopscience.iop.org/article/10.1088/1757-899X/874/1/012017
https://iopscience.iop.org/article/10.1088/1757-899X/874/1/012017
https://iopscience.iop.org/article/10.1088/1757-899X/874/1/012017
https://doi.org/10.1145/3408296
https://doi.org/10.1145/3408296
http://dx.doi.org/10.5120/ijca2018917395
http://dx.doi.org/10.5120/ijca2018917395
https://doi.org/10.1109/ACCESS.2021.3069001
https://doi.org/10.1109/ACCESS.2021.3069001
https://doi.org/10.1109/ACCESS.2021.3069001
https://doi.org/10.1016/j.patrec.2021.07.009
https://doi.org/10.1016/j.patrec.2021.07.009
https://doi.org/10.1109/I2CT45611.2019.9033691
https://doi.org/10.1109/I2CT45611.2019.9033691
https://doi.org/10.1109/I2CT45611.2019.9033691
https://doi.org/10.1109/I2CT45611.2019.9033691
https://doi.org/10.1109/I2CT45611.2019.9033691
https://doi.org/10.1109/I2CT45611.2019.9033691
https://doi.org/10.1016/j.ijin.2021.06.005
https://doi.org/10.1016/j.ijin.2021.06.005
https://doi.org/10.22219/kinetik.v4i4.912
https://doi.org/10.22219/kinetik.v4i4.912
https://doi.org/10.22219/kinetik.v4i4.912
https://iopscience.iop.org/article/10.1088/1757-899X/1098/3/032044/meta
https://iopscience.iop.org/article/10.1088/1757-899X/1098/3/032044/meta
https://iopscience.iop.org/article/10.1088/1757-899X/1098/3/032044/meta