JURNAL RISET INFORMATIKA 
Vol. 5, No. 3 June 2023 

P-ISSN: 2656-1743 |E-ISSN: 2656-1735 
DOI: https://doi.org/10.34288/jri.v5i3.555 

Accredited rank 4 (SINTA 4), excerpts from the decision of the DITJEN DIKTIRISTEK No. 230/E/KPT/2023 

 
419 

 
ANALYSIS OF INDONESIAN LANGUAGE DATASET FOR TAX COURT 
CASES: MULTICLASS CLASSIFICATION OF COURT VERDICTS 

 
Ade Putera Kemala-1*, Hafizh Ash Shiddiqi-2 

 
Data Science1*, Computer Science2 

School of Computer Science, Bina Nusantara University 
Jakarta, Indonesia 

ade.kemala@binus.ac.id1*), hafizh.shiddiqi@binus.ac.id2) 

(*)Coresponding Author 
 

Abstrak 
Pajak adalah kewajiban yang timbul akibat adanya undang-undang, menciptakan kewajiban bagi warga 
negara untuk memberikan sebagian pendapatan mereka kepada negara. Pengadilan Pajak berperan sebagai 
otoritas peradilan bagi wajib pajak yang mencari keadilan dalam sengketa pajak. Penelitian ini menyajikan 
analisis dataset pengadilan pajak dalam bahasa Indonesia dengan tujuan melakukan klasifikasi multiclass 
untuk memprediksi putusan pengadilan. Sebelum digunakan dataset melalui tahap pra-pemrosesan untuk 
membersihkan data, proses augmentasi data menggunakan metode oversampling dan label weighting untuk 
mengatasi ketidakseimbangan kelas. Dua model, yaitu bi-LSTM dan IndoBERT, digunakan untuk 
melaksanakan proses klasifikasi. Penelitian ini menghasilkan model akhir dengan akurasi 75,83% 
menggunakan model IndoBERT. Hasil penelitian menunjukkan efektivitas kedua model dalam memprediksi 
putusan pengadilan. Penelitian ini memiliki implikasi dalam memprediksi kesimpulan pengadilan dengan 
informasi kasus yang terbatas, dan memberikan wawasan berharga untuk proses pengambilan keputusan 
hukum. Temuan ini berkontribusi pada bidang analisis data hukum, menampilkan potensi teknik NLP dalam 
memahami dan memprediksi hasil pengadilan, sehingga meningkatkan efisiensi proses hukum. 
 
Kata kunci: NLP; Tax; BERT; Deep learning; Klasifikasi 
 

Abstract 
Tax is an obligation that arises due to the existence of laws, creating a duty for citizens to contribute a certain 
portion of their income to the state. The Tax Court serves as a judicial authority for taxpayers seeking justice 
in tax disputes, handling various types of taxes on a daily basis. This paper presents an analysis of an 
Indonesian language dataset of tax court cases, aiming to perform multiclass classification to predict court 
verdicts. The dataset undergoes preprocessing steps, while data augmentation using oversampling and 
label weighting techniques address class imbalance. Two models, bi-LSTM and IndoBERT, are utilized for 
classification. The research produced a final result of model with 75.83% using IndoBERT model. The 
results demonstrate the efficacy of both models in predicting court verdicts. This research has implications 
for predicting court conclusions with limited case details, providing valuable insights for legal decision-
making processes. The findings contribute to the field of legal data analysis, showcasing the potential of NLP 
techniques in understanding and predicting court outcomes, thus enhancing the efficiency of legal 
proceedings. 
 
Keywords: NLP; Tax; BERT; Deep learning; Classification 
 
 
INTRODUCTION 
 
Taxation, from an economic standpoint, 

refers to the transfer of resources from the private 
sector to the public sector. From a legal perspective, 
taxation is an obligation that arises due to the 
existence of laws, creating a duty for citizens to 
contribute a certain portion of their income to the 
state (Sutedi, 2022). Tax is a compulsory 
contribution imposed by the government on 

taxpayers, whether individuals or corporations, 
and it is enforced based on legal provisions. The 
government does not provide direct compensation 
to taxpayers; however, tax revenue should be 
utilized for the maximum prosperity of the people 
and the needs of the state (Halim et al., 2014). 

Taxation in Indonesia is primarily 
regulated in the constitution through Article 23A of 
the Undang Undang Dasar (UUD) Tahun 1945 
(Pracasya, 2021) “Taxes and other compulsory 


P-ISSN: 2656-1743 | E-ISSN: 2656-1735 
DOI: https://doi.org/10.34288/jri.v5i3.555 

JURNAL RISET INFORMATIKA 
Vol. 5, No. 3 June 2023 

Accredited rank 4 (SINTA 4), excerpts from the decision of the DITJEN DIKTIRISTEK No. 230/E/KPT/2023 

 
420 

 
levies for the purposes of the state are regulated by 
law”. Based on the collecting institutions, taxes in 
Indonesia are levied by both central and regional 
institutions. The types of taxes managed by the 
central government include Income Tax (PPh), 
Value-Added Tax (PPN), Luxury Goods Sales Tax 
(PPnBM), Stamp Duty (Bea Materai), Land and 
Building Tax for Plantations, Forestry, and Mining. 
These taxes are mostly managed by the Directorate 
General of Taxes and the Ministry of Finance 
(Farouq, 2018).  

Taxes play a crucial role in the life of a 
nation, particularly in the implementation of 
development, as taxes serve as a source of state 
revenue to finance expenditures, including 
development expenses. The general functions of 
taxes are budgetary, regulatory, stability, and 
income redistribution functions (Rohendi, 2014). 

The Tax Court is a judicial body that 
exercises judicial authority over taxpayers or tax 
payers seeking justice in tax disputes they 
encounter. The Tax Court has the same status, rank, 
and independence as other equivalent courts. This 
Tax Court operates within the framework of state 
administration and has an organizational structure 
that ultimately reports to the Mahkamah Agung 
(Supreme Court) (Sandra, 2021).  

NLP (Natural Language Processing) 
research has been flourishing in recent years due to 
the advancements in context-based models in NLP 
research, revolutionalized by the publication of the 
Transformer model (Vaswani et al., 2017). One of 
the developments stemming from the Transformer 
model is the BERT (Bidirectional Encoder 
Representations from Transformers) model 
(Devlin et al., 2018), which is an architecture 
created by stacking the Encoder component of the 
Transformer model. Previously, the popular 
method involved using word vectors like word2vec 
(Church, 2017) or glove(Pennington et al., 2014) 
combined with deep neural networks such as LSTM 
(Yu et al., 2019). 

Using BERT architecture, it is possible to 
create a pre-trained model that is trained with a 
huge amount of unlabeled data to provide it with a 
general understanding of language. Subsequently, 
the pre-trained model can be further fine-tuned 
using a small amount of labeled data to adapt it to a 
specific task. The BERT model has achieved 
remarkable results in various NLP tasks, such as 
classification (Sun et al., 2019), question answering 
(Wang et al., 2019), and named entity recognition 
(Church et al., 2020). 

One of the implementation of the BERT 
architecture, pre-trained on the Indonesian 
language, is referred to as IndoBERT (Wilie et al., 

2020). This model is trained with dataset called 
Indo4B which is a 23GB collection of corpus dataset, 
including Wikipedia, twitter, newsletter data. The 
resulting model has achieved state-of-the-art 
results in several NLP tasks specific to Indonesian 
language. 

The aim of this paper is to analyze the 
dataset of legal cases from the Indonesian Tax Court 
and attempt to predict the court verdict using 
available data by leveraging the capabilities of the 
BERT model for multiclass classification tasks. 

Due to the limited availability of textual 
dataset in the Indonesian language (Ferdiana et al., 
2019), there is still a gap that needs to be addressed 
in Indonesian natural language processing 
research. Therefore, this research aims to explore 
and utilize pre-trained models to build a classifier 
model for analyzing and classifying court-based 
data. 

 
RESEARCH METHODS 

 
This quantitative research focuses on 

analyzing and performing a multiclass classification 
task to predict court verdicts based on provided 
Indonesian language text data. 

Most of the research was conducted in the 
virtual space, utilizing cloud services provided by 
Google for the required computational tasks. The 
execution timeframe for this research was from 
June 1, 2023, to July 7, 2023. 
 

Figure 1. Flowchart research 

 
Figure 1 shows the steps involved in this 

research, starting with the dataset processing and 
utilizing the data to develop two NLP models. Each 


JURNAL RISET INFORMATIKA 
Vol. 5, No. 3 June 2023 

P-ISSN: 2656-1743 |E-ISSN: 2656-1735 
DOI: https://doi.org/10.34288/jri.v5i3.555 

Accredited rank 4 (SINTA 4), excerpts from the decision of the DITJEN DIKTIRISTEK No. 230/E/KPT/2023 

 
421 

 
model has different mechanisms, resulting in 
different treatments for the training dataset. Once 
the training process is completed, the accuracy of 
the models is evaluated. 
 
Dataset  

The dataset used is the Indonesian Tax 
Court Verdict Summary, which is a secondary 
dataset and is open source, obtained through 
Kaggle (Christian, 2021). This dataset consists of 
12,283 text data entries describing the content of 
tax disputes in the tax court along with the court 
decisions. Each data row represents a case in the tax 
court. The dataset consists of 7 columns. The 
detailed explanations for each column can be found 
in Table 1. 

 
Table 1. Dataset Features 

No Column Desc 
1 Text Textual data from court 

documents 
2 Nomor_putusan Court decision number 
3 Tahun_pajak Tax year 
4 Jenis_pajak Tax type 
5 Tahun_putusan Year of court decision 
6 Pokok_sengketa Main dispute 
7 Jenis_putusan Type of verdicts 

 
The dataset contains tax disputes handled by the 
tax court from 2005 to 2020. Among these 7 
columns, the jenis_pajak (tax type) and 
pokok_sengketa (main dispute) columns will be 
used as features for analysis, while the 
jenis_putusan (verdict type) column will be used as 
the label column. In this stage, we will attempt a 
multiclass classification task using the dataset. We 
will analyze the dataset, clean it, and determine 
which parameters to use. Based on preliminary 
inspection, we are particularly interested in using 
the pokok_sengketa (main dispute) as the primary 
feature for determining the verdict. This column 
contains the textual main object of discussion 
regarding the dispute related to this legal case. 
 

Figure 2. Tax Type Distribution 
 

Figure 3. Court Verdict Distribution 
 
Figures 2 and 3 display the statistical data 

distribution based on the jenis_putusan (decision 
type) and jenis_pajak (tax type) columns. It is 
evident that based on the jenis_putusan column, 
there are three types of verdicts with higher 
frequencies which is : mengabulkan seluruhnya 
(approved completely), mengabulkan sebagian 
(partially approved), and menolak (rejected). For 
convenience, menolak will be referred to as label '0', 
mengabulkan sebagian as label '1', and 
mengabulkan seluruhnya as label '2'.  

In Figure 2, it is shown that there are four 
major types of tax cases in the dataset: BeaCukai 
(customs), PPN (Value-added Tax), PPh (Income 
Tax), and gugatan (Tax Lawsuit). Based on this 
preliminary examination, we will only utilize these 
four types of cases that have verdicts falling into the 
three most common categories in the dataset. 

Table 2 displays the final data to be used 
for training and testing. There is a significant 
imbalance in the dataset labels, and various 
approaches will be employed to address this issue 
later. 

 
Table 2. Data Distribution 

Data Total 
Count 

Class Distribution 
(0 : 1 : 2) 

All Data 11380 4011 : 1464 : 5905 
BeaCukai 4371 1340 : 265 : 2766 

PPN 3992 1201 : 773 : 2018 
PPh 1730 573 : 404 : 753 

Gugatan 1287 897 : 22 : 368 
 

P-ISSN: 2656-1743 | E-ISSN: 2656-1735 
DOI: https://doi.org/10.34288/jri.v5i3.555 

JURNAL RISET INFORMATIKA 
Vol. 5, No. 3 June 2023 

Accredited rank 4 (SINTA 4), excerpts from the decision of the DITJEN DIKTIRISTEK No. 230/E/KPT/2023 

 
422 

 
Models  
There are two models used in this research. 

The first model is the bi-LSTM model. For this 
algorithm, the dataset is further pre-processed by 
removing stopwords and punctuation. Afterward, 
the dataset is split into an 80:20 distribution as 
training data and testing data. The resulting 
training data is fed into a bi-LSTM network to train 
the model.  

To ensure the best result, we employ a 
Gridsearch method (Akiba et al., 2019) to find the 
optimal hyperparameters for the model. The search 
space consists of the embedding dimension ranging 
from 50 to 300, LSTM units ranging from 64 to 256, 
and dropout rates ranging from 0 to 0.5. The best 
hyperparameters generated by this method and the 
default initial values to train the model can be 
observed in Table 3. The resulting models then 
evaluated with testing data to find the accuracy of 
the model. 

 
Table 3. Hyperparameter Searching 

Hyperparameter Range Best 
Value 

Default 
Value 

Embedding 
dimension 

50 – 300 271 100 

LSTM unit 64 – 256 75 128 
Dropout 0 – 0.5 0.11 0.5 

The second model used in this research is 
called IndoBERT, which is an implementation of the 
BERT architecture specifically for the Indonesian 
language. For this model, pre-processing is not 
required since the model focuses on the context of 
the sentences as a whole. Deleting or modifying the 
sentences may remove or alter the context. 

The data split used remains the same, with 
an 80:20 ratio for the train data and test data. The 
model is trained with the following 
hyperparameter settings: maximum input length of 
512, batch size of 16, epoch of 10, and a learning 
rate of 5e-06. Several scenarios were tested with 
this model, and further details of these scenarios 
will be explained in the following section. 

 
Imbalance Dataset 

To address data imbalance, two methods 
will be employed namely : label weighting and 
oversampling. Label weighting is a technique used 
to address the issue of imbalanced datasets in 
machine learning. It involves assigning different 
weights to the labels or classes in the dataset based 
on their frequency or importance (Madabushi et al., 
2020). With label weighting, each label will be 
assigned a different weight based on its frequency 
distribution. Label weighting helps in giving more 
importance to the minority classes 

On the other hand, for the oversampling 
method, the EDA (Easy Data Augmentation) 
technique (Wei & Zou, 2019) combined with 
WordNet will be used to generate new synthetic 
data using similar words from the WordNet corpus. 
It is worth noted that only the train data will be 
oversampled to prevent data testing leakage.  

There are several steps implemented by 
the EDA method to perform data augmentation, 
including replacing words with their synonyms, 
deleting a percentage of words, rearranging the 
positions of words in a sentence, and randomly 
inserting words into sentences. These steps are 
randomly applied to each data row. This 
augmentation will result in a more balanced 
dataset, where all classes with fewer instances will 
be augmented to match the size of the majority 
class. The results of the oversampling technique are 
presented in Table 4. 

 
  Table 4. Data Distribution after Oversampling 

Data Class Distribution (0 : 1 : 2) 
BeaCukai 2166 : 2180 : 2195 

PPN 1950 : 1833 : 1607 
PPh 924 : 630 : 607 

Gugatan 731 : 630 : 846 
 
 
RESULTS AND DISCUSSION 
 

The overall results of the research are 
presented in Table 5. Several insights can be 
derived from these results. 
 

Table 5. Research Result 
Model Data Acc 

IndoBERT All data Normal 75.83% 
Weighted label 75.04% 

BeaCukai 
Data 

Normal 79.66% 
Weighted label 78.29% 
Oversampling 65.49% 

PPN Data Normal 80.48% 
Weighted label 79.50% 
Oversampling 51.81% 

PPh Data Normal 66.48% 
Weighted label 65.61% 
Oversampling 43.06% 

Gugatan 
Data 

Normal 86.05% 
Weighted label 81.78% 
Oversampling 64.34% 

Bi-LSTM All data Normal 
Parameter 

67.36% 

Best Parameter 67.57% 

 
The first insight observed from the results 

is that context-based models like BERT 
outperformed word vector-based models with deep 


JURNAL RISET INFORMATIKA 
Vol. 5, No. 3 June 2023 

P-ISSN: 2656-1743 |E-ISSN: 2656-1735 
DOI: https://doi.org/10.34288/jri.v5i3.555 

Accredited rank 4 (SINTA 4), excerpts from the decision of the DITJEN DIKTIRISTEK No. 230/E/KPT/2023 

 
423 

 
neural networks such as bi-LSTM. Even with the 
best parameters applied to the bi-LSTM model, 
there was no significant increase in performance 
compared by using the initial hyperparameter in bi-
LSTM model.  

For the IndoBERT model, several scenarios 
were tested. The first scenario involved fine-tuning 
the model using all available data. There was no 
significant difference between using normal 
distribution data or weighted distribution data; 
both scenarios achieved an accuracy of 75%. 

After observing the performance using all 
the data, it was decided to attempt dividing the data 
by tax types. The rationale behind this is that 
different types of taxes may involve distinct 
wording and contextual elements in each legal case. 
The results indicated that the fine-tuning process 
using BeaCukai and PPN data yielded better results 
compared to the previous approach. It is worth 
noting that among all tax types, PPh (Income Tax) 
proved to be the most challenging for the model to 
classify accurately.  

Regarding the gugatan (Tax Lawsuit) data, 
its comparatively higher performance may be 
attributed to significant label imbalances within the 
testing data. There were only sufficient data for two 
labels, while the last label had very little 
representation. As a result, the model primarily 
focused on classifying the two main labels, which 
was relatively easier than classifying all three 
labels, leading to a more significant improvement in 
performance. 

It is also observed that using the weighted 
label and oversampling methods does not 
contribute to an improvement in the models' 
performance. In the case of oversampling, where 
several words in a sentence may be altered, 
removed, or added, it can potentially change the 
contextual meaning of the text, making it more 
challenging for the BERT model to accurately 
analyze the true context of the data. 

 
CONCLUSIONS AND SUGGESTIONS 
 

After analyzing the data, performing the 
necessary preprocessing data for each model it is 
concluded that context-based model like BERT 
performed best on a multiclass classification task. 
Interestingly, neither label weighting and 
oversampling method yielded a better result in this 
particular case. This research has implications for 
predicting court conclusions with limited case 
details, providing valuable insights for legal 
decision-making processes. The findings contribute 
to the field of legal data analysis, showcasing the 

potential of NLP techniques in understanding and 
predicting court outcomes, thus enhancing the 
efficiency of legal proceedings. Suggestion for 
future research is to use ‘text’ column in dataset as 
a new features and extract relevance information 
from it in order to perform multiclass classification 
with a better result. 

 
REFERENCES 
 
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. 

(2019). Optuna: A next-generation 
hyperparameter optimization framework. 
Proceedings of the 25th ACM SIGKDD 
International Conference on Knowledge 
Discovery & Data Mining, 2623–2631. 

Christian, W. (2021). Indonesian Tax Court Verdict 
Summary. 
https://www.kaggle.com/datasets/christian
wbsn/indonesia-tax-court-verdict 

Church, K. W. (2017). Emerging Trends: Word2Vec. 
Natural Language Engineering, 23(1), 155–
162. 
https://doi.org/10.1017/S13513249160003
34 

Church, K. W., Luoma, J., & Pyysalo, S. (2020). 
Exploring cross-sentence contexts for named 
entity recognition with BERT. ArXiv Preprint 
ArXiv:2006.01563, 23(1), 155–162. 

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. 
(2018). Bert: Pre-training of deep 
bidirectional transformers for language 
understanding. ArXiv Preprint 
ArXiv:1810.04805. 

Farouq, M. (2018). Hukum pajak di Indonesia. 
Prenada Media. 

Ferdiana, R., Jatmiko, F., Purwanti, D. D., Ayu, A. S. T., 
& Dicka, W. F. (2019). Dataset Indonesia 
untuk Analisis Sentimen. Jurnal Nasional 
Teknik Elektro Dan Teknologi Informasi, 8(4), 
334–339. 

Halim, A., Bawono, I. R., & Dara, A. (2014). 
Perpajakan: Konsep, Aplikasi, Contoh, dan 
Studi Kasus. Jakarta: Salemba Empat. 

Madabushi, H. T., Kochkina, E., & Castelle, M. (2020). 
Cost-sensitive BERT for generalisable 
sentence classification with imbalanced data. 
ArXiv Preprint ArXiv:2003.11563. 

Pennington, J., Socher, R., & Manning, C. D. (2014). 
Glove: Global vectors for word 
representation. Proceedings of the 2014 
Conference on Empirical Methods in Natural 
Language Processing (EMNLP), 1532–1543. 

Pracasya, D. P. (2021). Penerapan Peraturan 
Perundang-Undangan Pajak Daerah Atas 


P-ISSN: 2656-1743 | E-ISSN: 2656-1735 
DOI: https://doi.org/10.34288/jri.v5i3.555 

JURNAL RISET INFORMATIKA 
Vol. 5, No. 3 June 2023 

Accredited rank 4 (SINTA 4), excerpts from the decision of the DITJEN DIKTIRISTEK No. 230/E/KPT/2023 

 
424 

 
Perubahan Pasal Mengenai Perpajakan Dalam 
Undang-Undang Dasar Republik Indonesia 
Tahun 1945. " Dharmasisya” Jurnal Program 
Magister Hukum FHUI, 1(2), 13. 

Rohendi, A. (2014). Fungsi budgeter dan fungsi 
regulasi dalam ketentuan perpajakan 
indonesia. Jurnal Ecodemica: Jurnal Ekonomi, 
Manajemen, Dan Bisnis, 2(1), 119–126. 

Sandra. (2021). Mengenal Tugas dan Wewenang 
Pengadilan Pajak. 
https://www.pajakku.com/read/60cc494e5
8d6727b1651ab0f/Mengenal-Tugas-dan-
Wewenang-Pengadilan-Pajak 

Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to 
fine-tune bert for text classification? China 
National Conference on Chinese Computational 
Linguistics, 194–206. 

Sutedi, A. (2022). Hukum pajak. Sinar Grafika. 
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., 

Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, 
I. (2017). Attention is all you need. ArXiv 
Preprint ArXiv:1706.03762. 

https://arxiv.org/abs/1706.03762 
Wang, Z., Ng, P., Ma, X., Nallapati, R., & Xiang, B. 

(2019). Multi-passage bert: A globally 
normalized bert model for open-domain 
question answering. ArXiv Preprint 
ArXiv:1908.08167. 

Wei, J., & Zou, K. (2019). Eda: Easy data 
augmentation techniques for boosting 
performance on text classification tasks. ArXiv 
Preprint ArXiv:1901.11196. 

Wilie, B., Vincentio, K., Winata, G. I., Cahyawijaya, S., 
Li, X., Lim, Z. Y., Soleman, S., Mahendra, R., 
Fung, P., & Bahar, S. (2020). IndoNLU: 
Benchmark and resources for evaluating 
Indonesian natural language understanding. 
ArXiv Preprint ArXiv:2009.05387. 

Yu, Y., Si, X., Hu, C., & Zhang, J. (2019). A review of 
recurrent neural networks: LSTM cells and 
network architectures. Neural Computation, 
31(7), 1235–1270.