KEDS_Paper_Template


Knowledge Engineering and Data Science (KEDS) pISSN 2597-4602 

Vol 4, No 1, July 2021, pp. 38–48 eISSN 2597-4637 

 
https://doi.org/10.17977/um018v4i12021p38-48  

©2021 Knowledge Engineering and Data Science | W : http://journal2.um.ac.id/index.php/keds | E : keds.journal@um.ac.id  

This is an open access article under the CC BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/) 

KEDS is Sinta 2 Journal (https://sinta.ristekbrin.go.id/journals/detail?id=6662) accredited by Indonesian Ministry of Research & Technology 

Indonesian Sentence Boundary Detection  

using Deep Learning Approaches 

Joan Santoso a, 1, *, Esther Irawati Setiawan a, 2,  

Christian Nathaniel Purwanto b, 3, Fachrul Kurniawan c, 4 

a Department of Information Technology, Institut Sains dan Teknologi Terpadu Surabaya 

Jalan Ngagel Jaya Tengah 73 - 77, Surabaya, Indonesia 

b Electrical Engineering and Computer Science, National Yan Ming Chiao Tung University 

1001 University Road, Hsincu, Taiwan 

c Department of Informatics, Maulana Malik Ibrahim State Islamic University 

Jalan Gajayana No.50, Malang, Indonesia 

1 joan@istts.ac.id*; 2 esther@istts.ac.id; 3 chrisnp.ee08@nycu.edu.tw; 4 fachrulk@ti.uin-malang.ac.id 

* corresponding author 

 
I. Introduction 

Sentence segmentation or tokenization is a primary text processing in natural language processing 
[1]. To begin processing each token of words, we need to detect whether those tokens belong to the 
same sentence or not. Sentence boundary detection is used to split every sentence in a document. 
Hence, we can transfer this boundary information to the following process. This kind of task is a 
crucial one for natural language processing. 

The core of detecting sentence boundary is to identify the end of a sentence [2]. A full stop mark 
“.” usually ends the sentence, but not in all cases. For example, the full stop mark may denote an 
abbreviation, decimal value, or even currency value. Another punctuation marks that may end a 
sentence are a question mark or exclamation mark. Even a random word may finish a sentence. It 
needs many rules to encounter all the possibilities as every writer comes with their writing style. Many 
rules mean a lot of effort and time required. 

Several studies use sentence boundary detection for text pre-processing. Walker [3] improves the 
accuracy of machine translation using a sentence splitter. Liu [4][5] and Roark [6] detect the sentence 
boundary from a conversation. Goldstein [7] and Erwin [8] also use sentence extractor to summarize 
a document. Rudrapal [9] uses sentence boundary detection for social media text. Another research 
by Chang et al. [10] use sentence position as a feature for question answering. Sentence boundary 

ARTICLE INFO A B S T R A C T   

Article history: 

Received 7 February 2021 

Revised 22 May 2021 

Accepted 21 June 2021 

Published online 17 August 2021 

 
Detecting the sentence boundary is one of the crucial pre-processing steps in natural 
language processing. It can define the boundary of a sentence since the border between 
a sentence, and another sentence might be ambiguous. Because there are multiple 
separators and dynamic sentence patterns, using a full stop at the end of a sentence is 
sometimes inappropriate. This research uses a deep learning approach to split each 
sentence from an Indonesian news document. Hence, there is no need to define any 
handcrafted features or rules. In Part of Speech Tagging and Named Entity 
Recognition, we use sequence labeling to determine sentence boundaries. Two labels 
will be used, namely O as a non-boundary token and E as the last token marker in the 
sentence. To do this, we used the Bi-LSTM approach, which has been widely used in 
sequence labeling. We have proved that our approach works for Indonesian text using 
pre-trained embedding in Indonesian, as in previous studies. This study achieved an 
F1-Score value of 98.49 percent. When compared to previous studies, the achieved 
performance represents a significant increase in outcomes. 

This is an open access article under the CC BY-SA license 

(https://creativecommons.org/licenses/by-sa/4.0/). 

Keywords: 

Bahasa Indonesia 

Bidirectional LSTM 

Natural Language Processing 

Sentence Boundary Detection 

Sequence Classification 

 
http://u.lipi.go.id/1502081730
http://u.lipi.go.id/1502081046
https://doi.org/10.17977/um018v4i12021p38-48
http://journal2.um.ac.id/index.php/keds
mailto:keds.journal@um.ac.id
https://creativecommons.org/licenses/by-sa/4.0/
https://sinta.ristekbrin.go.id/journals/detail?id=6662
https://creativecommons.org/licenses/by-sa/4.0/


 J. Santoso et al. / Knowledge Engineering and Data Science 2021, 4 (1): 38–48 39 

 
detection can help the pre-processing phase and further improve the performance results. We choose 
the Deep Learning approach to simplify the learning process without crafting any rules by hand as in 
the traditional machine learning approach. We decide to use Bidirectional LSTM because of its ability 
to remember long-term sequences from two-way directions. By using this model, we do not need a 
handcrafted feature like in previous research. This model only needs the token of words. 

There are several reasons why we conduct this research for Bahasa Indonesia. The main reason is 
the limitation of available tools and resources. Moreover, there is a need for tokenizing sentences. 
Natural Language Processing approaches can use the tokenizing task as a basis for further tasks. 
Sentence Boundary Detection is crucial as a pre-processing phase of many natural language 
processing tasks. One use is on Simultaneous Translation, where Sentence Boundary Detection could 
detect sentences before the translation process [11]. Sentence Boundary Detection also is needed for 
chatbot [12], machine translation, named entity recognition, and coreference resolution [13]. 

Previous researchers have worked on several machine learning approaches on Sentence Boundary 
Detection, i.e., Unsupervised [14], Rule-Based Method [13][15], Maximum Entropy [16], Hidden 
Markov Model [17], Conditional Random Field [18], Support Vector Machine [19], and Confusion 
Networks [20]. We use a deep learning approach to detect the sentence boundary as in our previous 
work [21]. Sentence Boundary Detection has been studied on other languages like English [22], 
Portuguese [23], French [24], Vietnamese [25], Chinese [26], Japanese [27], Marathi [28], Kannada 
[29], Arabic [30], and Urdu [31]. Another study is in Thai with Bi-LSTM CNN Approach [32]. In 
Indonesian, Sentence Boundary Detection has been presented using Maximum Entropy [33] and 
Bidirectional LSTM [21].  

Our contribution is aimed directly at text processing in Bahasa Indonesia. The result of sentence 
boundary detection might be used for extracting information or even further, like solving another 
natural language processing problem. To our knowledge, we are the first to propose Sentence 
Boundary Detection with Deep Learning in Indonesian. After the tokenization process of a document, 
sometimes the determination of punctuation as the end of a sentence gives ambiguity whether it is the 
ending of a sentence or not. In this research, the sequential learning method is used to classify each 
token whether it marks the end of a sentence or not. We use Deep Learning to provide a crucial pre-
processing of text that detects each sentence from a text document. Our sentence boundary detector 
can be used as a feature extractor for later tasks. Furthermore, we also prove that the deep learning 
model is capable of detecting sentence boundaries. Our approach could achieve a higher F1 Score 
than the previous approach, and no need to build any handcrafted rules.  

II. Method 

This section explains the steps of our research framework. The first step is explaining how we 
build our corpus for sentence boundary detection. This section explains how to get the raw data until 
processed as a labeled dataset, followed by further discussion of the proposed architecture. The 
discussion is divided according to each architecture layer: input layer, Bidirectional LSTM cells, and 
output layer. This section also includes an additional explanation of the used optimization method. 

A. The problem in Sentence Boundary Detection for Bahasa Indonesia 

This section will explain some problems that occur when detecting sentence boundaries for Bahasa 
Indonesia [33]. All of them are based on the ambiguities that punctuation marks may not always end 
a sentence [34]. We have listed each problem with few examples. There are also several points that 
we discuss to explain each case. 

The first problem is writing the title and degree. When writing someone’s title, the writer often 
uses the short version of the title or degree. As seen in the first example, “H” is a title for someone 
who used to have a pilgrimage. “Ir” is an academic degree for an engineering major. The title “H” 
stands for “Haji” and the title “Ir” stands for “Insinyur”. This example shows the use of a stop mark 
to shorten the writing of the title or degree. The full stop mark in the title and degree does not end the 
sentence. This case is different when the title or degree is placed at the end of the sentence. In the 
second example, the stop mark in “Kom.” ends the sentence because it is the last word. 

1. Presiden Ir. H. Joko Widodo berkunjung ke Surabaya. 
President Ir. H. Joko Widodo visited Surabaya. 


40 J. Santoso et al. / Knowledge Engineering and Data Science 2021, 4 (1): 38–48 

2. Kelas kami diajar oleh Joan Santoso, M.Kom. 
Our class is taught by Joan Santoso, M.Sc. 

 
Abbreviation for names case comes when writing a long name. The writer usually makes the name 

shorter by using each word’s first character and gives a full stop mark on each abbreviation. This 
abbreviation is written in uppercase letters. It is hard to list all of the abbreviations for names because 
many names are used in the document collection. Full stop mark in abbreviations for names does not 
end a sentence. However, it ends the sentence if the abbreviation is placed at the end of the sentence. 
This case is similar to the first case in 2.1, which happens in writing someone’s title and degree. As 
seen in the first example, the writer use “W” which stands for “Widodo” to shorten the name. In the 
second example, the stop mark after “S” stands for “Santoso” ends the sentence because it is the last 
word. 

1. Presiden Ir. H. Joko W. berkunjung ke Surabaya. 
President Ir. H. Joko W. visited Surabaya. 

2. Kelas kami kedatangan alumni bernama Joan S. 
Our class is visited by an alumnus named Joan S. 

 
The third problem is related to common abbreviations. There are some standard abbreviations used 

in Bahasa Indonesia. For examples: “a.n” (atas nama / by the name of), “s.d.” (sampai dengan/until), 
“d.a.” (dengan alamat/placed in), “jl.” (jalan/street), “hlm.” (halaman/page), etc. A full stop mark in 
this kind of abbreviation does not end a sentence. Usually, the writer uses these abbreviations in the 
middle of the sentence. The first example shows that the stop mark in “tgl” is shortened from the 
original word “tanggal”. The second example also using the abbreviation “s.d.” to shorten the original 
word “tanggal”. In the third example, the writer could write the original word “Jalan” or just “Jl.” for 
the shorter one. 

1. Dia akan pergi pada tgl. 25 Agustus 2018. 
He will go on 25 August, 2018. 

2. Dia akan pergi dari Senin s.d. Minggu. 
He will go from Monday to Sunday. 

3. Dia akan pergi ke Jl. Ngagel Jaya. 
He will go to Ngagel Jaya Street. 

 
Time separator is considered as the fourth problem. Time can be separated using punctuation 

marks. The full stop mark in the time separator does not end the sentence. In the first example, the 
expression of time “10.30” does not end its sentence. It only separates that 10 is the number of hours 
and 30 is the number of minutes. The second example also uses a full stop mark to separate between 
hours and minutes. Both the first and second examples provide the use case of stop mark for time 
expression in a sentence. 

1. Dia telah tiba di Surabaya pukul 10.30 WIB. 
He had arrived in Surabaya at 10.30 WIB. 

2. Pada pagi hari jam 08.05, sang pembunuh menemui korban. 
At 08:05 in the morning, the murderer met the victim. 

 
The next problem, the money separator, can be expressed using punctuation marks. In the first 

example, the full stop mark in the expression “100.000” does not end the sentence. It separates the 
amount of money. Usually, people separate money per three digits in Bahasa Indonesia to make it 
easier for the reader. The second example also expresses the money format with the other currency 
used. “Rp.” is the formal way to write Indonesian currency. There is another way to express money 
in Bahasa Indonesia, as stated in the third example. The only difference is in the use of “,-” to end the 
money expression. 

1. Buku ini seharga 100.000. 
This book costs 100,000. 

2. Tas ini seharga Rp. 100.000. 
This bag costs Rp. 100,000. 


 J. Santoso et al. / Knowledge Engineering and Data Science 2021, 4 (1): 38–48 41 

 
3. Meja ini seharga Rp. 100.000,-. 
This table costs Rp. 100,000,-.  

 
Another problem is a number separator. A full stop mark is used to separate the number per 

thousand. It is not only used in expressing money but also when writing any number. For example, 
“1.123” in the first example contains full stop mark that separate number in expressing the number of 
people who died from the earthquake. The second example shows the use of a full stop mark to 
separate the number of smartphones. Almost any numbering expression uses a full stop mark to 
separate per thousand. This separation is similar to money separation to make the reader easier to read 
and understand. 

1. Gempa pekan lalu menimbulkan korban sekitar 1.123 jiwa. 
Last week's quake caused casualties of around 1,123 people. 

2. Ada 1.500.000 ponsel pintar yang terhubung ke server kami. 
There are 1,500,000 smartphones connected to our server. 

 
The email-formatted text could be problematic, contain more than one full stop mark. The first 

example shows the standard email formatted text. However, the second example shows that the 
number of full stop marks in email can be as much as possible. Users can freely choose a custom name 
for their email. The third example shows that there are a lot of non-formal ways to write an email. In 
this case, building rules for each case is time-consuming. Moreover, email-formatted text can also be 
written like in the fourth example. The writer can use “dot” instead of a full stop mark. 

1. Pertanyaan lain dapat dikirimkan ke email christian.np@indocl.stts.edu. 
Other questions can be sent to christian.np@indocl.stts.edu. 

2. Email kami yaitu people.hrd.tech.123@main.hrd.indocl.stts.edu. 
Our email is people.hrd.tech.123@main.hrd.indocl.stts.edu. 

3. Email saya adalah christian at indocl.stts.edu. 
My email is christian at indocl.stts.edu. 

4. Email dia adalah christian.np at indocl dot stts dot edu. 
His email is christian.np at indocl dot stts dot edu. 

 
Problem number eight is the username formatted text. Sometimes the writer takes quotes from 

social media and includes the username. There is no limitation on giving a full stop mark in the 
username. Full stop in username does not end the sentence. The first example shows the use of a full 
stop mark in the usual username “@christian.np”. On the contrary, a username can also contain many 
full stop marks like the second example. “@christian.n.p.stts.sby” contains several numbers of full 
stop marks. This case rarely happens, but it is still possible for a username to have many full stop 
marks. 

1. Akun @christian.np juga mengatakan hal yang serupa. 
Account @christian.np also said the same thing. 

2. @christian.n.p.stts.sby @joan.s. Ayo pergi ke Bali bulan depan! 
@christian.n.p.stts.sby @joan.s. Let’s go to Bali next month! 

 
Sentence emphasis is often used when the author wants to emphasize some meaning from the text. 

This kind of writing is often found in drama script writing to express feeling through the writing. In 
addition, the writer can combine many different punctuation marks according to his or her creativity. 
This case is the same when handling free structured text from social media. Chat, comment, or post 
on social media does not have a fixed rule in writing. Users can write anything based on current trends, 
thus making a problem because the rule for splitting each sentence is different for each time. 

Sometimes multiple punctuations can be combined to be a single token. Usually, the last 
punctuation mark in the token is the one that ends the sentence. A question mark may not finish a 
sentence when it comes together with another punctuation mark like “?!”. As seen in Figure 1, the 
question mark after the word “Surabaya” does not end the sentence. The exclamation mark after the 
question mark is the one that ends the sentence. On the other hand, a single punctuation mark may not 
end the sentence if it is placed in line with another punctuation mark. The last punctuation mark in 
token “!!!” which is the exclamation mark, is the one that ends the sentence. 


42 J. Santoso et al. / Knowledge Engineering and Data Science 2021, 4 (1): 38–48 

 
Fig. 1. Sentence emphasis example. 

 
Fig. 2. Non-punctuation token example from detik.com  

 
The next problem happens in the dialogue text. Some conversations may consist of multiple 
sentences. When we try to split them up, we lose their context, which is used to determine these 
sentences belong to whom. The sentences seem to have their context, but those sentences are in the 
same context. We make an agreement that all spoken words from a person at a particular time will be 
counted as a single sentence, even if there is more than one sentence inside it. This agreement may be 
different from other sentence tokenizer tools where the text is tokenized based on the end of a sentence, 
not the context of the whole text. 

1.  “Siapa namamu?” tanya Joan. 
“What is your name?” asked Joan. 

2. “Hai! Nama saya Christian NP. Saya senang berkenalan denganmu!” ujar Christian. 
“Hi! My name is Christian NP. I am glad to know you!” said Christian. 

 
The first example is a common writing style in which dialogue contains only one sentence. The 

second example is more complex than the previous example. It consists of three different sentences, 
which are “Hai!”, “Nama saya Christian NP.”, and “Bagaimana harimu?”. We count these three 
sentences as a single sentence, together with the main sentence. The context is the same because they 
are all spoken by one person at a particular time. 

The last problem is the non-punctuation token. As we analyze our dataset, we found that the end 
of each sentence is not always a punctuation mark. It may occur when a non-word ends the sentence. 
This case usually happens when converting a list to plain text. Point by point in a list can either ends 
with a punctuation mark like a full stop mark or just a word. Figure 2 [35] shows the output of sentence 
tokenization from a list. Colon mark “:” ends the first sentence as a description of the list. The second 
sentence until the rest is split according to the number of the list. As we can see, the end of the second 
sentence until the rest is different. It may be “Widjojanto”, “Husein”, “Hehamahua”, or other words 
that ends the sentence. On another view, the full stop mark after the index is combined with the current 
sentence. These numbers are used as an index and do not end the sentence. 

B. Data Preparation 

Our corpus is built from Indonesian news documents. All news is crawled from two news sites 
which are Detik and Kompas. Each news is then extracted and parsed to get the text. We remove 
unused information like ads, pictures, video, and audio because we only need plain text. Then, we 
conducted post-processing, which converts all list types to readable text and does tokenization at the 
word level. The product is a token that contains either word or a punctuation mark. In the last step, we 
split each sentence manually for all documents and crawled those sites from 2011 until 2012. There 
are 14,142 sentences in total from all documents. 

 
Input : Akankah Bapak Gunawan berkunjung ke Surabaya?! Kami yakin beliau datang!!! 

Output : Akankah Bapak Gunawan berkunjung ke Surabaya?! 

   Kami yakin beliau datang!!! 
 

Berikut 8 nama calon pimpinan KPK hasil seleksi Pansel yang dikirim Presiden ke DPR: 

1. Bambang Widjojanto 

2. Yunus Husein 

3. Abdullah Hehamahua 

4. Handoyo Sudradjat 

5. Abraham Samad 

6. Jaksa Zulkarnain 

7. Adnan Pandu Praja 

8. Irjen Pol (Purn) Aryanto Sutadi 


 J. Santoso et al. / Knowledge Engineering and Data Science 2021, 4 (1): 38–48 43 

 
Fig. 3. Data preparation from news site Detik  

Figure 3 [36] displays an example of the data preparation process. The rest of the dataset follows 
the same process. On the left side is the original HTML formatted text from Detik news. On the right 
side is the result with 20 sentences in total. Each sentence is split based on its context (as discussed in 
section 2). There is a section with a list of typed text in the last part, separated one by one per point. 
The numbering is essential information for further tasks. 

C. Sequence Classification for Sentence Boundary Detection 

Long Short-Term Memory (LSTM) is established for Part-of-Speech (POS) Tagging, Named 
Entity Recognition (NER), and Noun Phrase Chunking. Sentence boundary detection can be seen as 
a sequence classification problem where we want to label every timestep of the input or as a 
collocation identification problem [37]. Every token of input is predicted, whether it is the end of a 
sentence or not, based on the previous token. We build an architecture based on the nature of the 
problem. We pay much attention to the whole sequence rather than individual prediction. Thus, we 
use bidirectional LSTM to capture the sequential features from both directions (left to right direction 
and right to left direction). 

Figure 4 is the visualization of our system architecture. We divide our architecture into three 
different layers: input layer, sequence learning, and output layer. The input is a sequence of tokens 
from a single sentence, and the output is also a sequence of labels. In the input layer, it just simply 
converts each token into a vector using word embedding. Thus, the word vector is learned by the 
sequence learning layer. We use Bidirectional LSTM for sequence learning. In the end, all predicted 
results are converted into final predictions in the output layer. This prediction contains information on 
which token is identified as the end of the sentence and which is not. We also use one of the 
optimization methods to help the learning process. We use Adam optimizer for that purpose. 

Our proposed input layer is token embedding because it converts from token input into vector. 
Every token is a string which can be either word or a punctuation mark. We use Skip-Gram Word2Vec 
as our embedding model. Skip-Gram Word2Vec is capable of giving a semantic representation of a 
token. It is also capable of providing the similarity of context from different words. However, 
Word2Vec has a drawback when handling unknown words. Word2Vec cannot provide the vector 
representation if the word is not trained before. To encounter this problem, we use a random trained 
vector to represent every unknown word. 

1. Penerbit buku panduan traveling terkemuka dunia, Lonely Planet 
mengumumkan 10 destinasi terbaik di Asia. 

2. Salah satunya ada dari Indonesia, yakni Pulau Komodo. 
Melansir CNN Travel, Jumat (13/7/2018), destinasi nomor satu di Asia berasal 

dari Korea Selatan, yakni Busan. 

3. Kota ini sering disebut juga sebagai kota kedua di Korea Selatan. 
Busan, sekitar 2,5 jam perjalanan dari Seoul. 

4. Kota ini terkenal karena merupakan tujuan berlibur di musim panas dengan 
seafoodnya yang lezat dan pantai yang cantik. 

Busan menawarkan berbagai kegiatan bagi pra traveler yang mengunjunginya. 

5. Anda bisa mendaki perbukitan ke kuil Buddha, bersantai di pemandian air panas 
dan menikmati hidangan laut di pasar ikan terbesar di negara itu. 

6. "Asia adalah benua yang sangat luas dengan keberagaman budayanya akan 
sangat cocok bagi mereka yang memimpikan tempat pelarian," kata juru bicara 

Lonely Planet Asia-Pasifik, Chris Zeiher. 

7. "Para ahli kami telah menyisir ribuan rekomendasi untuk memilih tujuan terbaik 
untuk dikunjungi selama 12 bulan ke depan," tukas dia. 

8. Tempat-tempat lain dipuji karena perbaikan infrastruktur destinasinya, sebagai 
contohnya Taman Nasional Komodo Indonesia. Berada di nomor 10 karena lebih 

mudah diakses daripada sebelumnya berkat rute penerbangan baru. 

9. "Selain melihat Komodo yang terkenal, para pengunjung dapat mengunjungi 
pulau-pulau kecil seperti di Padar, Kanawa dan menyelam dengan pemandangan 

terumbu karang cantik," katanya. 

10. Berikut daftar 10 Destinasi Terbaik Asia tahun 2018 versi Lonely Planet: 
11. 1. Busan, Korea Selatan 
12. 2. Uzbekistan 
13. 3. Ho Chi Minh City, Vietnam 
14. 4. Ghats Barat, India 
15. 5. Nagasaki, Jepang 
16. 6. Chiang Mai, Thailand 
17. 7. Lumbini, Nepal 
18. 8. Teluk Arugam, Sri Lanka 
19. 9. Provinsi Sìchuan, China 

20. 10. Taman Nasional Komodo, Indonesia 


44 J. Santoso et al. / Knowledge Engineering and Data Science 2021, 4 (1): 38–48 

 Sequence learning is used to predict the outputs from the given inputs. We use bidirectional 
LSTM, which uses two different LSTM cells. Each cell acts as a forward learner and a backward 
learner. Forward LSTM reads input from the first token to the last token, and backward LSTM reads 
input from the last token to the first token. The results from both of the cells will be concatenated. The 
gray circle on the figure denotes the input to LSTM Cell. The colorful circle denotes every gate in the 
LSTM cell, which consists of yellow for the activation gate, green for the input gate, red for the forget 
gate, and blue for the output gate. The last one is light blue for cell state, which holds long-term 
memory from several previous calculations. 

𝑎𝑡
𝑓𝑤𝑑

= tanh⁡(𝑤𝑎
𝑓𝑤𝑑

. 𝑥𝑡
𝑓𝑤𝑑

+ 𝑢𝑎
𝑓𝑤𝑑

. ℎ𝑡−1
𝑓𝑤𝑑⁡

+⁡𝑏𝑎
𝑓𝑤𝑑

) (1) 

𝑖𝑡
𝑓𝑤𝑑

⁡= ⁡𝜎(𝑤𝑖
𝑓𝑤𝑑

. 𝑥𝑡
𝑓𝑤𝑑

+ 𝑢𝑖
𝑓𝑤𝑑

. ℎ𝑡−1
𝑓𝑤𝑑

⁡+⁡𝑏𝑖
𝑓𝑤𝑑

) (2) 

𝑓𝑡
𝑓𝑤𝑑

= ⁡𝜎(𝑤𝑓
𝑓𝑤𝑑

. 𝑥𝑡
𝑓𝑤𝑑

+ 𝑢𝑓
𝑓𝑤𝑑

. ℎ𝑡−1
𝑓𝑤𝑑

⁡+⁡𝑏𝑓
𝑓𝑤𝑑

) (3) 

𝑜𝑡
𝑓𝑤𝑑

= ⁡𝜎(𝑤𝑜
𝑓𝑤𝑑

. 𝑥𝑡
𝑓𝑤𝑑

+ 𝑢𝑜
𝑓𝑤𝑑

. ℎ𝑡−1
𝑓𝑤𝑑

⁡+⁡𝑏𝑜
𝑓𝑤𝑑

) (4) 

𝑐𝑡
𝑓𝑤𝑑

=⁡𝑐𝑡−1
𝑓𝑤𝑑

∗ 𝑓𝑡
𝑓𝑤𝑑

+ 𝑖𝑡
𝑓𝑤𝑑

∗ 𝑎𝑡
𝑓𝑤𝑑

⁡ (5) 

ℎ𝑡
𝑓𝑤𝑑

=⁡𝑜𝑡
𝑓𝑤𝑑

∗ tanh⁡(𝑐𝑡
𝑓𝑤𝑑

) (6) 

Equations (1) to (6) are the mathematical functions for the Forward LSTM cell. (1) is the activation 
gate, (2) is the input gate, (3) is the forget gate, (4) is the output gate, (5) is the cell state, and (6) is the 
prediction from the Forward LSTM cell. Equations (7) to (12) are similar to equations (1) to (6). (13) 
is used as the final prediction of both LSTM Cells which use concatenation function to combine two 
vectors values. 

𝑎𝑡
𝑏𝑤𝑑 = tanh⁡(𝑤𝑎

𝑏𝑤𝑑. 𝑥𝑡
𝑏𝑤𝑑 + 𝑢𝑎

𝑏𝑤𝑑. ℎ𝑡−1
𝑏𝑤𝑑⁡ +⁡𝑏𝑎

𝑏𝑤𝑑) (7) 

𝑖𝑡
𝑏𝑤𝑑 ⁡= ⁡𝜎(𝑤𝑖

𝑏𝑤𝑑. 𝑥𝑡
𝑏𝑤𝑑 + 𝑢𝑖

𝑏𝑤𝑑. ℎ𝑡−1
𝑏𝑤𝑑 ⁡+⁡𝑏𝑖

𝑏𝑤𝑑) (8) 

𝑓𝑡
𝑏𝑤𝑑 = ⁡𝜎(𝑤𝑓

𝑏𝑤𝑑. 𝑥𝑡
𝑏𝑤𝑑 + 𝑢𝑓

𝑏𝑤𝑑. ℎ𝑡−1
𝑏𝑤𝑑 ⁡+⁡𝑏𝑓

𝑏𝑤𝑑) (9) 

𝑜𝑡
𝑏𝑤𝑑 = ⁡𝜎(𝑤𝑜

𝑏𝑤𝑑. 𝑥𝑡
𝑏𝑤𝑑 + 𝑢𝑜

𝑏𝑤𝑑. ℎ𝑡−1
𝑏𝑤𝑑 ⁡+⁡𝑏𝑜

𝑏𝑤𝑑) (10) 

𝑐𝑡
𝑏𝑤𝑑 =⁡𝑐𝑡−1

𝑏𝑤𝑑 ∗ 𝑓𝑡
𝑏𝑤𝑑 + 𝑖𝑡

𝑏𝑤𝑑 ∗ 𝑎𝑡
𝑏𝑤𝑑⁡ (11) 

ℎ𝑡
𝑏𝑤𝑑 =⁡𝑜𝑡

𝑏𝑤𝑑 ∗ tanh⁡(𝑐𝑡
𝑏𝑤𝑑) (12) 

ℎ𝑡 ⁡⁡⁡⁡⁡=⁡concat⁡(ℎ𝑡
𝑓𝑤𝑑

, ℎ𝑡
𝑏𝑤𝑑⁡) (13) 

The output layer converts every vector result from the sequence learner to be the predicted label 
using the Softmax function. The function provides a probability distribution for each label and then 
outputs the label with the largest probability. The output labels are “E” as “EndOfSentence” and “O”. 
Label “E” or “EndOfSentence” means the current token is the ending of a sentence. Label “O” 

 
Fig. 4. System architecture 

 
 J. Santoso et al. / Knowledge Engineering and Data Science 2021, 4 (1): 38–48 45 

 
(Others) represents that the current token does not end a sentence. Because of its sequential nature, 
every token input will have a single output label. 

In this research, we choose Adam optimizer to obtain an appropriate gradient for each weight in 
networks. Adam combines adaptive learning rate and momentum. Technically, every weight is 
updated by using gradient calculated with Adam. Algorithm 1 is the pseudocode of the Adam 
optimizer. The default value for each hyperparameter is based on the original paper in [38]. 

Algorithm 1; Adam Optimizer 

while  not converged do:  
 t = t + 1  
 gt = GetGradient(θt-1)   
 mt = β1 * mt-1 + (1 - β1)*gt   
 vt = β2 * vt-1 + (1 – β2)*gt2    
 mt = mt / (1- β1t)     
 vt = vt/(1- β2t)     

  = -1 α * mt/(√vt + ϵ)  

return   

III. Results and Discussions 

We had done several experiments to prove the capability of our proposed architecture. We provide 
some test cases by fine-tuning a few hyperparameters. Besides, we also report a different approach by 
using standard LSTM to compare with our Bidirectional LSTM model. We ran different scenarios 
based on the changing of hyperparameters. Each scenario used the same dataset. We split our corpus 
by 70% (9,953 sentences) for training and 30% (4,189 sentences) for testing. The random seed was 
turned off to focus only on the original effect of the hyperparameters setting. There were two big 
categories based on the model we have tried. We tested every model by changing the hidden unit of 
LSTM cell, the number of layers, and training iteration. 

Table 1 contains all experiments using different kinds of methods. The row represents the method, 
and the column represents the number of iterations. Every method is either experimented on the LSTM 
or the word embedding. Based on the results in Table 1, we found that BiLSTM (Bidirectional LSTM) 
works better than UniLSTM (Unidirectional LSTM). Word embedding gives a small difference in 
overall accuracy. The number of iterations will increase accuracy but not a lot in the next iteration. 
We also conduct another trial to identify the effect of word embedding dimension by using 50% of 
the training document and separate as 70% sentences as training and 30% sentences as testing. The 
results are as follows 0.9843% for 50 dimensions, 0.9830% for 100 dimensions, 0.9832% for 150 
dimensions, 0.9846% for 200 dimensions, 0.9850% for 250 dimensions, 0.9857% for 300 dimensions, 
0.9808% for 350 dimensions, 0.9826% for 400 dimensions, 0.9863% for 450 dimensions, and 
0.9826% for 500 dimensions. Our final result is 98.49% when using Bi-LSTM model with Word2Vec 
embedding and 100 iterations. 

The second experiment was conducted by comparing the performance of the proposed method 
with several approaches from previous state-of-the-art research. The problem modeling in this 
research is sequential tagging for a set of input token sequences. Several sequential tagging methods 
will be used as a comparison method in this proposed approach, namely Maximum Entropy, Decision 
Tree, and Naïve Bayes. In addition to using several traditional non-Deep Learning models, the 
performance of the proposed method is also compared with previous studies using Bi-LSTM by 
Purwanto et al. [21]. The experimental results can be seen in Table 2. 

Table 1. Experiments results 

Method 
Number of Iteration 

10 20 50 100 

UniLSTM 96.79% 96.94% 97.14% 96.81% 

BiLSTM  96.95% 97.43% 98.22% 98.47% 

UniLSTM + Word2Vec* 96.91% 96.41% 97.44% 97.48% 

BiLSM + Word2Vec* 97.09% 98.10% 98.39% 98.49% 

*We use Skip-Gram model for Word2Vec 


46 J. Santoso et al. / Knowledge Engineering and Data Science 2021, 4 (1): 38–48 

Based on the experimental results in previous studies, the best performance of the Bi-LSTM 
proposed in this study provides the most significant increase of approximately 13% compared to other 
approaches that do not use Deep Learning. However, compared with the Bi-LSTM that has been 
proposed by [21], there was an increase of approximately 2%. The reason is that the results of the 
proposed approach are using two labels and while in [21] approach uses four labels. The use of two 
labels can give the best results compared to 4 labels in previous studies, especially in sentence 
boundary detection research. 

IV. Conclusion 

We have done several experiments to prove the capability of Bidirectional LSTM as the sequence 
learner to solve sentence boundary detection. We view this task as a sequential problem where every 
token input is predicted to end a sentence. Based on our experiments, we could reach 98.49% F1 score 
with Bidirectional LSTM as our sequence learner and train embedding for the word embedding as the 
best model. We also compare our research with other widely used methods in sequence classification. 
We conclude that the Bidirectional LSTM is way better than a Unidirectional LSTM. In our case, 
word2vec does not effectively capture sentence boundaries for Indonesian news documents. Our last 
trial gives a similar F1 score, whether using low dimension or high dimension embedding size.  

Acknowledgment 

The authors want to appreciate Institut Sains dan Teknologi Terpadu Surabaya (ISTTS) for 
supporting this research. Also, we want to thank our laboratory member from Natural Language 
Processing Laboratory from ISTTS for helping us finish this research. 

Declarations  

Author contribution  

All authors contributed equally as the main contributor of this paper. All authors read and approved the final paper. 

Funding statement  

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.  

Conflict of interest  

The authors declare no known conflict of financial interest or personal relationships that could have appeared to influence 
the work reported in this paper.  

Additional information  

Reprints and permission information is available at http://journal2.um.ac.id/index.php/keds. 

Publisher’s Note: Department of Electrical Engineering - Unversitas Negeri Malang remains neutral with regard to 
jurisdictional claims and institutional affiliations. 

References 

[1] D. Jurafsky and H. James, Martin: Speech and Language Processing: An Introduction to Natural Language Processing, 
Speech Recognition, and Computational Linguistics. Prentice-Hall, Englewood Cliffs, 2008. 

[2] J. Read, R. Dridan, S. Oepen, and L. J. Solberg, “Sentence boundary detection: A long solved problem?,” in Proceedings 
of COLING 2012: Posters, 2012, pp. 985–994. 

[3] D. J. Walker, D. E. Clements, M. Darwin, and J. W. Amtrup, “Sentence boundary detection: A comparison of paradigms 
for improving MT quality,” in Proceedings of the MT Summit VIII, 2001, vol. 58. 

Table 2. Experimental results compared with other studies 

No. Previous Research  Performance 

1 Maximum Entropy 87.91% 

2 Decision Tree 82.23% 

3 Naïve Bayes 86.28% 

4 Bi-LSTM by Purwanto et al. [21] 96.57% 

5 Our Proposed Model 98.49% 

 
http://journal2.um.ac.id/index.php/keds
https://www.researchgate.net/publication/200111340_Speech_and_Language_Processing_An_Introduction_to_Natural_Language_Processing_Computational_Linguistics_and_Speech_Recognition
https://www.researchgate.net/publication/200111340_Speech_and_Language_Processing_An_Introduction_to_Natural_Language_Processing_Computational_Linguistics_and_Speech_Recognition
https://aclanthology.org/C12-2096/
https://aclanthology.org/C12-2096/
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.4928
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.4928


 J. Santoso et al. / Knowledge Engineering and Data Science 2021, 4 (1): 38–48 47 

 
[4] Y. Liu, A. Stolcke, E. Shriberg, and M. Harper, “Comparing and combining generative and posterior probability models: 
Some advances in sentence boundary detection in speech,” in Proceedings of the 2004 Conference on Empirical 
Methods in Natural Language Processing, 2004, pp. 64–71. 

[5] Y. Liu, A. Stolcke, E. Shriberg, and M. Harper, “Using conditional random fields for sentence boundary detection in 
speech,” in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005, pp. 451–458. 

[6] B. Roark et al., “Reranking for sentence boundary detection in conversational speech,” in 2006 IEEE International 
Conference on Acoustics Speech and Signal Processing Proceedings, 2006, vol. 1, pp. I--I. 

[7] J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz, “Multi-document summarization by sentence extraction,” in 
Proceedings of the 2000 NAACL-ANLP Workshop on Automatic summarization, 2000, pp. 40–48. 

[8] E. Y. Hidayat, F. Firdausillah, K. Hastuti, I. N. Dewi, and A. Azhari, “Automatic text summarization using latent 
Drichlet allocation (lda) for document clustering,” Int. J. Adv. Intell. Informatics, vol. 1, no. 3, pp. 132–139, 2015. 

[9] D. Rudrapal, A. Jamatia, K. Chakma, A. Das, and B. Gambäck, “Sentence Boundary Detection for Social Media Text,” 
in Proceedings of the 12th International Conference on Natural Language Processing, 2015, pp. 254–260. 

[10] X. Chang and Q. Zheng, “Offline definition extraction using machine learning for knowledge-oriented question 
answering,” in International Conference on Intelligent Computing, 2007, pp. 1286–1294. 

[11] R. Zhang and C. Zhang, “Dynamic Sentence Boundary Detection for Simultaneous Translation,” Proceedings of the 
First Workshop on Automatic Simultaneous Translation, 2020. 

[12] T. A. Le, “Sequence labeling approach to the task of sentence boundary detection,” in ACM International Conference 
Proceeding Series, Jan. 2020, pp. 144–148, doi: 10.1145/3380688.3380703. 

[13] N. Sadvilkar and M. Neumann, “PySBD: Pragmatic Sentence Boundary Disambiguation,” Oct. 2020, [Online]. 
Available: http://arxiv.org/abs/2010.09657.  

[14] T. Kiss and J. Strunk, “Unsupervised multilingual sentence boundary detection,” Comput. Linguist., vol. 32, no. 4, pp. 
485–525, 2006. 

[15] J. Wang, Y. Zhu, and Y. Jin, “A rule-based method for Chinese punctuations processing in sentences segmentation,” in 
2014 International Conference on Asian Language Processing (IALP), 2014, pp. 195–198. 

[16] J. C. Reynar and A. Ratnaparkhi, “A maximum entropy approach to identifying sentence boundaries,” in Proceedings 
of the fifth conference on Applied natural language processing, 1997, pp. 16–19. 

[17] B. Jurish and K.-M. Würzner, “Word and Sentence Tokenization with Hidden Markov Models.,” JLCL, vol. 28, no. 2, 
pp. 61–83, 2013. 

[18] K. Tomanek, J. Wermter, and U. Hahn, “Sentence and token splitting based on conditional random fields,” in 
Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, 2007, vol. 49, p. 57. 

[19] Y. Akita, M. Saikou, H. Nanjo, and T. Kawahara, “Sentence boundary detection of spontaneous Japanese using 
statistical language model and support vector machines,” 2006. 

[20] D. Hillard, M. Ostendorf, A. Stolcke, Y. Liu, and E. Shriberg, “Improving automatic sentence boundary detection with 
confusion networks,” in Proceedings of HLT-NAACL 2004: Short Papers, 2004, pp. 69–72. 

[21] C. N. Purwanto, A. T. Hermawan, J. Santoso, and Gunawan, “Distributed Training for Multilingual Combined 
Tokenizer using Deep Learning Model and Simple Communication Protocol,” in 2019 1st International Conference on 
Cybernetics and Intelligent System (ICORIS), 2019, vol. 1, pp. 110–113. 

[22] D. Gillick, “Sentence boundary detection and the problem with the US,” in Proceedings of Human Language 
Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational 
Linguistics, Companion Volume: Short Papers, 2009, pp. 241–244. 

[23] C. N. Silla and C. A. A. Kaestner, “An analysis of sentence boundary detection systems for English and Portuguese 
documents,” in International Conference on Intelligent Text Processing and Computational Linguistics, 2004, pp. 135–
141. 

[24] C.-E. González-Gallardo and J.-M. Torres-Moreno, “Sentence boundary detection for French with subword-level 
information vectors and convolutional neural networks,” arXiv Prepr. arXiv1802.04559, 2018. 

[25] H. P. Le and T. V. Ho, “A maximum entropy approach to sentence boundary detection of Vietnamese texts,” 2008. 

[26] N. Xue and Y. Yang, “Chinese sentence segmentation as comma classification,” in Proceedings of the 49th Annual 
Meeting of the Association for Computational Linguistics: Human Language Technologies, Jun. 2011, pp. 631–635, 
[Online]. Available: https://www.aclweb.org/anthology/P11-2111. 

[27] K. Shitaoka, K. Uchimoto, T. Kawahara, and H. Isahara, “Dependency Structure Analysis and Sentence Boundary 
Detection in Spontaneous Japanese,” in Proceedings of the 20th International Conference on Computational Linguistics, 
2004, pp. 1107–es, doi: 10.3115/1220355.1220514. 

[28] N. Wanjari, G. M. Dhopavkar, and N. B. Zungre, “Sentence Boundary Detection For Marathi Language,” Procedia 
Comput. Sci., vol. 78, pp. 550–555, 2016, doi: https://doi.org/10.1016/j.procs.2016.02.101.  

[29] D. N and R. K. P, “Article: Sentence Boundary Detection in Kannada Language,” Int. J. Comput. Appl., vol. 39, no. 9, 
pp. 38–41, Feb. 2012. 

[30] C.-E. González-Gallardo, E. L. Pontes, F. Sadat, and J.-M. Torres-Moreno, “Automated Sentence Boundary Detection 
in Modern Standard Arabic Transcripts using Deep Neural Networks,” Procedia Comput. Sci., vol. 142, pp. 339–346, 
2018, doi: https://doi.org/10.1016/j.procs.2018.10.485.  

[31] Z. Rehman, W. Anwar, and U. I. Bajwa, “Challenges in Urdu Text Tokenization and Sentence Boundary 
Disambiguation,” in Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing 

https://www.researchgate.net/publication/221012655_Comparing_and_Combining_Generative_and_Posterior_Probability_Models_Some_Advances_in_Sentence_Boundary_Detection_in_Speech
https://www.researchgate.net/publication/221012655_Comparing_and_Combining_Generative_and_Posterior_Probability_Models_Some_Advances_in_Sentence_Boundary_Detection_in_Speech
https://www.researchgate.net/publication/221012655_Comparing_and_Combining_Generative_and_Posterior_Probability_Models_Some_Advances_in_Sentence_Boundary_Detection_in_Speech
https://doi.org/10.3115/1219840.1219896
https://doi.org/10.3115/1219840.1219896
https://doi.org/10.1109/icassp.2006.1660078
https://doi.org/10.1109/icassp.2006.1660078
https://doi.org/10.3115/1567564.1567569
https://doi.org/10.3115/1567564.1567569
https://doi.org/10.26555/ijain.v1i3.43
https://doi.org/10.26555/ijain.v1i3.43
https://www.researchgate.net/publication/287199747_Sentence_Boundary_Detection_for_Social_Media_Text
https://www.researchgate.net/publication/287199747_Sentence_Boundary_Detection_for_Social_Media_Text
https://doi.org/10.1007/978-3-540-74282-1_144
https://doi.org/10.1007/978-3-540-74282-1_144
https://doi.org/10.18653/v1/2020.autosimtrans-1.1
https://doi.org/10.18653/v1/2020.autosimtrans-1.1
https://doi.org/10.1145/3380688.3380703
https://doi.org/10.1145/3380688.3380703
https://arxiv.org/abs/2010.09657v1
https://arxiv.org/abs/2010.09657v1
https://doi.org/10.1162/coli.2006.32.4.485
https://doi.org/10.1162/coli.2006.32.4.485
https://doi.org/10.1109/ialp.2014.6973504
https://doi.org/10.1109/ialp.2014.6973504
https://doi.org/10.3115/974557.974561
https://doi.org/10.3115/974557.974561
https://www.researchgate.net/publication/259772781_Word_and_Sentence_Tokenization_with_Hidden_Markov_Models
https://www.researchgate.net/publication/259772781_Word_and_Sentence_Tokenization_with_Hidden_Markov_Models
https://www.semanticscholar.org/paper/Sentence-and-Token-Splitting-Based-On-Conditional-Tomanek-Wermter/5651b25a78ac8fd5dd65f9c877c67897f58cf817
https://www.semanticscholar.org/paper/Sentence-and-Token-Splitting-Based-On-Conditional-Tomanek-Wermter/5651b25a78ac8fd5dd65f9c877c67897f58cf817
https://www.researchgate.net/publication/221478457_Sentence_boundary_detection_of_spontaneous_Japanese_using_statistical_language_model_and_support_vector_machines
https://www.researchgate.net/publication/221478457_Sentence_boundary_detection_of_spontaneous_Japanese_using_statistical_language_model_and_support_vector_machines
https://doi.org/10.21236/ada460954
https://doi.org/10.21236/ada460954
https://doi.org/10.1109/icoris.2019.8874898
https://doi.org/10.1109/icoris.2019.8874898
https://doi.org/10.1109/icoris.2019.8874898
https://doi.org/10.3115/1620853.1620920
https://doi.org/10.3115/1620853.1620920
https://doi.org/10.3115/1620853.1620920
https://doi.org/10.1007/978-3-540-24630-5_16
https://doi.org/10.1007/978-3-540-24630-5_16
https://doi.org/10.1007/978-3-540-24630-5_16
https://arxiv.org/abs/1802.04559
https://arxiv.org/abs/1802.04559
https://hal.inria.fr/inria-00334762/document
ttps://www.aclweb.org/anthology/P11-2111
ttps://www.aclweb.org/anthology/P11-2111
ttps://www.aclweb.org/anthology/P11-2111
https://doi.org/10.3115/1220355.1220514
https://doi.org/10.3115/1220355.1220514
https://doi.org/10.3115/1220355.1220514
https://doi.org/10.1016/j.procs.2016.02.101
https://doi.org/10.1016/j.procs.2016.02.101
https://doi.org/10.5120/4852-7124
https://doi.org/10.5120/4852-7124
https://doi.org/10.1016/j.procs.2018.10.485
https://doi.org/10.1016/j.procs.2018.10.485
https://doi.org/10.1016/j.procs.2018.10.485
https://www.aclweb.org/anthology/W11-3007
https://www.aclweb.org/anthology/W11-3007


48 J. Santoso et al. / Knowledge Engineering and Data Science 2021, 4 (1): 38–48 

(WSSANLP), Nov. 2011, pp. 40–45, [Online]. Available: https://www.aclweb.org/anthology/W11-3007.  

[32] S. Sirirattanajakarin, D. Jitkongchuen, and P. Intarapaiboon, “BoydCut: Bidirectional LSTM-CNN Model for Thai 
Sentence Segmenter,” Sep. 2020, doi: 10.1109/IBDAP50342.2020.9245454. 

[33] S. J. Putra, M. N. Gunawan, I. Khalil, and T. Mantoro, “Sentence boundary disambiguation for Indonesian language,” 
in ACM International Conference Proceeding Series, Dec. 2017, pp. 587–590, doi: 10.1145/3151759.3156474. 

[34] S. Raharjo, R. Wardoyo, and A. E. Putra, “Rule Based Sentence Segmentation of Indonesian Language,” J. Eng. Appl. 
Sci., vol. 13, no. 21, pp. 8986–8992, 2018. 

[35] “Siapa Calon Pimpinan KPK yang Akan Dipilih DPR?,” Nov. 14, 2011. https://news.detik.com/berita/d-1766855/siapa-
calon-pimpinan-kpk-yang-akan-dipilih-dpr (accessed Aug. 09, 2021).  

[36] “10 Destinasi Terbaik Asia 2018 Versi Lonely Planet, Ada Komodo,” Jul. 13, 2018. https://travel.detik.com/travel-
news/d-4113452/10-destinasi-terbaik-asia-2018-versi-lonely-planet-ada-komodo (accessed Aug. 09, 2021).  

[37] T. Kiss and J. Strunk, “Viewing sentence boundary detection as collocation identification,” in Proceedings of 
KONVENS, 2002, vol. 2002, pp. 75–82. 

[38] D. P. Kingma and J. L. Ba, “Adam: a Method for Stochastic Optimization,” Int. Conf. Learn. Represent. 2015, 2015. 

 
https://www.aclweb.org/anthology/W11-3007
https://doi.org/10.1109/ibdap50342.2020.9245454
https://doi.org/10.1109/ibdap50342.2020.9245454
https://doi.org/10.1145/3151759.3156474
https://doi.org/10.1145/3151759.3156474
https://www.medwelljournals.com/abstract/?doi=jeasci.2018.8986.8992
https://www.medwelljournals.com/abstract/?doi=jeasci.2018.8986.8992
https://news.detik.com/berita/d-1766855/siapa-calon-pimpinan-kpk-yang-akan-dipilih-dpr
https://news.detik.com/berita/d-1766855/siapa-calon-pimpinan-kpk-yang-akan-dipilih-dpr
https://travel.detik.com/travel-news/d-4113452/10-destinasi-terbaik-asia-2018-versi-lonely-planet-ada-komodo
https://travel.detik.com/travel-news/d-4113452/10-destinasi-terbaik-asia-2018-versi-lonely-planet-ada-komodo
https://www.researchgate.net/publication/249902347_Viewing_sentence_boundary_detection_as_collocation_identification
https://www.researchgate.net/publication/249902347_Viewing_sentence_boundary_detection_as_collocation_identification
https://arxiv.org/abs/1412.6980

	I. Introduction
	II. Method
	A. The problem in Sentence Boundary Detection for Bahasa Indonesia
	B. Data Preparation
	C. Sequence Classification for Sentence Boundary Detection

	III. Results and Discussions
	IV. Conclusion
	Acknowledgment
	Declarations
	Author contribution
	Funding statement
	Conflict of interest
	Additional information

	References
	[1] D. Jurafsky and H. James, Martin: Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. Prentice-Hall, Englewood Cliffs, 2008.
	[2] J. Read, R. Dridan, S. Oepen, and L. J. Solberg, “Sentence boundary detection: A long solved problem?,” in Proceedings of COLING 2012: Posters, 2012, pp. 985–994.
	[3] D. J. Walker, D. E. Clements, M. Darwin, and J. W. Amtrup, “Sentence boundary detection: A comparison of paradigms for improving MT quality,” in Proceedings of the MT Summit VIII, 2001, vol. 58.
	[4] Y. Liu, A. Stolcke, E. Shriberg, and M. Harper, “Comparing and combining generative and posterior probability models: Some advances in sentence boundary detection in speech,” in Proceedings of the 2004 Conference on Empirical Methods in Natural La...
	[5] Y. Liu, A. Stolcke, E. Shriberg, and M. Harper, “Using conditional random fields for sentence boundary detection in speech,” in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005, pp. 451–458.
	[6] B. Roark et al., “Reranking for sentence boundary detection in conversational speech,” in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 2006, vol. 1, pp. I--I.
	[7] J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz, “Multi-document summarization by sentence extraction,” in Proceedings of the 2000 NAACL-ANLP Workshop on Automatic summarization, 2000, pp. 40–48.
	[8] E. Y. Hidayat, F. Firdausillah, K. Hastuti, I. N. Dewi, and A. Azhari, “Automatic text summarization using latent Drichlet allocation (lda) for document clustering,” Int. J. Adv. Intell. Informatics, vol. 1, no. 3, pp. 132–139, 2015.
	[9] D. Rudrapal, A. Jamatia, K. Chakma, A. Das, and B. Gambäck, “Sentence Boundary Detection for Social Media Text,” in Proceedings of the 12th International Conference on Natural Language Processing, 2015, pp. 254–260.
	[10] X. Chang and Q. Zheng, “Offline definition extraction using machine learning for knowledge-oriented question answering,” in International Conference on Intelligent Computing, 2007, pp. 1286–1294.
	[11] R. Zhang and C. Zhang, “Dynamic Sentence Boundary Detection for Simultaneous Translation,” Proceedings of the First Workshop on Automatic Simultaneous Translation, 2020.
	[12] T. A. Le, “Sequence labeling approach to the task of sentence boundary detection,” in ACM International Conference Proceeding Series, Jan. 2020, pp. 144–148, doi: 10.1145/3380688.3380703.
	[13] N. Sadvilkar and M. Neumann, “PySBD: Pragmatic Sentence Boundary Disambiguation,” Oct. 2020, [Online]. Available: http://arxiv.org/abs/2010.09657.
	[14] T. Kiss and J. Strunk, “Unsupervised multilingual sentence boundary detection,” Comput. Linguist., vol. 32, no. 4, pp. 485–525, 2006.
	[15] J. Wang, Y. Zhu, and Y. Jin, “A rule-based method for Chinese punctuations processing in sentences segmentation,” in 2014 International Conference on Asian Language Processing (IALP), 2014, pp. 195–198.
	[16] J. C. Reynar and A. Ratnaparkhi, “A maximum entropy approach to identifying sentence boundaries,” in Proceedings of the fifth conference on Applied natural language processing, 1997, pp. 16–19.
	[17] B. Jurish and K.-M. Würzner, “Word and Sentence Tokenization with Hidden Markov Models.,” JLCL, vol. 28, no. 2, pp. 61–83, 2013.
	[18] K. Tomanek, J. Wermter, and U. Hahn, “Sentence and token splitting based on conditional random fields,” in Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, 2007, vol. 49, p. 57.
	[19] Y. Akita, M. Saikou, H. Nanjo, and T. Kawahara, “Sentence boundary detection of spontaneous Japanese using statistical language model and support vector machines,” 2006.
	[20] D. Hillard, M. Ostendorf, A. Stolcke, Y. Liu, and E. Shriberg, “Improving automatic sentence boundary detection with confusion networks,” in Proceedings of HLT-NAACL 2004: Short Papers, 2004, pp. 69–72.
	[21] C. N. Purwanto, A. T. Hermawan, J. Santoso, and Gunawan, “Distributed Training for Multilingual Combined Tokenizer using Deep Learning Model and Simple Communication Protocol,” in 2019 1st International Conference on Cybernetics and Intelligent S...
	[22] D. Gillick, “Sentence boundary detection and the problem with the US,” in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Sho...
	[23] C. N. Silla and C. A. A. Kaestner, “An analysis of sentence boundary detection systems for English and Portuguese documents,” in International Conference on Intelligent Text Processing and Computational Linguistics, 2004, pp. 135–141.
	[24] C.-E. González-Gallardo and J.-M. Torres-Moreno, “Sentence boundary detection for French with subword-level information vectors and convolutional neural networks,” arXiv Prepr. arXiv1802.04559, 2018.
	[25] H. P. Le and T. V. Ho, “A maximum entropy approach to sentence boundary detection of Vietnamese texts,” 2008.
	[26] N. Xue and Y. Yang, “Chinese sentence segmentation as comma classification,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Jun. 2011, pp. 631–635, [Online]. Available: htt...
	[27] K. Shitaoka, K. Uchimoto, T. Kawahara, and H. Isahara, “Dependency Structure Analysis and Sentence Boundary Detection in Spontaneous Japanese,” in Proceedings of the 20th International Conference on Computational Linguistics, 2004, pp. 1107–es, d...
	[28] N. Wanjari, G. M. Dhopavkar, and N. B. Zungre, “Sentence Boundary Detection For Marathi Language,” Procedia Comput. Sci., vol. 78, pp. 550–555, 2016, doi: https://doi.org/10.1016/j.procs.2016.02.101.
	[29] D. N and R. K. P, “Article: Sentence Boundary Detection in Kannada Language,” Int. J. Comput. Appl., vol. 39, no. 9, pp. 38–41, Feb. 2012.
	[30] C.-E. González-Gallardo, E. L. Pontes, F. Sadat, and J.-M. Torres-Moreno, “Automated Sentence Boundary Detection in Modern Standard Arabic Transcripts using Deep Neural Networks,” Procedia Comput. Sci., vol. 142, pp. 339–346, 2018, doi: https://d...
	[31] Z. Rehman, W. Anwar, and U. I. Bajwa, “Challenges in Urdu Text Tokenization and Sentence Boundary Disambiguation,” in Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP), Nov. 2011, pp. 40–45, [Online]. ...
	[32] S. Sirirattanajakarin, D. Jitkongchuen, and P. Intarapaiboon, “BoydCut: Bidirectional LSTM-CNN Model for Thai Sentence Segmenter,” Sep. 2020, doi: 10.1109/IBDAP50342.2020.9245454.
	[33] S. J. Putra, M. N. Gunawan, I. Khalil, and T. Mantoro, “Sentence boundary disambiguation for Indonesian language,” in ACM International Conference Proceeding Series, Dec. 2017, pp. 587–590, doi: 10.1145/3151759.3156474.
	[34] S. Raharjo, R. Wardoyo, and A. E. Putra, “Rule Based Sentence Segmentation of Indonesian Language,” J. Eng. Appl. Sci., vol. 13, no. 21, pp. 8986–8992, 2018.
	[35] “Siapa Calon Pimpinan KPK yang Akan Dipilih DPR?,” Nov. 14, 2011. https://news.detik.com/berita/d-1766855/siapa-calon-pimpinan-kpk-yang-akan-dipilih-dpr (accessed Aug. 09, 2021).
	[36] “10 Destinasi Terbaik Asia 2018 Versi Lonely Planet, Ada Komodo,” Jul. 13, 2018. https://travel.detik.com/travel-news/d-4113452/10-destinasi-terbaik-asia-2018-versi-lonely-planet-ada-komodo (accessed Aug. 09, 2021).
	[37] T. Kiss and J. Strunk, “Viewing sentence boundary detection as collocation identification,” in Proceedings of KONVENS, 2002, vol. 2002, pp. 75–82.
	[38] D. P. Kingma and J. L. Ba, “Adam: a Method for Stochastic Optimization,” Int. Conf. Learn. Represent. 2015, 2015.