International Journal on Advances in ICT for Emerging Regions 2019 12 (2):  

 
Classification of Voice Content in the Context of 

Public Radio Broadcasting 

G.A.G.S.Karunarathna#1, K.L.Jayaratne #2, P.V.K.G.Gunawardana#3

 Abstract— With the rapid development of mass media 

technology, content classification of radio broadcasting has 

emerged as a major research area facilitating the automation of 

radio broadcasting monitoring process. This research focuses on 

the voice dominant content classification of radio broadcasting 

by employing a multi-class Support Vector Machine (SVM) in 

order to automate monitoring of radio broadcasting in Sri 

Lanka. This study investigates the performance of “One Vs. 

One” and “One Vs. All” methods known to be two conventional 

ways to build a multi-class SVM. These two multi-class SVM 

models are trained to classify five voice dominant classes as 

news, conversations, and advertisements without jingles, radio 

drama and religious programs.                

One of the substantial measures in creating such a 

classification is selection of the optimal feature sets. For that 

purpose, time domain features, frequency domain features, 

cepstral features, and chroma features are manually analyzed 

for each binary SVM classifier independently. Two multi-class 

SVM models are trained based on the selected features and the 

“One Vs. One” model was able to better classify the recordings 

with an 85% accuracy compared to 83% accuracy achieved by 

“One Vs. All” model. Further, the results revealed the 

importance of careful feature selection in order to achieve 

higher classification accuracies.                             

Keywords— Audio monitoring, Audio classification, Radio 

Broadcasting, Audio feature analysis, Support Vector Machines 

I. INTRODUCTION  

      The radio is a dynamic and amiable communication 
device to people for many decades since its invention. Unlike 
other communications devices such as computers and 
smartphones, anyone can easily use the radio without an age 
bracket. According to the statistic in 2015, more than half of 
the population use the radio while they are driving, women 
tend to listen to the radio while they are cooking, and most of 
the people listen to the radio even at their workplaces [1]. 
Therefore, unlike the rest of the communication mediums 
radio plays an important role in sharing information. 

In radio transmission, radio station and the listeners are 
the two endpoints. Radio stations broadcast unidirectional 
wireless signals over space to the multitudes of individual 
listeners with radio receivers. Radio stations broadcast a 
sequence of content categories such as songs, advertisements, 
news, interviews, conversations, and radio dramas. The 
number of listeners of a radio channel always relies on the 
programs that the radio stations are broadcasting. Thus, in 

order to grasp the audience to a program, the program should 
be performed well, and fit into the audience. Hence, for the 
purpose of measuring the performance of a broadcast 
program, radio stations need to monitor the broadcasting 
content regularly. Broadcast content monitoring helps to 
verify when and where the broadcast content placed, protect 
copyrights by knowing precisely how the content is being 
used, and measure performance across other broadcast 
channels [2, 3, 4, 5, 6]. Therefore, broadcast content 
monitoring is a necessary thing for radio stations. 

      Furthermore, the stakeholders of radio channels also 
need to monitor the broadcast content for different purposes 
such as business, political and legal needs [7, 8]. Authorized 
people in mass media and information corporations need to 
track the FM channels regularly to ensure whether the 
broadcasting contents adhere to the rules and regulations and 
has a diversity of available programs. Singers and composers 
need monitoring of songs to claim their rights [9, 10, 11, 12, 
13]. Advertising agents are keen on the frequencies of the 
advertisement broadcasting that have a huge impact on their 
corporate income. Political parties also keep an alert on their 
name referencing in the radio broadcasting content, 
especially on the news and political discussions. Therefore, it 
is reasonable to state that many stakeholders are interested in 
monitoring the radio broadcasting content for a wide range of 
reasons. 

      In the monitoring of radio broadcasting, both manual 
monitoring and automated monitoring are used. In manual 
monitoring techniques such as having an observer to listen to 
the radio content, reading the attached meta-data, asking for 
the broadcast report from broadcast stations are used. These 
techniques become inefficient and resource intense when the 
amount of content that needed to be monitored is high. In 
automated radio monitoring processes, well-trained machine 
monitors the radio content effectively and efficiently than the 
manual monitoring process. Most of the time, developed 
countries use automated radio monitoring process [14]. As a 
developing country, Sri Lanka has not yet established such a 
technology to monitor radio broadcasts. Since there is a large 
number of radio channels in Sri Lanka, manual monitoring is 
not practical. Unfortunately, the mechanisms used in 
developed countries cannot be substituted for FM channels in 
Sri Lanka, due to the differences in languages and 
pronunciations. Hence, it is imperative for the Sri Lankan 
Broadcasting context to have an automatic radio monitoring 
program. 

      As the initial step to build an automated monitoring 
process, identifying different content classes (i.e. songs, 
advertisements, news, interviews, conversations, and radio 
dramas) in radio broadcasting content is essential. When 
analyzing the current situation of the above-mentioned 
problem, classifying broadcast context for onset detection is 
recognized as the closest research work [15]. Onset detection 

Manuscript received on 22nd Feb. 2019.  Recommended by Dr. D.N. 

Ranasinghe  on 30th Dec. 2019.  
 

G.A.G.S.Karunarathna, K.L.Jayaratne, P.V.K.G.Gunawardana are from the 

University of Colombo School of Computing, Sri Lanka.   
 (gothamikarunarathna@gmail.com, klj@ucsc.cmb.ac.lk, 

kgg@ucsc.cmb.ac.lk).  

 
Classification of Voice Content in the Context of Public Radio Broadcasting               2 

 
International Journal on Advances in ICT for Emerging Regions                                                                                                                           December 2019 

 
is the mechanism which is used to identify the places where 
the content changes are happening in a musical note or other 
sound streams. The researchers have proposed a unified 
methodology to automate radio broadcasting monitoring 
which detects onsets of radio broadcasting context with the 
assist of the classification of the broadcasting content. The 
proposed mechanism distinguishes songs, commercial 
advertisements with jingles, news, and other contents in a 
radio stream. However, the issue with this unified method is, 
it is unable to identify voice dominant content classes in the 
broadcasting context. Hence, the classification of different 
voice dominant contents in the radio broadcasting stream 
such as news, advertisements without jingles, conversations, 
radio dramas, and religious programs are identified as the 
knowledge gap in between the requirement and existing 
solutions. Therefore, in the context of radio broadcasting, this 
research proposes a methodology to classify voice dominant 
contents in radio broadcasting. 

II. RELATED WORK 

      As audio classification has emerged as a demanding 

research area, a considerable amount of related works can be 

examined. Based on the approaches that the researchers 

followed in audio classification, these works can be divided 

into two parts as algorithmic approaches and machine 

learning approaches. 

A. Algorithmic Approaches 

Lie Lu, Stan Z. Li and Hong-Jiang Zhang [21]  proposed 
an algorithm which is able to classify an audio stream into 
speech, music, environmental sounds and silence. Silence 
detection is performed based on short-time energy and zero-
crossing rate (ZCR) in a one-second window. Linear Spectral 
Pairs  (LSP) distance analysis is used to apply refinements 
over the proposed algorithm. The result of this research has 
some misclassifications between music and environment 
sound due to the overlaps in the distribution of the features.  

An algorithm for discriminating speech from music on 
broadcast FM radio based on ZCR of the time domain 
waveform is proposed by John Saunders [16]. This technique 
emphasized the characteristics of speech such as limited 
bandwidth, alternate voiced and unvoiced sections, energy 
contour between high and low levels are well capable of 
separating speech from music.  

Barzilay et al [17] proposed an algorithmic approach for 
the speaker’s role identification in radio broadcasting context. 
This approach classified anchor, journalist and guest 
programmer by considering lexical features, features from the 
surrounding context and the duration features.  

      Though the algorithmic approaches show promising 
results, when the number of classes in the classification is 
increasing the problem becomes non-trivial. Identifying the 
threshold values to discriminate each class is also difficult. In 
order to avoid these negatives on the algorithmic approaches, 
researchers recently moved to machine learning approaches. 

B. Machine Learning Approaches 

The machine learning community has done numerous 
works the under both supervised learning and unsupervised 
learning. The learning approaches associated with supervised 
learning are Neural Networks (NN), Hidden Markov Model 
(HMM), and Support Vector Machine (SVM). Unsupervised 
learning approaches are K means and Gaussian Mixture 

Model (GMM). Since our problem domain refers to the 
supervised learning approaches, more attention goes to NN,  
HMM, and SVM. 

The most recent and closest work of addressing the same 
problem is, “Classification of public radio broadcast context 
for onset detection” conducted by C. Weeratunga et al [15].  
In this approach, the onset detection mechanism along with a 
classification model is proposed to predict four classes (i.e. 
songs, voice-related segments, news, and radio commercials). 
A supervised neural network model with 38 extracted features 
has included in the classification framework. Radio 
commercials, songs, news, and other voice contents are 
classified with accuracies of 76%, 75%, 41%, and 59% 
respectively. In this approach, the output of the onset detection 
largely depends on the accuracy of the classification. 
Currently, it has 82% accuracy for onset detection with respect 
to prior mentioned audio classes in radio broadcasting context. 
In order to automate the radio broadcasting monitoring 
process, the existing onset detection method should be 
improved. Therefore as a further step, the focus should be on 
the voice dominant content classification in radio 
broadcasting events. 

Another supervised neural network approach has used by 
Khan et al [22] to classify speech and music. As the 
classification framework, multilayer perceptron neural 
network and back-propagation learning algorithms are used. 
The experimental results have shown an overall accuracy of 
96.6%, with 100% accuracy in recognizing music from 
speech. 

In the research done by R. Kotsakis, G. Kalliris, C. 
Dimoulas [14], various audio pattern classifiers in the 
broadcast-audio semantic analysis are investigated using radio 
program-adaptive classification strategies with supervised 
ANN system. In the evaluation, Kotsakis et al found ANN and 
KNN classifiers quite effective than tree complex and SMO 
methods. 

Kons et al [23] suggested a Deep Neural Network (DNN) 
as a solution for classifying four classes as the crowd of 
people, cars/road noises, applause yelling/cheering, and 
various kinds of music recorded in outdoor. The overall 
performance of the DNN classifier achieved the best in most 
of the classes, except for the music class where the SVM 
performs better. 

Same as Neural Networks approaches, the Hidden Markov 
Model is also shown high performance in radio broadcasting 
content classification problems. HMM is used in a radio 
commercial classification by G.Koolagudi et [19]. As they 
observed, in some situations where ANN failed (i.e. 
background music follows an advertisement), HMM 
performed well. Another work related to HMM has conducted 
by Yang Liu [18], identify the roles of speakers in radio 
broadcasting news contents. Well-structured news content is 
used in this research which highlights the speaker role 
sequences. Accuracy of 80% is obtained and they found the 
beginning and the end of the sentences in the voice of the 
speaker as a good heuristic for role identification. 

SVM basically designs for binary classification problems. 
As extensions, multi-class SVM obtains by compromising set 
of binary SVM classifiers. There are 3 ways to design multi-
class SVMs as One Vs. All, One Vs. One, and DAGSVM [24].  
The main advantage of SVM when compared to other 


3                                                                                                                                          G.A.G.S.Karunarathna #1, K.L.Jayaratne #2, P.V.K.G.Gunawardana #3  
 

December   2019                                                               International Journal on Advances in ICT for Emerging Regions  

 
machine learning approaches is that SVM performs much 
better in many cases because it finds the best hyperplane/s that 
separates all data into different classes, no matter even the 
dataset is small [25]. Aurino et al [26] have proposed a One-
class SVM based approach to detect anomaly events that are 
considered as abnormal sounds in the environment like a 
gunshot, screaming and broken glass. The proposed 
methodology consists of two stages. At the first stage, the 
researchers introduced a new mechanism called “Majority 
Voting and Rejection” to classify short time frames into 
predefined classes. At the second stage, aggregated the results 
of the first stage into longer time frames and reclassified. 

      In the work of Bouril, A. et al [27], 3000 
phonocardiograms from 9 locations of the body of both adults 
and children were taken to identify normal and abnormal heart 
sounds using SVM. Here, 74 features of time and frequency 
domain were considered. The SVM model was utilized by a 
Gaussian Kernel where it allows three different 
classifications; -1 for the normal heartbeat, 0 for ambiguous 
sounds due to noise and 1 for abnormal heartbeat sound. In 
this research, a binary SVM is chosen to be effective in normal 
and abnormal classification. Audio-based event detection in 
live office environments using optimized MFCC features with 
SVM model has implemented by Kucukbay et al [28]. Sixteen 
classes such as alert beeping, clear throat, keyboard and switch 
on/off sounds were classified. One Vs. All multi-class SVM 
is used as the classifier. Martin Morato et al [29] conducted a 
case study on feature sensitivity for audio event classification 
using One Vs. All multi-class SVM. Same as the above [28], 
sixteen classes have been differentiated by MFCC features 
and MFE features using 2.5s frame length with 1s overlaps 
and 44.1 kHz sampling frequency. Wang, J.et al [30] have 
used a frame based multi-class SVM classifier to differentiate 
fifteen audio classes including both male, female voices. A 
frame-based classifier segmented one audio file into several 
frame sizes and trained the classifier for each. Even though 
this method improves accuracy from 13.9%, the pre-
processing and training time was considerably high. 

Lie Lu, Stan Z. Li and Hong-Jiang Zhang [31] have 
proposed a method called hierarchical binary support vector 
machine for employing an audio segmentation and 
classification. Here the researchers furthermore considered 
five pre-defined classes as silence, music, background sound, 
pure speech and non-pure speech including speech over music 
and speech over the noise. In the evaluation, it has shown the 
accuracy of the SVM based method is better than the method 
based on KNN and GMM. But the major disadvantage in this 
approach is misclassifications of upper levels can be 
propagated to the classifiers at the lower level. 

      Since the broadcasting FM channels demand the news 
content classification of broadcasting context, Vavrek, J. et al 
[20] also proposed a hierarchical tree to address the news 
content classification problem. This hierarchical classification 
strategy is used as a particular feature set for each SVM binary 
classifier. Therefore, the F-score feature selection algorithm is 
used to obtain optimal features for each SVM. The drawback 
of this work is the error of upper levels of the tree were 
propagated to the bottom levels of the tree. To prevent that, 
misclassifications of upper levels have not considered. 

The work of Zhu, Y., Ming, Z. , Q. Huang [32] is classified 
six audio classes using clip based SVM method. Here, the 
researchers classified pure speech, music, silence, 
environmental sounds, speech with music, and speech with 

environmental sounds. The key finding of this work is, the 
researchers found that the performance of SVM shows good 
results in similar cases than Decision Trees, KNN, and Neural 
Networks. 

The potentials of these approaches vary from problem 
domain to domain. Based on the research question and past 
studies in the domain, the following choices were made in 
order to carry out this study. Since the dataset consists of a set 
of pre-defined classes, a supervised learning approach is 
proposed for the classification. Therefore unsupervised 
classifiers were eliminated.  As mentioned before, unique sets 
of features to discriminate each class from the rest has 
identified. Hence, if all the features are input to the 
classification model together, it will reduce the accuracy of the 
model because of some irrelevant features input to the 
classification of some classes. Therefore, another facet of this 
research is input different feature sets to discriminate each 
class. When considering the ANN approach, it is impossible 
to provide unique sets of features for each class separately. 
Moreover, according to the research conducted by C. 
Weeratunga et al [15], the ANN model is not the best approach 
to distinguish voice dominant categories such as news. In 
other hands, HMM was rejected in view of the fact that the 
sequence of the audio events appearing is not beneficial to our 
problem. Accordingly, the SVM classification model is 
selected after considering all aspects. 

Since SVM's are originally designed for binary 
classification, the multi-class SVM builds as a compound of 
binary SVM classifiers. As we already identified specific 
features for each class, we can input only the relevant features 
separately in the case of using a multi-class SVM model 
because it holds multiple binary SVM models. Accordingly, 
multi-class SVM is chosen as the most suitable classifier 
which fits into our problem domain. As it is  a composition of 
several binary SVMs, multi-class SVM can be designed as one 
of the following methods [24],  

 One Vs. One  

 One Vs. All 

 Dynamic Acyclic Graph SVM (DAGSVM) 

One Vs. All constructs N number of binary SVM models 
where it has N number of classes. Every single binary SVM is 
trained with all of the data in the one class with positive labels 
and rest with negative labels. The decision function which has 
the largest value is taken as the predicted class.  One Vs. One 
constructs N(N-1)/2 number of binary SVM models where 
each one is trained only for two classes and a class is predicted 
using the "Max-winning" strategy. Same as One Vs. One, 
DAGSVM also constructs N(N-1)/2 number of binary SVM 
models where each one is trained for two classes. These binary 
SVMs are structured as a top to the bottom hierarchical tree 
where it has (N-1) number of leave nodes. It starts at the root 
node, then a binary decision function is evaluated, and it 
moves to either left or right depending on the output value of 
the previous node. 

Since the DAGSVM is a hierarchical graph, the 
misclassifications of upper levels in the graph can propagate 
to lower levels in the graph [33]. This will lead to an erroneous 
situation. Hence DAGSVM was rejected at the very first step.  
One Vs. One and One Vs. All both have benefits as well as 
limitations [24]. It depends on the application domain. Hence 


Classification of Voice Content in the Context of Public Radio Broadcasting               4 

 
International Journal on Advances in ICT for Emerging Regions                                                                                                                           December 2019 

 
this research attempts to obtain the most reliable method by 
modeling the multi-class SVM in both ways. 

 
III. PROPOSED APPROACH 

The proposed approach mainly focuses on the 
classification of different voice dominant classes in radio 
broadcasting of Sri Lanka. Since the dataset consists of a set 
of pre-defined classes (i.e. news, advertisements without 
jingles, radio dramas, conversations, and religious programs), 
a supervised learning approach is proposed initially for the 
classification. SLBC (Sri Lanka Broadcasting Cooperation) 
audio recordings are used as the dataset in order to represent 
all the Sri Lankan FM channels. The length of the dataset is 5 
hours and 50 minutes and contained both male and female 
voices. As shown in Figure 1, initially the whole dataset is 
divided as 60% and 40% for training and evaluation purposes 
respectively. Again the training dataset is divided into 70% 
and 30% for training and testing respectively. The number of 
frames consists of training and the testing dataset is given in 
Table 1 and Table 2. The length of the frame is 5s. 

 
 A quantitative interpretation of audio data is required for 
the analysis to identify the most suitable features for 
distinguishing each class separately. This research specifically 
focuses on the analysis of time series and frequency series of 
audio signals. Figure 2 depicts the design of the proposed 
approach. 

A. Feature Analysis 

Features are used to capture the measurable information in 
the dataset. In a classification, identifying the most 
appropriate features is essential for differentiate one class 
from another. In order to select the appropriate features, the 
dataset should be thoroughly analyzed. Here, altogether 34 
features used in the most recent and relevant past study [15] 
are analyzed. These features belong to time domain features, 
frequency domain features, cepstral features, and chroma 
features. 

 
Figure 2: Design Overview 

 
The novelty of the research is that, rather than feeding all 
the features together, specific sets of features are fed 
separately into each class. This assists to avoid the input of 
unnecessary features, reduce dimensions, and make 
classification faster and more accurate. Since a multi-class 
SVM holds multiple binary SVMs, one of the advantages of 
using a multi-class SVM is that it can input specific features 
for each binary SVM separately.  

Since this research compare the performance of two types 
of multi-class SVM models, the feature selection carried out 
separately for both multi-class SVM models. As illustrates in 
Table 3, binary SVM models used to construct multi-class 
SVM models are trained to classify different class pairs. 
Therefore, for each binary SVM model, the features are 
identified by the class pairs that are to be classified. 

Table 3: Binary classifiers of two multi-class SVMs 

Multi-class 

SVM 

Binary 

classifiers 
Identical classes 

One Vs One  

SVM 1 News Vs. Advertisements 

SVM 2 News Vs. Conversations 

SVM 3 News Vs. Radio drama 

SVM 4 News Vs. Religious program 

SVM 5 Advertisements Vs. Conversations 

SVM 6 Advertisements Vs. Radio drama 

SVM 7 Advertisements Vs. Religious program 

 
Figure 1: Dataset Partition 

 
                Table 2: Number of frames in the testing dataset 

        
                Table 2: Number of frames in the training dataset 

         
5                                                                                                                                          G.A.G.S.Karunarathna #1, K.L.Jayaratne #2, P.V.K.G.Gunawardana #3  
 

December   2019                                                               International Journal on Advances in ICT for Emerging Regions  

 
Multi-class 

SVM 

Binary 

classifiers 
Identical classes 

SVM 8 Conversations Vs. Radio drama 

SVM 9 Conversations Vs. Religious program 

SVM 10 Radio drama Vs. Religious program 

One V 
s All 

SVM 1 News Vs. others 

SVM 2 Advertisements Vs. others 

SVM 3 Conversations Vs. others 

SVM 4 Radio drama Vs. others 

SVM 5 Religious program Vs. others 

 
1) Feature Analysis: One Vs. One 
According to Table 3, the best features are analyzed to 

distinguish ten class pairs by looking at the spectrums. For 

that purpose, with the class pairs, 10 audio clips are prepared 

as in Figure 3, where one class is 15 minutes long. 

 
By observing spectrums of each pair, 24 of the 34 features 

are selected by eliminating the features that do not show a 

spectrum discrimination pattern for any class pair. As an 

example, Figure 4 shows the spectrum of energy feature 

which is selected to distinguish advertisements and religious 

programs. 

 
Figure 4: Energy Feature spectrum for advertisements Vs. religious 

programs 

 
Then ranked the selected features by calculating the score 

of the feature importance. Feature importance is scored using 

a tree-based classifier, which provides a measurement of the 

relevance of a feature towards the output variable. It is an 

inbuilt class provided by the scikit-learn library. Table 4 

shows the selected features for One Vs. One model with the 

ranks. 

 
2) Feature Analysis: One Vs. All 
As shown in Table 3, One Vs. All method required only 

five binary classifiers because it constructs the number of 

binary SVMs equal to the number of classes. Each classifier 

allocates for the classification of one class. Here, the relevant 

features for distinguishing a class from the rest were 

analyzed. For that, 2 hours and 5 minutes long audio clip are 

prepared as Figure 5, which includes all classes of 25 minutes 

per class. Figure 6 shows a pattern obtained from the 

frequency spectrum of Spectral Entropy against the five 

classes. Using this 2 hours and 5 minutes lengthened sample 

audio clip, 34 features are analyzed and 24 features are 

selected as shown in Table 5.  

 
Figure 3: Audio clip structure designed for analyze features of two 

classes 

 
Table 4: Selected features for One Vs. One model with ranks 

Features 
Binary SVMs in One Vs. One 

1 2 3 4 5 6 7 8 9 10 

ZCR           7 

Energy    7 3   4  1 3 

Energy entropy  1     3   5 4 

Spectral centroid   2 3 4 2 1 3   6 

Spectral spread    4 1  6 1  2 1 

Spectral entropy   1   1 4  1   

Spectral flux      6      

Spectral roll off   4   4   3   

MFCC 1          7  

MFCC 2   3   5      

MFCC 3     5   2  4 2 

MFCC 4  3 5 1   2  4 8  

MFCC 5    6   7     

MFCC 6    6        

MFCC 7       5 5    

MFCC 8  6     7  6  8 

MFCC 9  2  2 2   6  3  

MFCC 10   7   3  7    

MFCC 11     6     6 5 

MFCC 12    5     2   

MFCC 13         5  9 

Chroma vector 1           

Chroma vector 2  5          

Chroma vector 3-9           

Chroma vector 10  7          

Chroma vector 11  4          

Chroma vector 12           

Chroma std           

 
Figure 5: Audio clip structure designed for analyze features of one class 

from rest 

 
Classification of Voice Content in the Context of Public Radio Broadcasting               6 

 
International Journal on Advances in ICT for Emerging Regions                                                                                                                           December 2019 

 
Figure 6: Spectral Entropy for five classes 

 
 Table 5: Selected features for One Vs. All model with ranks 

Features 
Binary SVMs in One Vs. All 

1 2 3 4 5 

ZCR    4  7 

Energy  5    2 

Energy entropy   8 13  8 

Spectral centroid    2  5 

Spectral spread    5  1 

Spectral entropy   4 1   

Spectral flux    12  9 

Spectral roll off   5 3   

MFCC 1   2    

MFCC 2   1 6   

MFCC 3    7 4 3 

MFCC 4  1   3  

MFCC 5   10 9   

MFCC 6      10 

MFCC 7    10  11 

MFCC 8   9  2  

MFCC 9  2 7  6 4 

MFCC 10   6 8   

MFCC 11  3   7 6 

MFCC 12   3  1  

MFCC 13     5  

Chroma vector 1-2      

Chroma vector 3 4     

Chroma vector 4-6      

Chroma vector 7    11   

Chroma vector 8-10      

Chroma vector 11  6     

Chroma vector 12      

Chroma std      

 
B. Data Pre-processing 

Data pre-processing helps to create raw data from audio 

files in a consistent way before extracting the features. As 

in Figure 2, 

 In the data formatting stage, the data files convert to 
the .wav file format. Then the monophonic channel is 
chosen as the channel type and 44.1 kHz is selected as 
the sample rate [34]. 

 In the data annotation, manually listen to the audio 
clips using “Audacity” tool, segment them into 
different classes and label with relevant class labels.  

 Then, remove silence only from news and 
conversations. The reason for remove silence from 
news and conversations will be described in section 
IV. 

C. Feature Extraction 

 In audio classification, feature extraction is the most 
important component. Frame blocking before extracting the 
features. The audio waves are framed into 5s of blocks. Then 
extract the best subset of features from selected features. 
Extracted features are expressed as feature vectors. Table 6 
lists all the features that are extracted in both One Vs. One and 
One Vs. All. 

D.  Classification 

 As stated in the related work, the multi-class SVM 
classifier is selected as the best approach which fits into our 
problem domain. To acquire higher performance, One Vs. 
One and One Vs. All multi-class SVM models are parallelly 
implemented and evaluated.  

              Table 6: Extracted features for each binary SVM 

 
Multi

-class 

SVM 

Binary 

SVMs 

No. of 

features Identical classes 

One 
Vs 

One  

SVM 1 
7 Chroma2, Chroma10, Chroma11, Energy 

entropy, MFCC4, MFCC8,  MFCC9 

SVM 2 

7 MFCC2, MFCC4, MFCC5, MFCC 10,  

Spectral centroid,  Spectral entropy, 
Spectral rolloff 

SVM 3 

7 Energy, MFCC4, MFCC6,  MFCC 9, 

MFCC12, Spectral centroid, Spectral 
spread 

SVM 4 
6 Energy,  MFCC3,  MFCC9, MFCC11, 

Spectral centroid, Spectral spread  

SVM 5 

7 MFCC2, MFCC5, MFCC10,  Spectral 

centroid,  Spectral entropy, Spectral flux, , 

Spectral rolloff 

SVM 6 

7 Energy entropy, MFCC4, MFCC7, MFCC 

8,Spectral centroid, Spectral spread, 

Spectral entropy  

SVM 7 
7 Energy, MFCC3, MFCC7, MFCC9, 

MFCC10, Spectral spread, Spectral 

centroid 

SVM 8 
6 MFCC4, MFCC8, MFCC12, MFCC13, 

Spectral entropy, Spectral rolloff 

SVM 9 

8 Energy, Energy entropy, MFCC1, 

MFCC3, MFCC4, MFCC9, MFCC11, 
Spectral spread 

SVM 

10 

9 Energy, Energy entropy, MFCC3, MFCC 

11, MFCC13, MFCC8, Spectral centroid, 
Spectral spread, ZCR 

One 
Vs 

All 

SVM 1 
6 Chroma3, Chroma11, Energy,  MFCC4, 

MFCC9, MFCC11 

SVM 2 
10 Energy entropy, MFCC1, MFCC2, MFCC  

5, MFCC8, MFCC9, MFCC10, MFCC12, 

Spectral entropy, Spectral rolloff 

SVM 3 

13 Chroma7, Energy entropy, MFCC2, 
MFCC 3, MFCC5, MFCC7, MFCC 10, 

Spectral entropy, Spectral centroid,  

Spectral flux, Spectral  rolloff, Spectral 
spread, ZCR 

SVM 4 
7 MFCC3, MFCC4, MFCC8, MFCC9,  

MFCC11, MFCC12, MFCC13 

SVM 5 

11 Energy, Energy entropy, MFCC3, MFCC 

6, MFCC7, MFCC9, MFCC11, Spectral 

flux,  Spectral centroid, Spectral spread , 
ZCR 


7                                                                                                                                          G.A.G.S.Karunarathna #1, K.L.Jayaratne #2, P.V.K.G.Gunawardana #3  
 

December   2019                                                               International Journal on Advances in ICT for Emerging Regions  

 
1)  One Vs. One model 

In this approach, N(N-1)/2 number of binary SVMs are 

implemented to classify N number of classes. Therefore, we 

design ten binary SVMs where each SVM classifies a pair of 

classes as shown in Table 3. SVM classifies ith and jth classes 

for a data point D = (xt, yt) as follows, 

 
      (𝑤 𝑖𝑗 )𝑇  𝜑(𝑥𝑡 ) +  𝑏
𝑖𝑗  called as decision boundary where wij 

is the weight vector, xt  is the input vector, b
ij is the bias, and 

data xt is mapped to a higher dimensional space by the 

function φ. The motivation behind the SVM is maximizing 

the decision boundary between two classes. The maximized 

decision boundary for ith and jth classes acquired by 

minimizing the magnitude of wij. Hence, to find the maximum 

margin, the magnitude of wij should be minimized as in the 

Equation (3). When the data is non-linearly separable, 

𝐶 ∑ 𝜀𝑡
𝑖𝑗

𝑡  is introduced as the penalty terms to reduce the 
number of training errors. 

𝑚𝑖𝑛𝑤𝑖𝑗,𝑏𝑖𝑗,𝜀𝑖𝑗
1

2
 (𝑤𝑖𝑗 )

𝑇
(𝑤𝑖𝑗 ) + 𝐶 ∑ 𝜀𝑡

𝑖𝑗

𝑡

 
(3) 

One Vs. One model builds ten binary SVMs to classify 

five classes. Since there are ten decision boundaries, the 

predicted class for a particular data point is identified using a 

voting strategy called “Max Winning” strategy. If the 

decision boundary says the data point belongs to ith class, then 

vote for the ith class. Otherwise, vote for the jth class. Then the 

class with the maximum votes is taken as the predicted class.  

2) One Vs. All model 

One Vs. All method constructs N number of binary SVMs 

where it has N number of classes to classify. Therefore, five 

SVM classifiers are designed as shown in Table 3. Each SVM 

is trained with the whole dataset where the data belongs to ith 

class with positive labels and remain of the data with negative 

labels. An SVM solves data point D = (xt, yt) for i
th class 

according to the following equations Equation (4) and 

Equation (5). 

 
To find the maximum margin, the magnitude of wi should 

be minimized as in the Equation (6) where C is the constant 

used to reduce training error. 

𝑚𝑖𝑛𝑤𝑖,𝑏𝑖 ,𝜀𝑖
1

2
 (𝑤𝑖 )

𝑇
(𝑤𝑖 ) + 𝐶 ∑ 𝜀𝑡

𝑖

𝑡

 
(6) 

One Vs. All model implements five binary SVMs to 

classify each class individually. After training five classifiers, 

the class of a data point x is predicted by finding the decision 

boundary which has the maximum value. Equation (7) gives 

the prediction function for data point x. 

𝑐𝑙𝑎𝑠𝑠 𝑜𝑓 𝑥 = 𝑎𝑟𝑔𝑚𝑎𝑥 ((𝑤𝑖 )
𝑇

 𝜑(𝑥𝑡 ) +  𝑏
𝑖 )  (7) 

  
E. Evaluation 

In evaluation, the performance of One Vs. One and One 
Vs. All multi-class SVM models are evaluated. Ground truth 
data is required to evaluate the accuracies of the two models. 
40% of the total data that is never in the training set is taken 
as the ground truth data. The ground truth data contains 28 
minutes long audio recordings of each class (news, 
conversations, advertisements, drama, and religious 
programs). “Audacity” tool is used to annotate the ground 
truth data.  

The models are evaluated under different criteria as 
depicted in Figure 7. By increasing the features, the 
performances of the models are evaluated. Additionally, the 
two models are evaluated using selected frame lengths, 
selected sample rates, before silence removal and after 
silence removal. The performances of the two models are 
presented using graphs and confusion matrices. Necessary  

𝑖𝑓 (𝑤𝑖𝑗 )𝑇  𝜑(𝑥𝑡) +  𝑏
𝑖𝑗  ≥ 1 −  𝜀𝑡

𝑖𝑗 ;  𝑦𝑡 = 𝑐𝑙𝑎𝑠𝑠 𝑖 (1) 

𝑖𝑓 (𝑤𝑖𝑗 )𝑇  𝜑(𝑥𝑡 ) +  𝑏
𝑖𝑗  ≤ −1 +  𝜀𝑡

𝑖𝑗 ;  𝑦𝑡 = 𝑐𝑙𝑎𝑠𝑠 𝑗 
 

(2) 

𝑖𝑓 (𝑤𝑖 )𝑇  𝜑(𝑥𝑡 ) +  𝑏
𝑖  ≥ 1 −  𝜀𝑡

𝑖 ;  𝑦𝑡 = 𝑐𝑙𝑎𝑠𝑠 𝑖 (4) 

𝑖𝑓 (𝑤𝑖 )𝑇  𝜑(𝑥𝑡) +  𝑏
𝑖 ≤ −1 +  𝜀𝑡

𝑖 ;  𝑦𝑡 ≠ 𝑐𝑙𝑎𝑠𝑠 𝑖 (5) 

 
Figure 7: Evaluation Plan 

 
Figure 7: Evaluation Plan 


Classification of Voice Content in the Context of Public Radio Broadcasting               8 

 
International Journal on Advances in ICT for Emerging Regions                                                                                                                           December 2019 

 
Figure 8: Changing of precision when increasing features of binary SVMs in One Vs. One model 

 
9                                                                                                                                          G.A.G.S.Karunarathna #1, K.L.Jayaratne #2, P.V.K.G.Gunawardana #3  
 

December   2019                                                               International Journal on Advances in ICT for Emerging Regions  

 
refinements for both models are made based on the evaluation 
results. The precision value is taken to measure performance. 

IV. EXPERIMENTS AND RESULTS 

 Models were initially evaluated using data that were not 
used in the training phase. All the experiments mentioned in 
section III.E were repeated for five times and the average of 
the results was calculated to determine the overall 
performance of the system.  According to the results, 
necessary refinements were done to the classification model. 

A. Increase the number of features 

This method is used to prevent "diminishing returns". 

First, features selected for each SVM are ranked according to 

the score value of feature importance. Thereafter, many 

rounds of experiments were conducted by increase the input 

feature count in order to determine the optimal feature set.  

Figure 8 and Figure 9 illustrate the obtained results for One 

Vs. One and One Vs. All models. These figures indicate less 

number of features can achieve the highest precision value. 

The minimal, required subset of features for each SVM is 

selected through this process. The obtained features are listed 

in Table 6. 

B. Different frame sizes 

Selecting a correct frame size to extract the features is 

essential when comes to a classification problem. In the 

closest work to this research done by C. Weeratunga et al [15] 

has used 2.5s as the frame size. Other than that, the works that 

are found in the literature have used different frame sizes such 

as 25ms, 0.25s, and 0.3s etc. Therefore, the model is 

evaluated with respect to different frame sizes and reports the 

results for the chosen frame sizes 0.25s, 2.5s, 4s, and 5s. 

Increasing the frame size more than 5s is impossible in this 

case since some of the data segments in the dataset has the 

length in between 5s and 6s. When changing the frame size, 

the rest of the model's parameters such as K value and sample 

rate were kept constant. The obtained results are shown in 

 
Figure 9: : Changing of precision when increasing features of binary SVMs in One Vs. All model 

 
Classification of Voice Content in the Context of Public Radio Broadcasting               10 

 
International Journal on Advances in ICT for Emerging Regions                                                                                                                           December 2019 

 
Figure 10. Length of 5s frame is selected as the best frame 

size. 

C. Different sample rates 

FM radio channels have a bandwidth of 15 kHz 

approximately. Bandwidth is the difference between the 

highest and lowest frequencies carried in an audio stream. 

According to Nyquist Shannon theorem, the highest 

frequency is half of the sample rate.  Practically, the highest 

frequency for a radio stream is in between 22050 Hz - 20000 

Hz because the highest audible frequency of a human is 

20000 Hz [34]. Thus, logically the best sample rate for our 

study is 44100 Hz. In addition to that, 16000 Hz and 22050 

Hz were also used as the sample rates in previous works 

related to radio broadcasting classification. Weeratunga et al 

[15] proposed 22050 Hz as the sample rate, John Saunders 

[16] and Vavrek, J. et al [20] proposed 16000 Hz as the 

sample rates for their studies. Therefore, we evaluate the 

models with sample rates 16000 Hz and 22050 Hz, and 44100 

Hz to find the most reliable sample rate for the research. 

Figure 11 illustrates the changing of performance against 

different sampling rates for both One Vs. One and One Vs. 

All models.  

 
D. With silence removing 

In the data pre-processing phase, silence removal is done 

only to news and conversation contents because they have 

long silence periods within an audio clip. For the 

evaluating purposes, the model is evaluated without silence 

removal and with silence removal from all classes. Figure 

12 and Figure 13 show the changing of the performances 

of the classification without silence removal and with 

silence removal in One Vs. One model and One Vs. All 

model respectively. As shown in figures, after the removal 

of the silence from all data, performances increased only in 

news and conversations. Therefore, we decided to remove 

silence only from news and conversations. 

 
V. EVALUATION 

One of the main aspects of this research is to select the 

optimal subset of features for classification. Table 7 and 

Table 8 provide the training accuracy (precision value) of 

each SVM when using all features and the selected subset 

 
Figure 13: Variation of the precision values of One Vs. All against 

Silence removal 

 
Figure 10: Variation of precision against frame size 

 
Figure 10: Variation of precision value against frame sizes 

 
Figure 11: Variation of precision against sample rate 

 
Figure 11: Variation of precision value against sample rates 

 
Figure 12: Variation of the precision values of One Vs. One against 
Silence removal 

 
11                                                                                                                                          G.A.G.S.Karunarathna #1, K.L.Jayaratne #2, P.V.K.G.Gunawardana 
#3  

 
December   2019                                                               International Journal on Advances in ICT for Emerging Regions  

 
of features. This indicates that the overall performance of the 

models increases with the optimal subset of features. 

 
Table 7: Training accuracies with all the features and the optimal subset of 

features of One Vs. One model   

Models  
Accuracy using all the features 

(Precision) 

Accuracy using the optimal 

subset of features (Precision) 

news/ 

advertisement 
85%  87%  

news/ 

conversation 
88%  92%  

news/ 

drama 
90%  92%  

news/ 

religious 

program 
99%  100%  

advertisement/ 

conversation 
89%  92%  

advertisement/ 

drama 
91%  93%  

advertisement/ 

religious 

program 
98%  99%  

conversation/ 

drama 
84%  87%  

conversation/ 

religious 

program 
95%  95%  

drama/ 

religious 

program 
97%  99%  

One Vs. One  81%  85%  

 
Table 8: Training accuracies with all the features and the optimal subset of 

features of One Vs. All model   

Models  Accuracy using all the features 
Accuracy using the optimal 

subset of features 

news/ 

other 
76%  82%  

conversation/ 

other 
82%  92%  

advertisement/ 

other 
86%  91%  

drama/ 

other 
79%  82%  

religious 

program/ 

other 
93%  96%  

One Vs. All  78%  83%  

 
After applying the optimal features with selected 

parameters, the performance of One Vs. One and One Vs. All 

models show respectively in Table 9 and Table 10. 

 
Table 9: Confusion matrix of One Vs. One model   

  Predicted Class Support Precision 

  news conv advert drama relprog   
 

True 

Class 

news 260  8  43  6  0  317  81% 
conv 25  247  14  33  6  325  85% 
advert 28  12  242  22  3  307  80% 
drama 7  13  6  290  0  316  82% 
relpog 0  7  0  2  314  323  98% 

Overall results 85.20% 

 
Table 10: Confusion matrix of One Vs. All model   

 
     When considering the results of both models, a 

considerable amount of news frames has misclassified as 

advertisements, while the conversations and advertisements 

have misclassified as news and drama. The religious program 

has classified better than others. Even though the news 

reading can be considered as monotonic, in some scenarios 

conversations and drama also have a monotonic nature as 

news. Even in the advertisements show a monotonic nature 

after removal of music and jingles. Therefore, the monotonic 

nature of news, conversations, advertisements, and drama 

might be the reason for these misclassifications. Table 9 and 

Table 10 shows the obtained accuracies of each binary SVM 

in both models. 
 

As included in Table 11, even though ten binary SVMs 

trained with high training accuracies, after combining the ten 

models together the accuracy of the alliance is degraded. The 

reason might be the ``Max Winning" strategy that uses to 

predict the classes in One Vs. One. In ``Max winning" 

strategy when computing the mode, if the maximum number 

of votes is equal to two classes, then it outputs only one class 

which appears first in the array. This might be caused to 

degrade the final result of One Vs. One SVM. In our training 

dataset, approximately 13% of data has faced this issue when 

the frame length is 0.25s. But when we increase the frame 

length, the proportion of identical classes decreased to 4% as 

shown in Figure 14. 

Table 11: Overall results of One Vs. One model   

Models  Accuracy  Precision  Recall  F1-score 

news/ 

advertisement 
87%  87%  87%  87% 

news/ 

conversation 
90%  92%  91%  91% 

news/ 

drama 
92%  92%  92%  92% 

news/ 

religious 

program 
99%  100%  100%  100% 

advertisement/ 

conversation 
91%  92%  92%  92% 

advertisement/ 

drama 
94%  93%  94%  94% 

advertisement/ 

religious 

program 
99%  99%  99%  99% 

conversation/ 

drama 
85%  87%  86%  87% 

conversation/ 

religious 

program 
95%  95%  95%  95% 

drama/ 

religious 

program 
99%  99%  99%  91% 

One Vs. One  85%  85%  85%  85% 

 
Table 12: Overall results of One Vs. All model   

Models  Accuracy  Precision  Recall  F1-score 

news/ 

other 
86%  82%  75%  78% 

conversation/ 

other 
91%  92%  78%  84% 

advertisement/ 

other 
91%  91%  79%  85% 

drama/ 

other 
88%  82%  81%  82% 

religious 

program/ 

other 
98%  96%  97%  97% 

One Vs. All  83%  83%  83%  83% 

 
As depicted in Table 12, One Vs. All model shows less 
accuracy than One Vs. One model. One disadvantage of  One 

  Predicted Class Support Precision 

  news conv advert drama relprog   
 

True 

Class 

news 265  9  38  5  0  317  73% 
conv 38  247  4  34  2  325  85% 
advert 44  11  198  51  3  307  80% 
drama 13  12  6  283  2  316  76% 
relpog 1  11  0  0  311  323  98% 

Overall results 83.37% 


Classification of Voice Content in the Context of Public Radio Broadcasting               12 

 
International Journal on Advances in ICT for Emerging Regions                                                                                                                           December 2019 

 
Vs. All model is when analyzing the features, each binary 
SVM in One Vs. All model must look at the features as one 
class against the four classes. Therefore, to choose the best, it 
is difficult to observe the distinguishing features for one class 
versus four. But when analyzing features for One Vs. One 
model, features should be analyzed against two classes each 
time whenever it is easy to identify a pattern that 
distinguishes two classes. Another drawback of One Vs. All 
model is that it takes high training time since each binary 
SVM classifier in One Vs. All model requires a complete 
dataset for individual training. But the binary SVM classifiers 
in One Vs. One model takes less training time compared to 
the binary SVMs in One Vs. All model as it requires data only 
from two classes for training. 

According to Table 9 and Table 10, the One Vs. One 
model achieved 85% of overall precision and the One Vs. All 
model achieved 83% of overall precision. The obtained 
results of this study guide us to find the most suitable multi-
class SVM for this problem domain. According to the 
performance of these two models, the One Vs. One model 
with 85% of precision, is chosen as the most appropriate 
model for this research. 

 
VI. CONCLUSION AND FUTURE WORK 

The main aim of this research is to identify voice dominant 
content categories to automate the radio broadcasting context 
in Sri Lanka. For that, a multi-class SVM was proposed. 
Multi-class SVM was built using two conventional ways, 
“One Vs. One” and “One Vs. All” and compared the 
performance to find the best model for this domain. The 
novelty of this approach is that instead of feeding all the 
features once, only selected features were fed separately to 
each classifier in the model.  

The performance of these two models was evaluated under 
different criteria. One Vs. One model successfully classified 
the pre-defined content categories with the accuracies of 81% 
for news, 85% for conversations 80% for advertisements, 82% 
for drama and 98% for religious programs. The One Vs. All 
model successfully classified the categories with the 
accuracies of 73% for news, 85% for conversations, 80% for 
advertisements, 76% for drama and 98% for religious 
programs. The final overall accuracies of the One Vs. One and 

One Vs. All models are 85% and 83% respectively. Moreover, 
this proposed methodology is able to increase the 
classification accuracy of news contents to 81% and the 
accuracy of the existing methodology [15] was 41%.  

The major limitation of this research is that the model is 
trained and tested only for “SLBC” radio FM channel. 
However, this study creates a platform to further generalize 
this model to all Sri Lankan FM channels. Another limitation 
is that restricts the data of each class to 1 hour and 10 minutes. 
The reason is that religious programs were unable to provide 
data for more than 1 hour and 10 minutes long. Therefore, the 
data of the rest of the classes were also limited to 1 hour and 
10 minutes to avoid the proportional bias in the dataset. 
Therefore, using more training data for classification to 
improve performance is a good choice. In addition, identifying 
the most prominent features yields more accurate results. 

 
REFERENCES 

 
[1] C. R. S. Celebrating Radio: Statistics / World Radio Day 2015, 2018. 

[Online].Available:http://www.diamundialradio.org/2015/en/content/c
elebrating-radiostatistics.html 

[2] Nishan, W. Senevirathna, and K. L Jayaratne, “A highly robust audio 
monitoring system for radio broadcasting, Proceedings of sixth Annual 
International Conference on Computer Games, Multimedia and Allied 
Technology” GSTF Journal on Computing (JoC), vol. 3, no. 2, pp. 87-
98, 2013. 

[3] N. Senevirathna and K. L Jayaratne, “Automated content based audio 
monitoring approach for radio broadcasting,” Proceedings of sixth 
Annual International Conference on Computer Games, Multimedia 
and Allied Technology (CGAT 2013), Singapore, pp. 110–118, CGAT, 
2013. 

[4] E. N. W. Senevirathna and K. L. Jayaratne, “Audio music monitoring: 
Analyzing current techniques for song recognition and identification,” 
GSTF Journal on Computing (JoC), vol. 4, no. 3, pp. 23-34, 2015. 

[5] E. D. N.W. Senevirathna and K. L Jayaratne, “Automated Audio 
Monitoring Approach for Radio Broadcasting in Sri Lanka,” 
Proceedings of International Conference on Advances in ICT for 
Emerging Regions (ICTer 2017), Sri Lanka, pp. 92–98, 2017. 

[6] E.D.N.W. Senevirathna and Lakshman Jayaratne (2018): Radio 
Broadcast Monitoring to Ensure Copyright Ownership. International 
Journal on Advances in ICT for Emerging Regions (ICTer), 11(1) 

[7] Dhanith Chaturanga and Lakshman Jayaratne (2013): Automatic 
Music Genre Classification of Audio Signals with Machine Learning 
Approaches. International Journal of Computing (JOC) by Global 
Science and Technology Forum (GSTF), 3(2):137-148 

[8] Dhanith Chaturanga and Lakshman Jayaratne (2012): Musical Genre 
Classification Using Ensemble of Classifiers. Proceedings of fourth 
International Conference on Computational Intelligence, Modeling 
and Simulation (CIMSim 2012), Kuantan, Malaysia. 

[9] Rajitha Amarasinghe and Lakshman Jayaratne (2016): Supervised 
Learning Approach for Singer Identification in Sri Lankan Music. 
European Journal of Computer Science and Information Technology 
(EJCSIT) by European Centre for Research Training and Development 
UK, 4(6):1-14 

[10] Rajitha Peiris and Lakshman Jayaratne (2016): Musical Genre 
Classification of Recorded Songs Based on Music Structure Similarity. 
European Journal of Computer Science and Information Technology 
(EJCSIT) by European Centre for Research Training and Development 
UK, 4(5):70-88 

[11] Tharika Madurapperuma, Gothami Abayawickrama, Nesara 
Dissanayake, Viraj B. Wijesuriya and K. L. Jayaratne (2017): Highly 
Efficient and Robust Audio Identification and Analytics System to 
Secure Royalty Payments for Song Artists, Proceedings of IEEE 
International Conference on Advances in ICT for Emerging Regions 
(ICTer 2017), Sri Lanka, 149-157. 

[12] Rajitha Peiris and Lakshman Jayaratne (2016): Supervised Learning 
Approach for Classification of Sri Lankan Music based on Music 
Structure Similarity, Proceedings of ninth Annual International 

 
Figure 14: Proportion of the identical classes against the frame length 

 
Figure 14: Proportion of the identical classes against the frame length 


13                                                                                                                                          G.A.G.S.Karunarathna #1, K.L.Jayaratne #2, P.V.K.G.Gunawardana 
#3  

 
December   2019                                                               International Journal on Advances in ICT for Emerging Regions  

 
Conference on Computer Games, Multimedia and Allied Technology 
(CGAT 2016), Singapore, 84-90. 

[13] M. G. Viraj Lakshitha and K. L. Jayaratne (2016): Melody Analysis for 
Prediction of the Emotion Conveyed by Sinhala Songs, Proceedings of 
IEEE International Conference on Information and Automation for 
Sustainability (ICIAfS 2016), Sri Lanka. 

[14] R. Kotsakis, G. Kalliris, and C. Dimoulas, “Investigation of broadcast-
audio semantic analysis scenarios employing radio-programme-
adaptive pattern classification,” Speech Communication, vol. 54, no. 6, 
pp. 743–762, 2012. 

[15] C.O.B. Weerathunga, P.V.K.G. Gunawardena and K.L. Jayaratne 
(2018): Classification of Public Radio Broadcast Context for Onset 
Detection. European Journal of Computer Science and Information 
Technology (EJCSIT) by European Centre for Research Training and 
Development UK, 7(6):1-22, Published by ECRTD – UK, ISSN2054 – 
0957 print 2054 – 0965 Online, www.eajournals.org, 13 Duncan Rd, 
Gillingham Kent ME7 4 LA, UK.  

[16] J. Saunders, “Real-time discrimination of broadcast speech/music,” in 
icassp. IEEE, 1996, pp. 993–996. 

[17] R. Barzilay, M. Collins, J. Hirschberg, and S. Whittaker, “The rules 
behind roles: Identifying speaker role in radio broadcasts,” in 
AAAI/IAAI, 2000, pp. 679–684. 

[18] Y. Liu, “Initial study on automatic identification of speaker role in 
broadcast news speech,” in Proceedings of the Human Language 
Technology Conference of the NAACL, Companion Volume: Short 
Papers. Association for Computational Linguistics, 2006, pp. 81–84. 

[19] S. G. Koolagudi, S. Sridhar, N. Elango, K. Kumar, and F. Afroz, 
“Advertisement detection in commercial radio channels,” in Industrial 
and Information Systems (ICIIS), 2015 IEEE 10th International 
Conference on. IEEE, 2015, pp. 272–277. 

[20] J. Vavrek, E. Vozarikov ´ a, M. Pleva, and J. Juh ´ ar, “Broadcast news 
audio classification using ´ svm binary trees,” in Telecommunications 
and Signal Processing (TSP), 2012 35th International Conference on. 
IEEE, 2012, pp. 469–473. 

[21] L. Lu, H. Jiang, and H. Zhang, “A robust audio classification and 
segmentation method,” in Proceedings of the ninth ACM international 
conference on Multimedia. ACM, 2001, pp. 203–211. 

[22] M. Khan, W. G. Al-Khatib, and M. Moinuddin, “Automatic 
classification of speech and music using neural networks,” in 
Proceedings of the 2nd ACM international workshop on Multimedia 
databases. ACM, 2004, pp. 94–99. 

[23] Z. Kons, O. Toledo Ronen, and M. Carmel, “Audio event classification 
using deep neural networks.” in Interspeech, 2013, pp. 1482–1486. 

[24] C. W. Hsu and C. J. Lin, “A comparison of methods for multiclass 
support vector machines,” IEEE transactions on Neural Networks, vol. 
13, no. 2, pp. 415–425, 2002. 

[25] C. Lin, “A comparison of methods for multi-class support vector 
machines,” IEEE Transaction onNeural Networks13 (2), pp. 415–425, 
2002. 

[26] F. Aurino, M. Folla, F. Gargiulo, V. Moscato, A. Picariello, and C. 
Sansone, “One-class svm based approach for detecting anomalous 
audio events,” in Intelligent Networking and Collaborative Systems 
(INCoS), 2014 International Conference on. IEEE, 2014, pp. 145– 
151. 

[27] A. Bouril, D. Aleinikava, M. S. Guillem, and G. M. Mirsky, 
“Automated classification of normal and abnormal heart sounds using 
support vector machines,” in Computing in Cardiology 
Conference (CinC), 2016. IEEE, 2016, pp. 549–552. 

[28] S. E. Kuc¸ ¨ ukbay and M. Sert, “Audio-based event detection in office 
live environments using ¨ optimized mfcc-svm approach,” in Semantic 
Computing (ICSC), 2015 IEEE International Conference on. IEEE, 
2015, pp. 475–480. 

[29] I. Mart´ın-Morato, M. Cobos, and F. J. Ferri, “A case study on feature 
sensitivity for audio ´ event classification using support vector 
machines,” in Machine Learning for Signal Processing (MLSP), 2016 
IEEE 26th International Workshop on. IEEE, 2016, pp. 1–6. 

[30] J. C. Wang, J. F. Wang, C. B. Lin, K.-T. Jian, and W. Kuok, “Content-
based audio classification using support vector machines and 
independent component analysis,” in Pattern Recognition, 2006. ICPR 
2006. 18th International Conference on, vol. 4. IEEE, 2006, pp. 
157–160. 

[31] L. Lu, S. Z. Li, and H. J. Zhang, “Content-based audio segmentation 
using support vector machines,” in Proc. ICME, vol. 1, 2001, pp. 749–
752. 

[32] Y. Zhu, Z. Ming, and Q. Huang, “Automatic audio genre classification 
based on support vector machine,” in Natural Computation, 2007. 
ICNC 2007. Third International Conference on, vol. 1. IEEE, 2007, pp. 
517–521. 

[33] B. Kijsirikul and N. Ussivakul, “Multiclass support vector machines 
using adaptive directed acyclic graph,” in Neural Networks, 2002. 
IJCNN’02. Proceedings of the 2002 International Joint Conference on, 
vol. 1. IEEE, 2002, pp. 980–985. 

[34] “Sample rates - audacity manual,” https://manual.audacityteam.org/ 
man/sample rates.html, (Accessed on 12/22/2018). 

[35] T. Giannakopoulos, “pyaudioanalysis: An open-source python library 
for audio signal analysis,” PloS one, vol. 10, no. 12, p. e0144610, 2015. 

 
http://www.eajournals.org/
https://manual.audacityteam.org/%20man/sample%20rates.html
https://manual.audacityteam.org/%20man/sample%20rates.html