The implementation of signal analysis in Java to determine the sound of human voice and graphical representation in standard m


FACTA UNIVERSITATIS  
Series: Electronics and Energetics Vol. 29, N

o
 1, March 2016, pp. 139 - 149 

DOI: 10.2298/FUEE1601139S 

THE IMPLEMENTATION OF SIGNAL ANALYSIS IN JAVA 

TO DETERMINE THE SOUND OF HUMAN VOICE 

AND ITS GRAPHICAL REPRESENTATION  

IN STANDARD MUSIC NOTATION
 

Patryk Solecki, Wojciech Zabierowski 

Lodz University of Technology Department of Microelectronics and Computer Science, 

Poland 

Abstract. The article presents the problems associated with signal processing in the 

human voice analysis. Based on the specific implementation of algorithms defining the 

human voice pitch, paper is shows the result in the form of a standard music notation with 

treble and bass keys on the stave. Particular attention is paid to the performance of the 

algorithms used for their implementation in Java. The Basic analysis of human voice 

signals is not a challenge, in general, but its implementation on mobile devices like 

smartphones with their limited hardware resources remains a challenge. The limitations 

of both, the CPU and the memory, affect the processing speed of the Java virtual machine. 

One should remember that the quality of microphone used in this type of mobile devices is 

low. From this point of view we have presented the new approach to the well-known 

problem of signal analysis implemented in computer applications such as Raven. 

Key words: Signal processing, java programming, music notation, voice analysis, java 

1. INTRODUCTION 

The sound pitch generated by the music instrument, was analyzed in Ref. [1] on the 

example of the smartphone application, which used phone's limited resources. In addition 

to the limitations associated with hardware (CPU, memory, bandwidth of the microphone) 

the selection and use of the appropriate programming language had a significant impact on 

the application. The problem with the choice of the language is that it is not just a matter of 

habit but it is very often determined by the selected hardware platform. It seems to be 

obvious that the most effective are the C + + type languages, especially in a situation where 

the sound analysis is done "online" on the data from the microphone. 

Analysis of the sound pitch is one of the easiest issues related to the processing of 

acoustic signals. Therefore, an implementation of algorithms for the identification of the 

human voice or the sound of an instrument [1] is an often undertaken problem. 

                                                 
Received March 27, 2015; received in revised form October 21, 2015 

Corresponding author: Wojciech Zabierowski 
Lodz University of Technology Department of Microelectronics and Computer Science, Poland 

(e-mail: wojciech.zabierowski@p.lodz.pl) 


140 P. SOLECKI, W. ZABIEROWSKI 

In the literature there are different approaches to the signal analysis. Various methods 

are used for speech recognition and the sound pitch or speech pitch determination. One 

example might be the use of adaptive algorithms in human speech segmentation [2]. The 

various approaches to the problem can be found in [3], where different methods of 

segmentation in automatic speech recognition are shown. One of the common problems 

in speech recognition is also a discrete wavelet transform used to identify the pitch and 

the segmentation of the signal [4]. An important aspect of speech recognition is to take 

into account the impact of individual features of the speakers and the signal transmission 

conditions on the issue of automatic speech recognition [5]. Another one is the Cepstral 

analysis, necessary for the continuation of the speech recognition not only in terms of its 

pitch, but also individual sounds [6]. 

The purpose of this publication is to show that for some simple problems of sound 

analysis, in particular its pitch determination, it is possible to create an effective 

implementation in Java, using the Java virtual machine and available input/output libraries.  

The intention of the authors was not to propose commercial applications, having a 

possibility to create complete music notation from any of the songs. There are simple such 

applications, e.g. for an IPhone, as well as for desktops, like Raven. Some of them, especially 

those for smartphones, at the time of this research, were not yet available. The aim of this 

research was to show that using fairly simple, publicly available Java mechanisms, applicable 

to the various systems, including e.g. Symbian on mobile phones, it is possible, despite the 

hardware limitations (among others, of the phone's microphone) to create this type of 

application. The simplifications and consideration of only certain ranges in the field of signal 

analysis are intentional and aim to produce a good presentation of the implementation for 

human speech-signal analysis on the devices with limited hardware resources. 

2. SPECIFICATION OF THE SOUND PROCESSING PROBLEM  

Assuming, that the current version of Java can deal with the issue of sound analysis, it 

was decided that a certain pitch, for a better visual effect and clarity, will be presented in the 

form of the traditional musical notation on a stave. From the algorithmic point of view, 

DFT (Discrete Fourier Transform) analysis should be used for the bands of the spectrum of 

the analyzed signal [7]. The algorithm should recognize the fundamental tone band number 

and transform calculated frequency into the corresponding notes in the standard musical 

notation. The implementation also considered the use of the FFT (Fast Fourier Transform) 

algorithm. With its use one can save a considerable amount of calculation in comparison to 

the direct implementation of the DFT. The complexity of the model is described first for the 

case of the FFT algorithm. The computational complexity for both algorithms is given by 

the equations:  

 
2

=O( )DFTk N  (1) 

 2=O( *log )FFTk N N  (2) 

where N is the number of the input data. 

With proper implementation, the memory load is not much larger than the amount of 

memory occupied by the input data. In the case of real samples, the process of FFT can 

be further improved by using a modification of the algorithm, the 2N-point Real FFT, 


 The Implementation of Signal Analysis In Java to Determine the Sound of Human Voice... 141 

 
which further reduces the amount of needed resources. Unfortunately, apart from the 

benefits, the applied algorithm has also a drawback: the number of samples increases to 2N, 

which, in turn, adversely affects the flexibility of the analysis of the sound. The final results 

are obtained only at the end of the execution of the algorithm. These advantages and 

disadvantages of considered solutions have a significant influence on implementation, in 

particular when considering the operation of the system in the "on-line" version. 

Basic DFT algorithm has two main inherent advantages. In contrast to the FFT, the 

results for various bands are evenly spaced in time. The second advantage, not to be 

underestimated, is the flexible amount of the input samples. This means that by regulating 

the number of samples used in the algorithm one can control the resolution, identified as fr 

(Equation 3).  

 sr
f

f =
N

 (3) 

where fs is the adopted sampling frequency. 

In the case of the FFT, a large number of input samples should be used, which has an 

effect on the application performance. During the online analysis, the signal is analyzed 

constantly. In the case of the offline analysis only specific samples are analyzed. 

In the case of the "online" processing (samples are processed in real time and 

immediately presented by applications in standard music notation), the extended period 

of the sampling time increases the inertia of the system - the implemented application. 

There are delays in the graphical presentation of notes. Assuming the standard fs = 44100 

Hz, we chose the following numbers of samples - powers of two (Equation 4). 

 ({8192 16384 32768 65536})iN = , , ,  (4) 

Too small number of samples, as in the case of N = 8192, results in a very low resolution. 

In this case, fr = 5.38 Hz, on the basis of Equation 3. Modified 2N-point FFT algorithm, that is 

used, determines the use of at least N = 16384 samples, which can significantly affect the data 

acquisition time for the next step of calculation. 

Out of the above discussion, the following conclusions can be derived with respect to 

the applicability of specific algorithms for the problem under consideration. The FFT is a 

faster, but more complex implementation, which requires more resources. The DFT algorithm 

in the basic version with a simple implementation has fewer hardware requirements, which in 

turn, gives the programmer more possibility to adapt the implementation to the limited 

resources. 

Calculating the FFT of the significant harmonics requires calculation of the whole 

FFT, or in other words, calculation of all the harmonics, which in the given example 

requires calculating at least 8820 samples every time (because fs = 44.1kHz). 

The experimentally used algorithm has the calculation complexity stated below: 

 
9002

s

O = N
f

 
 
 

 (5) 

Complexity graphs of the FFT, DFT and the experimental DFT for the analyzed 

ranges of 500Hz and 900Hz are shown in Figure 1.  

 
142 P. SOLECKI, W. ZABIEROWSKI 

 
Fig. 1 Graph of the calculation complexity: computing time relative to the number of samples. 

Specifying the fundamental pitch is a useful tool in the analysis of musical sounds. It 

allows you to specify the frequency of the fundamental tone, and on this basis determine 

the name of the sound. It also helps to examine such traits sound like vibrato. The 

frequency of the fundamental tone is in the intervals shown in Table 1. [5]: 

Table 1 Frequency of the fundamental tones depending on the voice type.[10] 

Voice name Frequency [Hz] 

Bass 8—320 

Baritone 100-400 

Tenor 120-480 

Alto 160-640 

Mezzo-soprano 200-800 

Soprano 240-960 

 
In addition, it is varied depending on the individual characteristics and is appropriate to the 

resonators: laryngeal, sinus, mouth and pectoral (chest), participating in creating a sound 

[10]. That is why, among other things, for the purpose of the research we decided to limit 

the frequency range to 1kHz. This restriction was introduced, because it was assumed that 

sounds will be read and written with the musical notation only in this range of frequencies. 

The human speech spectrum includes frequencies from 100 Hz to over 8 kHz, where the 

largest spectral density (energy) is in the vicinity of 500 Hz and gradually decreases with 

increasing frequency, which also supports our limitation of the analyzed interval. 

The human ear receives signals in a much wider frequency range, but it is limited 

depending on individual human being. The typical range of signals recorded by the human 

ear covers frequencies from 20 Hz to 15 kHz (sometimes 20 kHz) and the highest sensitivity 

is from 1kHz to 3 kHz [16]. 


 The Implementation of Signal Analysis In Java to Determine the Sound of Human Voice... 143 

 
3. SOLVING THE PROBLEM – THE DFT ANALYSIS 

Analysis of the human voice induces a lot of problems. The human voice is very 

complex in terms of number of parameters describing it [2,3,12,14]. This also results in a 

very complex set of harmonics visible in the spectrum of the signal. Changes in the voice 

can occur dynamically during the analysis, because of the conscious subject's voice 

modulation, but also by the impact of external factors that could affect the image of the 

spectrum of the voice of the tested person [10]. Although, the voice of every human is 

determined appropriately to the personal sound produced by vocal folds, but as a result of 

changes in the voice path it can vary considerably. This means that the voice of each 

person will be different due to the inter-individual characteristics, although it will still 

have the same pitch or the same character. 

With the DFT analysis based on transformed expressions (Equation 6) one must be 

aware of certain characteristics of such signals, which facilitate further analysis and can 

prevent errors. 

 
1

0

( ) ( ) [cos(2π / ) sin(2π / )]

N

n=

X m = x n n m N j n m N



         (6) 

Attention should be drawn to the following points: 

 The fundamental tone is not always the most powerful component of the sound. 
 Quality of used equipment is of fundamental importance and has an impact on the 

resolution and the possible disruption of the spectrum at low frequencies. 

 If in the DFT without windowing the input data is used, fundamental tone may be 
disturbed with other bands, so-called "leaves". 

 For a proper analysis of the signal and, in particular, of a human voice, the signal 
strength must be at the right level for the appropriate resolution of the harmonics, 

which allows for proper analysis. 

 
Fig. 2 Spectrum of 'e' sound produced by a male voice [1]. 


144 P. SOLECKI, W. ZABIEROWSKI 

 
Fig. 3 Spectrum of 'e' sound produced by a female voice [1]. 

As shown in Figures 2 and 3, the frequency range of the human voice, in the sense of 

the fundamental tone, already starts in this particular case around 60-70Hz, and ends just 

over 1000 Hz. Taking into account the fact that the lowest and the highest frequencies are 

achieved only by a small percentage of the population for the tests and the described 

implementation, for the simplification, the narrowing of the scope was adopted in the 

range f = 98.00 Hz - 783.99 Hz. This is due to the need to ensure the appropriate resolution 

and frequency values assigned by equally tempered scale. The scale of the difference 

between the sounds was at the level df = 5.83 Hz. This means that the adoption of the 

resolution of fr = 5 Hz would be sufficient for the most of the range. Unfortunately, the 

specificity of the mathematical properties of the Fourier transform can introduce errors 

for low frequencies, so-called "leaks". Correctly adopted resolution (see above) implicates a 

certain behavior of the variants of the algorithm. With such resolution of the DFT procedure 

components with the pitch similar to the observed spectral bands is transferred to adjacent 

spectral strips, which can result in having two neighboring strips of similar values. The 

interpretation of such strips depends on the applied algorithm. Either one chooses the strip 

of larger value as the basis for a sound diagnosis or approximating neighboring bands 

identifies the maximum to assign the sound pitch to it. At the resolution fr = 5 Hz it is possible 

to recognize extreme low sound pitch on the basic level. In this way we limit also the DFT 

resolution and for the calculations, according to formula (1), 8820 samples may be used. 

With such resolution it is possible to reduce the number of samples while maintaining the 

basic, satisfactory sensitivity of the pitch markings. The algorithm adopted in the analysis 

was narrowed to five strips. The limitation was adopted on the basis of the signal analysis 

and the observation that for the purposes of pitch recognition this limitation gives such 

advantages in terms of utilization and load on system resources that we can tolerate the 

potential inaccuracies in the designation of the pitch. 


 The Implementation of Signal Analysis In Java to Determine the Sound of Human Voice... 145 

 
For the proper operation of the described algorithm, the following assumptions were 

made: 

 Two adjacent bands on both sides must have a smaller value. 
 The analyzed band must exceed the value established in advance. 
Please be aware of the issue of simplifying assumptions. One can, of course, expand the 

algorithm described, but these changes will affect the computational complexity, and this, 

in turn, will decrease the processing performance of the solution. The presented algorithm is 

based on a single pass, which implies that the time needed to find a correct tone varies for 

different tones. The analysis continues from low to high frequencies and, therefore, the 

lower sound is detected earlier and it will be marked sooner. 

4. THE APPLICATION 

A Multi-threaded application was written, using [8], so that the described algorithms work 

efficiently. The project was split into groups of classes responsible for the analysis of samples 

and class group presentation, as well as the input/output operation to receive samples. Control 

group classes provide communication between the groups and between the threads. Collection 

and processing of data is carried out and can be controlled by the user. Data are collected 

directly from the buffered sound card stream, and then are subjected to normalization. 

 
Fig. 4 A simple line-in configuration [11]. 

Java libraries provide a mechanism for downloading directly from the line-in standard 

audio mixer of the operating system (Figure 4). 

The analysis described in the 

previous section is carried out with the 

prepared data to search for the 

fundamental tone. The effect of this 

thread is delivered to the thread 

responsible for the presentation by 

means of the music notation (Figure 5). 

Saving the marked tone should be 

taken into consideration as well as 

special characters like flat and sharp 

symbols, which lay down the increase 

and decrease of the displayed note one 

halftone. Although the frequency range 

of the human voice is not large, for the 

proper presentation of tones the key 

treble and bass should be used. The 

application has several features to 

enable various options for sound analysis. 

 
Fig. 5 The fragment of the dialog window 

(simplified) - presentation of the sound [11] 


146 P. SOLECKI, W. ZABIEROWSKI 

To generate the appropriate notes in musical notation JavaSwing package was used. 

In addition, applications have been introduced to allow control and algorithm modification by 

the user (Figure 6). The user can change parameters and select the method of analysis. It is 

possible to choose how the data acquisition and data analysis should be performed. These 

additional features allow one to show a different aspect of the application functioning. 

 
Fig. 6 The fragment of the dialog window - presentation of the sound [11]. 

Before beginning the sound pitch analysis, the size of the incoming data buffer (samples) 

being sent to the analyzing functions must be set in the application window. This setting 

determines the number (N) of the samples analyzed in one process. This number is shown in 

the lower-left corner of the application's dialog window. The number of currently being 

identified stripe is shown on the right. This allows fast real time check of correctness of the 

results. There is also the intonation indicator in this corner. It shows if the identified pitch is 

below or above the standard frequency of the identified sound. During the calculation process 

the type of the calculations can be changed at any time by choosing between the FFT or the 

narrowed DFT. To assure correct identification of the tone of the sound by the program the 

calibration of the traceability threshold is necessary. The possibility to control this parameter 

allows adjusting the program to the level and timbre of the analyzed sound. This means 

scaling the levels of the analyzed parameters in order to avoid false identification of the 

tones. The use of the FFT, avoids leaflets (dicribed in the DFT section). While using the 

FFT, user has the possibility to enable the option of the thorough peak checking, filling with 

zeros in order to increase the resolution of the spectrum and windowing, in order to 

eliminate leaks (the, so called, side leafs or leaflets). 


 The Implementation of Signal Analysis In Java to Determine the Sound of Human Voice... 147 

 
5. SMARTPHONE APPLICATION 

As already mentioned, the real challenge is to write the application for the mobile 

device, such as a smartphone, which is now already a very popular device accessible to 

everyone. Having experience in setting up sound for recording guitar tablature and guitar 

tuner implementations on a mobile phone [1], we decided to face a more complex challenge. 

On a smartphone we tested the results obtained for a relatively powerful machines: a desktop 

computer and a laptop. To increase the challenge, the application has not been tested on the 

latest models of smartphones, but on those 2-3 years old. 

The implementation of the user interface especially required considerable changes in 

comparison to the desktop version. The touch screen instead of a mouse and a keyboard 

gives significant expansion of the functionality and usability of the application. However, 

as it was with the guitar tabulature creator solution [1], a serious problem was the mobile 

device's microphone. During the implementation of the application, the following 

additional assumptions have been adopted. First, the FIR (Finite Impulse Response) filter 

has been used, in addition, as an interpolating filter to increase digital sample rating. The 

main operation of data filtering is the convolution (Equation 7).  

 
0

( ) ( ) ( )

M

k=

y n = h k x n k   (7) 

w h e r e ,  M – number of input samples. 

Thanks to this operation one can obtain better resolution of the guitar tuning. Dolph-

Chebyshev window - also applied to the Final Impulse Response Filter - is very useful and 

corrects characteristic of window. Furthermore, characteristics of the filter and the window 

(such as the gamma parameter) can be set by the user. In addition, the autocorrelation 

function were introduced. Analysis of the autocorrelation function described below has 

been done: 

 ( ) ( ) ( )r n = x m x m+ n   (8) 
m – sample from the input range, n- sample number; 

 
Fig. 7 Analysis of the autocorrelation function. Extraction of the fundamental harmonic [6]. 


148 P. SOLECKI, W. ZABIEROWSKI 

This algorithm provides a good resolution, but it does not eliminate the noise. There 

can occur also problems if input signal does not include the fundamental harmonic: 

 Analysis of the spectral function. The main goal of the spectrum function analysis 
is to find the peak of the function and establish current fundamental frequency of 

the input signal; 

 Adaptation of Dolph-Chebyshev window (FIR filter). This type of window is very 
useful in case of creating the FIR filter; 

 Approximation of the complex vector module. 

Commonly used operation is arithmetic with complex numbers i.e.: 

 
2 2

|V |= I +Q  (9) 

I – real part of a complex number 

Q – imaginary part of a complex number 

It can be replaced with the simple low cost operation, which gives comparable result: 

 |V |= αMax+ βMin  (10) 

where Max number is a larger part of the complex value and Min is a smaller part. Alpha 

and beta are the parameters, which are chosen from the appropriate table [7]. A standard 

sound sampling for mobile devices amounts to 8 kHz. This frequency is typical for voice 

calls and simple voice recording but sometimes it can cause difficulties if one wants to 

analyze the digital signal (sometimes the digital increase of the sample rating is required). 

For the purposes of this implementation, input signal frequency has been increased from 

8 kHz to 80 kHz with digital interpolation process (sampling frequency has been increased). 

The disadvantage is the additional interpolation process to be executed by smartphone's 

resources but, on the other hand, we profit from the possibility of using anti-alias filter of 

lower order, which needs less processing power. The disadvantage is the possibility of the 

appearance of the noise and artifacts, because interpolation process is never able to reproduce 

the original signal accurately. The main profit is, that it increases the signal-to-noise ratio. 

Summing up, by applying this procedure it became possible to obtain a good quality 

of the signal for further analysis and, finally, satisfactory results. 

6. SUMMARY 

It has been shown that Java programming language and the Java virtual machine, 

despite its limitations is able to process signals "online" in a satisfactory manner. New 

versions of the Java platform, as well as newer computers, significantly improve the 

comfort of the programmer, allowing more accurate analysis of increasingly complex 

computing problems. However, we must remember that today's challenge is not a desktop 

computer or even a laptop, but a mobile device [1]. Therefore, the issue of signal analysis 

for a variety of platforms, including Java is still valid. 

This study may also have a practical aspect. Application of this type can be used for 

educational purposes, e.g. for learning signal analysis and related issues. It can also serve 

as an interesting addition to learning to recognize a sound pitch, which is a basic exercise 

for students learning to play an instrument or to sing. 


 The Implementation of Signal Analysis In Java to Determine the Sound of Human Voice... 149 

 
The signal was collected online from a microphone, and analyzed according to the 

presented algorithm. No tests were carried out on the external databases, but only a few 

people (musicians) evaluated the application by listening, in terms of the quality of the 

voice recognition. Indeed, one could think of some more systematic way of checking the 

application, but it is worth noting that apart from checking by a few musicians, the 

application is used during learning, as an aid to teaching at the music school. This is good 

practical stress test. A detailed comparison with commercial applications has not been done 

because it was not the purpose of authors to compete with commercial applications for 

desktops, like Raven [17]. 

For these tests, the effect of recognition was satisfying. The obtained results show that 

the effect of our work may be useful for people studying playing on musical instruments, 

tuning the instruments, etc. The program was used with good results as a teaching aid for 

children learning music at music school. 

REFERENCES 

[1] P. Solecki, W. Zabierowski, The signal analysis of sound based on the application of guitar tabulatures 
for mobile devices. PRZEGLĄD ELEKTROTECHNICZNY, 2012, rocznik 88, nr 10b, p. 239-242. 

[2] V. A. Petrushin, Adaptive Algorithms for Pitch-synchronous Speech Signal Segmentation, SPECOM’2004: 9th 
Conference Speech and Computer St. Petersburg, Russia September 20-22, 2004. 

[3] A. S. Spanias, ―Speech coding: A tutorial review,‖ Proc. IEEE, 82, 1541–1575, October 1994. 
[4] Ch.Wendt, Athina P. Petropulu, Pitch determination and speech segmentation using the discrete wavelet 

transform, Electrical and Computer Engineering Department, Drexel University, Philadelphia PA 19104. 

[5] P. Mrowka, Algorytmy kompensacji warunków transmisyjnych i cech osobniczych mówcy w systemach 
automatycznego rozpoznawania mowy, Politechnika Wrocławska, Instytut Telekomunikacji, 

Teleinformatyki i Akustyki, Raport Nr I28/PRE-001/07, phd dissertation, Wrocław 2007. 

[6] A.P. Dobrowolski, E. Majda, Analiza cepstralna w systemach rozpoznania mówców, No 6/2012, Instytut 
Logistyki i Magazynowania, 2012. 

[7] R. G. Lyons, Understanding Digital Signal Processing, PEARSON, 2010. 
[8] B. Eckel, , 2003. Thinking in Java. Wydawnictwo ―Helion‖, Gliwice. 
[9] K. Demuynck, T. Laureys, A Comparison of Different Approaches to Automatic Speech Segmentation, 

http://www.esat.kuleuven.ac.be/#spch 2013. 

[10] W. P. Morozow, Isskustwo Rezonansnawo Pienija, Iskusstwo i nauka. Instytut Psychologii Rosyjskiej 
Akademii Nauk, Państwowe Konserwatorium im. P.I. Czajkowskiego w Moskwie, Moskwa 2002. 

[11] M. Dybowski, W. Zabierowski, 2005. Aplikacja rozpoznająca wysokość dźwięków głosu ludzkiego JAVA 
– w mgnieniu oka, XIII Konferencja SIS - Sieci i Systemy Informatyczne – teoria, projekty, wdrożenia, 
aplikacje, Łódź, p. 421-426, t. 2, Piątek Trzynastego Wydawnictwo 2005, ISBN 837415-069-6, 711 s., 2 

t., 23,5 cm 

[12] A. Gersho, ―Advances in speech and audio compression‖, Proc. IEEE, 82, June 1994. 
[13] Lawreace R.Rabiner, Ronald W. Schafer, Digital Processing of Speech singlas, Prentice-Hall, 

Inc.Englewood Cliffs, New Jersey 07632, Bell Laboratories 1978 

[14] T. Robinson, Speech Analysis, Lent Term 1998 
[15] W. Hess. Pitch Determination of Speech Signals. Springer-Verlag, 1983. 
[16] D. Gerhard. Pitch extraction and fundamental frequency: History and currenttechniques. Technical 

Report TR-CS 2003-06, Department of Computer ScienceUniversity of Regina, Regina, Saskatchewan, 
CANADA S4S 0A2, november 2003 

[17] http://www.birds.cornell.edu/brp/raven/RavenTestimonials.html 2013