11268


FACTA UNIVERSITATIS 

Series: Electronics and Energetics Vol. 36, No 2, June 2023, pp. 299-314 

https://doi.org/10.2298/FUEE2302299M 

© 2023 by University of Niš, Serbia | Creative Commons License: CC BY-NC-ND 

 
Original scientific paper  

ANALYSIS OF PORTABLE SYSTEM FOR SOUND ACQUISITION 

OF VEHICLES POWERED BY INTERNAL COMBUSTION 

ENGINES 

Marko Milivojčević1, Emilija Kisić2, Dejan Ćirić3 

1Academy of Technical and Art Applied Studies, School of Electrical and Computer 

Engineering, Belgrade, Serbia 
2Metropolitan University, Faculty of Information Technology, Belgrade, Serbia 

3University of Niš, Faculty of Electronic Engineering in Niš, Niš, Serbia 

Abstract. In this paper a portable system for acquisition of sound generated by passenger 

vehicles powered by internal combustion engines is described and analyzed. The acquisition 

system is developed from scratch and tested in order to satisfy the requirements such as 

high-quality of audio recordings, high mobility, robustness and privacy respect. With this 

acquisition system and adequate signal processing, the main goal was to collect a large 

amount of clear audio recordings that will form a quality dataset. In further research, this 

dataset will be used for machine learning model training and testing, i.e. for developing a 

system for automatic recognition of the type of car engine based on fuel. 

Key words: acoustic based acquisition system, dataset, audio signals, internal 

combustion engines 

1. INTRODUCTION 

Applications of artificial intelligence algorithms to audio signals are becoming more 

numerous over time [1-4]. Sound classification, audio event detection and audio scene 

recognition are examples of tasks that are successfully realized in practice by applying 

machine or deep learning [5, 6]. In this context, machine and deep learning could be used 

to identify the type of internal combustion engine with regard to the fuel based on the 

sound generated by the engine. Namely, the sound of these engines differs depending on 

the used fuel - petrol (gasoline) or diesel. Human ear can recognize this sound difference, 

that is, whether it is a petrol or diesel engine’s sound. Those facts and the need to classify 

passenger vehicles by fuel as a result of improved environmental standards [7] have 

served as major pillars of the present research. Its main aim is to develop a system for 

automatic recognition of engine type based on sound generated by the engine, that is, to 

 
Received November 09, 2022; revised January 18, 2023; accepted February 06, 2023 

Corresponding author: Marko Milivojčević 

Academy of Technical and Art Applied Studies, School of Electrical and Computer Engineering, 

Belgrade, Serbia  

E-mail: markom@viser.edu.rs 


300 M. MILIVOJČEVIĆ, E. KISIĆ, D. ĆIRIĆ 

 
build a machine/deep learning model that will be able to recognize the type of engine 

with high accuracy, where an input to the model will be the engine sound. 

Since the successful implementation of machine/deep learning requires an adequate 

dataset (containing, in this case, audio samples), a specialized acquisition system has 

been developed for this purpose. Details of the development of such an acquisition 

system for the collection of audio samples of the passenger vehicles powered by internal 

combustion engines are presented here. 

The first requirement that the acquisition system should satisfy is the automation, because 

manual collection of a large number of samples would require a lot of time and might 

introduce certain differences in conditions during the acquisition. Then, the collected data 

should meet the requirements for quality, duration and invariability of environmental 

conditions in order to provide the reliable information regarding the acoustic characteristics. 

The paper is divided into several sections. The technical characteristics of the system, 

hardware configuration and selection of components as well as acquisition procedure and 

processing of the collected audio signals are presented in the section related to methodology. 

The section describing the results provides a tabular presentation of the system efficiency for 

three cases of time interval between the start of detection of two consecutive vehicles, as 

well as the presentation of audio signals in the time and spectral domain as a measure of 

validity of the obtained images for further analysis with a machine or deep learning 

system. The paper ends with concluding remarks. 

2. ACQUISITION SYSTEM AND PROCEDURE DESCRIPTION 

In the earlier phases of the research, the influence of microphone position in the area 

below the engine compartment on the characteristics of audio recordings was analyzed in 

detail [8]. As a result, it was determined that the basic characteristics of the audio signal 

varied minimally independently where the measuring microphone was placed as long as 

the microphone was directly below the engine compartment [8]. In that regard, depending 

on a vehicle, the target area where the microphone can be placed below the engine is 

approximately 1.2 m by 1.2 m. Because of that, it is possible to collect relevant audio 

samples regardless of the exact position of the vehicle when it is stopped above the 

microphone. Based on the previous findings, the acquisition system uses a microphone 

positioned in the area below the engine compartment chosen as the most suitable area in 

terms of "purity" of sound [8, 9], and audio recording begins only after the presence of 

the vehicle is detected. In this way, audio samples of engine operation in the idle mode 

are collected, without the microphone itself being positioned on the vehicle. 

The system has been developed to be mobile, so that it can be set up independently of 

availability of power sources, and it is fully designed to run on battery power. Additionally, 

the system is designed to be autonomous, i.e., not to require human presence during operation. 

As the system has limited memory space, it was necessary to develop several verification 

steps before the current audio sample was written in the memory. Specifically, this system has 

four levels of verification before storing the audio recording, which resulted in a dataset of 

recordings that contains only sounds of interest, i.e., engine operation. When the system is 

applied in real conditions involving presence of interfering sources of noise and different 

engine load modes, despite a large number of successfully collected audio recordings, some 

recordings containing not only the desired engine mode but also other engine modes appeared 


  Analysis of Portable System for Sound Acquisition of Vehicles Powered by Internal Combustion Engines 301 

 
in the formed dataset. So, it was necessary to develop a procedure that detects and then 

extracts the idling mode of the internal combustion engine. In order to have as much autonomy 

of the system as possible, the requirement for minimum energy consumption conditioned the 

application of the simplest possible procedure for separating the desired engine mode. Thus, the 

procedure of extracting the engine idle mode applied here is based on the audio signal 

processing in the time domain, i.e., usage of signal envelope. It is worth mentioning that the 

number of recordings containing only the engine idling mode is also affected by the minimum 

time period between the start of detection of two consecutive vehicles. 

2.1. Acquisition system 

The main goal of collecting audio samples of engine operation is to make a dataset 

containing sounds of passenger vehicles recorded in real conditions. In this way, the future 

classification system will be able to properly work in such conditions, as those at entrances to 

underground garages, toll plazas, gas stations, etc. The generated dataset of audio samples 

should preferably have such characteristics that will enable its usage in different machine and 

deep learning approaches [10]. They include support vector machine (SVM) [11], k-nearest 

neighbors (k-NN) [12], deep forest [13] or various deep neural network architectures as multi-

layer perceptron [14] or convolutional neural network [15]. For this purpose, the audio 

samples may be transformed either into selected set of features or images, such as 

spectrogram-like images, or they may be used in the existing format (raw audio signals). 

The entrance to the underground garage with a ramp was chosen as the most suitable 

space for collecting audio samples, where it is necessary to stop the vehicle until the 

driver takes the card / token. During this period, the car is static and idling. Even if it has 

a start / stop system, it will run in idle mode for a certain period of time. In addition, in 

such a situation, the movement of the vehicle is so directed that there is no possibility of 

mechanical damage to the microphone and sensor that are placed on the ground in the 

space between the wheels. The block diagram of the system is presented in Fig. 1, and the 

realized system in a laboratory environment is shown in Fig. 2.  

 
Fig. 1 Block diagram of the sound acquisition system 


302 M. MILIVOJČEVIĆ, E. KISIĆ, D. ĆIRIĆ 

 
Fig. 2 Realized acquisition system in laboratory environment 

The system has been developed so that the presence of vehicles is detected with the 

ultrasonic sensors before the process of recording the engine operation sound begins (the 

first level of verification). Ultrasonic sensors are primarily selected as sensors that, unlike 

widely used cameras, do not affect user privacy. Also, these sensors that are among the 

cheapest sensors on the market have very low power consumption, and they are accurate 

enough to detect vehicles. This type of vehicle detection enables the installation of the 

system almost everywhere because there is no possibility of interference with any existing 

induction sensors at the entrance ramp and violations of the law related to user privacy. In 

order to avoid detection of objects that are not vehicles of interest, two sensors are used. 

The sensors are positioned so that one measures the distance along the horizontal (x) axis 

and the other one along the vertical (y) axis. The plane formed by the ultrasound sensors is 

not perpendicular to the direction of vehicle movement, as shown in Fig. 3. 

By using the sensors placed in the described way, the possibility of detecting two-

wheelers and pedestrians that might also show at the ramp is eliminated. Namely, due to 

the position and orientation of the sensor placed on the ground, two-wheelers can only be 

detected if they pass directly above, i.e., over the sensor. However, even in such a case, 

they will not meet the distance requirement from the side sensor, if they move in the 

intended direction of entering the garage. If the sensors were located in the same plane, 

then it would theoretically be possible for a motorcycle to be oriented perpendicularly in 

reference to the intended direction of movement of the vehicle, i.e., above the ground 

sensor and facing the side sensor with the front or rear wheel. By positioning the sensors 

in two planes, a motorcycle would have to be in an almost impossible position to enter 

the garage, i.e., it would need to hit the ramp in order to satisfy the condition of the vehicle 

presence on both sensors. In a similar manner, a pedestrian who is above the ground sensor 

could move in tandem with another pedestrian who would satisfy the condition of the side 

sensor if both sensors were in the same plane. However, if the distances are measured in 

different planes, it would be more difficult and less likely to meet the condition of the 

vehicle presence on both sensors. 


  Analysis of Portable System for Sound Acquisition of Vehicles Powered by Internal Combustion Engines 303 

 
Fig. 3 The acquisition system positioned at the entrance to the underground garage, 

where horizontal (x) and vertical (y) axis as well as horizontal and vertical plane, 

which is also the plane formed by the axes, are presented 

Readily available waterproof ultrasonic distance measurement modules containing an 

ultrasonic sensor JSN-SR04T, whose specification is given in [16], are used in the 

acquisition system. These modules, that is, sensors are controlled by a microcontroller 

within the Arduino Nano platform [17], where distances are set for the specific measurement 

case. Distance measurement is realized by the short-term emission of an ultrasonic signal 

triggered by Arduino, after which Arduino measures the time until the reflected signal 

appears. The distance to an obstacle is calculated based on the measured time required for 

the signal to reach the obstacle and then return, and based on the speed of sound in the 

air. Since measurement of the distance to the vehicle does not require precision greater than 

1 cm, the best results were obtained by a trigger signal lasting 10 microseconds. Thus, if both 

sensors detect an object (the horizontal sensor at distance less than 80 cm and the vertical 

sensor at distance less than 40 cm), the microcontroller registers a vehicle presence and sends 

this information using serial communication to the Raspberry Pi computer [18]. This 

computer represents the central part and heart of the acquisition system. The reason for using 

an additional microcontroller in addition to the Raspberry Pi, which can also control and read 

ultrasonic sensors, is the need to detect vehicle presence continuously, i.e., in parallel with 

recording the audio. By having both the Raspberry Pi and the additional microcontroller 

(Arduino Nano), two activities − vehicle detection and audio recording, supposed to be 

done in parallel, can be realized in an easier and more reliable way. 

Each audio sample is recorded with an omnidirectional microphone that is placed on 

the ground in the area below the vehicle. In this way, in almost all cases, the microphone 

is positioned directly below the vehicle’s engine after the vehicle is stopped in front of 

the ramp. In order to obtain the highest quality audio recordings, the AKG C562CM 

omnidirectional microphone is used, with the specifications that are listed in [19] and 

presented in Fig. 4. 


304 M. MILIVOJČEVIĆ, E. KISIĆ, D. ĆIRIĆ 

 
                                         (a)               (b) 

Fig. 4 Characteristics of AKG C562CM microphone: (a) frequency and (b) polar response [19] 

The microphone and ultrasonic sensor that measures the vertical distance are placed 

in a purpose-made cable protector (Fig. 5a) made of industrial rubber with a hardness of 

90 Shora. For the purpose of collecting audio recordings, the edges of this cable protector 

had to be processed at an appropriate angle so that the sound of wheel crossing over it 

should be negligible in the recordings. The processing angle was determined empirically 

and was approximately 150°. In addition to protecting the cables that connect the microphone 

and ultrasonic sensor to the rest of the system, the cable protector is designed to protect both 

the microphone and sensor in the case that the vehicle wheel passes directly over them (Fig. 

5b). When the cable protector is placed at the measuring position, it is not necessary to fasten 

this guide to the base, because it is not subject to slipping and moving due to the structure 

of the rubber and its width of 20 cm. The guide was at almost the same position during 

the acquisition independently on how large and heavy were the vehicles passing over. 
 

(a) (b) 

Fig. 5 Cable and ultrasonic sensor protection: (a) purpose-made cable protector and 

(b) microphone/sensor protection 

The hardware limitations of the Raspberry Pi computer in terms of maximum sampling 
frequency and number of bits for audio signal quantization as well as the need for microphone 
phantom power resulted in the insertion of an A/D converter between the microphone and the 
Raspberry Pi computer. For the purpose of A/D conversion and microphone power supply, a 
dedicated high-quality audio interface iRig Pre HD is employed, which is also a battery-
powered device whose specifications are given in [20]. Additionally, the use of an external 


  Analysis of Portable System for Sound Acquisition of Vehicles Powered by Internal Combustion Engines 305 

 
A/D converter enables the Raspberry Pi to run at lower processing power and lower power 
consumption. On the Raspberry Pi computer, the developed Python code is run after the 
power is turned on. Within this code, the serial communication via USB port is listened to in 
order to receive the information from the Arduino about the vehicle presence. When the 
vehicle is detected, a series of processes are realized that are described in the next subsection. 

2.2. Acquisition procedure 

After the vehicle presence is detected, and in order to save the battery, the Raspberry 
Pi starts the microphone listening mode via the A/D interface. Only when the detected 
sound level is above the set threshold, the storage of the stream in the buffer will begin 
(the second level of verification). The threshold level is determined empirically at 74 dB. 
In this way, an accidental excitation of the sensors that can be caused by the passing of a 
pedestrian, dog or cat is avoided. The audio recording duration is initially set to 5 s and 
after the time has elapsed, the stream stops. In order to avoid an accidental excitation 
potentially caused by the passing of a motorcycle, the stream is additionally checked after 
stopping it. Namely, at the location where the samples were taken, and in most of the 
underground garages, motorcycles are allowed to enter without any obstruction next to 
the ramp, so they are not stopped at the entrance. The mentioned check is performed 
simply - after two seconds from the beginning of the stream the signal level is checked 
whether it is above the threshold set in the previous step or not (the third level of 
verification). If the threshold condition is met in that segment of the stream, it is stored as 
a wav file on the SD card. The entire procedure is shown as the flowchart given in Fig. 6. 

 
Fig. 6 Acquisition procedure flowchart 


306 M. MILIVOJČEVIĆ, E. KISIĆ, D. ĆIRIĆ 

 
The fourth level of verification is a specially developed algorithm where only the engine 

idle mode (stationary signal) is extracted from the existing wav file. The description of this 

procedure is given in the next subsection. 

The initial installation of the system at the entrance ramp of the underground garage 

showed that the system detected only vehicles and that the audio recordings contained only 

signals originating from internal combustion engines. However, the waiting time of vehicles 

above the microphone varied considerably from case to case. Due to this phenomenon, three 

different approaches for audio signal recording (A, B and C) were applied based on the 

activities of ultrasound sensors. In the first one (A), it was defined that after detecting the 

object (vehicle), the ultrasound sensors remained inactive for 5 s until the rest of the system 

finished the audio sample recording. In the approach B, a fixed time of 5 s of sensor inactivity 

after detection was replaced by the time of 8 s. The third approach (C) is related to the 

situation where the sensors were constantly active in order to detect when the vehicle left the 

space above the microphone, thus not sending a command to the rest of the system to start the 

next recording. If the sensors in two successive iterations separated in time for 50 

microseconds detected the absence of a vehicle, the system interpreted this situation as the 

vehicle had left the position. This is important because occasionally one of the sensors 

measures greater distance to a vehicle caused by the higher-order reflections of ultrasonic 

waves, due to the long waiting of the vehicle. This is interpreted as non-compliance with the 

presence condition. Such a phenomenon is attributed to the dispersion of ultrasonic waves that 

can occur due to the shape of the vehicle’s body. During the system testing, it was shown that 

this phenomenon was rare. 

In terms of the negative effects of constant exposure to the ultrasonic waves, the used 

ultrasonic sensors are of very low power, designed to measure distances of up to 4.5 m, 

which means that the signal level can be negligible at longer distances due dispersion. If 

we look at the configuration of the entrances to the underground garages, the width of the 

passage for vehicles must be at least 3 m. In this way, if a pedestrian passage exists, it can 

only be found at a distance greater than 3 m from the sensor. Besides, within the few 

hours of the acquisition, fewer than 10 passengers were seen in the pedestrian passage, 

but being further than 5 m from the sensors. 

2.3. Extraction of idling mode of operation 

In order to extract the stationary part of each recorded audio signal that corresponds to 

the engine idle by the signal processing in the time domain applied here, it is necessary to 

determine the threshold (time moment) after which the non-stationary part of the signal 

should be rejected. Due to the nature of the problem, the stationary part of the signal 

always appears at the signal beginning, see Figs. 7, 8 and 9 given in the next section. 

There are no cases where the idling occurs later (in the middle or at the end of the signal). 

So, it is clear that the threshold needs to be found at a certain time point after the signal 

starts, i.e., at the first moment when the signal becomes non-stationary.  

Based on the analysis of the waveforms of the recorded signals in the time domain, it 

is noticed that at the moment when the signal ceases to be stationary, its amplitude abruptly 

increases. Thus, at that moment, there is a noticeable increase (jump) in the signal envelope. 

The idea for extracting a stationary part of a recorded signal is based on generating the signal 

envelope and calculating the difference between the current and previous envelope values 

along the envelope. While the signal is stationary, the difference between the current and 


  Analysis of Portable System for Sound Acquisition of Vehicles Powered by Internal Combustion Engines 307 

 
previous envelope values is expected to be small. On the other hand, at the moment when 

the signal ceases to be stationary, the difference between the current and previous value 

of the envelope must be significantly greater than the difference at time instants before that 

moment. The first time instant from the beginning to the end of the signal where there is a 

significant increase in the difference between the current and the previous envelope value is a 

candidate for setting the threshold. This significant increase needs to be quantified. 

If the signal envelope is denoted as Env(t) and the threshold representing the upper 

time limit of the stationary signal part as tL, the threshold itself can be determined as: 

 













−−=

f

s
L

N

t
AtEnvtEnvt })1()(min{  (1) 

where ts denotes the duration of the signal, and Nf is the number of frames in which the 

signal maxima are calculated in the procedure of generating the signal envelope. A is a 

constant having the value of 0.1 determined empirically. Since it is necessary to set the 

threshold at the first time instant after the envelope jump looking from the signal beginning to 

its end, the smallest value that satisfies the condition in (1) is taken as the threshold tL. More 

precisely, since the time variable t is given in frames used for generating the signal envelope, 

the condition min{Env(t)-Env(t-1)>A} returns an envelope frame in which there is an 

envelope jump indicating a transition from stationary to non-stationary part of the signal. In 

order to obtain the exact time instant for setting the threshold, it is necessary to normalize the 

obtained envelope frame value by ts/Nf. In our case, the frame size for generating the envelope 

is 4000 samples with the frame overlap of 1000 samples. This means that the resolution for 

setting the threshold tL is determined by the frame size, which can be chosen in accordance 

with the nature of the analyzed signal. 

3. ANALYSIS OF RECORDED AUDIO SIGNALS 

Positioning the acquisition system at the entrance of the underground garage with a ramp 
where it is necessary to stop a vehicle in order to take a token gave the results above the 
expectations in terms of the quality and number of audio recordings. These recordings have 
the following parameters: sampling rate of 44.1 kHz, the bit depth of 16 bits, fixed duration of 
5 s resulting in a file size of approximately 431 kB, which provides the possibility of storing 
approximately 67800 audio samples assuming the effective storage space of 28 GB on the 32 
GB SD card. The power supply used a power bank with a capacity of 10 Ah, consumed about 
20% of the capacity for 2 hours of recording, showing that the system is able to function with 
this power supply for about 10 hours in a completely autonomous way. 

In parallel with the autonomous operation of the system, manual records of the engine 
type by fuel were made, meaning that the samples were labeled manually. This was done to 
identify the possible error, e.g. audio recording that would be unusable due to excessive noise 
of the environment that might be present indoors typically coming from the garage ventilation. 
This case did not happen in practice as a result of a correctly set threshold that determines the 
beginning of the recording. Considering the three approaches mentioned above (A, B and C), 
after analyzing the recordings, the most important fact is that no vehicle passed by the 
acquisition system without triggering the system to record the sound of its engine operation. 
Also, events other than passenger vehicles passing by did not falsely trigger the system, and a 
completely blank recording was not obtained. Table 1 provides a comparative overview of 


308 M. MILIVOJČEVIĆ, E. KISIĆ, D. ĆIRIĆ 

 
these three approaches in terms of the number of samples collected as well as the usability of 
the samples. It is worth mentioning, that during the collection of audio samples, a very small 
percentage of vehicles belonged to the older generation of vehicles. The majority of diesel 
vehicles belonged to the generation of common rail type injection, while the majority of 
gasoline vehicles had multipoint indirect injection. 

Table 1 gives the total number of audio recordings and number of useful audio recordings. 
Here, the latter contain the engine idling mode sounds, while the rest of recordings still 

contain the vehicle engine sounds, but not the idling mode of operation − instead they contain 
the sound of a vehicle leaving the ramp. Large majority of recordings are the useful recordings, 
and its percentage in reference to the total number of recordings is above 90%, where this 
percentage is the highest for the ultrasonic detection approach C, and it is close to 97%. 

By comparing three ultrasonic detection approaches from Table 1, it can be noticed 
that the approach A with a fixed time interval of detection (sensor inactivity) of 5 s gave 
the most audio samples, as many as 202% of useful recordings in relation to the number 
of vehicles. This approach is primarily suitable for generating the largest possible dataset, 
but it is not suitable from the point of view of efficient usage of the storage resources. If 
the system is used employing this approach for detection and recognition of the engine 
type in real conditions, there will be cases where the same vehicle is detected more than 
once. Strictly speaking, this increased number of recorded audio signals for some 
vehicles could have certain detrimental effects on the machine/deep learning due to over-
representation of these vehicles in comparison to others. Although the number of 
recordings for majority of vehicles is up to two, these effects will be analyzed in the next 
phases of the research. Besides, if necessary, the redundant recordings for the same 
vehicle could easily be removed from the dataset according to the time of recording. 

Table 1 Comparative overview of three different detection approaches (A, B and C) in 

terms of the number of samples collected as well as the usability of the samples 

 A (sensors 
inactive for 5 s 

after vehicle 
detection) 

B (sensors 
inactive for 8 s 

after vehicle 
detection) 

C (continuous 
detection of 
vehicles by 

sensors) 

Number of vehicles that passed through the 
acquisition system 

50 100 100 

Number of detected vehicles 50 100 100 
Total number of audio recordings 111 143 122 
Number of useful audio recordings 101 133 118 
Number of idle mode records only (without 
any additional processing) 

69 97 109 

Percentage of vehicles detected 100% 100% 100% 
Percentage of useful recordings in relation 
to the total number of recordings 

90.99% 93% 96.72% 

Percentage of useful recordings in relation 
to the number of sampled vehicles 

202% 133% 118% 

Percentage of recordings of idle mode only 
without additional processing in relation to 
the number of sampled vehicles 

138% 97% 109% 

Percentage of recordings not requiring the 
fourth level of verification in relation to the 
total number of recordings 

62.16% 67.83% 89.34% 


  Analysis of Portable System for Sound Acquisition of Vehicles Powered by Internal Combustion Engines 309 

 
The approach B (time interval of sensor inactivity of 8 s) also gave good results in terms 

of the number of vehicles detected and the amount of audio recordings. However, it has the 

lowest percentage of recordings of idling mode only without additional processing compared 

to the number of sampled vehicles. This approach has more efficient usage of memory 

resources compared to the first approach. 

The most complex approach (C), continuous detection with the recognition of the next 

vehicle, gave the least audio recordings in relation to the number of detected vehicles. On 

the other hand, this approach is the most efficient in terms of memory utilization, 

achieving a high percentage of clean recordings. In this way, the lowest redundancy 

among samples and the highest percentage of useful recordings in relation to the total 

number of recordings were obtained. The latter led to the least need for additional 

processing (saving CPU resources) and additional power from the power supply. 

Within all three approaches from A to C, one or two audio recordings per vehicle 

were obtained for the majority of vehicles. Here, the first recording represented the 

engine idling stationary mode without exceptions, Fig. 7. In most of the samples, the 

second recording (in some cases the last one) partially contained the engine idling mode 

followed by an increase in the crankshaft speed and partial engine load mode in order to 

accelerate the vehicle, as shown in Figs. 8 and 9. There were no cases where the partial 

load mode of the engine appeared before the idling mode in the recordings. In these three 

figures (Figs. 7, 8 and 9), the audio signals of approximately the same generation of 

vehicles are presented. Here, the signals’ amplitudes are normalized; hence the focus is 

on differences in the signal level caused by the change in operating mode. 

 
Fig. 7 Audio signal of (a) petrol and (b) diesel engine at idle, without changing the mode 

 
Fig. 8 Audio signal of (a) petrol and (b) diesel engine having early operation mode 

change from idle to load mode (during the recording interval) 


310 M. MILIVOJČEVIĆ, E. KISIĆ, D. ĆIRIĆ 

 
Fig. 9 Audio signal of (a) petrol and (b) diesel engine having late operation mode change 

from idle to load mode (during the recording interval) 

Calculation of the threshold (i.e., the time instant of the audio signal until which the 

engine is in the idling mode) used for extraction of idling mode of operation is illustrated 

in Figs. 10 and 11, where the threshold is marked with a purple vertical line. The terms 

“early” and “late” are related to the cases where the operation mode change happens 

earlier (up to 1 s) and later (after 1 s) in the recorded signal, respectively. In the recorded 

signals where there is no change in the engine operation mode, the threshold (cutoff time) 

could not be determined in the described way. In such a case, the entire audio track is 

selected as an engine idle, and is used for further analysis and processing.  

 
Fig. 10 Waveform and envelope of the audio signal of (a) petrol and (b) diesel engine with 

an early change of operation mode (the threshold is marked by a vertical line) 

 
Fig. 11 Waveform and envelope of the audio signal of (a) petrol and (b) diesel engine 

with a late change of operation mode (the threshold is marked by a vertical line) 

The waveforms of the characteristic audio signals extracted in the described way are 

presented in Figs. 12 and 13. For the presented case of the vehicle using diesel fuel where 

an early operation mode change (almost at the very beginning of the recording) occurred, 

the calculated threshold (cutoff) time was also very close to the beginning of the signal 


  Analysis of Portable System for Sound Acquisition of Vehicles Powered by Internal Combustion Engines 311 

 
(Fig. 10b), which means that this recording is rejected using the function for checking the 

duration of the stationary mode. This duration can be set according to the requirement 

related to the minimal length of the signals. Depending on a particular need, the signal 

length may be either shorter or longer. In the present case, the duration of the stationary 

mode is set to 0.5 s meaning that the minimal signal length is 0.5 s. 

 
Fig. 12 Audio signal of (a) petrol and (b) diesel engine at idle extracted from the recordings 

with a late change of operation mode 

 
Fig. 13 Audio signal of (a) petrol and (b) diesel engine at idle extracted from the 

recordings with an early change of operation mode 

As the mapping of audio signals into an adequate image format [21, 22], such as 

spectrogram-like images, is increasingly used in modern signal processing and deep 

learning, the obtained audio signals are also presented in the form of spectrograms, see 

Figs. 14, 15 and 16. There are some properties present in the spectrograms of both engine 

types (petrol and diesel), such as stronger components at low and mid frequencies than at 

high frequencies as well as rather steady-state behavior along the time axis. However, 

these images contain also certain differences between the sounds of petrol and diesel 

engines, such as more uniform energy distribution along frequency axis for the petrol 

engine and more prominent particular frequency components for the diesel engine. More 

detailed analysis of the recorded audio signals and their representations in different 

domains, as well as correlation between the signals and vehicle types by fuel will be done 

in the very next phase of the research. 


312 M. MILIVOJČEVIĆ, E. KISIĆ, D. ĆIRIĆ 

 
Fig. 14 Spectrogram of (a) petrol and (b) diesel engine audio signal at idle, without changing 

the mode and without applying the idling mode extraction 

 
Fig. 15 Spectrogram of (a) petrol and (b) diesel engine audio signal at idle with a late 

change of operation mode 

 
Fig. 16 Spectrogram of (a) petrol and (b) diesel engine audio signal at idle with an early 

change of operation mode 

4. CONCLUSIONS 

Considering the number of recordings containing exclusively the idling mode of the 

vehicles in reference to the number of sampled vehicles, it can be seen that the developed 

acquisition system has collected at least one such recording for each vehicle. Also, the 

system has not recorded a single blank audio file, and it is rather robust to false 

triggering. In addition, the selected amount of memory proved to be sufficient, and the 

most critical part of the system, the battery power, gave very satisfactory results in terms 

of system autonomy. Since 250 vehicles in total passed behind the microphone and 

sensors placed on the ground without any consequences for functionality, the condition of 

robustness has been satisfied, and also the ability of unattended use has been proven. 


  Analysis of Portable System for Sound Acquisition of Vehicles Powered by Internal Combustion Engines 313 

 
The developed additional processing of recorded signals for extracting exclusively the 

engine idle mode along the entire audio recording has enabled to create a dataset of audio 

samples containing only this target mode of operation. The acquisition system has proven 

to be efficient for recording the sound of a passenger vehicle at idle regardless of the type 

of fuel. The number of audio recordings can also be affected by the approach applied for 

detecting the presence of a vehicle using ultrasound sensors. This results in a larger or 

smaller number of recordings having higher or lower redundancy between the recordings, 

respectively. 

By using the developed acquisition system, a dataset has been created consisting of 352 

audio recordings for 250 vehicles containing the sound of engines in the idling mode of 

operation. This acquisition system can found its application in different use-cases including 

control of car entrance in restricted areas of smart-cities, prevention of misfueling at gas 

stations, optimization of road usage or noise prevention based on engine fuel type. In such 

cases, this proof-of-concept system could be implemented as an embedded system on a 

dedicated single platform. 

Depending on a particular application and its requirements, the acquisition system might 

be modified to become even less demanding. Thus, taking into account relatively high 

sound pressure levels at the microphone (above 74 dB) and proximity of the source, the 

condenser AKG C562CM microphone might by replaced by an electro-dynamic microphone 

not requiring phantom power. Since it is expected that dynamic range of the acquired 

signals will not be that large, the bit depth might be smaller than 16 bits used here. In 

addition, after developing an adequate classifier and considering the useful frequency range, 

it would be worthwhile to explore an option of reducing the sampling frequency.  

The generated dataset of audio samples will play an important role in future work for 

developing a system for automatic recognition of the type of engine based on the used 

fuel. This system will be designed by applying an adequate approach of deep or machine 

learning for classification and employing the created dataset for model training and 

testing. Based on the samples from the generated dataset, it can be concluded that 

spectrograms of engines that use petrol and diesel at idle seem to be different, forming a 

strong ground-base for achieving high accuracy in engine type classification. 

Acknowledgment: This work has been supported by the Ministry of Education, Science and 

Technological Development of the Republic of Serbia, contract no. 451-03-68/2022-14/200102.  

REFERENCES  

[1] S. Das, A. Dey, A. Pal and N. Roy, "Applications of Artificial Intelligence in Machine Learning: Review and 
Prospect", Int. J. Comput. Appl., vol. 115, no. 9, pp. 31-41, April 2015. 

[2] P. Dhanalakshmi, S. Palanivel and V. Ramalingam, "Classification of Audio Signals Using SVM and RBFNN", 
Expert Syst. Appl., vol. 36, no. 3, part 2, pp. 6069-6075, 2009. 

[3] P. Dhanalakshmi, S. Palanivel and V. Ramalingam, "Classification of Audio Signals Using AANN and GMM", 
Appl. Soft. Comput., vol. 11, no. 1, pp. 716-723, 2011. 

[4] H. Ponce, P. Ponce and A. Molina, "Adaptive Noise Filtering Based on Artificial Hydrocarbon Networks: An 
Application to Audio Signals", Expert Syst. Appl., vol. 41, no. 14, pp. 6512-6523, 2014. 

[5] Z. Liu, J. Huang, Y. Wang and T. Chen, "Audio feature extraction and analysis for scene classification", In 
Proceedings of First Signal Processing Society Workshop on Multimedia Signal Processing, Princeton, NJ, 
USA, 23-25 June 1997, pp. 343-348. 


314 M. MILIVOJČEVIĆ, E. KISIĆ, D. ĆIRIĆ 

 
[6] T. Birtchnell, "Listening Without Ears: Artificial Intelligence in Audio Mastering", Big Data & Society, vol. 5, 
no. 2, July 2018.  

[7] G. P. Chossière, R. Malina, F. Allroggen, S. D. Eastham, R. L. Speth and S. R. H. Barrett, "Country- and 
Manufacturer-Level Attribution of Air Quality Impacts due to Excess NOx Emissions from Diesel Passenger 
Vehicles in Europe", Atmospheric Environ., vol. 189, pp. 89-97, Sept. 2018. 

[8] M. Milivojčević, F. Pantelić, D. Ćirić, "Pozicioniranje mikrofona prilikom snimanja audio karakteristika motora 
putničkih vozila" (Microphone positioning when recording audio characteristics of passenger car engines) In 
Proceedings of 63rd National Conference on Electrical, Electronic and Computing Engineering ETRAN, 

Srebrno Jezero, Serbia: 3-6 June 2019, pp. 58-62 (in Serbian). 

[9] M. Milivojčević, F. Pantelić and D. Ćirić, "Comparison of frequency characteristics of sound generated by 
internal combustion engines depending on fuel", In Proceedings of  26th Noise and Vibration, Niš, Serbia: 6-7 

December 2018, pp. 115-120. 

[10] N. Evans, Automated Vehicle Detection and Classification using Acoustic and Seismic Signals. Ph.D. Thesis, 
University of York, 2010. 

[11] H. Frederick, A. Winda and M. Iwan Solihin, "Automatic petrol and diesel engine sound identification based on 
machine learning approaches", In Proceedings of the International Conference on Automotive, Manufacturing, 
and Mechanical Engineering. Bali, Indonesia: 26-28 September 2018, published at E3S Web of Conferences, 

vol. 130, article no. 01011. 

[12] A. D. Mayvana, S. A. Beheshtib and M. H. Masoom, "Classification of Vehicles Based on Audio Signals using 
Quadratic Discriminant Analysis and High Energy Feature Vectors", Int. J. Soft Comput., vol. 6, no. 1, pp. 53-

64, Feb. 2015. 

[13] A. Wieczorkowska, E. Kubera, T. Słowik and K. Skrzypiec, "Spectral Features for Audio Based Vehicle and 
Engine Classification", J. Intell. Inf. Sys., vol. 50, pp. 265-290, 2018. 

[14] E. Alexandre, L. Cuadra, S. Salcedo-Sanz, A. Pastor-Sánchez and C. Casanova-Mateo, "Hybridizing Extreme 
Learning Machines and Genetic Algorithms to Select Acoustic Features in Vehicle Classification Applications", 
Neurocomput., vol. 152, pp. 58-68, March 2015. 

[15] S. D. Badiger and M. UttaraKumari, "Vehicle Classification Using Machine Learning Algorithms Based on the 
Vehicular Acoustic Signature", Sci. Tech. Dev., vol. 8, no. 11, pp. 369-374, Nov. 2019. 

[16] Ultrasonic Waterproof Range Finder datasheet. Available at: https://www.jahankitshop.com/getattach.aspx?id= 
4635&Type=Product. 

[17] A. Pajankar, Kickstart to Arduino Nano. Susteren, The Netherlands: Elektor International Media, 2022.  
[18] B. R. Kent, Science and Computing with Raspberry Pi. San Rafael, USA: Morgan & Claypool Publishers, 2018. 
[19] C562 CM specifications. Available at: https://www.akg.com/Microphones/Boundary%20Layer% 

20Microphones/C562CM.html. 
[20] Digital high definition microphone interface specifications. Available at: https://www.ikmultimedia. 

com/products/irigprehd/. 

[21] S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag, S. Pugachevskiy, A. Baird and B. Schuller, 
"Snore sound classification using image-based deep spectrum features", In Proceedings of Interspeech 2017, 

Stockholm, Sweden, August 20–24, 2017, pp. 3512-3516. 
[22] D. Ćirić, Z. Perić, J. Nikolić, N. Vučić, "Audio signal mapping into spectrogram-based images for deep learning 

applications", In Proceedings of 20th International Symposium Infoteh-Jahorina (INFOTEH), East Sarajevo, 

Bosnia and Herzegovina: March 17-19, 2021, pp. 1-6.