HUNGARIAN JOURNAL OF
INDUSTRY AND CHEMISTRY
Vol. 48(1) pp. 3–10 (2020)
hjic.mk.uni-pannon.hu
DOI: 10.33927/hjic-2020-02

IMPROVING THE EFFICIENCY OF NEURAL NETWORKS WITH VIRTUAL
TRAINING DATA

JÁNOS HOLLÓSI*1,2 , RUDOLF KRECHT2 , NORBERT MARKÓ2 , AND ÁRON BALLAGI2,3

1Department of Information Technology, Széchenyi István University, Egyetem tér 1, Győr, 9026, HUNGARY
2Research Center of Vehicle Industry, Széchenyi István University, Egyetem tér 1, Győr, 9026, HUNGARY
3Department of Automation, Széchenyi István University, Egyetem tér 1, Győr, 9026, HUNGARY

At Széchenyi István University, an autonomous racing car for the Shell Eco-marathon is being developed. One of the main
tasks is to create a neural network which segments the road surface, protective barriers and other components of the
racing track. The difficulty with this task is that no suitable dataset for special objects, e.g. protective barriers, exists. Only a
dataset limited in terms of its size is available, therefore, computer-generated virtual images from a virtual city environment
are used to expand this dataset. In this work, the effect of computer-generated virtual images on the efficiency of different
neural network architectures is examined. In the training process, real images and computer-generated virtual images
are mixed in several ways. Subsequently, three different neural network architectures for road surfaces and the detection
of protective barriers are trained. Past experiences determine how to mix datasets and how they can improve efficiency.

Keywords: neural network, virtual training data, autonomous vehicle

1. Introduction

Shell Eco-marathon is a unique international competition
held by Royal Dutch Shell Plc. This event challenges
university students to design, develop, build and drive
the most energy-efficient racing cars. Our University’s
racing team, the SZEnergy Team, has been a successful
participant in the Shell Eco-marathon for over 10 years.
Two years ago, Shell introduced the Autonomous Urban-
Concept (AUC) challenge, which is a separate competi-
tion for self-driving vehicles that participate in the Shell
Eco-marathon. Participants in the AUC challenge have to
complete five different tasks, e.g. parking in a dedicated
parking rectangle, obstacle avoidance on a straight track,
drive one lap of the track autonomously, etc.

Our long-term goal is to prepare for the AUC chal-
lenge. One of the main tasks is to create an intelligent sys-
tem, which perceives the environment of our racing car,
e.g. other vehicles, the road surface, other components of
the racing track, etc. In this paper, only the segmentation
of the road surface and of the protective barriers is taken
into consideration. An approach based on neural net-
works will determine the segmentation, because such net-
works are one of the best tools to solve problems concern-
ing visual information-based detection and segmentation,
e.g. image segmentation. Many high-performance neural
network architectures are available such as AlexNet by
Krizhevsky et al. [1], VGGNet by Simonyan and Zisser-

*Correspondence: hollosi.janos@sze.hu

man [2], GoogLeNet by Szegedy et al. [3], Fully Convo-
lutional Networks by Shelhamer et al. [4], U-Net by Ron-
neberger et al. [5], ResNet by He et al. [6] and Pyramid
Scene Parsing Network by Zhao et al. [7]. Training neural
networks requires a large amount of training data. How-
ever, in this case, the number of training samples is insuf-
ficient, e.g. no training images of protective barriers are
available and the generation and annotation of real world
data is labour-intensive and time-consuming. Computer
simulation environments will be used to generate train-
ing data for this task. Some attempts that apply virtually
generated data to train neural networks have been made.
Peng et al. [8] demonstrated CAD model-based convo-
lutional neural network training for joint object detec-
tion. Tian et al. [9] presented a pipeline to construct vir-
tual scenes and virtual datasets for neural networks. They
proved that mixing virtual and real data to train neural
networks for joint object detection helps to improve per-
formance. Židek et al. [10] presented a new approach to
joint object detection using neural networks trained by
virtual model-based datasets. In this paper, an attempt is
made to show the effects of computer-generated training
data on the learning process of different network archi-
tectures.

The paper is structured as follows: in Section 2, the
virtual simulation environment that is used for generating
training data is described; in Section 3, our neural net-
work architectures are presented; in Section 4, the train-
ing process of the networks is outlined; in Section 5, our

https://doi.org/10.33927/hjic-2020-02
mailto:hollosi.janos@sze.hu


4 HOLLÓSI, KRECHT, MARKÓ, AND BALLAGI

results and experiences are shared; finally, in Section 6,
our conclusions are stated.

2. Our virtual environment

Our aim is to create highly realistic image sets that depict
racing tracks which follow the rulebook of the Shell Eco-
marathon Autonomous UrbanConcept. In order to ensure
repeatability and simple parameter setup, the creation of
complete, textured 3D-models of the racing tracks is ad-
vised. These simulated environments can be used to cre-
ate images with desired weather and lighting conditions
by scanning the track environment using a camera mov-
ing at a predefined constant speed. The images created
using this method can be processed further, e.g. segmen-
tation and clustering of different types of objects such as
the road surface, protective barriers and vegetation. Based
on the characteristics of the predefined task, the require-
ments of the simulation environment can be enumerated:

• highly realistic appearance,

• easy use of textures,

• fast workflow,

• characteristics definable by parameters (parametric
lights, weather conditions),

• modular environment construction,

• importability of external CAD models.

Unreal Engine 4 [11] is a games engine designed for
the fast creation of modular simulated environments by
the use of modular relief, vegetation and building ele-
ments. In these environments, actors based on external
CAD models could be used. Fields of engineering that ap-
ply different visual sensors and cameras require very sim-
ilar computer simulation technologies to the video game
industry. Video games need to be highly realistic as well
as efficient due to limited computational capacity. The re-
quirements are the same for the simulation of vehicles
mounted with cameras. Highly realistic computer simu-
lations reduce the cost and duration of real-life tests and
camera calibrations. It is also important to mention that
by using technology implemented and/or developed by
the video game industry, the support of a vast developer
community is available.

Since our goal is to develop image-perceiving solu-
tions for the Shell Eco-marathon Autonomous Urban-
Concept challenge, it is important to carefully follow
the rules of this competition with regard to the racing
tracks. The simulated environments and racing tracks
created by Unreal Engine strictly follow the rules de-
fined by the aforementioned rulebook. These rules de-
fine that the self-driving vehicles have to compete on rac-
ing tracks equipped with protective barriers of a known
height painted in alternating red and white segments. It
is also defined that every racing track consists of three

Figure 1: Example images from the training set.

painted line markings, one that is green to denote the
starting position, a yellow one to trigger the self-driving
mode, and another that is red to mark the finish line. Be-
cause the racing tracks and tasks are well defined, it is
crucial to create accurate models of the expected environ-
ments. Differences between real and simulated environ-
ments might lead to further developments in the wrong
direction.

Two simulated test environments were created. The
first one was based on a readily available city model with
streets corresponding to a typical racing track. Barrier el-
ements were added to the roads to ensure that the rac-
ing track complies with the requirements outlined in the
rulebook. This model includes defects in and textures of
the road surface to ensure detection of the road surface is
robust. In order to create image sets based on this envi-
ronment model, a vehicle model equipped with a camera
travelled around the racetrack on a pre-defined path. The
camera was set to take pictures at pre-defined time in-
tervals. The image set was annotated by using a module
called AirSim. AirSim is an open-source, cross-platform
simulator simulation platform built on Unreal Engine, but
it also has a Unity release. This simulator module con-
sists of a built-in Python-based API (Application Pro-
gramming Interface) which was developed for image seg-
mentation. By using this API, the necessary realistic and
segmented image datasets were created. Some example
pairs of images from our virtual dataset are presented in
Fig. 1. In order to prepare for all the tasks defined in
the rulebook, multiple models of racing tracks were cre-
ated. All such models are based on the same environment
model, which includes vegetation and the sky as shown
in Fig. 2. The models of sections of racing track were
realized according to the challenges defined in the rule-
book. The CAD models representing elements of the rac-
ing track were custom-made to comply with the shapes,
sizes and colors outlined in the rulebook. The sections of
racing track generated can be used to simulate handling

Hungarian Journal of Industry and Chemistry


IMPROVING THE EFFICIENCY OF NEURAL NETWORKS WITH VIRTUAL TRAINING DATA 5

Figure 2: Basic environment of racing tracks.

Figure 3: Parking place and slalom course.

(slaloming) and parking tasks. This virtual racing track is
shown in Fig. 3. The image sets were created by a moving
camera in the environment and segmentation was carried
out by changing the textures.

3. Neural network architectures

Three different neural network architectures are imple-
mented in this work: FCN, U-Net and PSPNet. All neural
networks are designed for image segmentation, where the
size of input images is 256×512×3, and the size of out-
put ones is 256×512×1. Every network is trained for the
segmentation of the road surface and protective barriers.

3.1 FCN

The Fully Convolutional Network (FCN) [4] architecture
is based on fully convolutional layers, where the basic
idea is to extend effective classification neural networks
to conduct segmentation tasks. Our FCN architectures are
shown in Fig. 4.
Let:

γ = (conv, bn, ReLu) (1)
b1 = (γ, γ,mp) (2)
b2 = (γ, γ, γ,mp) (3)

Figure 4: FCN architecture.

48(1) pp. 3–10 (2020)


6 HOLLÓSI, KRECHT, MARKÓ, AND BALLAGI

Figure 5: U-Net architecture.

where conv denotes a convolutional layer, bn represents
a batch normalization layer, ReLu stands for a rectified
linear activation unit and mp is a max pooling layer. Let:

B1 = (b1, b1,b2) (4)
B2 = (b2) (5)
B3 = (b2, γ, γ, γ) (6)
x = (conv, bn) (7)
y = (ReLu, softmax) (8)
Z = (ReLu, softmax) (9)

where softmax denotes a softmax layer. In this imple-
mentation, the dimensions of all convolutional layers are
3×3, except for the three fully connected layers in block
B3. The dimensions of these convolutional layers are
7×7. In block B1, the first two convolutional layers both
contain 64 filters, the third and fourth both contain 128
filters, and the last three convolution blocks each contain
256 filters. The convolutional layers in block B2 contain
512 filters in total. The first three convolutional layers in
block B3 contain 512 filters in total, and the fully con-
nected layers are based on 4096 filters in total.

3.2 U-Net

The U-Net [5] neural network architecture was originally
created for biomedical image segmentation. It is based on
FCN, where the neural network can be divided into two
main blocks, namely the downsampling and upsampling

Figure 6: PSP Net architecture.

blocks. Our implementations are shown in Fig. 5. Let:

D1 = D2 = (γ, γ, maxpooling) (10)
D3 = D4 = B = (γ, γ, γ,maxpooling) (11)

U1 = U2 = U3 =
(
convt, bn, ReLu,γ, γ

)
(12)

U4 =
(
convt, bn, ReLu,γ, γ, conv,softmax

)
(13)

where convt is a transposed convolution layer. In the U-
Net neural network, the dimensions of all convolutions
and transposed convolutions are 3×3, and 2×2, respec-
tively. The number of convolutional filters are as follows:
each convolutional layer in D1 consists of 64, in D2 of
128, in D3 of 256 and in D4 as well as B of 512 filters.
The upsampling block is very similar. U1 consists of 512,
U2 of 256, U3 of 128 and U4 of 64 filters.

3.3 PSPNet

The Pyramid Scene Parsing Network (PSPNet) [7] was
judged to be the best architecture in the ImageNet Scene
Parsing Challenge in 2016 [12]. The main building block
of the PSPNet is a pyramid pooling module, where
the network fuses features under four different pyramid
scales. Our PSPNet-based architecture is shown in Fig. 6.

Hungarian Journal of Industry and Chemistry


IMPROVING THE EFFICIENCY OF NEURAL NETWORKS WITH VIRTUAL TRAINING DATA 7

Let:

B1 = (γ, γ,γ,maxpooling) (14)
C = (γ, γ, conv, bn) + (conv, bn) (15)
I = (γ, γ, conv, bn) (16)
p = (avg, conv) (17)
P1 = (p) (18)
P2 = (p, p) (19)
P3 = (p, p, p) (20)
P4 = (p, p, p, p) (21)

B2 =
(
γ, dropout, conv, convt, softmax

)
(22)

where avg denotes an average pooling layer and dropout
represents a dropping out unit. In block B1, the dimen-
sions of all convolutions are 3×3. In blocks C and I, the
dimensions of every first & third and every second con-
volution are 3×3 and 1×1, respectively. In block B2, the
dimensions of the first convolution are 3×3 and the sec-
ond 1×1. The dimensions of the transposed convolution
are 16 × 16. Each of the first two convolutions in block
B1 consist of 64 filters, and the last one of 128. The first
block C and first two I blocks contain 64, 64, 256 filters,
respectively, while the second block C and the following
three I blocks consist of 128, 128 and 512 filters, respec-
tively. The third block C and the following five I blocks
contain 256, 256 and 1024, respectively, and the fourth
block C along with the last two I blocks consist of 512,
512 and 2048 filters, respectively.

4. Training with virtual data

An attempt was made to improve the accuracy of neural
networks using computer-generated virtual training data
that originates from the virtual city environment. Some
mixed datasets were compiled which contain real-world
images and computer-generated virtual images. The real-
world images originate from the Cityscapes Dataset, a
large-scale dataset for semantic segmentation [13]. The
dataset contains 5000 annotated images with fine annota-
tions created in 50 different cities under various weather
conditions. 30 object classes are included, e.g. roads,
sidewalks, people, vehicles, traffic lights, terrain, sky, etc.
but in this research, only road surface segmentation is ex-
amined. The computer-generated images originate from
the simulation environment described in Section 2.

For road surface segmentation, five different datasets
are created from the Cityscapes Dataset and our collec-
tion of virtual images. Table 1 shows how these two col-
lections were mixed. Our goals are to use a minimum
amount of data from a real-world dataset, and when the
number of virtual images is changed, to observe how
the efficiency of the neural networks is affected. Dataset
A only contains real-world images, therefore, this is re-
garded as the basic dataset, while the others were com-
pared to it. Dataset B already contains the same number
of virtual images as real-world images. Here, observa-
tions of how the introduction of virtual images changes

Table 1: Number of images in our mixed datasets
Dataset name Training set Validation set

Virtual Real-world Virtual Real-world
A 0 500 0 125
B 500 500 0 250
C 1500 500 0 500
D 1500 1000 0 625
E 1500 1500 0 750

the initial degree of efficiency are sought. Dataset C con-
tains three times more virtual images than Dataset B. If
the number of virtual images is much higher than the
number of real-world images, the efficiency may be re-
duced. A future paper of ours will investigate this. In
Datasets D and E the number of real-world images was
increased. For the segmentation of protective barriers,
only virtual training data were used. How efficiently the
neural network recognizes real objects, if only trained by
virtual data, will now be shown.

The effect of increasing the number of real-world im-
ages on efficiency was investigated. Adam optimization
was used for training with a learning rate of 10−4 and a
learning rate decay of 5×10−4. As the objective function,
categorical crossentropy is used:

L(y, ŷ) = −y×log (ŷ) (23)

and the dice coefficient measured:

dc(y, ŷ) = 1−
2×y×ŷ+1
y+ŷ+1

(24)

where y ∈ {0, 1} is the ground truth and 0 ≤ ŷ ≤ 1 is
the result of the neural network.

5. Results

An attempt was made to examine the efficiency of road
surface detection, while the composition of the dataset
was modified. For examining changes in efficiency, the
most useful datasets were A, C and E. Dataset A is the ba-
sic dataset, which only contains a small set of real-world
images. Dataset C is based on Dataset A, but contains
three times as many virtual images as real-world images.
Dataset C shows how performance changes, when vir-
tual world images are integrated into a small dataset. In
Dataset E, the size of the collection was expanded. This
dataset shows how much greater the efficiency of a larger
mixed dataset is. Fig. 7 shows the validation accuracy
over the training process of road surface detection, while
Fig. 8 shows the best dice coefficient values for road sur-
face segmentation. FCN is much simpler than both U-Net
and PSPNet neural network architectures.

Hence the efficiency of the FCN on Dataset A is a lit-
tle less than for the other networks. U-Net and PSPNet
are very robust and complex, therefore, mixed datasets
do not significantly increase the efficiency of these archi-
tectures. However, for simpler networks like FCN, this
method improves the efficiency. Fig. 9 shows the perfor-

48(1) pp. 3–10 (2020)


8 HOLLÓSI, KRECHT, MARKÓ, AND BALLAGI

Figure 7: Road surface segmentation performance

mances with regard to the segmentation of protective bar-
riers. Only virtual images were used to train the neural
networks that determine the segmentation of protective
barriers. This would not have been possible in the case
of road surface segmentation, because the road surface
is too complex. The texture of the protective barriers is
very simple, therefore, it is possible to recognize it from
virtual images alone.

It is our intention to use an environment detection sys-
tem in a low-budget racing car, where the hardware re-
sources available are limited and detection must occur in
real time with a high degree of detection accuracy. There-
fore, the neural network should be designed to be as sim-
ple as possible. If the neural network architecture is too
simple, it is more difficult to train for complex recogni-
tion tasks. Moreover, the dataset concerning the racing

track, protective barriers, etc. is not large. In this case,
it is helpful to be able to train simpler neural networks,
e.g. FCN, with virtual datasets to achieve higher degrees
of efficiency. Experience has shown that the efficiency of
road surface detection is improved by using three times
as many virtual images, while for protective barrier de-
tection it is sufficient to only use virtual images.

6. Conclusion

This paper presents how to use computer-generated vir-
tual images to train artificial neural networks when the
amount of available real-world images is limited. Three
different neural network architectures, namely FCN, U-
Net and PSPNet, were investigated and these networks
trained with mixed datasets. It was shown that virtual im-

Hungarian Journal of Industry and Chemistry


IMPROVING THE EFFICIENCY OF NEURAL NETWORKS WITH VIRTUAL TRAINING DATA 9

Figure 8: Best accuracy of road segmentation

Figure 9: Barrier segmentation performance

ages improve the efficiency of neural networks. Our re-
search demonstrates that when the texture of the objects
is simple, e.g. that of protective barriers, it is sufficient to
only use virtual image-based training datasets. This work
may help us to create an efficient environment detector
for the Shell Eco-marathon, where special objects have
to be detected in the absence of real-world datasets.

Acknowledgements

The research was carried out as part of the “Autonomous
Vehicle Systems Research related to the Autonomous
Vehicle Proving Ground of Zalaegerszeg (EFOP-3.6.2-
16-2017-00002)” project in the framework of the New

Széchenyi Plan. The completion of this project is funded
by the European Union and co-financed by the European
Social Fund.

REFERENCES

[1] Krizhevsky, G. A.; Sutskever, I.; Hinton, G. E.: Ima-
geNet classification with deep convolutional neural
networks, Commun. ACM, 2017 60(6), 84–90 DOI:
10.1145/3065386

[2] Simonyan, K.; Zisserman, A.: Very deep convolu-
tional networks for large-scale image recognition,
3rd International Conference on Learning Repre-
sentations, San Diego, USA, 2015

[3] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed,
S.: Going deeper with convolutions, IEEE Con-
ference on Computer Vision and Pattern Recog-
nition (CVPR), Boston, MA, USA, 2015 DOI:
10.1109/CVPR.2015.7298594

[4] Shelhamer, E.; Long, J.; Darrell, T.: Fully con-
volutional networks for semantic segmentation,
IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 2017 39(4) 640–651 DOI:
10.1109/CVPR.2015.7298965

[5] Ronneberger, O.; Fischer, P.; Brox, T.: U-Net: con-
volutional networks for biomedical image segmen-
tation, In: Navab, N.; Hornegger, J.; Wells, W.;
Frangi, A. (eds.) Medical Image Computing and
Computer-Assisted Intervention – MICCAI 2015.
MICCAI 2015: Lecture Notes in Computer Science,
9351 234–241, Springer: Cham, Switzerland, 2015
DOI: 10.1007/978-3-319-24574-4_28

[6] He, K.; Zhang, X.; Ren, S.; Sun, J.: Deep resid-
ual learning for image recognition, IEEE Confer-
ence on Computer Vision and Pattern Recogni-
tion, Las Vegas, NV, USA, 770–778, 2016 DOI:
10.1109/CVPR.2016.90

[7] Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J.:
Pyramid Scene Parsing Network, IEEE Confer-
ence on Computer Vision and Pattern Recognition
(CVPR), Honolulu, HI, USA, 6230–6239 2017 DOI:
10.1109/CVPR.2017.660

[8] Peng, X.; Sun, B.; Ali, K.; Saenko, K.: Learn-
ing deep object detectors from 3D models, IEEE
International Conference on Computer Vision
(ICCV), Santiago, Chile, 1278–1286, 2015 DOI:
10.1109/ICCV.2015.151

[9] Tian, Y.; Li, X.; Wang, K.; Wang, F.: Training
and Testing Object Detectors with Virtual Images,
IEEE/CAA J. Autom. Sin., 2018 5(2) 539–546 DOI:
10.1109/JAS.2017.7510841

[10] Židek, K.; Lazorík, P.; Pitel, J.; Hošovskı, A.: An
automated training of deep learning networks by
3D virtual models for object recognition, Symmetry,
2019 11 496–511 DOI: 10.3390/sym11040496

[11] Unrealengine.com 2020. Unreal Engine | The
Most Powerful Real-Time 3D Creation Platform.

48(1) pp. 3–10 (2020)

https://doi.org/10.1145/3065386
https://doi.org/10.1145/3065386
https://doi.org/10.1109/CVPR.2015.7298594
https://doi.org/10.1109/CVPR.2015.7298594
https://doi.org/10.1109/CVPR.2015.7298965
https://doi.org/10.1109/CVPR.2015.7298965
https://doi.org/10.1007/978-3-319-24574-4_28
https://doi.org/10.1109/CVPR.2016.90
https://doi.org/10.1109/CVPR.2016.90
https://doi.org/ 10.1109/CVPR.2017.660
https://doi.org/ 10.1109/CVPR.2017.660
https://doi.org/10.1109/ICCV.2015.151
https://doi.org/10.1109/ICCV.2015.151
https://doi.org/10.1109/JAS.2017.7510841
https://doi.org/10.1109/JAS.2017.7510841
https://doi.org/10.3390/sym11040496


10 HOLLÓSI, KRECHT, MARKÓ, AND BALLAGI

https://www.unrealengine.com/en-US/ [Accessed 14 Septem-
ber 2019]

[12] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.;
Satheesh, S.;Ma, S.; Huang, Z.; Karpathy, A.;
Khosla, A.; Bernstein, M.; Berg, A. C.; Fei-Fei,
L.: ImageNet Large Scale Visual Recognition Chal-
lenge, Int. J. Computer Vision, 2015 115 211–252
DOI: 10.1007/s11263-015-0816-y

[13] Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.;
Enzweiler, M.; Benenson, R.; Franke, U.; Roth,
S.; Schiele, B.: The Cityscapes Dataset for Se-
mantic Urban Scene Understanding, IEEE Confer-
ence on Computer Vision and Pattern Recognition
(CVPR), Las Vegas, NV, USA, 3213–3223 2016
DOI: 10.1109/CVPR.2016.350

Hungarian Journal of Industry and Chemistry

https://www.unrealengine.com/en-US/
https://doi.org/10.1007/s11263-015-0816-y
https://doi.org/10.1109/CVPR.2016.350

	Introduction
	Our virtual environment
	Neural network architectures
	FCN
	U-Net
	PSPNet

	Training with virtual data
	Results
	 Conclusion