HUNGARIAN JOURNAL OF INDUSTRY AND CHEMISTRY Vol. 48(1) pp. 3–10 (2020) hjic.mk.uni-pannon.hu DOI: 10.33927/hjic-2020-02 IMPROVING THE EFFICIENCY OF NEURAL NETWORKS WITH VIRTUAL TRAINING DATA JÁNOS HOLLÓSI*1,2 , RUDOLF KRECHT2 , NORBERT MARKÓ2 , AND ÁRON BALLAGI2,3 1Department of Information Technology, Széchenyi István University, Egyetem tér 1, Győr, 9026, HUNGARY 2Research Center of Vehicle Industry, Széchenyi István University, Egyetem tér 1, Győr, 9026, HUNGARY 3Department of Automation, Széchenyi István University, Egyetem tér 1, Győr, 9026, HUNGARY At Széchenyi István University, an autonomous racing car for the Shell Eco-marathon is being developed. One of the main tasks is to create a neural network which segments the road surface, protective barriers and other components of the racing track. The difficulty with this task is that no suitable dataset for special objects, e.g. protective barriers, exists. Only a dataset limited in terms of its size is available, therefore, computer-generated virtual images from a virtual city environment are used to expand this dataset. In this work, the effect of computer-generated virtual images on the efficiency of different neural network architectures is examined. In the training process, real images and computer-generated virtual images are mixed in several ways. Subsequently, three different neural network architectures for road surfaces and the detection of protective barriers are trained. Past experiences determine how to mix datasets and how they can improve efficiency. Keywords: neural network, virtual training data, autonomous vehicle 1. Introduction Shell Eco-marathon is a unique international competition held by Royal Dutch Shell Plc. This event challenges university students to design, develop, build and drive the most energy-efficient racing cars. Our University’s racing team, the SZEnergy Team, has been a successful participant in the Shell Eco-marathon for over 10 years. Two years ago, Shell introduced the Autonomous Urban- Concept (AUC) challenge, which is a separate competi- tion for self-driving vehicles that participate in the Shell Eco-marathon. Participants in the AUC challenge have to complete five different tasks, e.g. parking in a dedicated parking rectangle, obstacle avoidance on a straight track, drive one lap of the track autonomously, etc. Our long-term goal is to prepare for the AUC chal- lenge. One of the main tasks is to create an intelligent sys- tem, which perceives the environment of our racing car, e.g. other vehicles, the road surface, other components of the racing track, etc. In this paper, only the segmentation of the road surface and of the protective barriers is taken into consideration. An approach based on neural net- works will determine the segmentation, because such net- works are one of the best tools to solve problems concern- ing visual information-based detection and segmentation, e.g. image segmentation. Many high-performance neural network architectures are available such as AlexNet by Krizhevsky et al. [1], VGGNet by Simonyan and Zisser- *Correspondence: hollosi.janos@sze.hu man [2], GoogLeNet by Szegedy et al. [3], Fully Convo- lutional Networks by Shelhamer et al. [4], U-Net by Ron- neberger et al. [5], ResNet by He et al. [6] and Pyramid Scene Parsing Network by Zhao et al. [7]. Training neural networks requires a large amount of training data. How- ever, in this case, the number of training samples is insuf- ficient, e.g. no training images of protective barriers are available and the generation and annotation of real world data is labour-intensive and time-consuming. Computer simulation environments will be used to generate train- ing data for this task. Some attempts that apply virtually generated data to train neural networks have been made. Peng et al. [8] demonstrated CAD model-based convo- lutional neural network training for joint object detec- tion. Tian et al. [9] presented a pipeline to construct vir- tual scenes and virtual datasets for neural networks. They proved that mixing virtual and real data to train neural networks for joint object detection helps to improve per- formance. Židek et al. [10] presented a new approach to joint object detection using neural networks trained by virtual model-based datasets. In this paper, an attempt is made to show the effects of computer-generated training data on the learning process of different network archi- tectures. The paper is structured as follows: in Section 2, the virtual simulation environment that is used for generating training data is described; in Section 3, our neural net- work architectures are presented; in Section 4, the train- ing process of the networks is outlined; in Section 5, our https://doi.org/10.33927/hjic-2020-02 mailto:hollosi.janos@sze.hu 4 HOLLÓSI, KRECHT, MARKÓ, AND BALLAGI results and experiences are shared; finally, in Section 6, our conclusions are stated. 2. Our virtual environment Our aim is to create highly realistic image sets that depict racing tracks which follow the rulebook of the Shell Eco- marathon Autonomous UrbanConcept. In order to ensure repeatability and simple parameter setup, the creation of complete, textured 3D-models of the racing tracks is ad- vised. These simulated environments can be used to cre- ate images with desired weather and lighting conditions by scanning the track environment using a camera mov- ing at a predefined constant speed. The images created using this method can be processed further, e.g. segmen- tation and clustering of different types of objects such as the road surface, protective barriers and vegetation. Based on the characteristics of the predefined task, the require- ments of the simulation environment can be enumerated: • highly realistic appearance, • easy use of textures, • fast workflow, • characteristics definable by parameters (parametric lights, weather conditions), • modular environment construction, • importability of external CAD models. Unreal Engine 4 [11] is a games engine designed for the fast creation of modular simulated environments by the use of modular relief, vegetation and building ele- ments. In these environments, actors based on external CAD models could be used. Fields of engineering that ap- ply different visual sensors and cameras require very sim- ilar computer simulation technologies to the video game industry. Video games need to be highly realistic as well as efficient due to limited computational capacity. The re- quirements are the same for the simulation of vehicles mounted with cameras. Highly realistic computer simu- lations reduce the cost and duration of real-life tests and camera calibrations. It is also important to mention that by using technology implemented and/or developed by the video game industry, the support of a vast developer community is available. Since our goal is to develop image-perceiving solu- tions for the Shell Eco-marathon Autonomous Urban- Concept challenge, it is important to carefully follow the rules of this competition with regard to the racing tracks. The simulated environments and racing tracks created by Unreal Engine strictly follow the rules de- fined by the aforementioned rulebook. These rules de- fine that the self-driving vehicles have to compete on rac- ing tracks equipped with protective barriers of a known height painted in alternating red and white segments. It is also defined that every racing track consists of three Figure 1: Example images from the training set. painted line markings, one that is green to denote the starting position, a yellow one to trigger the self-driving mode, and another that is red to mark the finish line. Be- cause the racing tracks and tasks are well defined, it is crucial to create accurate models of the expected environ- ments. Differences between real and simulated environ- ments might lead to further developments in the wrong direction. Two simulated test environments were created. The first one was based on a readily available city model with streets corresponding to a typical racing track. Barrier el- ements were added to the roads to ensure that the rac- ing track complies with the requirements outlined in the rulebook. This model includes defects in and textures of the road surface to ensure detection of the road surface is robust. In order to create image sets based on this envi- ronment model, a vehicle model equipped with a camera travelled around the racetrack on a pre-defined path. The camera was set to take pictures at pre-defined time in- tervals. The image set was annotated by using a module called AirSim. AirSim is an open-source, cross-platform simulator simulation platform built on Unreal Engine, but it also has a Unity release. This simulator module con- sists of a built-in Python-based API (Application Pro- gramming Interface) which was developed for image seg- mentation. By using this API, the necessary realistic and segmented image datasets were created. Some example pairs of images from our virtual dataset are presented in Fig. 1. In order to prepare for all the tasks defined in the rulebook, multiple models of racing tracks were cre- ated. All such models are based on the same environment model, which includes vegetation and the sky as shown in Fig. 2. The models of sections of racing track were realized according to the challenges defined in the rule- book. The CAD models representing elements of the rac- ing track were custom-made to comply with the shapes, sizes and colors outlined in the rulebook. The sections of racing track generated can be used to simulate handling Hungarian Journal of Industry and Chemistry IMPROVING THE EFFICIENCY OF NEURAL NETWORKS WITH VIRTUAL TRAINING DATA 5 Figure 2: Basic environment of racing tracks. Figure 3: Parking place and slalom course. (slaloming) and parking tasks. This virtual racing track is shown in Fig. 3. The image sets were created by a moving camera in the environment and segmentation was carried out by changing the textures. 3. Neural network architectures Three different neural network architectures are imple- mented in this work: FCN, U-Net and PSPNet. All neural networks are designed for image segmentation, where the size of input images is 256×512×3, and the size of out- put ones is 256×512×1. Every network is trained for the segmentation of the road surface and protective barriers. 3.1 FCN The Fully Convolutional Network (FCN) [4] architecture is based on fully convolutional layers, where the basic idea is to extend effective classification neural networks to conduct segmentation tasks. Our FCN architectures are shown in Fig. 4. Let: γ = (conv, bn, ReLu) (1) b1 = (γ, γ,mp) (2) b2 = (γ, γ, γ,mp) (3) Figure 4: FCN architecture. 48(1) pp. 3–10 (2020) 6 HOLLÓSI, KRECHT, MARKÓ, AND BALLAGI Figure 5: U-Net architecture. where conv denotes a convolutional layer, bn represents a batch normalization layer, ReLu stands for a rectified linear activation unit and mp is a max pooling layer. Let: B1 = (b1, b1,b2) (4) B2 = (b2) (5) B3 = (b2, γ, γ, γ) (6) x = (conv, bn) (7) y = (ReLu, softmax) (8) Z = (ReLu, softmax) (9) where softmax denotes a softmax layer. In this imple- mentation, the dimensions of all convolutional layers are 3×3, except for the three fully connected layers in block B3. The dimensions of these convolutional layers are 7×7. In block B1, the first two convolutional layers both contain 64 filters, the third and fourth both contain 128 filters, and the last three convolution blocks each contain 256 filters. The convolutional layers in block B2 contain 512 filters in total. The first three convolutional layers in block B3 contain 512 filters in total, and the fully con- nected layers are based on 4096 filters in total. 3.2 U-Net The U-Net [5] neural network architecture was originally created for biomedical image segmentation. It is based on FCN, where the neural network can be divided into two main blocks, namely the downsampling and upsampling Figure 6: PSP Net architecture. blocks. Our implementations are shown in Fig. 5. Let: D1 = D2 = (γ, γ, maxpooling) (10) D3 = D4 = B = (γ, γ, γ,maxpooling) (11) U1 = U2 = U3 = ( convt, bn, ReLu,γ, γ ) (12) U4 = ( convt, bn, ReLu,γ, γ, conv,softmax ) (13) where convt is a transposed convolution layer. In the U- Net neural network, the dimensions of all convolutions and transposed convolutions are 3×3, and 2×2, respec- tively. The number of convolutional filters are as follows: each convolutional layer in D1 consists of 64, in D2 of 128, in D3 of 256 and in D4 as well as B of 512 filters. The upsampling block is very similar. U1 consists of 512, U2 of 256, U3 of 128 and U4 of 64 filters. 3.3 PSPNet The Pyramid Scene Parsing Network (PSPNet) [7] was judged to be the best architecture in the ImageNet Scene Parsing Challenge in 2016 [12]. The main building block of the PSPNet is a pyramid pooling module, where the network fuses features under four different pyramid scales. Our PSPNet-based architecture is shown in Fig. 6. Hungarian Journal of Industry and Chemistry IMPROVING THE EFFICIENCY OF NEURAL NETWORKS WITH VIRTUAL TRAINING DATA 7 Let: B1 = (γ, γ,γ,maxpooling) (14) C = (γ, γ, conv, bn) + (conv, bn) (15) I = (γ, γ, conv, bn) (16) p = (avg, conv) (17) P1 = (p) (18) P2 = (p, p) (19) P3 = (p, p, p) (20) P4 = (p, p, p, p) (21) B2 = ( γ, dropout, conv, convt, softmax ) (22) where avg denotes an average pooling layer and dropout represents a dropping out unit. In block B1, the dimen- sions of all convolutions are 3×3. In blocks C and I, the dimensions of every first & third and every second con- volution are 3×3 and 1×1, respectively. In block B2, the dimensions of the first convolution are 3×3 and the sec- ond 1×1. The dimensions of the transposed convolution are 16 × 16. Each of the first two convolutions in block B1 consist of 64 filters, and the last one of 128. The first block C and first two I blocks contain 64, 64, 256 filters, respectively, while the second block C and the following three I blocks consist of 128, 128 and 512 filters, respec- tively. The third block C and the following five I blocks contain 256, 256 and 1024, respectively, and the fourth block C along with the last two I blocks consist of 512, 512 and 2048 filters, respectively. 4. Training with virtual data An attempt was made to improve the accuracy of neural networks using computer-generated virtual training data that originates from the virtual city environment. Some mixed datasets were compiled which contain real-world images and computer-generated virtual images. The real- world images originate from the Cityscapes Dataset, a large-scale dataset for semantic segmentation [13]. The dataset contains 5000 annotated images with fine annota- tions created in 50 different cities under various weather conditions. 30 object classes are included, e.g. roads, sidewalks, people, vehicles, traffic lights, terrain, sky, etc. but in this research, only road surface segmentation is ex- amined. The computer-generated images originate from the simulation environment described in Section 2. For road surface segmentation, five different datasets are created from the Cityscapes Dataset and our collec- tion of virtual images. Table 1 shows how these two col- lections were mixed. Our goals are to use a minimum amount of data from a real-world dataset, and when the number of virtual images is changed, to observe how the efficiency of the neural networks is affected. Dataset A only contains real-world images, therefore, this is re- garded as the basic dataset, while the others were com- pared to it. Dataset B already contains the same number of virtual images as real-world images. Here, observa- tions of how the introduction of virtual images changes Table 1: Number of images in our mixed datasets Dataset name Training set Validation set Virtual Real-world Virtual Real-world A 0 500 0 125 B 500 500 0 250 C 1500 500 0 500 D 1500 1000 0 625 E 1500 1500 0 750 the initial degree of efficiency are sought. Dataset C con- tains three times more virtual images than Dataset B. If the number of virtual images is much higher than the number of real-world images, the efficiency may be re- duced. A future paper of ours will investigate this. In Datasets D and E the number of real-world images was increased. For the segmentation of protective barriers, only virtual training data were used. How efficiently the neural network recognizes real objects, if only trained by virtual data, will now be shown. The effect of increasing the number of real-world im- ages on efficiency was investigated. Adam optimization was used for training with a learning rate of 10−4 and a learning rate decay of 5×10−4. As the objective function, categorical crossentropy is used: L(y, ŷ) = −y×log (ŷ) (23) and the dice coefficient measured: dc(y, ŷ) = 1− 2×y×ŷ+1 y+ŷ+1 (24) where y ∈ {0, 1} is the ground truth and 0 ≤ ŷ ≤ 1 is the result of the neural network. 5. Results An attempt was made to examine the efficiency of road surface detection, while the composition of the dataset was modified. For examining changes in efficiency, the most useful datasets were A, C and E. Dataset A is the ba- sic dataset, which only contains a small set of real-world images. Dataset C is based on Dataset A, but contains three times as many virtual images as real-world images. Dataset C shows how performance changes, when vir- tual world images are integrated into a small dataset. In Dataset E, the size of the collection was expanded. This dataset shows how much greater the efficiency of a larger mixed dataset is. Fig. 7 shows the validation accuracy over the training process of road surface detection, while Fig. 8 shows the best dice coefficient values for road sur- face segmentation. FCN is much simpler than both U-Net and PSPNet neural network architectures. Hence the efficiency of the FCN on Dataset A is a lit- tle less than for the other networks. U-Net and PSPNet are very robust and complex, therefore, mixed datasets do not significantly increase the efficiency of these archi- tectures. However, for simpler networks like FCN, this method improves the efficiency. Fig. 9 shows the perfor- 48(1) pp. 3–10 (2020) 8 HOLLÓSI, KRECHT, MARKÓ, AND BALLAGI Figure 7: Road surface segmentation performance mances with regard to the segmentation of protective bar- riers. Only virtual images were used to train the neural networks that determine the segmentation of protective barriers. This would not have been possible in the case of road surface segmentation, because the road surface is too complex. The texture of the protective barriers is very simple, therefore, it is possible to recognize it from virtual images alone. It is our intention to use an environment detection sys- tem in a low-budget racing car, where the hardware re- sources available are limited and detection must occur in real time with a high degree of detection accuracy. There- fore, the neural network should be designed to be as sim- ple as possible. If the neural network architecture is too simple, it is more difficult to train for complex recogni- tion tasks. Moreover, the dataset concerning the racing track, protective barriers, etc. is not large. In this case, it is helpful to be able to train simpler neural networks, e.g. FCN, with virtual datasets to achieve higher degrees of efficiency. Experience has shown that the efficiency of road surface detection is improved by using three times as many virtual images, while for protective barrier de- tection it is sufficient to only use virtual images. 6. Conclusion This paper presents how to use computer-generated vir- tual images to train artificial neural networks when the amount of available real-world images is limited. Three different neural network architectures, namely FCN, U- Net and PSPNet, were investigated and these networks trained with mixed datasets. It was shown that virtual im- Hungarian Journal of Industry and Chemistry IMPROVING THE EFFICIENCY OF NEURAL NETWORKS WITH VIRTUAL TRAINING DATA 9 Figure 8: Best accuracy of road segmentation Figure 9: Barrier segmentation performance ages improve the efficiency of neural networks. Our re- search demonstrates that when the texture of the objects is simple, e.g. that of protective barriers, it is sufficient to only use virtual image-based training datasets. This work may help us to create an efficient environment detector for the Shell Eco-marathon, where special objects have to be detected in the absence of real-world datasets. Acknowledgements The research was carried out as part of the “Autonomous Vehicle Systems Research related to the Autonomous Vehicle Proving Ground of Zalaegerszeg (EFOP-3.6.2- 16-2017-00002)” project in the framework of the New Széchenyi Plan. The completion of this project is funded by the European Union and co-financed by the European Social Fund. REFERENCES [1] Krizhevsky, G. A.; Sutskever, I.; Hinton, G. E.: Ima- geNet classification with deep convolutional neural networks, Commun. ACM, 2017 60(6), 84–90 DOI: 10.1145/3065386 [2] Simonyan, K.; Zisserman, A.: Very deep convolu- tional networks for large-scale image recognition, 3rd International Conference on Learning Repre- sentations, San Diego, USA, 2015 [3] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.: Going deeper with convolutions, IEEE Con- ference on Computer Vision and Pattern Recog- nition (CVPR), Boston, MA, USA, 2015 DOI: 10.1109/CVPR.2015.7298594 [4] Shelhamer, E.; Long, J.; Darrell, T.: Fully con- volutional networks for semantic segmentation, IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2017 39(4) 640–651 DOI: 10.1109/CVPR.2015.7298965 [5] Ronneberger, O.; Fischer, P.; Brox, T.: U-Net: con- volutional networks for biomedical image segmen- tation, In: Navab, N.; Hornegger, J.; Wells, W.; Frangi, A. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015: Lecture Notes in Computer Science, 9351 234–241, Springer: Cham, Switzerland, 2015 DOI: 10.1007/978-3-319-24574-4_28 [6] He, K.; Zhang, X.; Ren, S.; Sun, J.: Deep resid- ual learning for image recognition, IEEE Confer- ence on Computer Vision and Pattern Recogni- tion, Las Vegas, NV, USA, 770–778, 2016 DOI: 10.1109/CVPR.2016.90 [7] Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J.: Pyramid Scene Parsing Network, IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 6230–6239 2017 DOI: 10.1109/CVPR.2017.660 [8] Peng, X.; Sun, B.; Ali, K.; Saenko, K.: Learn- ing deep object detectors from 3D models, IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 1278–1286, 2015 DOI: 10.1109/ICCV.2015.151 [9] Tian, Y.; Li, X.; Wang, K.; Wang, F.: Training and Testing Object Detectors with Virtual Images, IEEE/CAA J. Autom. Sin., 2018 5(2) 539–546 DOI: 10.1109/JAS.2017.7510841 [10] Židek, K.; Lazorík, P.; Pitel, J.; Hošovskı, A.: An automated training of deep learning networks by 3D virtual models for object recognition, Symmetry, 2019 11 496–511 DOI: 10.3390/sym11040496 [11] Unrealengine.com 2020. Unreal Engine | The Most Powerful Real-Time 3D Creation Platform. 48(1) pp. 3–10 (2020) https://doi.org/10.1145/3065386 https://doi.org/10.1145/3065386 https://doi.org/10.1109/CVPR.2015.7298594 https://doi.org/10.1109/CVPR.2015.7298594 https://doi.org/10.1109/CVPR.2015.7298965 https://doi.org/10.1109/CVPR.2015.7298965 https://doi.org/10.1007/978-3-319-24574-4_28 https://doi.org/10.1109/CVPR.2016.90 https://doi.org/10.1109/CVPR.2016.90 https://doi.org/ 10.1109/CVPR.2017.660 https://doi.org/ 10.1109/CVPR.2017.660 https://doi.org/10.1109/ICCV.2015.151 https://doi.org/10.1109/ICCV.2015.151 https://doi.org/10.1109/JAS.2017.7510841 https://doi.org/10.1109/JAS.2017.7510841 https://doi.org/10.3390/sym11040496 10 HOLLÓSI, KRECHT, MARKÓ, AND BALLAGI https://www.unrealengine.com/en-US/ [Accessed 14 Septem- ber 2019] [12] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.;Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A. C.; Fei-Fei, L.: ImageNet Large Scale Visual Recognition Chal- lenge, Int. J. Computer Vision, 2015 115 211–252 DOI: 10.1007/s11263-015-0816-y [13] Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B.: The Cityscapes Dataset for Se- mantic Urban Scene Understanding, IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 3213–3223 2016 DOI: 10.1109/CVPR.2016.350 Hungarian Journal of Industry and Chemistry https://www.unrealengine.com/en-US/ https://doi.org/10.1007/s11263-015-0816-y https://doi.org/10.1109/CVPR.2016.350 Introduction Our virtual environment Neural network architectures FCN U-Net PSPNet Training with virtual data Results Conclusion