Vision-based reinforcement learning for lane-tracking control


ACTA IMEKO 
ISSN: 2221-870X 
September 2021, Volume 10, Number 3, 7 - 14 

 
ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 7 

Vision-based reinforcement learning for lane-tracking control 

András Kalapos1, Csaba Gór2, Róbert Moni3, István Harmati1 

1 BME, Dept. of Control Engineering and Information Technology, Budapest, Hungary 
2 Continental ADAS AI, Budapest, Hungary 
3 BME, Dept. of Telecommunications and Media Informatics, Budapest, Hungary 

 
Section: RESEARCH PAPER  

Keywords: Artificial intelligence, machine learning, mobile robot, reinforcement learning, simulation-to-reality, transfer learning 

Citation: András Kalapos, Csaba Gór, Róbert Moni, István Harmati, Vision-based reinforcement learning for lane-tracking control, Acta IMEKO, vol. 10, no. 3, 
article 4, September 2021, identifier: IMEKO-ACTA-10 (2021)-03-04 

Section Editor: Bálint Kiss, Budapest University of Technology and Economics, Hungary  

Received January 17, 2021; In final form September 22, 2021; Published September 2021 

Copyright: This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, 
distribution, and reproduction in any medium, provided the original author and source are credited. 

Corresponding author: András Kalapos, e-mail: andras.kalapos.research@gmail.com  

 
1. INTRODUCTION 

Reinforcement learning has been used to solve many control 
and robotics tasks. However, only a handful of papers have been 
published that apply this technique to end-to-end driving [1]-[7], 
and even fewer studies have focused on reinforcement learning-
based driving, trained only in simulations and then applied to 
real-world problems. Generally, bridging the gap between 
simulation and the real world is an important transfer-learning 
problem related to reinforcement learning, and it is an 
unresolved task for researchers. 

Mnih et al. [1] proposed a method to train vehicle controller 
policies that predict discrete control actions based on a single 
image of a forward-facing camera. Jaritz et al. [2] used WRC6, a 
realistic racing simulator, to train a vision-based road-following 
policy. They assessed the policy's generalisation capability by 
testing it on previously unseen tracks and on real driving videos 
in an open-loop configuration; but their work did not extend to 
an evaluation of real vehicles in closed-loop control. Kendall et 
al. [3] demonstrated real-world driving by training a lane-
following policy exclusively on a real vehicle under the 

supervision of a safety driver. Shi et al. [4] presented research that 
involved training reinforcement learning agents in Duckietown, 
in a similar way to that presented here; however, the focus was 
mainly on presenting a method that explained the reasoning 
behind the trained agents rather than the training methods. Also 
similar to the present study, Balaji et al. [5] presented a method 
for training a road-following policy in a simulator using 
reinforcement learning and tested the trained agent in the real 
world, yet their primary contribution is the DeepRacer platform 
rather than an in-depth analysis of the road-following policy. 
Almási et al. [7] also used reinforcement learning to solve lane 
following in the Duckietown environment, but their work differs 
from the present study in the use of an off-policy reinforcement 
learning algorithm (deep Q-networks (DQNs) [8]); in this study 
an on-policy algorithm (proximal policy optimization [9]) is used, 
which achieves significantly better sample efficiency and shorter 
training times. Another important difference is that Almási et al. 
applied hand-crafted colour threshold-based segmentation to the 
input images, whereas the method presented here takes the ‘raw’ 
images as inputs, which allows for a more robust real 
performance. 

ABSTRACT 
The present study focused on vision-based end-to-end reinforcement learning in relation to vehicle control problems such as lane 
following and collision avoidance. The controller policy presented in this paper is able to control a small-scale robot to follow the right-
hand lane of a real two-lane road, although its training has only been carried out in a simulation. This model, realised by a simple, 
convolutional network, relies on images of a forward-facing monocular camera and generates continuous actions that directly control 
the vehicle. To train this policy, proximal policy optimization was used, and to achieve the generalisation capability requir ed for real 
performance, domain randomisation was used. A thorough analysis of the trained policy was conducted by measuring multiple 
performance metrics and comparing these to baselines that rely on other methods. To assess the quality of the simulation-to-reality 
transfer learning process and the performance of the controller in the real world, simple metrics were measured on a real track and 
compared with results from a matching simulation. Further analysis was carried out by visualising salient object maps. 

mailto:andras.kalapos.research@gmail.com


ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 8 

This paper is an extended version of the authors’ original 
contribution [10]. It includes the results of the 5th AI Driving 
Olympics [11] and aims to improve the description of the 
methods. In both works, vision-based end-to-end reinforcement 
learning relating to vehicle control problems is studied and a 
solution is proposed that performs lane following in the real 
world, using continuous actions, without any real data provided 
by an expert (as in [3]). Also, validation of the trained policies in 
both the real and simulated domains is conducted.  

The training and evaluation code for this paper is available on 
GitHub1. 

2. METHODS 

In this study, a neural-network-based controller was trained 
that takes images from a forward-looking monocular camera and 
produces control signals to drive a vehicle in the right-hand lane 
of a two-way road. The vehicle to be controlled was a small 
differential-wheeled mobile robot, a Duckiebot, which is part of 
the Duckietown ecosystem [11], a simple and accessible platform 
for research and education on mobile robotics and autonomous 
vehicles. The primary objective was to travel as far as possible 
within a given time without leaving the road. Lane departure was 
allowed but not preferred. Although the latest version of the 
Duckiebot is equipped with wheel encoders, for this method, the 
vehicle was solely reliant on data from the robot's forward-facing 
monocular camera. 

2.1.  Reinforcement learning algorithm 

In reinforcement learning, an agent interacts with the 

environment by taking 𝑎𝑡 action, then the environment returns 
𝑠𝑡+1 observation and 𝑟𝑡+1 reward. The agent computes the next 
𝑎𝑡+1 action based on 𝑠𝑡+1 and so on. The policy is the parametric 
controller of the agent, and it is tuned during the reinforcement 
learning training. Sequences of actions, observations and rewards 

(𝜏 trajectories) are used to train the parameters of the policy to 
maximise the expected reward over a finite number of steps 
(agent–environment interactions). For vehicle control problems, 
the actions are the signals that control the vehicle, such as the 
steering and throttle, and the observations are the sensor data 
relating to the environment of the vehicle, such as the camera, 
lidar data or higher-level environment models. In this research, 
the observations were images from the robot's forward-facing 
camera, and the actions were the velocity signals for the two 
wheels of the robot. 

Policy optimisation algorithms are on-policy reinforcement 

learning methods that optimise the parameters of the πθ(𝑎𝑡|𝑠𝑡) 
policy based on the 𝑎𝑡 actions and the 𝑟𝑡 reward received for 
them; 𝜃 denotes the trainable parameters of the policy. On-policy 
reinforcement learning algorithms optimise the πθ(𝑎𝑡|𝑠𝑡) policy 
based on trajectories in which the actions have been computed 

by πθ(𝑎𝑡|𝑠𝑡). In contrast, off-policy algorithms (such as DQNs 
[8]) compute actions based on the estimate of the action-value 
function of the environment, which they learn using data from a 
large number of (earlier) trajectories, making these algorithms 
less stable in some environments. In policy optimisation 

algorithms, the πθ(𝑎𝑡|𝑠𝑡) policy is stochastic, and in the case of 
deep reinforcement learning, it is implemented by a neural 
network, which is updated using a gradient method. The policy 
is stochastic because, instead of computing the actions directly, 

 
1 https://github.com/kaland313/Duckietown-RL (Accessed 23 

September 2021) 

the policy network predicts the parameters of a probability 

distribution (see 𝜇 and 𝜎 in Figure 1) that is sampled to acquire 
the 𝑎�̃� predicted actions (here, predicted refers to this action 
being predicted by the policy).  

In the present study, to train the policy, the proximal policy 
optimization algorithm [9] was used because of its stability, 
sample-complexity and ability to take advantage of multiple 
parallel workers.  

Proximal policy optimization performs the weight updates 
using a special loss function to keep the new policy close to the 
old, thereby improving the stability of the training. Two loss 
functions were proposed by Schulman et al. [9]: 

𝔏CLIP(𝜃) = �̂�[min(𝜌𝑡(𝜃)�̂�𝑡,clip(𝜌𝑡(𝜃),1 − 𝜖,1 + 𝜖)�̂�𝑡)], (1) 

𝔏KLPEN(𝜃) = �̂�[𝜌𝑡(𝜃)�̂� − 𝛽 KL[πθold(⋅ |𝑠𝑡),πθ(⋅ |𝑠𝑡)]], (2) 

where clip(⋅) and KL[⋅] refer to the clipping function and the 

Kullback–Leibler (KL) divergence, respectively, while �̂� is 
calculated as the generalised advantage estimate [12]. In these 

loss functions, 𝜖 is usually a constant in the [0.1,0.3] range, while 
𝛽 is an adaptive parameter, and  

𝜌
𝑡
(𝜃) =

πθ(𝑎𝑡|𝑠𝑡)

πθold(𝑎𝑡|𝑠𝑡)
. (3) 

An open-source implementation of proximal policy 
optimization from RLlib [13] was used, which performs the 
gradient updates based on the weighted sum of these loss 
functions. The pseudo code and additional details for the 
algorithm are provided in the Appendix. 

2.2. Policy architecture 

The controller policy was realised by a shallow (4-layer) 
convolutional neural network. Both the policy and the value 
network used the architecture presented by Mnih et al. [1], with 
the only difference being the use of linear activation in the output 
of the policy network. No weights were shared between the 
policy and the value network. This policy is considered to be end-
to-end because the only learning component is the neural 
network, which directly computes actions based on observations 
from the environment. 

Some pre- and post-processing was applied to the 
observations and actions, but these only performed very simple 
transformations (explained in the next paragraph and Section 
2.3). The aim of these pre- and post-processing steps was to 

transform the 𝑠𝑡 observations and 𝑎𝑡 actions into 
representations that enabled faster convergence without losing 

 
Figure 1. Illustration of the policy architecture with the notations used. The 
agent is represented jointly by the ‘Policy network’ and ‘Sampling action 
distribution’ blocks; 𝑠𝑡: ‘raw’ observation, 𝑠�̃�: pre-processed observation, 𝑎�̃�: 
predicted action, 𝑎𝑡: post-processed action. 

https://github.com/kaland313/Duckietown-RL


ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 9 

any important features in the observations or restricting 
necessary actions. 

The input of the policy network consisted of the last three 
observations (images) scaled, cropped and stacked (along the 

depth axis). The observations returned by the environment (𝑠𝑡 
on Figure 1) were 640 × 480 (width, height) RGB images, the top 
third of which mainly showed the sky, which was therefore 

cropped. The cropped images were then scaled down to 84 × 84 
resolution (note the uneven scaling), which were then stacked 

along the depth axis, resulting in 84 × 84 × 9 input tensors (𝑠�̃� in 
Figure 1). The last three images were stacked to provide the 
policy with information about the robot's speed and acceleration. 

Multiple action representations were experimented with (see 
Section 2.3). Based on these representations, the policy outputs 

𝒂�̃� predicted an action vector of two or a scalar value that 
controlled the vehicle. The policy was stochastic, and the output 

of the neural network therefore produced the 𝜇 and logσ 
parameters of a multivariate diagonal normal distribution. 

During training, this distribution was sampled to acquire the 𝑎�̃� 
actions, which improved the exploration of the action space. 
During these evaluations, the sampling step was skipped by using 

the predicted 𝜇 mean value as the 𝑎�̃� policy output. 

2.3. Action representations 

The action mapping step transformed the 𝑎�̃� predicted 
actions, which could be implemented using many 

representations, to 𝑎𝑡 = [𝜔𝑙,𝜔𝑟] wheel velocities (see Figure 1). 
The vehicle to be controlled was a differential-wheeled robot; the 
most basic action representation was therefore to directly 
compute the angular velocities of the two wheels as continuous 

values in the 𝜔𝑙,𝑟 ∈ [−1;1] range (where 1 and −1 
corresponded to forward and backward rotation at full speed). 
However, this action space allowed for actions that were not 
necessary for the manoeuvres examined in this paper. Moreover, 
as the reinforcement learning algorithm ruled out unnecessary 
actions, exploration of the action space was potentially made 
more difficult, and the number of steps required to train an agent 
was therefore increased.  

Several methods can be used to constrain and simplify the 
action space, such as discretisation, clipping some actions or 
mapping to a lower-dimensional space. Most previous studies 
[1],[2],[5],[7] have used discrete action spaces, thus the neural 
network in these policies selected one from a set of hand-crafted 
actions (steering, throttle combinations), while Kendall et al. [3] 
utilised continuous actions, as has been used in this study. 

In order to test the reinforcement learning algorithm's ability 
to address general tasks, multiple action mappings and 
simplifications of the action space were experimented with. 
These are described in the following paragraphs. 

Wheel velocity: Wheel velocities were a direct output of the 

policy; 𝑎𝑡 = [𝜔𝑙,𝜔𝑟] = 𝑎�̃�, therefore 𝜔𝑙,𝑟 ∈ [−1;1]. 
Wheel velocity - positive only: Only positive wheel velocities were 

allowed because only these were required to move forward. 

Values predicted outside the 𝜔𝑙,𝑟 ∈ [0;1] interval were clipped: 
𝑎𝑡 = [𝜔𝑙,𝜔𝑟] = clip(𝑎�̃�,0,1). 

Wheel velocity - braking: Wheel velocities were still only able to 

fall within the 𝜔𝑙,𝑟 ∈ [0;1] interval, but the predicted values 
were interpreted as the amount of braking from the maximum 
speed. The main differentiating factor from the ‘positive only’ 
option was the bias towards moving forward at full speed:  

𝑎𝑡 = [𝜔𝑙,𝜔𝑟] = clip(1 − 𝑎�̃�,0,1). 

Steering: Predicting a scalar value that was continuously 
mapped to combinations of wheel velocities. The 0.0 scalar value 

corresponds to moving straight (at full speed), while −1.0 and 1.0 
refer to turning left or right with one wheel completely stopped 
and the other going at full speed. Intermediate values are 
computed using linear interpolation between these values. The 
speed of the robot is always maximal for a particular steering 
value. Below is the formula that implements this action mapping: 

𝑎𝑡 = [𝜔𝑙,𝜔𝑟] = clip([1 + 𝑎�̃�,1 − 𝑎�̃�],0,1). 

2.4. Reward shaping 

The reward function is a fundamental element of every 
reinforcement learning problem, as it serves the important role 
of converting a task from a textual description to a mathematical 
optimisation problem. The primary objective for the agent is to 
travel as far as possible within a given time in the right-hand lane; 
therefore, two rewards that promote this behaviour were 
proposed.  

Distance travelled: The agent’s reward was directly proportional 
to the distance it moved along the right-hand lane at each step. 
Only longitudinal motion was counted and only if the robot 
stayed in the right-hand lane. 

Orientation: The agent was rewarded if it was facing and 
moving in the desired orientation, which was determined based 
on its lateral position. In simple terms, it received the largest 
reward if it faced towards the centre of the right-hand lane (some 
example configurations are shown in Figure 2 d). A term 
proportional to the angular velocity of the faster moving wheel 
was also added to encourage fast motion. 

This reward was calculated as 𝑟 = 𝜆Ψ 𝑟Ψ(𝛹,𝑑) +
λ𝑣 𝑟𝑣(𝜔𝑙,𝜔𝑟), where 𝑟Ψ(⋅),𝑟𝑣(⋅) are the orientation and 
velocity-based components (explained below), while the 𝜆Ψ,𝜆𝑣 
constants scale these to [-1,1]. 𝛹,𝑑 are the orientation and lateral 
error from the desired trajectory, which is the centreline of the 
right-hand lane (see Figure 2 a). 

The orientation-based term was calculated as 𝑟Ψ(𝛹,𝑑) =

Λ(𝛹𝑒𝑟𝑟) = Λ(𝛹 − 𝛹des(𝑑)), where 𝛹des(𝑑) is the desired 
orientation calculated using the lateral distance from the desired 

trajectory (see Figure 2 b for the illustration of 𝛹des(𝑑)). The Λ 
function achieves the promotion of the |𝛹𝑒𝑟𝑟| < 𝜑 error, while 
an error larger than 𝜑 leads to a small negative reward (a plot of 
Λ(𝑥) is shown in Figure 2 c): 

 
Figure 2. Explanation of the proposed orientation reward: (a) explains 𝛹,d, 
(b) shows how the desired orientation depends on the lateral error, (c) shows 
the Λ(𝑥) function and (d) provides some examples of desired configurations. 


ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 10 

Λ(𝑥) =

{
 

1

2
+
1

2
cos(π

𝑥

𝜑
) if − 1 ≤ 𝑥 ≤ 1

(1 − |
𝑥

𝜑
|) otherwise

, (4) 

where the ε ∈ [10−1,10−2] and 𝜑 = 50° hyperparameters are 
selected arbitrarily. 

The velocity-based component was calculated as 

𝑟𝑣(𝜔𝑙,𝜔𝑟) = max(𝜔𝑙,𝜔𝑟) to reward an equally high-speed 
motion in both straight and curved sections. In the curved 
sections, only the outer wheel was able to rotate at maximal 
speed, while on a straight road, both wheels were able to do so. 

2.5. Simulation-to-reality transfer 

To train the agents, an open-source simulation of the 
Duckietown environment was used [14]. This simulation models 
certain physical properties of the real environment accurately 
(dimensions of the robot, camera parameters, dynamic 
properties, etc.), but several other effects (textures, objects at the 
side of the road) and light simulation are less realistic (e.g. 
compared to modern computer games). These inaccuracies 
create a gap between simulation and reality that makes it 
challenging for any reinforcement learning agent to be trained 
only in simulation but operate in reality.  

To bridge the simulation-to-reality gap and to achieve the 
generalisation capability required for real performance, domain 
randomisation was used. This involves training the policy in 
many different variants of a simulated environment by varying 
lighting conditions, object textures, the camera, vehicle dynamics 
parameters and road structures (see Figure 3 for examples of 
domain randomised observations). In addition to the ‘built-in’ 
randomisation options of Gym-Duckietown, this study used a 
diverse set of maps to train on in order to further improve the 
agent's generalisation capability. 

2.6. Collision avoidance 

Collision avoidance with other vehicles greatly increases the 
complexity of the lane-following task. These problems can be 
solved in different ways, for example, by overtaking or following 
at a safe distance. However, the sensing capability of the vehicle 
and the complexity of the policy determine the solution it can 
learn. Images from the forward-facing camera of a Duckiebot 

only have a 160 ° horizontal field of view; therefore, the policy 
controlling the vehicle has no information about objects moving 
next to or behind the robot. For simplicity, in this study, the same 
convolutional network for collision avoidance as for lane 
following was used, which does not feature a long short-term 
memory cell or any other sequence modelling component (in 
contrast to [2]). For these reasons, it is unable to plan long 
manoeuvres, such as overtaking, which also requires side vision 

to check when it is safe to return to the right-hand lane. The 
policy was therefore trained in situations where there was a slow 
vehicle ahead, and the agent had to learn to perform lane 
following at full speed until it had caught up with the vehicle in 
front, at which point it had to reduce its speed and maintain a 
safe distance to avoid collision. 

In these experiments, the wheel velocity - braking action 
representation was used as the policy's output because this 
allowed the agent to slow down or even stop the vehicle if 
necessary (unlike the steering action). Both the orientation and the 
distance travelled reward functions were used to train agents for 
collision avoidance. The former was supplemented with a term 
that promoted collision avoidance, while the latter was used 

unchanged. The simulation used provided a 𝑝coll penalty if the 
safety circles around the two vehicles overlapped. The 
𝑟𝑐𝑜𝑙𝑙 reward component that promoted collision avoidance was 
calculated using this penalty. If the penalty decreased because the 
robot was able to increase its distance from an obstacle, the 
reward term was proportional to the change in penalty; 
otherwise, it was 0: 

𝑟coll = {
−𝜆coll ⋅ Δ𝑝coll if Δ𝑝coll < 0

0 otherwise
. (5) 

This term was added to the orientation reward, and it aimed to 
encourage the policy to increase the distance from the vehicle 
ahead if it got too close. Collisions were only penalised by 
terminating the episode without giving any negative rewards. 

2.7. Evaluation 

To assess the performance of the reinforcement learning-
based controller, multiple performance metrics in the simulation 
were measured and compared against two baselines, one using a 
classical control theory approach and the other being human 
driving. 

Survival time (𝑡survive) in s: The time until the robot left the 
road or the duration of an evaluation episode. 

Distance travelled in ego-lane (𝑠ego) in m: The distance travelled 
along the right-hand lane within a fixed time period. Only 
longitudinal motion was counted; tangential movement therefore 
counted the most towards this metric. 

Distance travelled both lanes (𝑠both) in m: Both the distance 
travelled along the right-hand-lane within a fixed time period and 
sections where the agent moved into the oncoming lane counted 
towards this metric.  

Lateral deviation (𝑑𝑑) in m·s: Lateral deviation from the lane’s 
centreline integrated over the time of an episode.  

Orientation deviation (𝑑Ψ) in rad·s: The robot orientation's 
deviation from the tangent of the lane centreline integrated over 
the time of an episode.  

 
Figure 3. Examples of domain randomised observations. 

   
a) Simulated  b) Simulated c) Real 

Figure 4. a) Test track used for simulated reinforcement learning and baseline 
evaluations; b) and c) real and simulated test track used for the evaluation of 
the simulation-to-reality transfer. 


ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 11 

Time outside ego-lane (𝑡𝑜𝑢𝑡) in s: Time spent outside the ego-lane.  
Even though Duckietown is intended to be a standardised 

platform, it is still under development, and the official evaluation 
methods and baselines have not been adopted widely in the 
research community. The AI Driving Olympics provided a great 
opportunity to benchmark the solution presented here to others; 
however, the methods behind these solutions have not yet been 
published in the scientific literature. For this reason, this method 
was analysed primarily by comparing it with baselines that could 
be evaluated under the same conditions. 

The classical control theory baseline relies on information 
about the robot’s relative location and orientation to the 
centreline of the lane, which is available in the simulator. This 
baseline works by controlling the robot to orient itself towards a 
point on its desired path ahead and calculating wheel velocities 
using a proportional-derivative (PD) controller based on the 
orientation error of the robot. The parameters of this controller 
are hand-tuned to achieve a sufficiently good performance, but 
more advanced control schemes could offer better results. 

In many reinforcement learning problems (e.g. the Atari 2600 
games [15]) the agents are compared to human baselines. 
Motivated by this benchmark, a method to measure how well 
humans were able to control Duckiebots was proposed, which 
was then used as a baseline. The values shown in Table 1 were 
recorded by controlling the simulated robot using the arrow keys 
on a keyboard (therefore via discrete actions), while the 
observations seen by the human driver were very similar to the 
observations of the reinforcement learning agent. 

2.8. Methods to improve results at the AI Driving Olympics 

The agents in this study were trained to solve autonomous 
driving problems in the Duckietown environment and not to 
maximise scores at the AI Driving Olympics. Therefore, some 
hyperparameters and methods had to be modified to match the 
competitions' evaluation procedures. It was found that training 
on lower frame rates (0.1 ms step time) improved the scores even 
though the evaluation simulation was stepped more frequently. 
In addition, implementing the same motion blur simulation that 
was applied in the official evaluation improved the results 
significantly compared with agents that were trained on non-
blurred observations. 

3. RESULTS 

3.1. Simulation 

Even though multiple papers have demonstrated the 
feasibility of training vision-based driving policies using 
reinforcement learning, adapting to a new environment still 
poses many challenges. Due to the high dimensionality of the 
image-like observations, many algorithms converge slowly and 
are very sensitive to hyperparameter selection. The method 
presented in this study, using proximal policy optimization, is 
able to converge with good lane-following policies in 1-million 

timesteps thanks to the high sample complexity of the algorithm. 
This training takes 2–2.5 hours on five cores of an Intel Xeon 
E5-2698 v4 2.2 GHz CPU and an Nvidia Tesla V100 GPU if 16 
parallel environments are used.  

3.1.1. Comparison against baselines 

Table 1 compares the reinforcement learning agent from this 
study with the baselines. The performance of the trained policy 
is measurable to the classical control theory baseline as well as to 
how well humans are able to control the robot in the simulation. 
Most metrics indicate similarly good or equal performance even 
though the PD-controller baseline relies on high-level data such 
as position and orientation error rather than images.  

3.1.2. Comparison against other solutions at the AI Driving 
Olympics 

Table 2 shows the top-ranking solutions of the simulated 
lane-following (validation) challenge at the 5th AI Driving 
Olympics. All top-performing solutions were able to control the 
robot reliably in the simulation for the duration of an episode (60 
s); however, the distances travelled were different. The method 
in this study is able to control the robot reliably at the highest 
speed, so it therefore achieves the highest distance-travelled 
value while also showing good lateral deviation and rarely 
departing from the ego-lane.  

3.1.3.  Action representation and reward shaping 

Experiments with different action representations show that 
constrained and preferably biased action spaces allow 
convergence with good policies (wheel velocity - braking and steering). 
However, more general action spaces (wheel velocity and its clipped 
version) can only converge with inferior policies during the same 
number of steps (see Figure 5). The proposed orientation-based 

Table 1. Comparison of the reinforcement learning agent with two baselines 
in simulation. 

Mean metrics over 5 episodes  
 RL  

agent 
PD 

baseline  
Human 
baseline  

Survival time in s ↑ 15 15 15 

Distance travelled both lanes in m ↑ 7.1 7.6 7.0 

Distance travelled ego-lane in m ↑ 7.0 7.6 6.7 

Lateral deviation in m ·s ↓ 0.5 0.5 0.9 

Orientation deviation in rad·s ↓ 1.5 1.1 2.8 

Table 2. Comparing the method in this study with other solutions at the AI 
Driving Olympics 

Author  
𝒕𝐬𝐮𝐫𝐯𝐢𝐯𝐞 
in s ↑ 

𝒔𝐞𝐠𝐨 

in m ↑ 
𝒅𝒅 

in m·s ↓ 
𝒕𝐨𝐮𝐭 

in s ↓ 

A. Kalapos [10], [16]  60 30.38 2.65 0 

A. Béres [16]  60 29.14 4.10 1.4 

M. Tim [16]  60 28.52 3.45 0.4 

A. Nikolskaya  60 24.80 3.15 1.6 

R. Moni [16]  60 18.60 1.78 0 

Z. Lorincz [16]  60 18.6 3.5 0.8 

M. Sazanovich  60 16.12 4.35 3.4 

R. Jean  60 15.5 3.28 0 

Y. Belousov  60 14.88 5.41 9.8 

M. Teng  60 11.78 2.92 0 

P. Almási [7], [16]  60 11.16 1.32 0 

 
a) Orientation reward b) Distance travelled reward 

Figure 5. Learning curves for the reinforcement learning agent with different 
action representations and reward functions. 


ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 12 

reward function also leads to as good a final performance as one 
that is ‘trivially’ rewarding based on the distance travelled; 
however, the latter seems to perform better on more general 
action representations (because policies using these action spaces 
and trained with the orientation reward do not learn to move fast). 

3.2. Real-world driving 

To measure the quality of the transfer learning process and 
the performance of the controller in the real world, performance 
metrics that were easily measurable both in reality and simulation 
were selected. These were recorded in both domains in matching 
experiments and compared against each other. The geometry of 
the tracks, the dimensions and the speed of the robot were 
simulated accurately to evaluate the robustness of the policy 
against all the inaccurately simulated effects and those that were 
not simulated. Using this method, policies trained in the domain-
randomised simulation were tested as well as those that were 
trained only in the ‘nominal’ simulation. This allows for the 
evaluation of the transfer learning process and the highlighting 
of the effects of training with domain randomisation. The real 
and simulated version of the test track used in this analysis is 
shown in Figure 4 b and Figure 4 c.  

During real evaluations, it was generally found that under 
ideal circumstances (no distracting objects at the side of the road 
and good lighting conditions), the policy trained in the ‘nominal’ 
simulation was able to drive reasonably well. However, training 
with domain randomisation led to a more reliable and robust 
performance in the real world. 

Table 1 shows the quantitative results of this evaluation. The 
two policies seemed to perform equally well when compared 
based on their performance in the simulation. However, metrics 
recorded in the real environment show that the policy trained 
with domain randomisation performed almost as well as in the 
simulation, while the other policy performed noticeably worse. 
The lower distance travelled ego-lane metric of the domain-
randomised policy can be explained by the fact that the vehicle 
tended to drift to the left-hand lane at sharp turns but returned 
to the right-hand lane afterwards, while the nominal policy 
usually made more serious mistakes. Note that in these 
experiments the orientation-based reward and the steering action 
representation were used, as this configuration learns to control 

the robot in the minimum number of steps and the shortest 
training time. 

An online video demonstrates the performance of the trained 
agent from this study: https://youtu.be/kz7YWEmg1Is 
(Accessed 23 September 2021). 

An important limitation for the method presented in this 
study is that during real evaluations, the speed of the robot had 
to be decreased to half of the simulated value. The policy 
evaluations were executed on a PC connected to the robot via 
wireless LAN; therefore, the observations and the actions were 
transmitted between the two devices at every step. This 
introduced delays in the order of 10 – 100 ms, making the 
control loop unstable when the robot was moving at full speed. 
However, at half speed, a stable operation was achieved.  

It was noticed that models trained with motion blur and 
longer step times for the AI Driving Olympics performed more 
reliably in the real world regardless of whether they used domain 
randomisation. However, further analysis and retraining of these 
agents multiple times is needed to firmly support these 
presumptions. 

3.3. Collision avoidance 

Figure 6 demonstrates the learned collision avoidance 
behaviour. In the first few seconds of the simulation, the robot 
controlled by the reinforcement learning policy accelerates to full 
speed. Then, as it approaches the slower, non-learning robot, it 
reduces its speed and maintains an approximately constant 
distance from the vehicle ahead (see Figure 6). From the simple, 
fully convolutional network of this policy, learning, planning and 
executing a more complex behaviour, such as overtaking, cannot 
be expected.  

Table 4 shows that training with both reward functions leads 
to functional lane-following behaviour. However, the non-
maximal survival time values indicate that neither of the policies 
are capable of performing lane following reliably with the 
presence of an obstacle robot for 60 s. All metrics in Table 4 
indicate that the modified orientation reward leads to better lane-
following metrics than the simpler distance travelled reward. It 
should be noted that these metrics were mainly selected to 
evaluate the lane-following capabilities of an agent; a more in-

Table 3. Evaluation results of reinforcement learning agent in the real 
environment and in matching simulations. 

Eval. 
Domain 

Mean metrics over 6 episodes   Domain  
Rand. Policy  

Nominal 
Policy  

Real Survival time in s ↑ 54  45  

 Distance travelled both lanes in m ↑ 15.6  11.4  

 Distance travelled ego-lane in m ↑ 7.0  8.4  

Sim. Survival time in s ↑ 60  60  

 Distance travelled in m ↑ 15.5  15.0  

     
a) 𝑡 = 0 s  b) 𝑡 = 6 s c) 𝑡 = 8 s d) 𝑡 = 24 s 

e) Approximate distance between the vehicles 
Initial Positions Catching up Following the vehicle ahead 

Figure 6. Sequence of robot positions in a collision avoidance experiment with a policy trained using the modified orientation reward. After 𝑡 = 6 s, the 
controlled robot follows the vehicle in front at a short but safe distance until the end of the episode (approximate distance is calculated as the distance 
between the centre points of the robots minus the length of a robot). 

Table 4. Evaluation results of policies trained for collision avoidance with 
different reward functions. 

Mean metrics over 15 episodes   Distance 
travelled 

Orientation 
+𝑟coll 

Survival time (max. 60) in s ↑ 46 52 

Distance travelled both lanes in m ↑ 22.5 22.9 

Distance travelled ego-lane in m ↑ 22.7 23.1 

Lateral deviation in m·s ↓ 1.9 1.6 

Orientation deviation in rad·s ↓ 6.3 5.8 

https://youtu.be/kz7YWEmg1Is


ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 13 

depth analysis of collision avoidance with a vehicle in front calls 
for more specific metrics. 

An online video demonstrates the performance of the agent 
trained in this study: https://youtu.be/8GqAUvTY1po 
(Accessed 23 September 2021) 

3.4. Salient object maps 

Visualising which parts of the input image contribute the 
most to a particular output (action) is important because it 
provides some explanation of the network's inner workings. 
Figure 7 shows salient object maps in different scenarios 
generated using the method proposed in [17]. All of these images 
indicate high activations on lane markings, which is expected. 

4. CONCLUSIONS 

This work presented a solution to the problem of complex, 
vision-based lane following in the Duckietown environment 
using reinforcement learning to train an end-to-end steering 
policy capable of simulation-to-real transfer learning. It was 
found that the training is sensitive to problem formulation, such 
as the representation of actions. This study has demonstrated 
that by using domain randomisation, a moderately detailed and 
accurate simulation is sufficient for training end-to-end lane-
following agents that operate in a real environment. The 
performance of these agents was evaluated by comparing some 
basic metrics to match real and simulated scenarios. Agents were 
also successfully trained to perform collision avoidance in 
addition to lane following. Finally, salient object visualisation was 
used to give an illustrative explanation of the inner workings of 
the policies in both the real and simulated domains. 

ACKNOWLEDGEMENT 

We would like to show our gratitude to Professor Bálint 
Gyires-Tóth (BME, Dept. of Telecommunications and Media 
Informatics) for his assistance and comments on the progress of 
our research.  

The research reported in this paper and carried out at the 
Budapest University of Technology and Economics was 
supported by Continental Automotive Hungary Ltd. and the 
‘TKP2020, Institutional Excellence Programme’ of the National 
Research Development and Innovation Office in the field of 
Artificial Intelligence (BME IE-MI-SC TKP2020). 

REFERENCES 

[1] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, 
D. Silver, K. Kavukcuoglu, Asynchronous methods for deep 
reinforcement learning, Proc. Of the International Conference on 
Machine Learning, New York, United States, 19–24 June 2016, pp. 
1928-1937. 

[2] M. Jaritz, R. de Charette, M. Toromanoff, E. Perot, F. Nashashibi, 
End-to-end race driving with deep reinforcement learning, Proc. 

of the IEEE International Conference on Robotics and 
Automation (ICRA), Brisbane, Australia, 21–25 May 2018, pp. 
2070-2075. 

[3] A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J. Allen, V. Lam, 
A. Bewley, A. Shah, Learning to drive in a day, Proc. of the 
International Conference on Robotics and Automation (ICRA), 
Montreal, Canada, 20–24 May 2019, pp. 8248-8254. 

[4] W. Shi, S. Song, Z. Wang, G. Huang, Self-supervised discovering 
of causal features: towards interpretable reinforcement learning, 
2020. Online [Accessed 3 August 2020]  
https://arxiv.org/abs/2003.07069 

[5] B. Balaji, S. Mallya, S. Genc, S. Gupta, L. Dirac, V. Khare, G. Roy, 
T. Sun, Y. Tao, B. Townsend, E. Calleja, S. Muralidhara, D. 
Karuppasamy, DeepRacer: educational autonomous racing 
platform for experimentation with Sim2Real reinforcement 
learning, 2019. Online [Accessed 13 April 2020] 
https://arxiv.org/abs/1911.01562  

[6] M. Szemenyei, P. Reizinger, Attention-based curiosity in multi-
agent reinforcement learning environments, Proc. of the 
International Conference on Control, Artificial Intelligence, 
Robotics & Optimization (ICCAIRO), Majorca Island, Spain, 3–5 
May 2019, pp. 176-181. 

[7] P. Almási, R. Moni, B. Gyires-Tóth, Robust reinforcement 
learning-based autonomous driving agent for simulation and real 
world, Proc. of the International Joint Conference on Neural 
Networks (IJCNN), Glasgow, United Kingdom, 19–24 July 2020, 
pp. 1-8. 

[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. 
Wierstra, M. Riedmiller, Playing Atari with deep reinforcement 
learning, 2013. Online [Accessed 13 April 2020] 
https://arxiv.org/abs/1312.5602  

[9] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, 
Proximal policy optimization algorithms, 2017. Online [Accessed 
2 December 2019] 
https://arxiv.org/abs/1707.06347  

[10] A. Kalapos, C. Gór, R. Moni, I. Harmati, Sim-to-real 
reinforcement learning applied to end-to-end vehicle control, 
Proc. of the 23rd International Symposium on Measurement and 
Control in Robotics (ISMCR), Budapest, Hungary, 15–17 October 
2020, pp. 1-6. 

[11] J. Zilly, J. Tani, B. Considine, B. Mehta, A. F. Daniele, M. Diaz, G. 
Bernasconi, C. Ruch, J. Hakenberg, F. Golemo, A. K. Bowser, M. 
R. Walter, R. Hristov, S. Mallya, E. Frazzoli, A. Censi, L. Paull, 
The AI Driving Olympics at NeurIPS, 2018. Online [Accessed 13 
April 2020] 
https://arxiv.org/abs/1903.02503  

[12] J. Schulman, P. Moritz, S. Levine, M. Jordan, P. Abbeel, High-
dimensional continuous control using generalized advantage 
estimation, Proc. of the International Conference on Learning 
Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016, 14 
pp. Online [Accessed 23 September 2021]  
http://arxiv.org/abs/1506.02438  

[13] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, 
J. Gonzalez, M. Jordan, I. Stoica, Rllib: abstractions for distributed 
reinforcement learning, Proc. of the International Conference on 
Machine Learning, Stockholm, Sweden, 10–15 July 2018, pp. 
3053-3062. 

[14] M. Chevalier-Boisvert, F. Golemo, Y. Cao, B. Mehta, L. Paull, 
Duckietown environments for OpenAI gym, 2018. Online 
[Accessed 15 January 2021]   
https://github.com/duckietown/ gym-duckietown  

[15] M. G. Bellemare, Y. Naddaf, J. Veness, M. Bowling, The arcade 
learning environment: an evaluation platform for general agents, 
J. Artif. Intell. Res. 47 (2013), pp. 253-279.  
DOI: 10.1613/jair.3912 

[16] R. Moni, A. Kalapos, A. Béres, M. Tim, P. Almási, Z. Lőrincz, PIA 
project achievements at AIDO5, 2020. Online [Accessed 15 
January 2021]  
https://medium.com/@SmartLabAI/pia-project-achievements-
at-aido5-a441a24484ef 

   
a) Simulated  b) Real c) Collision avoidance 

Figure 7. Salient objects highlighted on observations in different domains and 
tasks. Blue regions represent high activations throughout the network. 

https://youtu.be/8GqAUvTY1po
https://arxiv.org/abs/2003.07069
https://arxiv.org/abs/1911.01562
https://arxiv.org/abs/1312.5602
https://arxiv.org/abs/1707.06347
https://arxiv.org/abs/1903.02503
http://arxiv.org/abs/1506.02438
https://github.com/duckietown/%20gym-duckietown
https://doi.org/10.1613/jair.3912
https://medium.com/@SmartLabAI/pia-project-achievements-at-aido5-a441a24484ef
https://medium.com/@SmartLabAI/pia-project-achievements-at-aido5-a441a24484ef


ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 14 

[17] M. Bojarski, P. Yeres, A. Choromanska, K. Choromanski, B. 
Firner, L. D. Jackel, U. Muller, Explaining how a deep neural 
network trained with end-to-end learning steers a car, 2017. 
Online [Accessed 15 April 2020]  
https://arxiv.org/abs/1704.07911  

APPENDIX 

Proximal policy optimization 

The pseudo code for proximal policy optimization (PPO) is 
as follows: 

Algorithm PPO, Actor-Critic Style (based on [9]) 

Input: initial policy with 𝜃0 parameters and initial value function estimator with 
𝜙0 parameters 

for iteration = 1,2,... do 
for actor=1,2,...,N do 

Run πθold in the environment for T timesteps to collect τ𝑖 trajectory 

Compute advantage estimates �̂�1,… ,Â𝑇 based on the current value 
function 

end 

Optimise 𝔏CLIP(𝜃) + 𝔏KLPEN(𝜃) wrt. 𝜃, for K epochs and minibatch size 
𝑀 ≤ 𝑁𝑇 

Fit the value function estimate by regression on mean-squared error 

𝜃old ← 𝜃, 𝜙old ← 𝜙 
end 

The 𝛽 adaptive parameter mentioned in Section 2.1 is updated 
according to the following rule: 

𝛽 ← {
𝛽/2, if 𝑑 < 𝑑targ/1.5

𝛽 × 2, if 𝑑 > 𝑑targ × 1.5,
 (6) 

where 𝑑targ is a hyperparameter and 𝑑 is the KL-divergence of 
the old and the updated policy  

𝑑 = �̂�[𝐾𝐿[πθold(⋅ |𝑠𝑡),πθ(⋅ |𝑠𝑡)]]. (7) 

The �̂�𝑡 generalised advantage estimate [12] is calculated as 

�̂�𝑡 = ∑(γλ)
𝑙δ𝑡
𝑉

∞

𝑙

 (8) 

𝛿𝑡
𝑉 = 𝑟𝑡 + 𝛾 𝑉(𝑠𝑡+1) − 𝑉(𝑠𝑡) , (9) 

where 𝑉(𝑠𝑡) and 𝑉(𝑠𝑡+1) are the value function estimates 
calculated by the value network at steps 𝑡 and 𝑡 + 1; 𝛾 is the 
discount factor, while 𝜆 is a hyperparameter of the generalised 
advantage estimate.  

To assure reproducibility, the hyperparameters of the 
algorithm are provided in the Table 5. 

 
Table 5. Hyperparameters of the algorithm. The description of some 
parameters is from the RLlib documentation [13]. 

Description  Value 

Number of parallel environments 𝑁 = 16 
Learning rate  α = 5 × 10−5  
Discount factor for return calculation  𝛾 = 0.99 

𝜆 parameter for the generalised advantage estimate  𝜆 = 0.95  

PPO clip parameter  ϵ = 0.2  
Sample batch size  𝑇 = 256 

SGD minibatch size  𝑀 = 128 

Number of epochs executed in every iteration 𝐾 =  30 

Target KL-divergence for the calculation of 𝛽 𝑑targ = 0.01  

https://arxiv.org/abs/1704.07911