A2CM: a new multi-agent algorithm


ACTA IMEKO 
ISSN: 2221-870X 
September 2021, Volume 10, Number 3, 28 - 35 

 
ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 28 

A2CM: a new multi-agent algorithm 

Gabor Paczolay1, Istvan Harmati1 

1 Budapest University of Technology and Economics, Magyar Tudósok Körútja 2, 1117 Budapest, Hungary 

 
Section: RESEARCH PAPER  

Keywords: Reinforcement learning, multiagent learning 

Citation: Gabor Paczolay, Istvan Harmati, A2CM: a new multi-agent algorithm, Acta IMEKO, vol. 10, no. 3, article 6, September 2021, identifier: IMEKO-ACTA-
10 (2021)-03-06 

Section Editor: Bálint Kiss, Budapest University of Technology and Economics, Hungary 

Received January 15, 2021; In final form August 13, 2021; Published September 2021 

Copyright: This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, 
distribution, and reproduction in any medium, provided the original author and source are credited. 

Corresponding author: Gabor Paczolay, e-mail: paczolay.gabor@gmail.com  

 
1. INTRODUCTION 

Reinforcement learning is one of the most researched fields 
within the scope of artificial intelligence. Newer algorithms are 
continually being developed to achieve successful learning in 
more situations or with fewer samples. 

In reinforcement learning, a new challenge arises when we 
take other agents into consideration. This research field is called 
‘multi-agent learning’. Dealing with other agents – whether they 
are cooperative, competitive or a mixture of both – brings the 
learning model closer to a real-world scenario. In real life, no 
agent acts alone; even random counteracts can be treated as 
‘counteracts of nature’. 

In our work, we optimised the synchronous actor–critic 
algorithm to perform better in cooperative multi-agent scenarios 
(those in which agents help each other). 

Littman [1] utilised the minimax-Q algorithm, a zero-sum 
multiagent reinforcement learning algorithm, and applied it to a 
simplified version of a robotic soccer game. Hu and Wellmann 
[2] created the Nash-Q algorithm and used it on a small 
gridworld example to demonstrate the results. Bowling [3] varied 
the learning rate of the training process to speed it up while 
ensuring convergence. Later, he applied the win or learn fast 
methodology to an actor–critic algorithm to improve its multi-
agent capabilities [4]. 

Reinforcement learning advanced significantly when neural 
networks gained popularity and convergence was improved. 
Mnih et al. [5] successfully applied deep reinforcement learning 
to playing Atari games by feeding multiple frames at once and 
utilising experience replay to ensure convergence. Later, deep 
reinforcement learning was applied to multi-agent systems, such 
as independent multi-agent reinforcement learning. Foerster et 
al. [6] stabilised experience replay for independent Q-learning 
using fingerprints. Omidshafiei et al. [7] utilised decentralised 
hysteretic deep recurrent Q-networks for partially observable 
multi-task multi-agent reinforcement learning problems. 

 
Figure 1. Markov decision process. 

ABSTRACT 
Reinforcement learning is currently one of the most researched fields of artificial intelligence. New algorithms are being developed that 
use neural networks to compute the selected action, especially for deep reinforcement learning. One subcategory of reinforcement 
learning is multi-agent reinforcement learning, in which multiple agents are present in the world. As it involves the simulation of an 
environment, it can be applied to robotics as well. In our paper, we use our modified version of the advantage actor–critic (A2C) 
algorithm, which is suitable for multi-agent scenarios. We test this modified algorithm on our testbed, a cooperative–competitive 
pursuit–evasion environment, and later we address the problem of collision avoidance. 

mailto:paczolay.gabor@gmail.com


ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 29 

Multiple advancements have also been made in the field of 
centralised learning and decentralised execution. Foerster et al. 
[8] created counterfactual multi-agent policy gradients to solve 
the issue of multi-agent credit assignment. Peng et al. [9] created 
multiagent bidirectionally-coordinated nets with actor–critic 
hierarchy and recurrent neural networks for communication. 
Sunehag et al. [10] utilised value-decomposition networks with 
common rewards and Q-function decomposition. Rashid et al. 
[11] utilised QMIX with value function factorisation, Q-function 
decomposition and a feed-forward neural network with better 
performance than the former value-decomposition one. Lowe et 
al. [12] improved the deep deterministic policy gradient by 
altering the critic to contain all actions of all agents, thus making 
the algorithm capable of processing more multi-agent scenarios. 
Shihui et al. [13] improved upon the previous MADDPG 
algorithm, increasing its performance in zero-sum competitive 
scenarios by utilising a method based on minimax-Q learning. 
Casgrain et al. [14] upgraded the deep Q-network algorithm 
utilising methods based on Nash equilibria, making it capable of 
solving multi-agent environments. 

Benchmarks have also been created to analyse the 
performance of various algorithms in multi-agent environments. 
Vinyals et al. [15] modified the StarCraft II game to make it a 
learning environment. Samvelyan et al. [16] also pointed to 
StarCraft as a multi-agent benchmark but with a focus on 
micromanagement. Liu et al. [17] introduced a multi-agent soccer 
environment with continuous simulated physics. Bard et al. [18] 
reached a new frontier with the cooperative Hanabi game 
benchmark. 

Cooperative multiagent reinforcement learning and the 
proposed algoirthm are usable in many scenarios in robotics. As 
our algorithm is decentralised, it can be installed into the robots 
themselves without any central command center. It might be 
useful in exploration or localisation tasks in which the use of 
multiple agents would significantly speed up the process. Our 
testbed can be considered a simplified version of a localisation 
task, as the pursuer robots are trying to approach and measure a 
non-cooperative moving object. For proper use in robotics, a 
well-prepared simulation of the robots and the environment is 
required, in which thousands of episodes can be run for learning. 

In our work, we modified the already existing advantage 
actor–critic (A2C) algorithm to make it better suited for multi-
agent scenarios by creating a single-critic version of the 
algorithm. Then, we tested this modified A2CM algorithm on 
our cooperative–competitive pursuit–evasion testbed. 

In the following section, we explain the theoretical 
background for our work. Then, the experiments themselves and 
the testbed are introduced. We continue by presenting the results 
and end with our conclusions and suggestions for future work 
on the topic. 

2. THEORETICAL BACKGROUND 

2.1. Markov decision processes 

A Markov decision process is a mathematical framework for 
modeling decision making, as shown in Figure 1. In a Markov 
decision process there are states, selectable actions, transition 
probabilities and rewards [1]. At each timestep, the process starts 

at a state 𝑠 and selects an action 𝑎 from the available action space. 
It gets a corresponding reward 𝑟 and then finds itself in a state 𝑠′ 
given by the probability of 𝑃(𝑠, 𝑠′). A process is said to be 
Markovian if  

𝑃(𝑎𝑡 = 𝑎|𝑠𝑡 , 𝑎𝑡−1, . . . , 𝑠0, 𝑎0) = 𝑃(𝑎𝑡 = 𝑎|𝑠𝑡 ), (1) 

which means that a state’s transition is based only on the 
previous state and the current action. Thus, only the last state 
and action are considered when deciding on the next state. 

In a Markov decision process, the agents are trying to find a 
policy that maximises the sum of discounted expected rewards. 
The standard solution for this uses an iterative search method 
that searches for a fixed point of the Bellman equation:  

𝑣(𝑠, 𝜋 ∗) = max𝑎 (𝑟(𝑠, 𝑎) + 𝛾 ∑

𝑠′

𝑝(𝑠 ′|𝑠, 𝑎)𝑣(𝑠 ′, 𝜋 ∗)). (2) 

2.2. Reinforcement learning 

When the state transition probabilities or the rewards are 
unknown, the problem of the Markov decision process becomes 
a problem of reinforcement learning. In this group of problems, 
the agent tries to make a model of the world around itself via trial 
and error. 

One type of reinforcement learning is value-based 
reinforcement learning. In this case, the agent tries to learn a 

 
Figure 2. The simulation environment. The squares represent the controlled 
agents, while the circle represents the fleeing enemy. The goal is to catch the 
enemy by moving horizontally or vertically.  

Initialise Model: 
  Initialise N+1 hidden and N+1 output (1 value + N action) 

layers (4 different networks in one model, 1 critic + 3 actor) 
number of updates batch size 

for number of updates do 
  for batch size do 

    Calculate next actions 𝑎 based on the previous state 
    Take actions 𝑎, get terminal state boolean and new rewards 
    Store the actions, the terminal state booleans, the calculated 

values, the rewards and the states  
  end for 
  Calculate returns based on (13) 
  Calculate advantages based on (12) 
  Update critic neural network based on the observed states 

and the corresponding returns: loss is the mean squared error 
between the returns and calculated values 

  Update actor neural networks based on the observed states, 
the taken actions and the advantages: loss is policy loss(weighted 
sparse categorical cross-entropy loss) − entropy loss(cross-
entropy over itself) 

end for 

Algorithm 1: A2CM.  


ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 30 

value function that renders a value to states or to actions from 
states. These values correspond to a reward achieved by reaching 
a state or taking a specific action from a state. 

The most commonly used type of value-based reinforcement 
learning is Q-learning [2], in which the so-called Q-values are 
estimated for each of the state–action pairs of the world. These 
Q-values represent the value of choosing a specific action in a 
state, meaning the highest reward the agent could possibly get by 
taking that action. The equation for Q-learning for updating the 
Q-values of a state is:  

𝑄(𝑠 ′, 𝑎) ← (1 − 𝛼) ⋅ 𝑄(𝑠, 𝑎) + 𝛼 ⋅ (𝑟 + 𝛾 ⋅ max
𝑎′

𝑄(𝑠 ′, 𝑎′)) , (3) 

where 𝛼 is the learning rate and 𝛾 is the discount for the reward. 
The agent always selects an action that maximises the Q-function 
for the state that the agent is in. 

Another type of reinforcement learning is policy-based 
reinforcement learning. In this case, actions are derived as a 
function of the state itself. The most common policy-based 
reinforcement learning method is the policy gradient approach 
[19]. In this case, the agent tries to maximise the expected reward 

by following the policy 𝜋𝜃  parametrised by 𝜃 based on the total 
reward for a given trajectory 𝑟(𝜏). Thus, the cost function of the 
parameters 𝜃 is the following:  

𝐽(𝜃) = 𝐸𝜋𝜃 [𝑟(𝜏)]. (4) 

The parameters are then tuned based on the gradient of the 
cost function:  

𝜃𝑘+1] = 𝜃𝑡 + 𝛼Δ𝐽(𝜃𝑡 ). (5) 

The advantages of policy-based methods include the ability to 
map environments with huge or even continuous action spaces 
and solve environments with stochasticity. However, when using 

these methods, there is also a much greater possibility of getting 
stuck in a local maximum rather than following the optimal 
policy.  

Apart from the aforementioned model-free reinforcement 
learning methods, there is also model-based reinforcement 
learning. In this case, a model is built (or just tuned) to perform 
the reinforcement learning. This is more sample-efficient than 
model-free methods and thus requires fewer samples to perform 
equally, but it is very dependent on the particular model. It can 
be combined with model-free methods to achieve better results, 
as in [20].  

2.3. Multi-agent systems and Markov games 

A matrix game is a stochastic framework in which each player 
selects an action and gets an immediate reward based on their 
action and those of the other agents [1]. They are called ‘matrix 
games’ because the game can be written as a matrix, with the first 
two players selecting actions in the rows and columns of the 
matrix. Unlike Markov decision processes, these games have no 
states. 

Markov games, or stochastic games, are extensions of Markov 
decision processes with multiple agents. They can also be 
thought of as extensions of matrix games with multiple states. In 
a Markov game, each state has a payoff matrix for all of the states. 
The next state is determined by the joint actions of the agents. A 
game is Markovian if  

𝑃(𝑎𝑖
𝑡 = 𝑎𝑖 |𝑠

𝑡 , 𝑎𝑖
𝑡−1, . . . , 𝑠0, 𝑎𝑖

0) = 𝑃(𝑎𝑖
𝑡 = 𝑎𝑖 |𝑠

𝑡 ), (6) 

so the next state depends only on the current state and the 
current actions taken by all agents.  

2.4. Deep reinforcement learning 

A reinforcement learning algorithm is called ‘deep’ if it is 
assisted by a neural network. 

 
Figure 3. An example of catching the randomly moving opponent. 

 
Figure 4. An example of catching the fleeing opponent. 


ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 31 

A neural network is a function approximator built from 
(sometimes billions of) artificial neurons. An artificial neuron, 
which is based on the real neurons of the brain, has the following 
equation:  

𝑦 = Act (∑ 𝑤 𝑥 + 𝑏), (7) 

where 𝑥 is the input vector, 𝑤 is the weight vector, 𝑏 is the bias 
and Act() is the activation function to introduce nonlinearity in 
an otherwise linear system. The parameters (𝑤 and 𝑏) are tuned 
with backpropagation, calculating the partial derivative error of 
all parameters propagated from the final error to the input 
vector. 

The selection of the activation function is important in deep 
learning due to the vanishing gradients: when many layers are 
stacked upon each other, higher layers’ gradients are too small 
during backpropagation, and thus, those layers are difficult to 
train. A basic activation function can be a sigmoid or logistic 
activation function:  

𝑦 =
1

1 + e−𝑥
 . (8) 

A common activation function in deep learning is rectified 
linear unit (ReLU) [21], which has gradients that are less 
vanishing and therefore better to train. It has the following 
equation:  

𝑦 = 𝑥 if 𝑥 > 0
𝑦 = 0 if 𝑥 <= 0 .

 (9) 

For multi-class classification, another activation function is 
used: the softmax activation function. When used as the last 
layer, the probabilities of all of the output neurons add up to 

exactly 1. Thus, in reinforcement learning, it is utile to use it as 
the probability distribution of the possible actions. It has the 
following equation:  

𝑦 =
e𝑥𝑖

∑𝑗 e
𝑦𝑗

 . (10) 

Deep reinforcement learning algorithms have several 
advantages compared to traditional reinforcement learning 
algorithms. First of all, they are not based on a state table, as the 
states are approximated (which is much more robust than using 
linear function approximators). This allows many more states to 
be mapped and even allows for continuous states. However, they 
are more prone to diverging, and thus, many optimisations have 
been created on deep reinforcement learning algorithms to 
provide better convergence on the problems.  

2.5. Actor–critic 

An actor–critic system combines value-based and policy-
based reinforcement learning. In these systems, there are two 
distinct parametrised networks: the critic, which estimates a value 
function (as in value-based reinforcement learning), and an actor, 
which updates the policy network based on the direction 
suggested by the critic (as in policy-based reinforcement 
learning). Actor–critic algorithms follow an approximate policy 
gradient:  

∇𝜃 𝐽(𝜃) ≈ 𝔼𝜋𝜃 [∇𝜃  log 𝜋𝜃 (𝑠, 𝑎) 𝑄𝑤 (𝑠, 𝑎) 

Δ𝜃 = 𝛼 ∇𝜃  log 𝜋𝜃 (𝑠, 𝑎) 𝑄𝑤 (𝑠, 𝑎) . 
(11) 

Approximating the policy gradient introduces bias to the 
system. A biased policy gradient may not find the right solution, 
but if we choose the value function approximation carefully, then 
we can avoid introducing any bias. 

Actor–critic systems generally perform better than regular 
reinforcement learning algorithms. The critic network ensures 
that the system does not get stuck in a local maximum; 
meanwhile, the actor network enables the mapping of 
environments with huge action spaces and provides better 
convergence [19].  

2.6. The A2C algorithm 

A2C stands for synchronous advantage actor–critic. It is a 
one-environment-at-a-time derivation of the asynchronous 
advantage actor–critic (A3C) algorithm [22], which processes 
multiple agent-environments simultaneously. In that algorithm, 
multiple workers update a global value function, thus exploring 
the state space effectively. However, the synchronous advantage 

 
Figure 5. The performance of the original A2C algorithm on our benchmark.  

 
Figure 6. Performance of the modified A2C algorithm on our benchmark.  

 
Figure 7. Performance of the original A2C algorithm on our benchmark with 
collision (with terminating at collision). 


ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 32 

actor–critic provides better performance than the asynchronous 
model. 

Advantage function is a method that significantly reduces the 
variance of the policy gradient by subtracting the cumulative 
reward using a baseline to make smaller gradients; thus, it 
provides much better convergence than regular Q-values. It has 
the following equation:  

𝐴(𝑠, 𝑎) = 𝑄(𝑠, 𝑎) − 𝑉(𝑠) . (12) 

Returns are calculated using the equation:  

𝐺𝑡 = 𝑟𝑡 + 𝛾 ∗ 𝑟𝑡+1 ∗ (1 − 𝑇𝑡 ) , (13) 

where 𝐺 is the return, 𝑟𝑡  is the reward at time t, 𝛾 is the discount 
factor and 𝑇𝑡  indicates whether the step at time 𝑡 is a terminal 
state.  

3. EXPERIMENTS AND RESULTS 

The testbed is a 5 × 5 grid with three cooperating agents (the 
squares) in three of the four corners of the environment. In the 
middle, there is a fourth agent (the circle). The former three 
agents have the objective of catching the fourth agent, which 
moves randomly. This testbed is analogous to pursuit–evasion 
(or predator–prey) scenarios that are also significant in robotics. 
The agents can move in four directions: up, down, left or right. 
When one of the three agents catches the fourth one, the episode 
ends. A penalty is introduced to the cooperative agents every 
timestep; thus, the return of an episode is maximised by ending 
the episode as soon as possible (i.e. catching the fleeing agent as 
quickly as possible). Each episode must end in 1,000 timesteps 
to avoid getting stuck. 

In the modification of the A2C algorithm, we followed the 
theory of centralised learning and decentralised execution. This 

means that the execution is decentralised, but the learning phase 
can be assisted by additional information from other agents. In 
our case, we used the information that the agents are cooperative; 
thus, they acquire the same rewards (and returns). As noted 
before, decentralised execution is most helpful in real-world 
scenarios in which communication difficulties make a centralised 
task-solving achitecture impossible. Such scenarios are often 
encountered in robotics. 

In our experiment, many A2C models with one actor and one 
critic were substituted for one model with one critic and multiple 
actors. The pseudocode of the algorithm can be seen in 
Algorithm 3. All neural network layers were subclasses of the 
TensorFlow model class, which provides utile functions for 
training and prediction – even for batch tasks – by providing only 
the forward steps of the network. The optimiser was RMSprop, 
with a learning rate of 7 · 10−3. 

The value estimator critic contained a neural network of 128 
hidden unit layers with ReLU activation function and one output 
layer with one unit. Its loss function was a simple mean squared 
error between the returns and the value. 

The actors contained a hidden layer with 128 hidden units and 
an output layer with four units (the number of actions in the 
action space). The loss function contained two distinct parts: 
policy and entropy loss. The policy loss was a weighted sparse 
categorical cross-entropy loss, where the weights were given by 
the advantages. This method increased the convergence of the 
algorithm. Entropy loss is a method for increasing exploration 
by encouraging actions that are not in the local minimum. This 
is very important for tasks with sparse rewards due to the fact 
that the agent does not receive feedback often. This loss was 
calculated as a cross-entropy over itself, and it was subtracted 
from the policy loss because it should be maximised, not 
minimised. The entropy loss was tuned by a constant, which was 
taken as 1 · 10−4. 

Episode rewards were taken to be a list where a value of 0 was 
appended to the end of the list at each episode’s end. During the 
episodes, only the last value of the list was incremented by the 
episode reward of the given step. For the training, a batch-sized 
container was created for the actions, rewards, terminal state 
booleans, state values and observed states. Then, a two-level loop 
was started: the outer one was run for the number of required 
updates (set by us), while the inner loop was run as many times 
as the batch size. The state observations, the taken actions (which 
were selected by a probability distribution based on the actor 
neural network results), the state values, the rewards, the terminal 
state booleans and the last observed state were stored in the 
aforementioned containers. Next, the returns and advantages 

 
Figure 8. Performance of the A2CM algorithm on our benchmark with 
collision (with terminating at collision). 

 
Figure 9. Number of steps per episode of the original A2C algorithm on our 
benchmark with collision (without terminating at collision).  

 
Figure 10. Number of steps per episode of the A2CM algorithm on our 
benchmark with collision (without terminating at collision). 


ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 33 

were calculated on the batch using the collected data, and then a 
batch training was performed on those data. There was no need 
to calculate the gradients themselves due to the use of the Keras 
API. 

During our experiment, the system was run 5,000 times in 
batches of 128, thus running the environments over a total of 
640,000 steps. Gamma was taken to be 0.99. 

Figure 3 and Figure 4 show the ends of some remarkable 
episodes of catching the opponent. Figure 6 and Figure 7 show 
the results of our experiments. It is important to note the x-
coordinates in Figure 5 and Figure 6: for the same number of 
steps, the original was run for 40,340 episodes, while the 
modified algorithm managed to complete 82,119 episodes. This 
means that the A2CM algorithm spent half as many steps in an 
episode and was able to catch the fleeing opponent in, on 
average, half of the time required by the agent based on the 
original algorithm. These figures also show that the original 
algorithm did not find an optimal solution without diverging 
later, and even between divergences, the solutions were not as 
stable. Our agent, on the other hand, found a solution with no 
divergences later and only small divergencess after the first half 
of the episodes. The A2CM algorithm found a solution with 
which it can catch the opponent in 6 steps, and it maintained this 
knowledge for 20,000 episodes, with one positive spike where it 
found the solution to the problem in just 3 steps. 

The run times are worth considering, as well. The regular A2C 
algorithm took 14,567.45 seconds to run, while the modified one 
ran for 14,458.28 seconds. It is worth noting that, due to the fact 
that almost twice as many episodes were completed, the 
environment had to be reset twice as often, so the modified 
algorithm is even faster than the normal one. 

Later, the difference between the algorithms were tested with 
collision turned on, bringing the problem set even closer to real-
world robotics scenarios. In this case, the agents received a 
penalty if they collided with each other. This method makes the 
environment much harder to learn, as failure will probably only 
result from chasing the enemy agent. It also makes the training 
process harder, as the steps leading to success are not as easy to 
determine; a collision that occurs before the enemy is caught will 
make similar attempts less likely to be selected as actions.  

When considering the training process of the environment 
with collision detection turned on, it is important to pay attention 
to the reward ratio between the negative rewards for each step 
and the negative reward for collision. The larger the reward for 
collision, the better the agents will evade collision; otherwise, 
they will be optimised to finish the episode as fast as possible. 
For this reason, the negative reward for each step was selected as 
−1,000, and the negative reward for the collision was 

−150,000,000, providing a ratio that is large enough to encourage 
the agents to follow a collision-evasion policy. 

In the first experiment on the environment with collision 
detection, we tried to set the algorithms such that a collision 
would terminate the episode. This scenario is analogous to 
certain scenarios in robotics in which collisions can cause 
malfunctions in the robots themselves and should be evaded 
even via high-level control. Apart from turning on the collision, 
all other conditions and parameters of the training process were 
the same. Figure 7 and Figure 8 show the cumulated rewards per 
episode for the original A2C and our A2CM algorithm, 
respectively. It can be seen that, while neither was able to solve 
the environment over the timespan of the training, there was a 
time span of ca. 700 episodes in which our algorithm was able to 
catch the enemy without colliding. The original algorithm lacked 
any of these longer periods. The training of the original algorithm 
in this case took 14,173.42 seconds, while the training of the 
A2CM took 14,659.00 seconds. It is worth noting that the 
original algorithm completed 1,665 episodes, and the A2CM 
completed 3,723; the different numbers of reinitialisations 
should be considered when comparing the training times. 

To make the environment easier to train on, the second 
experiment with collisions was conducted such that the episodes 
only terminated if the opponent was caught. This way, the 
episodes were longer and always terminated successfnotedully 
and therefore might provide better training information than the 
setting of the previous experiment. This scenario is analogous to 
problems in robotics in which the presence of two robots in the 
same area is discouraged, such as area scanning scenarios or sub-
tasks in which two robots should not scan the same area at once. 
Just as in the previous experiment, all other parameters were left 
as they were in the training of the system without collision. 
Figure 9 and Figure 10 show the number of steps required to 
finish each episode for the original A2C and the modified A2CM 
algorithms, respectively, while Figure 11 and Figure 12 show the 
cumulated (negative) rewards per episode (higher is better) for 
the A2C and the A2CM algorithms, respectively. It can be seen 
that, while the original A2C algorithm did not show any clear sign 
of successful training, there is some indication of success for the 
A2CM algorithm. Approaching the end of the training process, 
the number of steps were kept low, and, as per Figure 12, 
collisions were also evaded, with the exception of some episodes. 
The original algorithm completed 1,177 episodes, while the 
modified one completed 1,964, which can also be seen as a sign 
of the superiority of the A2CM algorithm. Regarding the training 
times, the original algorithm was trained for 13,981.22 seconds, 
while the modified one was trained for 19,519.85 seconds. In this 
case, it is clear that our algorithm used significantly more training 
time. 

 
Figure 11. Rewards per episode of the A2CM algorithm on our benchmark 
with collision (without terminating at collision). 

 
Figure 12. Rewards per episode of the original A2C algorithm on our 
benchmark with collision (without terminating at collision). 


ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 34 

4. CONCLUSION 

Looking at the previous section, we can conclude that our 
modification of the original A2C algorithm, the A2CM 
algorithm, was able to perform much better than the original on 
our testbed without collision. To some extent, it outperformed 
the original A2C algorithm even in enviroments with collision; 
thus, it is recommendable for tasks in robotics. However, the 
algorithm has the caveat of being usable only when the agents 
are fully cooperative and do not have special, predefined roles. 

There are still many ways to improve upon the current state 
of our algorithm. One possibile improvement would be to 
introduce a variable learning rate, such as win or learn fast [3], in 
a deep reinforcement learning algorithm. Another possible 
improvement is to include the fleeing agent in the algorithm so 
that the algorithm encompasses the full cooperative–competitive 
nature of the environment. In addition, other activation 
functions could be tried to check their behavior; for example, 
exponential linear units [23] might have better convergence at the 
price of slightly more training time. The algorithm could be 
extended using recurrent neural networks so that it could handle 
partially observable Markov decision processes in which the full 
state is unknown.  

ACKNOWLEDGEMENT 

The research reported in this paper and carried out at the 
Budapest University of Technology and Economics was 
supported by the TKP2020, Institutional Excellence Program of 
the National Research Development and Innovation Office in 
the field of Artificial Intelligence (BME IE-MI-SC TKP2020). 

The research was supported by the EFOP-3.6.2-16-2016-
00014 project, which was financed by the Hungarian Ministry of 
Human Capacities. 

REFERENCES 

[1] M. L. Littman, Markov games as a framework for multi-agent 
reinforcement learning, Proceedings of the Eleventh International 
Conference on Machine Learning, New Brunswick, USA, 10 – 13 
July 1994, pp. 157-163.   
DOI 10.1016/b978-1-55860-335-6.50027-1  

[2] J. Hu, M. Wellman, Nash q-learning for general-sum stochastic 
games, Journal of Machine Learning Research 4 (2003), pp. 1039-
1069. Online [Accessed 6 September 2021]  
https://www.jmlr.org/papers/volume4/hu03a/hu03a.pdf  

[3] M. Bowling, M. Veloso, Multiagent learning using a variable 
learning rate, Artificial Intelligence 136 (2002), pp. 215-250. 
DOI: 10.1016/S0004-3702(02)00121-2  

[4] M. H. Bowling, M. M. Veloso, Simultaneous adversarial multi-
robot learning, IJCAI (2003) pp. 699-704.  
DOI: 10.5555/1630659.1630761  

[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. 
Wierstra, M. Riedmiller, Playing Atari with deep reinforcement 
learning, arXiv (2013), 9 pp. Online [Accessed 14 September 2021] 
https://arxiv.org/abs/1312.5602  

[6] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. S. Torr, P. 
Kohli, S. Whiteson, Stabilising experience replay for deep multi-
agent reinforcement learning, PMLR 70 (2017) pp. 1146-1155.  
DOI: 10.5555/3305381.3305500  

[7] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, J. Vian, Deep 
decentralized multi-task multi-agent reinforcement learning under 
partial observability, PMLR 70 (2017) pp. 2681-2690.   
DOI: 10.5555/3305890.3305958  

[8] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, S. Whiteson, 
Counterfactual multi-agent policy gradients, Proceedings of the 
AAAI Conference on Artificial Intelligence, New Orleans, USA, 2 

– 7 February 2018, pp. 1146-1155, arXiv (2017), 12 pp. Online 
[Accessed 14 September 2021]    
https://arxiv.org/abs/1705.08926  

[9] P. Peng, Y. Wen, Y. Yang, Q. Yuan, Z. Tang, H. Long, J. Wang, 
Multiagent bidirectionally-coordinated nets: Emergence of 
human-level coordination in learning to play StarCraft combat 
games, arXiv (2017), 10 pp. Online [Accessed 14 September 2021]  
https://arxiv.org/abs/1703.10069  

[10] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, 
M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, T. 
Graepel, Value-decomposition networks for cooperative multi-
agent learning, arXiv (2017), 17 pp. Online [Accessed 14 
September 2021]   
https://arxiv.org/abs/1706.05296  

[11] T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. Foerster, 
S. Whiteson, QMIX: Monotonic value function factorisation for 
deep multi-agent reinforcement learning, Proceedings of Machine 
Learning Research, Stockholm, Sweden, 10 – 15 July 2018, pp. 
4295-4304. arXiv (2018), 14 pp. Online [Accessed 14 September 
2021]   
https://arxiv.org/abs/1803.11485  

[12] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, I. Mordatch, Multi-
agent actor-critic for mixed cooperative-competitive 
environments, Advances in Neural Information Processing 
Systems 30 (2017), pp. 6379-6390.    
DOI: 10.5555/3295222.3295385  

[13] S. Li, Y. Wu, X. Cui, H. Dong, F. Fang, S. Russell, Robust multi-
agent reinforcement learning via minimax deep deterministic 
policy gradient, Proceedings of the 33rd AAAI Conference on 
Artificial Intelligence, Honolulu, Hawaii, USA, 27 January – 1 
February 2019, pp. 4213-4220.    
DOI: 10.1609/aaai.v33i01.33014213  

[14] P. Casgrain, B. Ning, S. Jaimungal, Deep Q-learning for Nash 
equilibria: Nash-DQN, arXiv (2019), 16 pp. Online [Accessed 14 
September 2021]   
https://arxiv.org/abs/1904.10554  

[15] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, 
M. Yeo, A. Makhzani, H. Kättler, J. Agapiou, J. Schrittwieser, J. 
Quan, S. Gaffney, S. Petersen, K. Simonyan, T. Schaul, H. van 
Hasselt, D. Silver, T. Lillicrap, K. Calderone, P. Keet, A. Brunasso, 
D. Lawrence, A. Ekermo, J. Repp, R. Tsing, StarCraft II: A new 
challenge for reinforcement learning, arXiv (2017), 20 pp. Online 
[Accessed 14 September 2021]  
https://arxiv.org/abs/1708.04782   

[16] M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, 
T. G. J. Rudner, C. Hung, P. H. S. Torr, J. Foerster, S. Whiteson, 
The StarCraft multi-agent challenge, arXiv (2019), 14 pp. Online 
[Accessed 14 September 2021]  
https://arxiv.org/abs/1902.04043    

[17] S. Liu, G. Lever, J. Merel, S. Tunyasuvunakool, N. Heess, T. 
Graepel, Emergent coordination through competition, arXiv 
(2019), 19 pp. Online [Accessed 14 September 2021]  
https://arxiv.org/abs/1902.07151    

[18] N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. 
Song, E. Parisotto, V. Dumoulin, S. Moitra, E. Hughes, I. 
Dunning, S. Mourad, H. Larochelle, M. G. Bellemare, M. Bowling, 
The Hanabi challenge: A new frontier for AI research, Artificial 
Intelligence 280 (2020), 103216.    
DOI: 10.1016/j.artint.2019.103216  

[19] R. S. Sutton, D. McAllester, S. Singh, Y. Mansour, Policy gradient 
methods for reinforcement learning with function approximation, 
Proceedings of the 12th International Conference on Neural 
Information Processing Systems, Denver, USA, 29 November – 4 
December 2000, pp. 1057-1063.    
DOI: 10.5555/3009657.3009806  

[20] A. Nagabandi, G. Kahn, R. S. Fearing, S. Levine, Neural network 
dynamics for model-based deep reinforcement learning with 
model-free fine-tuning, 2018 IEEE International Conference on 
Robotics and Automation, Brisbane, Australia, 21 – 26 May 2018, 

https://doi.org/10.1016/b978-1-55860-335-6.50027-1
https://www.jmlr.org/papers/volume4/hu03a/hu03a.pdf
https://doi.org/10.1016/S0004-3702(02)00121-2
https://doi.org/10.5555/1630659.1630761
https://arxiv.org/abs/1312.5602
https://doi.org/10.5555/3305381.3305500
https://doi.org/10.5555/3305890.3305958
https://arxiv.org/abs/1705.08926
https://arxiv.org/abs/1703.10069
https://arxiv.org/abs/1706.05296
https://arxiv.org/abs/1803.11485
https://doi.org/10.5555/3295222.3295385
https://doi.org/10.1609/aaai.v33i01.33014213
https://arxiv.org/abs/1904.10554
https://arxiv.org/abs/1708.04782
https://arxiv.org/abs/1902.04043
https://arxiv.org/abs/1902.07151
https://doi.org/10.1016/j.artint.2019.103216
https://doi.org/10.5555/3009657.3009806


ACTA IMEKO | www.imeko.org September 2021 | Volume 10 | Number 3 | 35 

pp. 7559-7566.    
DOI: 10.1109/ICRA.2018.8463189  

[21] A. F. Agarap, Deep learning using rectified linear units (ReLU), 
arXiv (2018), 7 pp. Online [Accessed 14 September 2021] 
https://arxiv.org/abs/1803.08375   

[22] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. 
Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for 
deep reinforcement learning, Proceedings of Machine Learning 

Research, New York, USA, 20 – 22 June 2016, pp. 1928-1937.  
DOI: 10.5555/3045390.3045594  

[23] D. A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate 
deep network learning by exponential linear units (ELUs), arXiv 
(2015), 14 pp. Online [Accessed 14 September 2021]  
https://arxiv.org/abs/1511.07289   

 
http://dx.doi.org/10.1109/ICRA.2018.8463189
https://arxiv.org/abs/1803.08375
https://doi.org/10.5555/3045390.3045594
https://arxiv.org/abs/1511.07289