Navigating Communication Networks with Deep Reinforcement Learning


Electronic Communications of the EASST
Volume 080 (2021)

Conference on Networked Systems 2021
(NetSys 2021)

Navigating Communication Networks with Deep Reinforcement
Learning

Patrick Krämer Andreas Blenk

16 pages

Guest Editors: Andreas Blenk, Mathias Fischer, Stefan Fischer, Horst Hellbrueck, Oliver
Hohlfeld, Andreas Kassler, Koojana Kuladinithi, Winfried Lamersdorf, Olaf Landsiedel, Andreas
Timm-Giel, Alexey Vinel

ECEASST Home Page: http://www.easst.org/eceasst/ ISSN 1863-2122

http://www.easst.org/eceasst/


ECEASST

Navigating Communication Networks with Deep Reinforcement
Learning

Patrick Krämer1 Andreas Blenk12

1 Chair of Communication Networks, Technical University of Munich, Germany 2 Faculty of
Computer Science, University of Vienna, Austria

Abstract: Traditional routing protocols such as Open Shortest Path First cannot
incorporate fast-changing network states due to their inherent slowness and limited
expressiveness. To overcome these limitations, we propose COMNAV, a system that
uses Reinforcement Learning (RL) to learn a distributed routing protocol tailored to
a specific network. COMNAVinterprets routing as a navigational problem, in which
flows have to find a way from source to destination. Thus, COMNAVhas a close con-
nection to congestion games. The key concept and main contribution is the design
of the learning process as a congestion game that allows RL to effectively learn a
distributed protocol.Game Theory thereby provides a solid foundation against which
the policies RL learns can be evaluated, interpreted, and questioned. We evaluate
the capabilities of the learning system in two scenarios in which the routing pro-
tocol must react to changes in the network state, and make decisions based on the
properties of the flow. Our results show that RL can learn the desired behavior and
requires the exchange of only 16 bits of information.

Keywords: Reinforcement Learning, Game Theory, Routing

1 Introduction

After proper configuration, communication networks are able to adjust to changes in their state,
e.g., to changes in the topology [CRS16]. Communication networks achieve this with a rule-
based learning system. Routers exchange information about the network state among each other
or a central entity [BSL+18]. A routing protocol-dependent algorithm, e.g., Dijkstra’s algo-
rithm [CRS16], computes the Routing Information Base (RIB) from this information [KR16].
The routing protocol derives forwarding rules, i.e., which traffic is forwarded through which
port, from the RIB. These rules form the Forward Information Base (FIB) [KR16]. The data-
plane matches packets against those rules and derives the applicable action set [AED+14].

This approach suffers from two shortcomings: 1) limited adaptability and 2) limited expres-
siveness. The process to translate a change in the network into new rules on the FIB has multiple
steps and is thus slow to react. It is difficult for this rule-based system to condition forwarding
decisions on fast-changing network state attributes such as utilization or queue length [AED+14].
To counteract this, rules for different contingencies could in principle be computed and installed
on the devices. However, forwarding rules based on dynamic network state could become com-
plex and intricate and exceed the rule table size. Increasing memory is possible but expen-
sive [SB18]. The restricted space for rules can thus limit their expressiveness and thus the extend
to which forwarding decision can be conditioned on network state.

1 / 16 Volume 080 (2021)


Navigating Communication Networks with Deep Reinforcement Learning

Network
State

Routing 
Process

RIB

FIB

Rules

Rules

UpdatesState

Network
State

Routing 
Process

RIB

FIB

Rules

Rules

UpdatesState
State

Updates

Node 1 Node 2

(a) Traditional.

FIB

Neural 
Network

Node 1

Rules

FIB

Neural 
Network

Node 2

Rules

Encoded 
State

(b) COMNAV.

Figure 1: System diagrams for traditional routing protocols, COMNAV and a comparison of
the induced logical control planes. COMNAV integrates the storage of network state, RIB, and
calculation of route into a Neural Network.

We propose COMNAV, a system that relies on network programmability, Reinforcement Learn-
ing (RL), Neural Networks (NNs) and hardware acceleration to overcome this limitation. The
idea of COMNAV is to train a Recurrent Neural Network (RNN) with a binary hidden state that
implements a routing protocol. This routing protocol can be tailored to a specific network and
its workload during training. Using a binary representation for the hidden state allows concise
encodings through a distributed representation of information in the RNN weights [GBB11].
Forwarding devices implement only limited functionality: The execution of the forward pass of
an RNN that can be implemented directly in hardware [SSB18, SB18].

Fig. 1a and Fig. 1b illustrate the architectural differences of a traditional rule-based system
and COMNAV. Network devices in traditional systems have a routing process that computes the
rules of the RIB and thus the FIB from the current network state, and configured policies. The
network devices exchange local state on the level of the routing processes through well defined
messages. In COMNAV, the computation of rules for the FIB, and the storage of network state
is implemented in a RNN. Nodes in COMNAV do also exchange messages. Those message
are exchanged only with direct neighboring nodes and are not explicitly specified. Instead, the
RNN learns how to design those messages during a training phase. This allows the network to
exchange exactly the information that is necessary to achieve a specific objective.

To obtain such a protocol, COMNAV takes a flow-centric perspective: we teach flows naviga-
tional skills with RL, which allows them to traverse a communication network, thereby exploiting
its current state (e.g., current loads, failed links, etc.). In other words, we present a first building
block towards a routing that is learned by the network itself, providing the opportunity to let the
network react to changing traffic conditions, hardware or topological changes. To this end, we
design and train a novel neural architecture that caters for the specific properties of communi-
cation networks and routing. The focus on flows introduces a connection to game theory and in

NetSys 2021 2 / 16


ECEASST

particular congestion games [RT02], which we will discuss in more detail.
As a first step, we investigate the feasibility of using an RNN with a binary hidden state to

make routing decisions based on the current state on network nodes. Using our proof-of-concept
implementation, we show the potential of such an approach in two case studies: learning to evade
congested links and learning to navigate flows with different properties. Indeed, we find that RL
can be used to learn strategies for different network states and differentiate path selection critera
based on observed flow characteristics and network state information.

The remainder is organized as follows: We present our vision, identify challenges and present
our approach in Sec. 2. Sec. 3 introduces the the environmental model for RL. Sec. 4 discusses
the used RL architecture. In Sec. 5 we evaluate our design in two experiments to provide a proof-
of-concept for our proposed methodology, Sec. 6 discusses related work, and Sec. 7 concludes
this work.

2 Vision: Navigation vs Routing

We envision networks that adapt their routing strategy for traffic that changes over short and
long time-scales. In order to achieve this vision, we use RL, which allows an agent to evolve
its behavior by adapting it to reward signals In our work, we take an unconventional approach
as to the choice of our agent. Intuitively, a forwarding device or central entity is referred to as
agent [BL94, CY96, VSST17], i.e., the entity would use RL for taking its actions in order to
maximize its reward. In contrast to the forwarding device-centered view, we identify with agents
the entities that traverse the network: a flow, a flowlet, or a packet. We understand routing as a
navigation problem. In this interpretation, multiple entities traversing the network actively take
actions, which can be interpreted as a congestion game [RT02].

The choice to take on this view-point is motivated by recent advances in the navigation of
artificial agents in complex simulated and real environments [MPV+16, MGM+18]. In robotics,
navigational tasks are called Simultaneous Localization and Mapping (SLAM): a robot has to
infer its location from sensory information, and creates a map of its environment through ex-
ploration, which can then later be used for planning routes [FGS07]. The agent has to associate
sensory information of the goal description, e.g., and address or image, and sensory information
with a decision like moving forward or turning in order to reach its goal.

Our vision of routing is similar: Packets navigate a network and decide at each switch where
to go next, i.e., choose an outgoing port. The decision is based on the state of the network, the
agent’s previous experiences, and a time invariant context. Fig. 2 illustrates this process. At
time t the network is in state Rt . Rt is not fully observable. Instead, the agent obtains sensory
information St from Rt . Sensory information can include port utilization on the switch the agent
is situated on, latency information, or congestion information. The only restriction on St is that
it can be obtained efficiently to support line rate. The agent then chooses an action At based on
St , its memory Mt and a time invariant context information C. The action At corresponds to an
outgoing port on the switch. The memory summarizes past experiences of the agent, and allows
her to include this information during decision making. For example, the utilization pattern on
earlier links can include relevant information for later decisions. The time invariant context C
can correspond to IP-Addresses, Quality-of-Service requirements, the volume of the flow, etc.

3 / 16 Volume 080 (2021)


Navigating Communication Networks with Deep Reinforcement Learning

Rt Rt+1 Rt+2

At−1 St At St+1 At+1 St+2

C C C C

Mt Mt+1 Mt+2

Figure 2: Dependencies between different components of a FlowAgent agent: Mt corresponds
to the memory of the agent represented by a RNN, At corresponds to the action taken, Rt cor-
responds to the state of the environment, i.e., the network, and St corresponds to the sensory
information the agent obtains from the environment. C corresponds to a time invariant context.

After choosing an action, i.e., a port on the node, the memory of the agent is updated. Thus, the
relevant information from the current node is present on the next node. Then, the environment
transitions into a new state Rt+1, i.e., the agent moves to the next hop. This process continues
until the agent reaches its destination.

3 A Navigation Network Model

In the following we introduce the different components of the agent-environment system.
The Environment. We represent a network as a graph G = (N ,E ,b,m) and model it as a
queuing network. A node v ∈ N corresponds to a forwarding device, and an edge e ∈ E to a
link between two forwarding devices. Nodes and edges are both modeled as service units. A
link e ∈ E corresponds to a service unit with one First-In-First-Out (FIFO) queue. A forwarding
device v ∈ N has multiple queues, one queue for each flow traversing the node. Queues on
nodes are served using Fair-Queuing (FQ) with Byte-by-Byte-Round-Robin.

The function b : N ∪E → N returns the service rates of nodes and links, and the function
m : N ∪E →N the buffer. Forwarding devices enqueue packets into the queues of its connected
links, and links enqueue packets into the queues of incident forwarding devices. Packets are
pushed into the network by sources using deterministic arrival times, i.e., each source places
packets at fixed times into the queue of the incident link. An environment state r ∈ R is thus as-
sociated with a specific forwarding device and contains the states of the queues of the forwarding
device, as well as the queues of the outgoing links.
The FlowAgent. We consider flow-level routing, and an agent corresponds to a flow to avoid
packet re-ordering. Note that flows consist of multiple packets, i.e., we use a packet-level simu-
lation. In our simulation, the first packet of the flow, e.g. the TCP SYN packet, discovers a path
from source to destination. The forwarding decisions are cached in the FIB of the nodes, similar
to CONTRA [HBC+20]. Subsequent packets follow the path of the first packet. A FlowAgent
has time invariant contextual information, e.g., source and destination addresses, QoS parame-
ters, Type-of-Service, or flow size. The agent utilizes this contextual information for decision
making.
Actions. When arriving on a forwarding device, a flow agent decides which link to take. The

NetSys 2021 4 / 16


ECEASST

forwarding device then enqueues the FlowAgent into the buffer of the chosen link. Thus,
the number of actions of a FlowAgent can differ between nodes, since each node can have a
different number of incident links. Thus, the possible actions in each state correspond to outgoing
links at the forwarding device associated with that state, i.e., the set of all actions A corresponds
to the set of edges E . At each time step t the agent can choose from a subset At ⊂ A .
Sensor. The sensor of a FlowAgent defines what the FlowAgent perceives, i.e., the observ-
able part of the environment. The FlowAgent uses information from the sensors to make a de-
cision. A sensor reading s ∈ S could include the MAC-address of the switch or time-dependent
flow properties such as the Time-to-Live. Also local information at the forwarding devices can
be observed, e.g., the number of flows at each port, or the utilization of each port.
Objective. The goal of each FlowAgent is the minimization of the Flow Completion Time
(FCT). The reward used during the learning of the protocol with RL is designed to achieve this
objective. Other objectives, such as minimizing the maximum link utilization, or a combination
of different objectives can be used as well. Depending on the requirements even policies such as
specific traffic should not traverse a specific link or node could be included.

4 Learning to Navigate Networks

We represent an agent as a RNN which is trained with RL. For training, we use a simulated
network with changing state. Each agent obtains a reward that is proportional to the FCT, i.e.,
the time it takes until the last packet is received by the destination. The agents are exposed to
varying demands and should learn which route to take based on the observed state, i.e., based on
the source, destination, current location, and the utilization of the ports at the current location.
A game theoretical perspective. We can view the FCT minimization as a congestion game
Γ = (P,Σ,u), P being the set of players, i.e., FlowAgents, Σ the space of possible strategy
profiles, and u : Σ×P → R a utility function evaluating a strategy profile with respect to a
specific player. A strategy for a player is the selected path from a source node s to a destination
node d. The utility function is the FCT. Here, a player then consists of the tuple (s,d,v,t), where
s,d ∈ N are the source and destination node of a flow, v its volume, and t its arrival time.

Now, let σ ∈ Σ be a strategy profile, i.e., for each player in P the strategy vector σ contains
the strategy of that player. Further, let σ ′ := (σ−i,σ ′i ) be the profile that is created when player
i ∈ A changes his strategy σi to σ ′i . Further, let P : Σ → R be a potential function. Here, the
potential function is the sum over all FCTs. Then, the game here is a exact potential game since:

u(σ ′,i)−u(σ,i) = P(σ ′)−P(σ). (1)

A simple proof sketch is as follows: Since the potential function is a sum of individual FCTs,
a change in one FCT translates to a change in P [RT02].

Thus, the above game emits at least one pure strategy Nash Equilibrium [Gib92], and the RL
approach converges towards one Nash Equilibrium [GLS17]. That is, it learns a mapping from
the state to the respective strategy that optimizes its utility, i.e., FCT.
Asynchronous Advantage Actor-Critic (A3C). We use an actor-critic based algorithm, which
maintains a policy π(At |St,C,θπ ), and an estimate of the value function V (St |θv,C) [MBM+16].

5 / 16 Volume 080 (2021)


Navigating Communication Networks with Deep Reinforcement Learning

We decided for an on-policy algorithm without replay memory to account for the non-stationarity
of the environment in multi-agent scenarios [GD21]. We use parametric functions in the form of
NNs to represent π and V , where θπ identifies the parameters of the policy, and θv the parameters
of the value function. The policy maps states to actions, whereas the value function computes
the expected reward which can be obtained with a policy from a specific state. In this case, a
state corresponds to a sensor reading.

We use an asynchronous variant to update the parameters of the functions. That is, multiple
agents and environments run in parallel. Parameter updates are applied asynchronously to a
global set of parameters, that are then copied to the local agent. The agents can thus continuously
adapt to the changed policies of the other agents [GD21].
Reward. Our problem is episodic. One episode corresponds to the delivery of one flow to a
destination. In case of the FlowAgent the reward is:

gt = α
t∗t + t

∗
p

tt + tp
, (2)

where t∗t is the transmission time on the shortest path based on residual capacity, and t
∗
p the

propagation delay on the shortest path based on delays calculated during the arrival of the flow.
Those paths serve as lower bounds. This information is only required during the training phase in
the simulation. The denominator corresponds to the transmission and propagation delay the flow
then experienced. Eq. (2) becomes larger the closer the agent gets to the minimum values. Note,
that values larger one are also possible in case flows leave the system providing better shortest
paths during the life time of the current flow.

In addition to the terminal reward, we use a per-step reward. For the per-step reward, we
exploit the relation to routing games and imitate a form of marginal cost pricing [RT02]:

gs =−
1
N

N

∑
i=1

(
t(i)′t −t

(i)
t

)
. (3)

t(i)′t is the transmission time of flow i with the current flow on the same link, while t
(i)
t was the

previous transmission time. The current flow is charged with the additional time its presence on
the link causes the other flows. This mechanism has been shown to reduce the Price-of-Anarchy
in congestion games [RT02], and should, therefore, be a good guidance.
NN Architecture. In our architecture, we make use of parameter sharing. That is, θπ and
θv do not correspond to different NNs, but instead share large parts of the parameters. Fig. 3
illustrates the architecture we use during our experiments. The network consists of multiple
building blocks: Two encoders, a binary recurrent layer, policy layers, a layer for the value
function, and an layer to predict loops in the network path.

The purpose of the encoder is to extract a low dimensional representation from the observation
St and the context C. The encoder output for the observation St is then fed into a binary recurrent
layer, serving as memory for the agent. For our experiments, a hidden layer with 16 neurons is
sufficient. The output of the recurrent layer is binary, i.e., consists of zeroes and ones. This is an
important design aspect since the hidden state of the recurrent layer must be transmitted to the
next node. Learning a binary representation results in small and concise codes and is a good fit
for the distributed representations of information that NNs learn [GBB11]. The encoded context

NetSys 2021 6 / 16


ECEASST

CAT 16

CAT

TTL

48

30
25

26
20

10

...

64

Policy 1

Policy 2

Policy N

Value

Loop 
Detection

MAC

S
ta

te
 a

t t
im

e 
t

C
on

te
xt

Binary LSTM

Figure 3: NN architecture. Each switch has its own policy. Except for the policy layer are all
weights identical for different switches. The numbers above the layer give their size.

is not fed into the recurrent layer since the context is time invariant. The output from context
encoder and recurrent layer are then used for different purposes: To predict the value, actions,
and to detect loops. Since the main purpose of encoder and recurrent layer is the extraction of
useful information, we share them between policy and value function.

Loop detection is an auxiliar task that is utilized during training. The purpose of the loop
detection task is to provide additional training for the recurrent layer and encoders to produce
a meaningful representation. If the agent is able to discern that it has visited a certain location
already, then the agent discriminated between states and memorized them. Both skills are impor-
tant for the task of the agent. The auxiliar task provides additional training to the agent, which
can improve the data efficiency, i.e., the agent converges faster to a good policy [MPV+16].

Challenges we are facing in the navigation of communication networks is the changing number
of actions at each forwarding device and semantical differences between the ”same” action. With
a single output layer, the number of actions on each forwarding device must equal the maximum
degree in the network. Further, some actions are forbidden on nodes that have a smaller degree.
The agent then has to learn which actions are admissible on which location. Also, the ”same”
action, e.g., leave trough port zero, must not be related in a meaningful way. For example,
choosing the action ”leave through port 2” will not mean that the agent is moving into a similar
direction when choosing it at different forwarding devices. In navigational tasks in the real world,
the same action (for example turn left) is related between two successive states. Also, forwarding
devices are farther apart in communication networks, and information such as the MAC address
are not necessarily related. In robotics one usually has a constant visual input stream.

We address these challenges by proposing an architecture featuring multiple policy layers.
One layer for each forwarding device. Here we exploit the fact that the agent always knows

7 / 16 Volume 080 (2021)


Navigating Communication Networks with Deep Reinforcement Learning

FIB

Neural
Network

Parser Deparser
Packet

Table Match
Out Port

Out Port 
Hidden State

Table Miss

Install rule

Packet

Figure 4: Data plane of a switch in COMNAV. Upon a table miss, the NN is queried for an output
port to a packet. The output is cached in the FIB and re-used for subsequent packets of the flow.

exactly where it is, which is usually not the case in robotics [MGM+18, MPV+16]. During
learning, the weights of the different policy layers can thus be tuned independently, giving credit
to the different semantics of actions at different forwarding devices.
Centralized Learning, Distributed Inference. A packet cannot carry a NN. Instead, we pro-
pose to distribute the NN across the communication network. A FlowAgent performs at each
time step a forward pass through the encoders and the recurrent layer to the policy and value
function, which requires the output of the recurrent layer from the previous time step. We pro-
pose to situate the parameters of the NN at the forwarding devices as illustrated in Fig. 4. If an
agent, i.e., the first packet of a flow arrives at a node, it first gets parsed. During parsing, relevant
information such as the context information, i.e., source and destination addresses, as well as
the hidden state of the RNN are extracted. Then, the forwarding device searches for forwarding
rules in its FIB. Since no such rule exists, the switch then queries the NN. The NN computes
an port through which the packet should be send. The switch installs a corresponding rule in its
FIB. Keeping rules for the active set of flows is feasible [HBC+20]. Subsequent packets of the
flow will match this rule and the NN is not queried again. In addition, the NN updates the hidden
state. The Deparser writes the new hidden state to the packet. Since the output of the recurrent
layer is binary, it can be used without further processing. The switch then forwards the packet
to the next hop. Note that the network devices perform only forward passes, i.e., no backward
passes or parameter updates are computed during operation.

While inference can happen in a distributed manner, training must be performed in a central
location. Distributed training would incur a very high overhead, since a lot of information would
have to be transmitted between nodes. Further, updating the network during operation can lead to
unwanted behavior of the network. Instead, we rely on the increasing capability to gather data,
e.g., provided by network telemetry [HBC+20], in order to obtain real state information from
the network and simulate traffic for learning. In this way, the parameters can continuously be
updated, adapted to shifting demands, and most important, can be validated wrtt. to consistency
of resulting routes. If the updated weights meet corresponding quality criteria, they can be de-
ployed on the forwarding devices. During training, varying network conditions can be simulated

NetSys 2021 8 / 16


ECEASST

and learned with the neural architecture. In this way, shifting traffic patterns, larger topological
changes, or other optimization objectives can be incroported into the protocol. The routing pro-
tocol implemented by the resulting NN then reacts to fast-changing network conditions, such as
link failures, changes in the utilization of the network, and traffic patterns.

One limitation of using a NN in the data plane is the large amount of inference steps that must
be performed to keep line rate. Recent work suggests, that even today’s ASICs are capable of
executing small NNs at line rate [SB18]. Thus, we believe that per-flow inference is possible,
even for larger networks.

5 Evaluation

We consider two scenarios as use-cases for our approach: Learning to evade congested links and
learning to navigate flows with different properties.
Topology Setting: In both cases, we use the Abilene network [KNF+11] as topology. Fig. 7
shows the Abilene network topology. Based on publicly available data, we added Autonomous
Systems (ASes) to the Abilene network from which traffic originates during our evaluation.
Learning Setting: For both set-ups, we use A3C with 16 cores and train the network with
10 000 flows per core. At every location, the observation St consists of the utilization of the ports
at the forwarding device, and a binary identifier for the in port. Utilization is represented using
zero-one-hot encoding based on Fortz et.al [FT02] and passed through an encoder consisting of
two fully connected layers with 30 and 25 neurons and Rectified Linear Unit (ReLU) activation.
The port identifier is passed to the recurrent layer implemented by a Long-Short-Term-Memory
(LSTM) RNN [HS97] with 16 binary hidden neurons. The context C consists of a one-hot-
encoding for the source and destination node, which is passed separately through an encoder
with 20 and 10 neurons and sigmoidal activation. The two outputs are then concatenated together
to a vector with 20 elements. For the policies, we use a linear layer consuming the combined
output of context encoder and LSTM with softmax activation. For the value function, we use a
linear layer with a single output neuron. Similarly, for the loop-detection, we use a linear layer
with single output neuron with a sigmoid activation.

During our experiment, we experienced the problem that the agent quickly converged to a
sub-optimal policy. To mitigate this risk, we additionally introduce random actions to increase
the exploration and let the agent discover a better strategy.

5.1 Learning to Evade

In this scenario, the agent starts at Nysernet connected to New-York and has to navigate the
network to Pacific Wave connected to Seattle. The objective is to minimize the FCT. In the
uncongested case, the best FCT is obtained via the path New-York - Chicago - Indianapolis -
Kansas City - Denver - Seattle. In the congested case, the flow is slowed down on the link Indi-
anapolis - Kansas City, and the best path becomes New-York - Washington - Atlanta - Houston
- Los-Angeles - Sunnyvale - Seattle. We train the agent for both scenarios: 50 % of the trained
cases contain congestion on the link Indianapolis - Kansas City, while the other half does not.

Fig. 5a shows the discounted reward of the agent, and illustrates the learning progress; it shows

9 / 16 Volume 080 (2021)


Navigating Communication Networks with Deep Reinforcement Learning

5000 10000
Number of Flows

2.5

5.0

7.5

R
ew

ar
d

congested
uncongested

(a) Reward

5000 10000
Number of Flows

10

20

P
at
h
L
en
gt
h

congested
uncongested

(b) Path Length

Figure 5: Reward for the agent learning to evade a bottleneck link.

0 5000 10000
Number of Flows

2

4

R
ew

ar
d

small
big

(a) Reward

0 5000 10000
Number of Flows

5

10
P
at
h
L
en
gt
h

small
big

(b) Path Length

Figure 6: Reward for the agent learning to differentiate between different flow patterns based on
the source and destination.

that after approximately 2 000 flows the agent has converged to a stable solution, since the reward
does not increase anymore. Fig. 5b shows the path length, which drops as the reward increases.
The difference between the congested and uncongested case is only marginally as indicated by
the overlapping lines. The agent has learned to avoid the link between Indianapolis and Denver
when it is congested. The agent thereby chooses the path New-York - Chicago - Indianapolis -
Atlanta - Houston - Los-Angeles - Sunnyvale - Seattle as alternative. This result can be explained
by the fact that the agent has no means of knowing at New-York City that the link between
Indianapolis and Kansas City is congested. In the absence of additional information, the best
choice is indeed to first travel to Indianapolis and then decide based on the conditions there. The
agent is very confident in making that decision. In the congested case, the agent chooses with
probability one to travel to Atlanta; whereas for the uncongested case the agent chooses to travel
to Kansas City with 87 % probability, and to Atlanta with 12 % probability. Those numbers do
not change when the agent has more trials.

5.2 Learning to Differentiate

In this scenario, two flows start from the Oregon Gigaport connected to Sunnyvale and
Pacific Wave 2 connected to Los-Angeles towards Pacific Wave connected to Seattle.

NetSys 2021 10 / 16


ECEASST

Seattle

Sunnyvale

Los Angeles

Houston
Atlanta

AtlaM5

Washington

New-York City

Chicago

Indianapolis

Kansas City

Denver
Nysernet

Pacific Wave

Oregon Gigaport

Pacific Wave 2

Figure 7: Abilene topology. with additionally attached Autonomous Systems.

We now assume different applications behind the flows. Flows originating at Pacific Wave
2 are assumed to be mostly larger downloads, whereas flows originating from the Oregon
Gigaport are small flows. We increase the utilization on the link between Sunnyvale and
Seattle, such that the larger flows from Pacific Wave experience significant transmission
delay, while the flows from the Oregon Gigaport are small enough to experience no signif-
icant delay. Thus, flows from Pacific Wave 2 should take a detour over Denver in order to
obtain a shorter FCT.

Fig. 6a illustrates the learning progress of the A3C agents. The reward converges after approx-
imately 5 000 flows. Fig 6b illustrates how the agent finds the shorter path for the smaller, and
the longer path for the larger agent. Interestingly, the larger agent learned to take the correct path
after only 2 000 flows, while the short flow took longer to converge. After the training ended,
the two agents are very confident, taking the correct path with probability one. Thus, the agent
learned to associate certain sources with specific flow properties and exploited this knowledge
when navigating the network

6 Related Work

6.1 Routing and Traffic Engineering

There exist many routing and traffic engineering (TE) solutions that address various challenges
and cope with different problems [NR17]. Solutions range from using well-established protocols
such as OSPF [CRS16] over new centralized designs [ARR+10, BAAZ11, ZCY+16, HKM+13,
JKM+13, KMSB14, LSYR11, LVC+16, ZCB+15] to distributed protocols [KKSB07, MT14,
AED+14] that can also utilize data plane programmability [HBC+20].

In contrast, COMNAVis a routing system in which relevant parts of the protocol are learned
and implemented with a NN. COMNAVlearns forwarding rules autonomously with RL. COM-
NAV relies on the flexibility of data-plane programmability to realize the NN in the data plane.

11 / 16 Volume 080 (2021)


Navigating Communication Networks with Deep Reinforcement Learning

6.2 ML and Networking

Combinations of ML and networking have been explored from different angles [BSL+18]. Be-
side applications like traffic prediction [CWG16] and classification [EMAW07, EAM06, RRS20],
also optimization of reconfigurable topologies [WCX+18, SSC+18], queue management [SZ07,
ZDCG09, CLCL18] and implementation of routing schemes have been addressed [Mam19,
GC18, XMW+20, YLX+20, BL94, VSST17, CY96]. In particular for routing schemes, there
are several approaches considering centralized as well as distributed implementations. Our work
differs from previous work in the flow-centric perspective, the resulting formulation of the under-
lying problem as a congestion game, and the explicit design of the neural architecture with binary
hidden layers and multi-head policies. Previous work, especially [YLX+20] relies on large, real-
valued vectors that need to be exchanged between nodes, which is impractical in practice. The
same is true for the online learning of NN weights. In contrast, COMNAV learns the weights
during a training phase and exchanges only 16 bits of information between nodes.

6.3 Navigation

Work in [MPV+16] considers the navigation of complex mazes with rich structure using an end-
to-end deep RL based approach that learns directly from pixels. In a later work, [MGM+18]
design a deep learning based system that is able to reliably navigate in large cities over long
distances. They also showed that the learned skills can be transferred to other cities.

In our work, we consider the navigation of communication networks which are quite different
from real world navigational tasks. In communication networks, the agent cannot carry its intel-
ligence by itself, also, actions have a completely different semantic between forwarding devices.

7 Conclusion

We presented COMNAV, a novel routing architecture that is motivated by programmable data-
planes and machine learning. In COMNAV, artificial FlowAgents successfully learn to traverse
a communication network, accounting for the state at nodes and traffic characteristics to decide
their path. FlowAgents query binary Recurrent Neural Networks situated at forwarding de-
vices for their next hop. The hidden state of the RNN is piggy-backed on packets, allowing them
to memorize their route as well as encountered network conditions. We find that 16 bits are
enough to successfully navigate a network. We showcase with two examples the ability of flows,
i.e., agents to learn, solely based on information locally available at nodes, which path to take.
This limits the action space of the agents and allows distributed inference by placing a forward
model at each forwarding device.

We understand our work as a first step and believe that our vision opens many interesting
directions for future research. For example, it would be interesting to see whether skills obtained
on one topology can be transferred to another one. Another avenue of research could be to
further explore and exploit the connection between our approach and game theory (and its formal
guarantees). Finally, the performance of the proposed approach could be evaluated in a testbed
to explore its limits.

NetSys 2021 12 / 16


ECEASST

Bibliography

[AED+14] M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan, K. Chu, A. Finger-
hut, V. T. Lam, F. Matus, R. Pan, N. Yadav, G. Varghese. CONGA: Distributed
Congestion-aware Load Balancing for Datacenters. SIGCOMM Comput. Commun.
Rev. 44(4):503–514, Aug. 2014.
doi:10.1145/2740070.2626316

[ARR+10] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, A. Vahdat. Hedera: Dy-
namic Flow Scheduling for Data Center Networks. In NSDI 2010. Pp. 281–296.
San Jose, CA, USA, 2010.

[BAAZ11] T. Benson, A. Anand, A. Akella, M. Zhang. MicroTE: Fine Grained Traffic En-
gineering for Data Centers. In CoNEXT ’11. Pp. 8:1–8:12. ACM, New York, NY,
USA, 2011.
doi:10.1145/2079296.2079304

[BL94] J. A. Boyan, M. L. Littman. Packet Routing in Dynamically Changing Networks:
A Reinforcement Learning Approach. In NIPS 6. Pp. 671–678. Morgan Kaufmann,
1994.

[BSL+18] R. Boutaba, M. A. Salahuddin, N. Limam, S. Ayoubi, N. Shahriar, F. Estrada-
Solano, O. M. Caicedo. A comprehensive survey on machine learning for network-
ing: evolution, applications and research opportunities. JISA 9(1):1–99, 2018.

[CLCL18] L. Chen, J. Lingys, K. Chen, F. Liu. Auto: Scaling deep reinforcement learning
for datacenter-scale automatic traffic optimization. In SIGCOM’ 18. Pp. 191–205.
Budapest, Hungary, 2018.

[CRS16] M. Chiesa, G. Rétvári, M. Schapira. Lying your way to better traffic engineering.
In CoNEXT’ 16. Pp. 391–398. 2016.

[CWG16] Z. Chen, J. Wen, Y. Geng. Predicting future traffic using hidden markov models. In
ICNP. Pp. 1–6. 2016.

[CY96] S. Choi, D.-Y. Yeung. Predictive q-routing: A memory-based reinforcement learn-
ing approach to adaptive tra c control. NIPS 8:945–951, 1996.

[EAM06] J. Erman, M. Arlitt, A. Mahanti. Traffic classification using clustering algorithms.
In SC2D’ 06. Pp. 281–286. 2006.

[EMAW07] J. Erman, A. Mahanti, M. Arlitt, C. Williamson. Identifying and discriminating
between web and peer-to-peer traffic in the network core. In WWW. Pp. 883–892.
2007.

[FGS07] A. Förster, A. Graves, J. Schmidhuber. RNN-based Learning of Compact Maps for
Efficient Robot Localization. In ESANN. 2007.

13 / 16 Volume 080 (2021)

http://dx.doi.org/10.1145/2740070.2626316
http://dx.doi.org/10.1145/2079296.2079304


Navigating Communication Networks with Deep Reinforcement Learning

[FT02] B. Fortz, M. Thorup. Optimizing OSPF/IS-IS weights in a changing world. IEEE
J. Sel. Areas Commun. 20(4):756–767, May 2002.
doi:10.1109/JSAC.2002.1003042

[GBB11] X. Glorot, A. Bordes, Y. Bengio. Deep Sparse Rectifier Neural Networks. In Gor-
don et al. (eds.), Proceedings of the Fourteenth International Conference on Ar-
tificial Intelligence and Statistics. Proceedings of Machine Learning Research 15,
pp. 315–323. JMLR Workshop and Conference Proceedings, Fort Lauderdale, FL,
USA, 11–13 Apr 2011.
http://proceedings.mlr.press/v15/glorot11a.html

[GC18] F. Geyer, G. Carle. Learning and Generating Distributed Routing Protocols Using
Graph-Based Deep Learning. In Big-DAMA ’18. Pp. 40–45. ACM, Budapest, Hun-
gary, 2018. Export Key: geyer2018.
doi:10.1145/3229607.3229610

[GD21] S. Gronauer, K. Diepold. Multi-agent deep reinforcement learning: a survey. Arti-
ficial Intelligence Review, Apr. 2021.
doi:10.1007/s10462-021-09996-w

[Gib92] R. Gibbons. A primer in game theory. Pearson Academic, 1992.

[GLS17] A. Greenwald, J. Li, E. Sodomka. Solving for Best Responses and Equilibria in
Extensive-Form Games with Reinforcement Learning Methods. In Başkent et al.
(eds.), Rohit Parikh on Logic, Language and Society. Pp. 185–226. Springer Inter-
national Publishing, Cham, 2017.
doi:10.1007/978-3-319-47843-2 11

[HBC+20] K.-F. Hsu, R. Beckett, A. Chen, J. Rexford, D. Walker. Contra: A programmable
system for performance-aware routing. In NSDI 20. Pp. 701–721. 2020.

[HKM+13] C.-Y. Hong, S. Kandula, R. Mahajan, M. Zhang, V. Gill, M. Nanduri, R. Watten-
hofer. Achieving High Utilization with Software-driven WAN. In SIGCOMM ’13.
Pp. 15–26. ACM, New York, NY, USA, 2013.
doi:10.1145/2486001.2486012

[HS97] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Comput.
9(8):1735–1780, Nov. 1997.
doi:10.1162/neco.1997.9.8.1735

[JKM+13] S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wan-
derer, J. Zhou, M. Zhu, J. Zolla, U. Hölzle, S. Stuart, A. Vahdat. B4: Experience
with a Globally-deployed Software Defined Wan. In SIGCOMM ’13. Pp. 3–14.
ACM, New York, NY, USA, 2013.
doi:10.1145/2486001.2486019

[KKSB07] S. Kandula, D. Katabi, S. Sinha, A. Berger. Dynamic load balancing without packet
reordering. ACM SIGCOMM Comp. Com. Rev. 37(2):51–62, 2007.

NetSys 2021 14 / 16

http://dx.doi.org/10.1109/JSAC.2002.1003042
http://proceedings.mlr.press/v15/glorot11a.html
http://dx.doi.org/10.1145/3229607.3229610
http://dx.doi.org/10.1007/s10462-021-09996-w
http://dx.doi.org/10.1007/978-3-319-47843-2_11
http://dx.doi.org/10.1145/2486001.2486012
http://dx.doi.org/10.1162/neco.1997.9.8.1735
http://dx.doi.org/10.1145/2486001.2486019


ECEASST

[KMSB14] S. Kandula, I. Menache, R. Schwartz, S. R. Babbula. Calendaring for Wide Area
Networks. In SIGCOMM ’14. Pp. 515–526. ACM, New York, NY, USA, 2014.
doi:10.1145/2619239.2626336

[KNF+11] S. Knight, H. Nguyen, N. Falkner, R. Bowden, M. Roughan. The Internet Topology
Zoo. J-SAC 29(9):1765 –1775, october 2011.
doi:10.1109/JSAC.2011.111002

[KR16] J. F. Kurose, K. W. Ross. Computer Networking: A Top-Down Approach. Pearson,
Boston, MA, 7 edition, 2016.

[LSYR11] N. Laoutaris, M. Sirivianos, X. Yang, P. Rodriguez. Inter-datacenter Bulk Transfers
with Netstitcher. In SIGCOMM ’11. Pp. 74–85. ACM, New York, NY, USA, 2011.
doi:10.1145/2018436.2018446

[LVC+16] H. H. Liu, R. Viswanathan, M. Calder, A. Akella, R. Mahajan, J. Padhye, M. Zhang.
Efficiently Delivering Online Services over Integrated Infrastructure. In NSDI 16.
Pp. 77–90. USENIX Association, Santa Clara, CA, 2016.

[Mam19] Z. Mammeri. Reinforcement learning based routing in networks: Review and clas-
sification of approaches. IEEE Access 7:55916–55950, 2019.

[MBM+16] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver,
K. Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. arXiv
preprint arXiv:1602.01783, 2016.
http://arxiv.org/abs/1602.01783

[MGM+18] P. Mirowski, M. K. Grimes, M. Malinowski, K. M. Hermann, K. Anderson,
D. Teplyashin, K. Simonyan, K. Kavukcuoglu, A. Zisserman, R. Hadsell. Learning
to Navigate in Cities Without a Map. ArXiv e-prints, Mar. 2018.

[MPV+16] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil,
R. Goroshin, L. Sifre, K. Kavukcuoglu, D. Kumaran, R. Hadsell. Learning to Nav-
igate in Complex Environments. CoRR abs/1611.03673, 2016.
http://arxiv.org/abs/1611.03673

[MT14] N. Michael, A. Tang. Halo: Hop-by-hop adaptive link-state optimal routing.
IEEE/ACM Transactions on Networking 23(6):1862–1875, 2014.

[NR17] M. Noormohammadpour, C. S. Raghavendra. Datacenter traffic control: Under-
standing techniques and tradeoffs. IEEE Communications Surveys & Tutorials
20(2):1492–1525, 2017.

[RRS20] A. Rashelbach, O. Rottenstreich, M. Silberstein. A Computational Approach to
Packet Classification. In SIGCOMM’ 20. Pp. 542–556. 2020.

[RT02] T. Roughgarden, E. Tardos. How Bad is Selfish Routing? J. ACM 49(2):236–259,
Mar. 2002.
doi:10.1145/506147.506153

15 / 16 Volume 080 (2021)

http://dx.doi.org/10.1145/2619239.2626336
http://dx.doi.org/10.1109/JSAC.2011.111002
http://dx.doi.org/10.1145/2018436.2018446
http://arxiv.org/abs/1602.01783
http://arxiv.org/abs/1611.03673
http://dx.doi.org/10.1145/506147.506153


Navigating Communication Networks with Deep Reinforcement Learning

[SB18] G. Siracusano, R. Bifulco. In-network Neural Networks. CoRR abs/1801.05731,
2018. Export Key: siracusano2018.
http://arxiv.org/abs/1801.05731

[SSB18] D. Sanvito, G. Siracusano, R. Bifulco. Can the Network Be the AI Accelerator? In
Proceedings of the 2018 Morning Workshop on In-Network Computing. NetCom-
pute ’18, pp. 20–25. Association for Computing Machinery, New York, NY, USA,
2018. event-place: Budapest, Hungary Export Key: sanvito2018.
doi:10.1145/3229591.3229594
https://doi.org/10.1145/3229591.3229594

[SSC+18] S. Salman, C. Streiffer, H. Chen, T. Benson, A. Kadav. DeepConf: Automating data
center network topologies management with machine learning. In NetAI. Pp. 8–14.
2018.

[SZ07] J. Sun, M. Zukerman. An adaptive neuron AQM for a stable internet. In NET-
WORKING. Pp. 844–854. 2007.

[The12] The Open Networking Foundation. OpenFlow Switch Specification. Jun. 2012.

[VSST17] A. Valadarsky, M. Schapira, D. Shahaf, A. Tamar. A Machine Learning Approach
to Routing. CoRR abs/1708.03074, 2017.
http://arxiv.org/abs/1708.03074

[WCX+18] M. Wang, Y. Cui, S. Xiao, X. Wang, D. Yang, K. Chen, J. Zhu. Neural net-
work meets DCN: Traffic-driven topology adaptation with deep learning. POMACS
2(2):1–25, 2018.

[XMW+20] S. Xiao, H. Mao, B. Wu, W. Liu, F. Li. Neural Packet Routing. In NetAI ’20. Pp. 28–
34. ACM, Virtual Event, USA, 2020.
doi:10.1145/3405671.3405813

[YLX+20] X. You, X. Li, Y. Xu, H. Feng, J. Zhao, H. Yan. Toward Packet Routing With Fully
Distributed Multiagent Deep Reinforcement Learning. IEEE TNSM, 2020.

[ZCB+15] H. Zhang, K. Chen, W. Bai, D. Han, C. Tian, H. Wang, H. Guan, M. Zhang. Guar-
anteeing Deadlines for Inter-datacenter Transfers. In EuroSys ’15. Pp. 20:1–20:14.
ACM, New York, NY, USA, 2015.
doi:10.1145/2741948.2741957

[ZCY+16] H. Zhang, L. Chen, B. Yi, K. Chen, M. Chowdhury, Y. Geng. CODA: Toward
Automatically Identifying and Scheduling Coflows in the Dark. In SIGCOMM ’16.
Pp. 160–173. ACM, New York, NY, USA, 2016.
doi:10.1145/2934872.2934880

[ZDCG09] C. Zhou, D. Di, Q. Chen, J. Guo. An adaptive AQM algorithm based on neuron
reinforcement learning. In IEEE ICCA. Pp. 1342–1346. 2009.

NetSys 2021 16 / 16

http://arxiv.org/abs/1801.05731
http://dx.doi.org/10.1145/3229591.3229594
https://doi.org/10.1145/3229591.3229594
http://arxiv.org/abs/1708.03074
http://dx.doi.org/10.1145/3405671.3405813
http://dx.doi.org/10.1145/2741948.2741957
http://dx.doi.org/10.1145/2934872.2934880

	Introduction
	Vision: Navigation vs Routing
	A Navigation Network Model
	Learning to Navigate Networks
	Evaluation
	Learning to Evade
	Learning to Differentiate

	Related Work
	Routing and Traffic Engineering
	ML and Networking
	Navigation

	Conclusion