International Journal of Interactive Mobile Technologies (iJIM) – eISSN: 1865-7923 – Vol. 15, No. 12, 2021 Paper—A Reinforcement Learning Approach for Interference Management in Heterogeneous Wireless... A Reinforcement Learning Approach for Interference Management in Heterogeneous Wireless Networks https://doi.org/10.3991/ijim.v15i12.20751 Akindele Segun Afolabi (*) University of Ilorin, Ilorin, Nigeria afolabisegun@unilorin.edu.ng Shehu Ahmed The Nigerian Television Authority, Ilorin, Nigeria Olubunmi Adewale Akinola Federal University of Agriculture, Abeokuta, Nigeria Abstract—Due to the increased demand for scarce wireless bandwidth, it has become insufficient to serve the network user equipment using macrocell base stations only. Network densification through the addition of low power nodes (picocell) to conventional high-power nodes addresses the bandwidth dearth issue, but unfortunately introduces unwanted interference into the net- work which causes a reduction in throughput. The purpose of this paper is to develop a model for controlling the interference between picocell and macrocell users of a cellular network so as to increase the overall network throughput. In order to achieve this, a reinforcement learning model was developed which was used in coordinating interference in a heterogeneous network comprising mac- rocell and picocell base stations. The learning mechanism was derived based on Q-learning, which consisted of agent, state, action, and reward. The base station was modeled as the agent, while the state represented the condition of the user equipment in terms of Signal to Interference Plus Noise Ratio. The action was represented by the transmission power level and the reward was given in terms of throughput. Simulation results showed that the trend of values of the learning rate (e.g., high to low, low to high, etc.) plays a major role in throughput per- formance. It was particularly shown that a multi-agent system with a normal learning rate could increase the throughput of associated user equipment by a whopping 212.5% compared to a macrocell-only scheme. Keywords—Heterogeneous Network, Q-Learning, Macrocell, Picocell, Inter- ference 1 Introduction Mobile broadband usage has increased dramatically in the last couple of years due to new types of terminals such as smart phones and tablet computers [1, 2]. The tradi- tional homogeneous networks [3, 4],comprising of only macrocell base stations (BSs), iJIM ‒ Vol. 15, No. 12, 2021 65 Paper—A Reinforcement Learning Approach for Interference Management in Heterogeneous Wireless... have become insufficient to meet the high traffic demands and stringent quality of service (QoS) requirements of mobile broadband communications [5]. A key method to fulfill the traffic demands is by network densification which involves adding small- er low power nodes, such as picocells, to traditional high power macro nodes. This results in what is termed "Heterogeneous Networks", or simply HetNets [3]-[7]. Het- Nets are expected to boost capacity and coverage beyond macrocells. They have been regarded as a promising paradigm to provide mobile users with high quality experi- ence [2, 5]. However, network densification through the addition of picocells intro- duces harmful interference into the network [2, 4, 8]. Therefore, the influence of picocell densification on the network performance is obviously of large interest and the use of sophisticated inter-cell interference management techniques is very crucial. This paper aims at developing a learning model for coordinating inter-cell interfer- ence existing between a picocell and macrocell base stations for the purpose of im- proving network throughput. The ability of learning new behaviours and adapt to the temporal dynamics of the system is associated with reinforcement learning (RL). Q-learning (QL) is a basic example of RL, which is proposed in this paper. The scenario of Q-learning is related to Markov decision method, where the learning agents interact with their environment to achieve the desired goals (rewards). Q-learning models have a set of states S, ac- tions A, and rewards R. The learning cycle is a state-action-reward process. On learn- ing, an agent takes an action aÎA that interacts with the environment. The agent goes into a state s(t)ÎSand receives a reward r(s(t))ÎR. The objective is to select actions at each state s, based on maximized reward r [9]-[12]. The agent should observe the state or environment and take actions that affect that state. Moreover, a goal must be intro- duced relating to the state of the environment. Learning can be performed using a centralised (single agent) [9, 10] or a distributed approach (multi-agent) [10, 11, 12]. A decentralized approach of learning is effective for solving complex problems. In this case, each agent in a multi-agent system is specialized at solving a particular problem. A multi-agent system is therefore useful, if a model can be developed for the agents’ behaviour in terms of desires and goals. The performances of single and mul- ti-agent systems are compared in this paper. 2 Related Works The problem interference poses to heterogeneous networks has recently dominated discussions in the research community [13]-[26]. In heterogeneous cellular networks, a user equipment (UE) at the cell border always experiences high interference from neighbouring transmitters as their distance dependent attenuation is a critical issue [27]. Spectrum splitting is a mechanism used by the operator in a multi-tier network to split the available sub-bands among the cells in a cellular network to mitigate inter- ference. Spectrum splitting can be carried out using a centralized approach. Splitting spectrum, in a centralized fashion, assigns sub bands to the macrocell base station (MBS) and small cells by means of a controller, which achieves efficient resource utilization at the expense of complexity and signalling overhead. The authors in [28] 66 http://www.i-jim.org Paper—A Reinforcement Learning Approach for Interference Management in Heterogeneous Wireless... proposed an interference coordination method in which resource partitioning opera- tion is done centrally, such that, sub-bands are given to the base stations in the net- work based on a weighted vertex colouring operation executed by a central controller. The authors in [29]-[31] used beam-forming technique to mitigate interference. Specifically, [29] proposed a method called "dynamic interference steering", in which, interference is steered in an optimum direction where its impact on an interference victim is minimized. Ref. [30] applied a cross-layer approach where interference co- ordination is applied both at the Physical and MAC (Media Access Control) layers. Beam-forming is used to suppress interference at the physical layer while an optimisa- tion problem is solved at the MAC layer to determine the set of users that will be accommodated by a resource block such that interference is reduced. In Ref [31], a beam selection scheme known as "beam skipping" is used to optimise a performance utility in a way that reduces inter-beam interference. Self-organization and self-configuration are useful features that are usually exploit- ed during interference coordination. Self-Organizing Networks (SONs) [32] attempt to minimize human intervention, where they use measurements from the network to minimize the cost of installation, configuration, and maintenance of a network [33]. Base stations can be made to self-organise by learning from their environment. Sever- al research works on Q-Learning based interference coordination exist in the litera- ture, in which, agents (usually base stations) self-organise based on network meas- urements. A self-organised method was proposed for mitigating interference in [9]. It considers a vehicular network, in which, a base station agent selects optimal resource control policy during each action policy interval of the learning and running phases. During the learning phase, the agent is trained to maximize expected future reward by updating Q elements till the attainment of convergence. The running phase involves the agent choosing the action that yields the highest expected reward from updated Q. Ref. [34] presents a multi-agent deep reinforcement learning system where femtocell and macrocell base stations act as agents whose goal is to maximize net- work capacity. Neural network used in the system enhances its ability to process a large amount of state information. The parallel operation of multiple agents ensures that the overall network interference is reduced in order to achieve an enhancement in capacity. In [35] a downlink reinforcement learning-based interference control algo- rithm is presented. The algorithm employs convolutional neural network to estimate Q values which as a result reduces the size of the state space. After a sufficient number of power control iterations, the network throughput is significantly increased. The authors in [36] developed a reinforcement learning algorithm for optimal configura- tion of interference coordination parameters, which have stochastic characteristics, such as, location of users, traffic demands, and strength of received signals. 3 System Model and Formulations In this model, learning based strategy for interference coordination in an environ- ment, where macrocell and picocell co-exist is considered. Our focus is on the analy- sis of a network deployment with a picocell underlaying a macrocell network. The iJIM ‒ Vol. 15, No. 12, 2021 67 Paper—A Reinforcement Learning Approach for Interference Management in Heterogeneous Wireless... total bandwidth (BW) of the network is divided into sub-carriers, with each sub- carrier having a bandwidth of Δf (15 kHz). Resource blocks are grouped using or- thogonal frequency division multiplexing (OFDM) symbols as shown in Figure 1. Both macrocell and picocell operate in the same frequency band, and they have access to the same set of resource blocks. Fig. 1. LTE downlink physical resource based on OFDM When picocell and macrocell utilise the same spectrum, inter-cell interference problem emerges. A typical collocation scenario of picocell and macro-cell is shown in Figure 2. In this scenario, the downlink transmissions from the MBS or picocell base station (PBS) will create a strong interference at a nearby macrocell user equip- ment (MUE) or picocell user equipment (PUE) and may cause the received macrocell or picocell signal at the MUE or PUE to be degraded. Hence, inter-cell interference hampers a successful macrocell and picocell co-existence. Fig. 2. A heterogeneous scenario 3.1 Computation of signal to interference plus noise ratio We consider that each MBS has a set of MUEs associated to it and the MUEs peri- odically report the quality of each resource block (RB) in terms of signal to interfer- ence plus noise ratio (SINR) to their serving MBS in order to facilitate a channel- aware scheduling. PBSs in the vicinity of an MUE constitute interference sources to the downlink signal of the MUE, and in a similar way, MBS interferes with the down- link signal of PUE. Without the loss of generality, only downlink transmission is considered in this work, since interference in the downlink is usually more severe than the uplink, especially when the interfering bases station is in close proximity to the victim UE. It can be observed in Figure 2 that the interference between macrocell and 68 http://www.i-jim.org Paper—A Reinforcement Learning Approach for Interference Management in Heterogeneous Wireless... picocell is mutual; which means that, the transmissions of MBS interfere with PUE’s received signal and also, the transmissions of the PBSs interfere with MUE’s received signal. The SINR of MUE i is computed as: (1) where PMBSd denotes the transmit power from serving MBS d to the i-th MUE, PMBSf denotes the transmit power from interfering MBS f to the i-th MUE, PPBSg denotes the transmit power from interfering PBS g the i-th MUE, hMBSd,MUEi denotes the link gain between serving MBS d and the i-th MUE, hMBSf, MUEi denotes the link gain between interfering MBS f and the i-th MUE, hPBSg, MUEi denotes the link gain between interfering PBS g and the i-th MUE, δ2 denotes the noise power. Similarly, the SINR of PUE j is computed as: (2) wherePPBSk denotes the transmit power from serving PBS k to the j-th PUE, hPBSk, PUEj denotes the link gain between serving PBS k and the j-th PUE, hMBSf, PUEj denotes the link gain between interfering MBS f and the j-th PUE, hPBSg, PUEj denotes the link gain between interfering PBS g and the j-th PUE. By applying Shannon’s capacity formula, the data rate achieved on an RB with SINR scheduled by the base station is computed as: (3) where BWRB denotes resource block bandwidth (in Hertz). The throughput of a UE is a function of the SINR and is expressed as: (4) 3.2 Model formulation This section presents a single agent, and also, multi-agent learning approach in or- der to solve the inter-cell interference problem in HetNets. In this study, the base- station is modeled as the learning agent as shown in Figures 3(a) and (b). It learns the condition or state of the UE in terms of interference level before taking an action of power allocation on RBs of UEs, while ensuring that the best reward in terms of the throughput is realised. We consider both single and multi-agent Q-learning models, such that, in the former, either PBS or MBS acts as an agent (not both concurrently), while in the latter, they concurrently both act as agents. Throughout the rest of this paper, the terms, “Q-learning” and “reinforcement learning” are used interchangeably. å å ¹= = ++ = M df,f ψ g iggiff idd i δhPhP hP γ 1 1 2 MUE,PBSPBSMUE,MBSMBS MUE,MBSMBS MUE å å = ¹= ++ = M f ψ kg,g jggjff jkk j δhPhP hP γ 1 1 2 PUE,PBSPBSPUE,MBSMBS PUE,PBSPBS PUE g ( )γBWC += 1log2RB ( )γfTP =user iJIM ‒ Vol. 15, No. 12, 2021 69 Paper—A Reinforcement Learning Approach for Interference Management in Heterogeneous Wireless... a) A single agent learning model where macrocell BS is the learning agent. b) A single agent learning model where picocell BS is the learning agent. Fig. 3. Agent learning model In the next section, we present our proposed learning method for mitigating the ag- gregate interference generated between the picocell and the macrocell of a HetNet. The section also introduces the concept of learning rate. Enhanced inter-cell interference coordination based on single agent Q- learning model: A single agent, which is either a picocell or macro-cell BS (but not both) learns the condition of the SINR of the resource blocks of its UE. Typical single agent scenarios are illustrated in Figures 3(a) and (b). Let D represent a set of similar base stations; in this context, we are assume that picocell BSs are similar to each other but are dissimilar to macrocell and vice versa; then, a single agent is such that . (5) The actions of the learning agent , the associated states, and reward functions are explained next: • Agent: This is base station x, which is a member of the set of similar base stations D that satisfies Eqn. (5). • State: The state represents the condition of a UE within the cell of agent based on the SINR seen on RB r of its UE. The set of states of a UE for all N RBs can be represented mathematically by: (6) DxÎ MψDD È=¢È DxÎ DxÎ { } { }N,,rxrx s !1Î=S 70 http://www.i-jim.org Paper—A Reinforcement Learning Approach for Interference Management in Heterogeneous Wireless... where, (7) where is the instantaneous value of the SINR reported by a UE on a resource block r served by a learning agent , while is the SINR threshold used in classifying an RB as either being in a good or bad state. Depending on what the state of an RB of its UE is, the agent takes an action. For instance, if the agent observes that an RB r of a UE has an SINR less than the threshold value , it takes an appropriate action which will be different from the action it would take when the SINR is above this threshold. The actions of the BS are described next. • Action: The action is the power level allocation by the single agent to the resource block of its served UE. The possible set of actions is repre- sented mathematically as: (8) where, (9) where is the transmitted power level of agent on resource block r. In this paper, we consider two possible levels of transmitted power for each resource block. They are maximum power level and zero power level. Maximum power is loaded when the reported SINR for an RB is above the threshold value , which indicates a state of 1, while zero power is loaded for a state of 0. • Reward: The reward is the capacity achieved by the single-agent on re- source block r when it is transmitting at a power level to a UE associated to it. It is represented mathematically as: (10) where of agent is computed according to Eqn. (3) such that, whenever zero transmission power (bad state of resource block of UE) is loaded on a resource block r, the capacity will be zero for that resource block; meaning that the reward is zero and vice versa. A reward of 0 is regarded as a penalty, and by learning an op- ïî ï í ì ³ < = stategoodif1 statebadif0 T T γγ γγ s x r x rx r x rγ DxÎ Tγ xx rs SÎ Tγ DxÎ { }N,,,r !21Î { } { }N,,rxrx a !1Î=A ïî ï í ì = = = rs rs a x r x rx r blockresourceonloadedbewillpowerfull1if1 blockresourceonloadedbewillpowerzero0if0 x ra DxÎ Tγ DxÎ x r x r CR = x rC DxÎ x rR iJIM ‒ Vol. 15, No. 12, 2021 71 Paper—A Reinforcement Learning Approach for Interference Management in Heterogeneous Wireless... timal policy, agent x, after some time, will be able to avoid actions causing zero re- wards but instead, will take the ones that yield higher rewards. Fig. 4. A multi-agent learning model where both macro-cell BS and picocell BS are learning agents. Enhanced inter-cell interference coordination based on multi-agent Q- learning model: In this section, the multi-agent Q-learning model in which, both picocell and macrocell serve as learning agents that learn the condition of the UE is introduced. This is called a multi-agent Q-learning approach where multiple agents (picocell and macrocell) aim at carrying out the learning process by repeatedly inter- acting with the environment to provide the best reward to their associated user equip- ment. A typical multi-agent learning scenario is illustrated in Figure 4. The set of learning agents in this case comprise all picocell and macrocell base stations repre- sented as . The learning agent, actions, states, and reward functions are de- signed and explained as follows: • Agent: An agent y is a member of • State: The state represents the condition of a UE within the cell of agent based on the SINR seen on RB r of the UE. The set of states of a UE for all N RBs can be represented mathematically by: (11) where (12) MψÈ MψÈ ( )Mψy ÈÎ { } { }Nryry s ,,1!Î=S ïî ï í ì ³ < = stategoodif1 statebadif0 T T γγ γγ s y r y ry r 72 http://www.i-jim.org Paper—A Reinforcement Learning Approach for Interference Management in Heterogeneous Wireless... where is the instantaneous value of the SINR reported by a UE on a resource block r served by a learning agent . represents the state of an RB r of the UE in the cell of agent , such that the state takes value of 0 if the SINR of the user falls below a certain threshold value , but takes a value of is 1, if otherwise. MBS and PBS, as agents, have the capability of jointly observing interfer- ence levels through periodical SINR reports received from their associated UEs. If the reported SINR falls below a threshold , the BSs identifies the RB as occupied (i.e., bad state) and takes a subsequent action which is explained next. • Action: For a multi-agent scenario, the action is defined as: (13) where (14) where , just like the single agent case, is the transmitted power level of agent on resource block r. • Reward: In this paper, the reward is the capacity achieved by the multi-agent y, while transmitting to a UE in its cell. It is represented mathematically as: (15) where is computed according to Eqn. (3). The rationale behind this reward function is that the Q-learning model aims to select optimum power level capable of improving the capacity of UEs associated to agent . Algorithm of Q-learning for inter-cell interference coordination scenario: To achieve interference coordination, exploration is performed and the Q-learning equa- tion is updated as: (16) where is the learnt or updated Q-value corresponding to the Quality value of state (interference) and the action (allocated power level) that gave the best reward in terms of throughputs of the UE. is the previous Q-values corresponding to the quality value of state (interference) and the action (power level allocation) that was previously learnt by the agent (base station) that does not result into an optimum re- ward in terms of throughputs to the UE [34]. is the next optimal Q-value y rγ ( )Mψy ÈÎ yyrs SÎ ( )Mψy ÈÎ Tγ Tγ { } { }Nryry a ,,1!Î=A ïî ï í ì = = = rs rs a y r y ry r blockresourceonloadedbewillpowerfull1if1 blockresourceonloadedbewillpowerzero0if0 y ra ( )Mψy ÈÎ y r y r CR = y rC ( )Mψy ÈÎ ( ) ( ) ( ) ( ) ( )( )( )as,Qa,sQβRαas,Qαas,Q* -¢¢++-¬ max1 ( )as,Q* ( )as,Q ( )a,sQ ¢¢ iJIM ‒ Vol. 15, No. 12, 2021 73 Paper—A Reinforcement Learning Approach for Interference Management in Heterogeneous Wireless... learnt by the agent (base station) after observing that Q(s, a) is not optimal to give a correct update. Action is denoted as a, that is, transmitted power level of the agent (or base station). α and β are the learning rate and discount factor, respectively. Note that x, y, and r are not included in Eqn. (16) in order to reduce nomenclature complexities. The algorithm for computation of the Q-value and associated parameters is given in Algorithm 1 [34]. Learning rate (𝜶) model: Learning rate implies willingness of the agent to learn from its environment. Three types of learning rate are considered in this paper. These are: Normal learning rate: (17) Logarithm learning rate: (18) Polynomial learning rate: (19) where Θ is the state learning indicator and the update of α is constrained by Eqn. (20) which is computed as: Θ α 1 = ÷÷ ø ö çç è æ + = Θ Θ α 1 log 21 Θ Θ α + = 74 http://www.i-jim.org Paper—A Reinforcement Learning Approach for Interference Management in Heterogeneous Wireless... (20) where ε is a small positive number greater than 0. Equation (20) ensures that α is always maintained in the interval (0,1) during any learning episode. Analysis of learning rate model: Consider Eqn. (16), it is observed that 1-α and α are the weighting factors of and , respectively. This indicates that when α is high, contributes signifi- cantly to the updated value of . This allows the system to explore new state and action pairs that have the tendency of yielding higher rewards. On the other hand, when α is low, contributes significantly to the updated value of , making the system to potentially adopt already known values (i.e., exploitation is favoured). Figure 5 graphically illustrates the relationship between learning rate and state learning indicator. Fig. 5. Illustration of α as a function of Θ It can be observed that for Normal learn rate and Logarithm learn rate, the learning rate is a monotonically decreasing function of the state learning indicator. In the case of the Polynomial learning rate, the learning rate initially increases and subsequently decays as the state learning indicator is increased. If we consider 2 extreme values of Θ (i.e., 1.0 and 5.0) the behaviour of the learning system can be summarised as shown in Table 1. ï ï î ïï í ì