Acta Polytechnica doi:10.14311/AP.2014.54.0367 Acta Polytechnica 54(5):367–377, 2014 © Czech Technical University in Prague, 2014 available online at http://ojs.cvut.cz/ojs/index.php/ap TOWARDS EVOLUTIONARY DESIGN OF COMPLEX SYSTEMS INSPIRED BY NATURE Jaroslav Vítků∗, Pavel Nahodil Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Cybernetics, Technická 2, 16627, Prague 6, Czech Rep. ∗ corresponding author: vitkujar@fel.cvut.cz Abstract. This paper presents the first steps towards the evolutionary design of complex autonomous systems. The approach is inspired by modularity of the human brain and the principles of evolution. Rather than evolving neural networks or neural-based systems, the approach focuses on evolving hybrid networks composed of heterogeneous sub-systems implementing various algorithms/behaviors. Currently, evolutionary techniques are used to optimize the weights between predefined blocks (so-called Neural Modules) in order to find an agent architecture appropriate for a given task. The framework, together with the simulator of such systems is presented here. Then, examples of agent architectures represented as hybrid networks are presented. One architecture is hand-designed and one is automatically optimized by means of an Evolutionary Algorithm. Even such a simple experiment shows how evolution is able to pick-up unexpected attributes of the task and exploit them when designing a new architecture. Keywords: agent, architecture, artificial life, creature, behavior, hybrid, neural network, evolution. 1. Introduction — problem decomposition Many researchers have had a long-term goal: to build an autonomous robotic system that is able to operate in real-world conditions in a robust way. However, af- ter many years of research and development, there are still no satisfactory results. The uneasy task of build- ing reliable robots is composed of too many smaller sub-problems, such as vision, reasoning in unknown domains, etc. On this long way, there are simply too many problems. Moreover, most of these smaller problems do not yet have a feasible solution. 1.1. Modular design in practice Nevertheless, many of these small problems have al- ready been solved relatively well, such as for example pattern recognition, planning, reinforcement learning, etc. In order to build a well-performing system it is of- ten possible just to use these known systems. Or even better, it is suitable to use implementation of known algorithms, which are tested and have performed on a given task. Building a more complex system then becomes faster and simpler. This approach of re-using modules was started by the Robotic Operating System (ROS). The main goal of ROS is to provide nodes — implementations of algorithms — that can be re-used in robotic applications (such as path-planning, vision etc.) [1]. The user picks selected nodes and composes the resulting system from them. Generally, such an approach is called Top-Down. The opposite direction is called a Bottom-up design approach. Connectionist Artificial Neural Networks (ANNs) provide an example. In ANNs, no piece of the system (neuron) implements useful processing alone. However when composed together, complex behavior can be obtained often with the use of emergence. The most traditional design of ANNs is as follows: (1) predefine some structure of ANN; (2) optimize the weights between neurons, in order to obtain the desired behavior. Optimization can be done by some local algorithm [2], or by global approaches, such as neuro-evolution [3]. More recently, ANNs have often been designed by a principle called “Neural Engineer- ing”. This is a Top-down approach, which works with Modular Neural Networks (MNNs) [4]. The purpose of each module (sub-network) is defined. The required behavior of the resulting system is then obtained by composing multiple sub-networks together [5]. In this way, complex systems can be composed [6]. 1.2. How can systems be designed in a general way? Both of the approaches mentioned here have their own benefits and their own weaknesses. The use of already- implemented pieces of robotic software provides high reliability of the system, and the user has big insight into the inner functionality. However, the resulting design can be very constrained, and the design pos- sibilities are limited. ANNs provide unconstrained design options, and are very good at dealing with uncertainty in data, but their structure is often too complex to be understood directly and altered well. Here, authors focus on designing agent architectures (learning and decision-making systems for (virtual) robots) in a hybrid way. The process of designing agents is not be fixed to any of the design approaches mentioned above, but takes advantages of both. The novel presented framework is called Hybrid Artificial Neural Network Systems (HANNS). It is an attempt to 367 http://dx.doi.org/10.14311/AP.2014.54.0367 http://ojs.cvut.cz/ojs/index.php/ap Jaroslav Vítků, Pavel Nahodil Acta Polytechnica Figure 1. A Neural Module with two data inputs, four data outputs and four configuration inputs. The module also contains ”Prosperity” output, which provides subjective heuristics defining how well the module performs in the current architecture during the current simulation. Here, a modem serves as a communication interface between the NengoROS simulator and an external ROS node, which implements a given algorithm/functionality. unify Modular Neural Networks with purely top-down and practical sub-systems, such as those implemented in ROS. One of main goals of this framework is to be able to automate the design of new architectures for a given task. This approach will be shown on a simple example here. 1.3. Structure of the text The following chapter will describe the main concepts of our HANNS framework, together with the Nen- goROS simulator. The third chapter will show these principles applied to a simple agent architecture imple- menting Motivation-Driven Reinforcement Learning (RL). Then the evolutionary approach is used to de- sign a similar architecture automatically. Finally, the two results are compared and discussed. 2. Hybrid Artificial Neural Network Systems Hybrid Artificial Neural Networks can be described as Modular ANNs [4] composed of heterogeneous subsys- tems. Each subsystem can implement various methods of decision-making and employ various representations of information processing, from neural-like to symbolic representations [7]. This means that (for a connect- ing symbolic-based module to a neural network) the symbolic representation has to be derived from the activity of particular neurons. This has to be done for example by means of a predefined lexicon, rule extraction or similar methods. This chapter describes the main principles employed by the presented Hybrid Artificial Neural Network Systems (HANNS) framework. The framework uses seamless communication between all modules in the network and encapsulated information transformation where needed. The particular strategy of information transformation belongs to own module and is hidden from the rest of the hybrid network. 2.1. Communication and sub-system Representation In order to deal with connecting different modules which use different types of communication, a com- mon type of communication had to be defined. Since there is a possibility that every system could be im- plemented by neural computation one day, the aim is to add higher-level subsystems into ANNs. Therefore the entire network uses neural-like communication in this framework. Each ”Neural Module” subsystem can implement arbitrary behavior and has a defined number of input and output connections. Each con- nection can represent a real-valued number, typically from the interval 〈0, 1〉. The scheme of an example of a Neural Module is shown in Fig. 1. The figure shows that the framework uses simple communication in the network, while a transformation from/to a symbolic (or other) domain is implemented inside a particular Neural Module (depicted as Modem in the schematics), if needed. According to [7], this framework uses “hybrid system coupling interleaved by function calls”, where all activity on inputs of Neu- ral Module is translated into the inner representation of Modules and is processed accordingly. After pro- cessing the data, the Module encodes the result back into the “neural communication” and sets data on its outputs. 2.2. Configuration of a Neural Module This type of Neural Module can implement various types of information processing. However, even do- 368 vol. 54 no. 5/2014 Towards Evolutionary Design of Complex Systems Inspired by Nature main independent algorithms often need fine-tuning of their parameters in order to work efficiently. There- fore the Neural Module has configuration inputs in addition to data inputs and outputs. These inputs are represented in the same way as data inputs, but their purpose is to define the parameters of the algorithm(s) encapsulated in the Neural Module. These parameters can be used as normal data inputs, which means that their value can be changed during the simulation. The only difference is that these inputs can be ignored. If the configuration inputs are left unconnected, these hold a predefined (default) value of the algorithm pa- rameters. This can be used for simplifying the overall complexity of the network topology. 2.3. Prosperity of the Neural Module One of the main aims of this framework is to study new, alternative use-cases of known algo- rithms/subsystems. During the automatic design of these Hybrid Artificial Neural Network Systems, the following use cases of the Neural Module can occur: • The Neural Module is connected in a completely wrong way: the corresponding algorithm is used inefficiently or it does not work at all. • The Neural Module is connected in an unexpected, new way: the corresponding algorithm is employed, but in a way not anticipated by the designer, a potentially new use of the algorithm. • The Neural Module is connected in an expected way: the algorithm is employed as expected during the Neural Module design. The second of these three cases does not necessarily mean that the algorithm is not advantageous in the architecture. However, the behavior of an algorithm (and potentially its purpose in the architecture) can be often hard to analyze by hand. In order to identify incorrectly used parts of the resulting architecture, it would be convenient to dis- tinguish between these three use-cases automatically. Furthermore it would be useful to be able to evaluate the performance of the algorithms in a given situation. However, this is possible only by means of heuris- tics. The value of Prosperity output defines subjective heuristics defining “how well the algorithm performs” in a given architecture during the simulation. This enables the user (and potentially EA) to distinguish between good and bad parts of a particular architec- ture. It is up to the designer of a particular Neural Module how to define its Prosperity function. The function should produce values in the interval 〈0, 1〉. 2.4. The NengoROS simulator The next goal of our HANNS framework is to provide a platform for simulating agent architectures. In order to rapid prototype and simulate these architectures, the simulator of Hybrid Artificial Neural Network Systems was created. The main objectives were the following: maximum reuse of current implementations of algorithms, decentralized and event-driven simula- tion and integration with an advanced simulator of large-scale ANNs. The resulting open-source system is called Nen- goROS [8]. It connects a simulator of large-scale ANNs of the 3rd generation called Nengo [9] with the Robotic Operating System (ROS) [1]. The Nengo simulator was created with the goal of implement- ing the Neural Engineering Framework (NEF) and it therefore supports modular ANNs. The addition of ROS enables the simulator to use any ROS-enabled subsystem in the simulation. The integration of ROS into the simulator is depicted in the Fig. 1, where the component called “Q-Lambda Module” is an external ROS node and the “Modem” serves as the interface between Nengo and ROS. The ROS-based components can be run remotely and can represent a particular piece of SW, a simulated world or even robotic HW. 3. Description of selected Neural Modules An example of the use of the presented framework will be shown on two selected Neural Modules.The first module implements the Reinforcement Learning algorithm, and the other module presents the physi- ological state of the agent. Together, these modules implement a principle called Motivation-driven Re- inforcement Learning. For each of these algorithms the theory will be briefly described. Then, integration into the Neural Module will be introduced. 3.1. Reinforcement Learning Module Agent architectures from the domain of Artificial Life (ALife) often require online and model-free Reinforce- ment Learning (RL). An algorithm called Q-Learning meets these requirements. This discrete algorithm learns a desired strategy only by means of interaction with the environment based on actions produced re- wards/punishments received. The algorithm learns behavior which leads towards the nearest reward while avoiding punishments. The algorithm is encapsulated as a standalone sub-system — Neural Module here. This particular example (the use of RL in HANNS) can be likened to Ensemble Algorithms in Reinforce- ment Learning [10], or to the Aggregated Multiple Reinforcement Learning System (AMRLS) [11]. Com- pared to these, a single ensemble is represented as a Multiple-Input Multiple-Output sub-system, which communicates compatibly with 2nd generation of ar- tificial neurons. The Action Selection Mechanism (ASM) is also currently integrated in the Neural Mod- ule currently. Figure 2 presents a graphical representation of our modification of the Q-Learning algorithm. This algo- rithm runs inside a standalone ROS node and com- municates externally only by means of ROS messages. Figure 1 shows the integration of this ROS node into a Neural Module. The Module is compatible with the 369 Jaroslav Vítků, Pavel Nahodil Acta Polytechnica Figure 2. Scheme of the Stochastic Return Predictor (SRP) implementation. It is composed of the Q-Lambda algorithm and ASM. The (sub-)system is implemented as a stand-alone ROS node, which can be used as a Neural Module. Outputs encode a selected action by means of the 1ofN code (where N is number of available actions). M state variables are encoded by M data inputs, and each data input is sampled in a predefined number of discrete values. A node selects one action at each time-step and expects information about the new state and the reward. The node has the following configuration inputs: α, γ, λ, Importance, which affect learning, and Action Selection Methods (in the case that these inputs are not connected, default values are used). The Prosperity heuristics represents the average between MCR and overall coverage of the state space (number of visited states). HANNS framework, and can be seamlessly connected into a network of heterogeneous nodes. 3.1.1. Learning For learning, the Neural Module uses standard algo- rithm called Q-Learning (more exactly: Q-Lambda, see below). It is named according to its Q matrix, which maps state-action pairs to utility value. At each environment state, the Q(s,a) matrix stores utility values for all possible actions: Q : A×S → R, (1) where A is a set of all available actions and S is the set of all possible states of the environment. The utility value represents the discounted future reinforcement that will be received by the agent if it will follow a given action a in a given state s. Online learning is governed by obtaining new Q(s,a) values into the ma- trix. A change of the value in the matrix is represented by the following equation: δ = rt+1 + γmax aQ(st+1,at+1) −Q(st,at). (2) The algorithm stores the current state and the action that was just executed (st,at). Action at may cause receiving the reward rt+1 and a transi- tion1 into the new state st+1. Based on this in- formation and the optimal action in the new state: 1Note that this theoretically requires environment with Markov Property. a∗t+1 = max aQ(st+1,at+1), the value Q(st,at) is up- dated as follows: Q(st,at) ← Q(st,at) + αδ, (3) the following parameters used: γ ∈ 〈0; 1) is a forget- ting factor and α ∈ (0; 1〉 is a learning rate. According to the equation (3), the Q-Learning algo- rithm updates only one value at a time. The learning speed can be enhanced by modification called Eligibil- ity Trace, which enables the algorithm updates values of multiple past state-action pairs at one step. Such modification of Q-Learning is called Q-Lambda, or Q(λ) algorithm. By introducing the error function, which is the fundamental for the eligibility traces- based approaches, the equation can be rewritten as follows: Q(st,at) ← Q(st,at) + αδe(s,a), (4) where the parameter error is defined for each state- action pair as follows: et(s,a) = { γλet−1(s,a) if (s,a) 6= (st,at) γλet−1(s,a) + 1 if (s,a) = (st,at) (5) The equation states that for all state-action pairs, there is an error function value that decays in time. If the state-action pair is used, the error function is increased by 1. Such a modified equation (4) means that all state-action pairs are updated in each step. The decay parameter λ ∈ 〈0, 1〉 defines the mag- nitude of the update of the previous states. In the 370 vol. 54 no. 5/2014 Towards Evolutionary Design of Complex Systems Inspired by Nature Figure 3. The Neural Module implementing the physiological state space. This node produces/represents motivation for the behavior which leads to the correct reinforcement. The value of its physiological variable decreases in each time step with given dynamics, and thus increases the motivation. Once the correct reinforcement is received, the value of the variable goes back towards the limbo area, where no motivation is produced. The value of the Prosperity output is defined as Pt = 1 − SF t (Mean State Distance to the limbo area). case that λ = 0, pure one-step Temporal Difference (TD) learning is used. In the case of λ = 1, Monte- Carlo learning is obtained. Correct estimation of λ can improve the speed of learning, but also can cause oscillations in learning. In the implementation of this Neural Module, the modification of the Q(λ) algorithm is used. Here, the Eligibility trace is constrained to finite length of N pre- vious steps, which saves computational resources and prevents bigger destabilization of learning convergence by an incorrect value of the λ parameter. 3.1.2. Action Selection Method in non-episodic experiments A typical use of the Q(λ) algorithm is in episodic experiments. In episodic experiments, the initial state of the environment is selected randomly. This helps toward uniform exploration of (and learning in) the en- tire state-space. In order to add domain-independence, the designed module has to be able to operate in non- episodic experiments. In non-episodic experiments (particularly those simpler with one attractor), it is necessary to achieve balance between knowledge ex- ploitation and exploration/learning. A typical Action Selection Method (ASM) for effi- cient knowledge exploitation is called a Greedy strategy, where the action with the highest utility is selected: at+1 = a∗t+1 = max aQ(st+1,at+1). (6) This strategy may stick at a local optimum, so the �- Greedy AMS is often used. In this ASM, parameter � affects the amount of randomization. Random action is selected with probability of �, while the Greedy strategy is followed with probability of: P(a∗t+1) = 1 − �. (7) Parameter � therefore directly balances between ex- ploitation of knowledge and exploration of the state space around the nearest attractor. In order to provide efficient learning ability in non- episodic experiments, authors introduce an input to the Neural Module called “importance” which gener- ally defines the current need for “services” provided by the module. For the Q(λ) module, the importance input represents the motivation for the behavior rep- resented by this node, see Fig. 2. The amount of randomization in the ASM should be indirectly pro- portional to the importance input. Here, the �-Greedy ASM is used, but the randomization is defined as: � = 1 − Importance. (8) By increasing the importance of the Q(λ) module (increasing the motivation for executing a behavior represented by this node), the probability of taking the greedy action a∗ increases. This means that the importance enables the agent to learn by exploration in free time and to exploit the information if needed. 3.1.3. Prosperity of the Reinforcement Learning Module This subjective online heuristics estimates how effi- ciently the module is used in the current topology during the current simulation (task). Since the Q(λ) module has to follow two antagonistic objectives (ex- ploitation vs. exploration), it can be difficult to repre- sent the efficiency of its use by a single value. The heuristics that is currently being used is repre- sented by the following equation: Pt = Cover t + MCRt 2 , (9) 371 Jaroslav Vítků, Pavel Nahodil Acta Polytechnica Figure 4. Scheme of mapping the genotype (vector of binary/real values) to the phenotype (working agent architecture). The Physiological Module is wired to the reward source in the map, and this determines the main goal of the agent (by the module’s Prosperity value). All configuration inputs are unconnected, so the default parameters are used. Outputs of the Q-Lambda Module are directly wired to agent’s actuators. The genotype of the hand-designed architecture is depicted at the bottom and its connections are highlighted in the scheme. Variables representing the state of the environment (X,Y coordinates) are connected to the data inputs of the Q-Lambda node. The reinforcement is connected to the Physiological Module, which produces motivation for the ASM and a reward for learning in the Q-Lambda Module. where the Mean Cumulative Reward (MCR) is defined as the mean reward (R) received during the simulation until time step t. This represents the efficiency of knowledge exploitation: MCRt = ∑ i ri i ∀i = 0, 1, . . . , t. (10) The value of Cover t represents how many states of the entire state-space have been visited so far: Cover t = ∑ i∈Visited si∑ s∈S , (11) This value represents exploration efficiency. 3.2. Motivation Source Module In order to represent agents’ needs, their physiology can be modeled [12]. It has been shown that decom- posing the task into subtasks can help the RL to learn more efficiently, which is beneficial especially in more complex tasks or in tasks that are difficult for RL to learn [13]. The agents’ needs can represent the motivation to execute/learn these subtasks of such a more complicated policy. The second Neural Mod- ule, which serves as a motivation source and holds one physiological variable is used here. The value of this variable decays with predefined dynamics in time. The value of 0 represents purgatory area and the value of 1 represents the limbo area. The amount of motivation is indirectly dependent on the physi- ological variable. In the limbo area, no motivation is produced, while the purgatory area represents the maximum motivation/need. The simple dynamics of the physiological variable is defined as follows: Vt+1 = Vt − decay. (12) The amount of motivation that is produced is deter- mined by applying the sigmoid to the inverse value of physiological variable V . The resulting amount of Motivation M at time t is: Mt = 1 1 + emin+(max−min)×(1−Vt) , (13) where the min and max parameters are chosen so that the value of the variable Vt = 0 roughly corresponds to the motivation of Mt = 1. If the reward is received, the value of Vt+1 is set to one, and therefore the motivation decreases towards 0, which switches the agent back towards exploration. 3.2.1. Prosperity of the Motivation Source Module If the agent behaves efficiently enough, the mean motivation produced by this module is low. The mean motivation value can be expressed by the Mean State Distance to optimal conditions (SF), which is defined as follows: SF t = ∑ i di i ∀i = 0, 1, . . . , t, (14) where di is the distance of state variable Vi from the optimal conditions of V = 1. SF t is computed 372 vol. 54 no. 5/2014 Towards Evolutionary Design of Complex Systems Inspired by Nature Figure 5. Observing the motivation-driven RL behavior of the architecture: the exploration vs. knowledge exploitation is dynamically balanced, according to the agent’s current needs. The X-axis represents the simulation steps; the Y axis is value of the motivation/reward. Peaks in the lower graph represent the binary event of receiving the reward. These events correlate with the amount of motivation that is currently produced (upper graph). The speed of receiving the motivation depends on quality of the agent’s knowledge (Fig. 6b and Fig. 6c) and the agent’s current distance to the reward source. online for each simulation step. Since the Prosperity is indirectly proportional, its value is computed as: Pt = 1 − SF t. (15) 4. Experiments First, experiments which validate the expected func- tionality of particular Neural Modules are performed. Then their correct interaction is evaluated on the ar- chitecture,which is hand-wired for a given experiment. Two types of Evolutionary Algorithms (EAs) were used. The standard generational model of the Genetic Algorithm (GA) and its modification for vector of real-valued numbers called Real-valued GA (RGA) were used. Both, the GA and RGA were tested and compared on designing new architectures for a given task. In the experiments, the agents are allowed to move in a 2D discrete world 15 × 15 positions in size. The map contains two obstacles and one source of motiva- tion (see Fig. 6b). The agent has 4 actions — moving in four directions — and if the agent steps on a tale with the reward, a positive reinforcement is received. Therefore, the Q-Lambda Module is configured to have four outputs (four actions) and two inputs (X,Y coordinates on the map) sampled into 15 expected values (see Fig. 4). The value of physiological variable is configured to decrease with value of decay = 0.01 each step. The simulated environment is also implemented as an ROS node and is also used in the NengoROS simulator as a Neural Module. The simulation is non- episodic; the agent is placed in the environment and is allowed to interact with the world for a predefined number of simulation steps. 4.1. Description of Hardwired Agent Architecture The modules are connected so that the Physiologi- cal Module receives a reward from the environment and produces motivation for the Q-Lambda Module. States of the environment are connected to the data inputs of the Q-Lambda Module. Actions produced by the Q-Lambda Module are then directly applied as actions of an agent in the environment. The scheme of the hand-wired architecture repre- sented as a hybrid artificial neural network is depicted in Fig. 4. Together, this network implements the motivation-driven reinforcement learning used in a non-episodic simulation. 4.1.1. Simulation of Hardwired Architecture During the simulation, the Prosperity values of both the Q-Lambda and Physiological Module are observed. These should correlate with the successfulness of the behavior of the agent. Note that Q-Lambda Pros- perity is composed of MCRt and Cover t values (how successful the learning is and how efficient the ex- ploitation is) and the Prosperity of the Physiological Module is defined as 1 − SF t (defining how satisfied the agent is). Figure 6a shows the course of these values in time. It can be seen that at about step 20000, the prosperity of the Physiological Module and the agent’s average reward/step converge. Since the behavior is learned re- liably at about step 20000, the agent is able to exploit the knowledge efficiently (fulfills the requirements for a reward faster) and therefore it has more time to explore. This causes further discovering of new states between steps 20000 and 40000. Figure 5 shows rel- atively stable behavior of the motivation-driven RL system during this part of the simulation. Here, the agent balances between exploration and exploitation 373 Jaroslav Vítků, Pavel Nahodil Acta Polytechnica 0 2 4 6 8 10 x 10 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Simulation Steps P ro s p e ri ty v a lu e Prosperity of Neural Modules During the Agent Life Prosperity of Physiology module (1!MSD) Prosperity of RL module No. of visited states Reward per step (a) Convergence of Prosperity val- ues of both Neural Modules in time. Since the behavior is learned reliably around the step 20000, the agent is satisfied and systematically explores further new states. (b) Visualization of knowledge learned by the agent during the first 10000 simulation steps. Each posi- tion in the table is filled with repre- sentation of action with the highest utility value. 5 10 15 5 10 15 0 10 20 30 40 S2 = Ypos Utility Values of Best Actions in a Given State ! Q(s,a) S1 = Xpos U ti lit y o f th e B e s t A c ti o n (c) Utility value of the best action (see b) based on agent’s current po- sition in the map. The nearer to the reward source, the higher the ex- pected outcome of the best action is. Figure 6. Simulation of the hand-designed agent architecture. Each position in b) and c) represents one position in the map. There are two obstacles and one source reward in the environment. b) and c) show the agent’s learned knowledge. Greedy ASM selects the action with the highest utility in each state. These actions are shown in b) and their utility is shown in c). Note that the utility values on the Z-axis are rescaled for better visibility. of knowledge based on the current motivation value. The longer delay between satisfying the motivation, the further the agent was from the source of a reward (potentially unexplored area). The knowledge that has been learned by the agent can be visualized in two ways. Figure 6b shows a graphical representation of the best action (a∗) based on the agent’s current position in the map (state s of the environment). Convergence to the optimal strat- egy can be observed especially near the source of the reward. Figure 6c depicts the actual utility values in the Q(s,a) matrix for the best action a∗ in the state. The higher the surface is, the higher is the expected reward. Zero values represent unexplored states, such as obstacles. Note that the behavior of this architecture can be seen in the attached movie. 4.2. Principle of the evolutionary design of new architectures Two types of simple Genetic Algorithm (GA) were used for the neuro-evolutionary design (optimizing the connection weights between modules) of new ar- chitectures here. As was mentioned above, the agent architecture is represented as an oriented graph, where the nodes are Neural Modules and the edges are the connection weights between their inputs/outputs. Sim- ilarly to the neuro-evolutionary design of ANNs [14], the topology of agent architecture is optimized by modifying the connection weights between nodes. Each individual consists of a genome and fitness value. A genome is a vector of binary/real values representing the connection weights between modules. The fitness value represents the quality of a given architecture (its performance on a given task) and is determined by means of Prosperity values (see below). The generational model of the GA/RGA (Real-valued GA) is used here. In this particular case, the mapping of genotype (genome) to phenotype (architecture) is depicted in the Fig. 4. The representation of the architecture is inspired in feed-forward ANN topologies, where the inputs/outputs between particular layers are fully connected. The weights of these connections are opti- mized by the GA/RGA. The previous experiment suggested that during the simulation of 20000 steps the architecture should be able to successfully learn the desired behavior. The evaluation of an individual is therefore determined by means of Prosperity values of the Physiological Neural Module, and is obtained after simulation of the architecture for 20000 steps. The parameters of GAs and RGAs were empirically set to the following values. The population size is PopSize = 50, the number of generations MaxGens = 80. Each gene is mutated with probability of pMut = 0.05. The one-point crossover is applied to two individuals with the probability of pCross = 0.8. The GA mutation is defined as flipping the value of a given gene. For the RGA, the mutation was implemented as sampling the Gaussian Function with the standard deviation of σ = 1 with mean value of µ = genei, where genei is the value of mutated genei ∈ R (the constraints of interval are applied after the mutation). 4.3. Evolutionary design of new architectures Since the prosperity of the Physiological Module is determined by the agent’s ability to learn new knowl- edge and use it, the suitability of the wiring of both modules is represented in the Prosperity of the Physio- logical Module. We therefore define the Simple Fitness (SF) is defined, which is equal to the Prosperity of the Physiological Module: Parch = SF arch = Pphys = 1 − SF t, (16) 374 vol. 54 no. 5/2014 Towards Evolutionary Design of Complex Systems Inspired by Nature 0 10 20 30 40 50 60 70 80 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Generations V a lu e o f th e b e s t F it n e s s i n t h e p o p u la ti o n Comparing SOrGA and SOGA on Designing new Architectures Best fitness during SOrGA (fitted by pol. of ord.5) Best fitness during SOGA (fitted by pol. of ord.8) Mean value of best fitness Mean value of best fitness 95% Prediction Intervals Figure 7. Comparing GA and RGA with Single Ob- jective Fitness (SF). The fitness is defined as a Pros- perity of the Physiology Module (but the courses of evolution for the composed-fitness CF look similar). The GA finds solution with similar quality faster than the RGA in both cases. Note that each result is av- eraged over 10 runs of the algorithm and fitted by a polynomial for better readability. where Pphys is the Prosperity of the Physiological Mod- ule. This means that Single-Objective GA (SOGA) and Single-Objective RGA (SORGA) will maximize only the Prosperity of the Physiological Module. Here, GA and RGA were used to design new archi- tectures for the described environment. The graph in Fig. 7 compares the convergence of the two types of evolutionary design of agent architectures. Again, it can be said that SOGA is able to find a similarly fit solution faster than SORGA. The following sec- tions will describe some typical automatically found architectures and their properties. 4.3.1. Analyzing new architectures found by SOGA Table 1 shows several automatically-designed architec- tures by means of SORGA and SOGA, and compares them to the manually-designed architecture. SOGA found functional architectures relatively fast. Both selected typical individuals (marked as Ind1 and Ind2 in the table) have the following properties: • The reward input of the Q-Lambda Module is wired correctly, so that the RL part works as expected. • Both environment state variables, and also the moti- vation output of the physiology are connected to the motivation input of the Q-Lambda Module. This means that the motivation-driven RL is also used. The motivation is directly proportional to the mo- tivation produced by physiology and to the agent’s position/distance from the reward source. • The binary reward output of the physiology is also connected to the motivation source, but this has no significant influence on the agent’s behavior. Parameters A B C D (see Fig. 4) E F G H I J K L M N Hand-designed 1 0 0 1 SF = 0.625 0 0 0 1 0 0 1 0 0 0 SOGA – Ind1 1 0 0 1 SF = 0.699 0 0 0 1 0 0 1 1 1 1 SOGA – Ind2 1 1 0 1 SF = 0.697 0 0 0 1 0 1 1 1 1 1 SORGA – Ind1 0 1 1 0.71 SF = 0.745 0 0 0 1 1 0 0.89 0 1 1 SORGA – Ind2 0 0.51 1 0 SF = 0.723 0 0 0 1 0 0 0.76 0 0.3 0.44 Table 1. Comparison of agent architectures’ genomes: hand-designed, two SORGA-designed and two SOGA- designed architectures. Also note that Ind1 has correctly wired state variables to the agent’s position, while the Ind2 has one dimen- sion “diagonalized” (both, the X coordinate and the Y coordinate are connected to the S1 variable). This means that only one half of the Q(s,a) matrix was used here, but still the architecture still performed rel- atively well. Figure 8 shows the behavior of a typical best architecture found by SOGA. The architecture performs similarly to the hand-designed architecture (see Fig. 6). Compared to the hand-designed architec- ture, this architecture has bigger motivation to stay near the reward source, so that the overall prosper- ity of the Physiological Module is higher, while the number of explored states is lower. 4.3.2. Analyzing new architectures found by SORGA Finding new architectures by means of SORGA took more than generations than for SOGA. However, SORGA has wider options for connecting modules. For example, it can be seen from the Table 1 that Ind2 used only half of the Q(s,a) memory. Ind1 uses swapped coordinates with one dimension slightly di- agonalized. In both architectures, motivation-driven RL is used. Again, the amount of motivation origi- nates from the physiology and the agent’s position in the map (the agent is therefore afraid of going fur- ther away from the food). Figure 9 shows the course of learning of Ind1 and the content of its memory. The knowledge is diagonalized (see Fig. 9b), but the learned data corresponds to the reward source. The position of the obstacle is also visible. The separated peak is caused by the fact that the reward output of the Physiological Module is connected to the S1 375 Jaroslav Vítků, Pavel Nahodil Acta Polytechnica 0 1 2 3 4 5 x 10 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Simulation Steps V a lu e s Agent Performance Prosperity of Physi logy module (1!MSD) Prosperity of RL module No. of visited states Reward per step (a) Course of life of typical best agent found by the SOGA – Ind1. The behavior is similar to the hand-designed architecture (see Fig. 6a). 5 10 15 2 4 6 8 10 12 14 0 10 20 30 40 S2 ~ yPos? Utility values of best actions in a given state ! Q(s,a) matrix values S1 ~ xPos? U ti lit y o f th e b e s t a c it o n b a s e d o n a g e n ts p o s it io n (b) Knowledge of selected agent found by the SOGA – Ind1. Representation is identical to the hand-designed architecture. Not all envi- ronment states are explored here. Figure 8. Analyzing the typical architecture found by the SOGA (marked as Ind1). It can be seen that the higher motivation (combined from two sources) causes the agent to stay nearer the reward source (once found) than in the hand-designed architecture. Compared to this, the CORGA approach is able to weight the amount of importance produced by particular sources. 0 1 2 3 4 5 x 10 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Simulation Steps V a lu e s Agents Performance Prosperity of Physilogy module (1!MSD) Prosperity of RL module No. of visited states Reward per step (a) Course of the life of the typical best agent found by SORGA – Ind1. Compared to the COGA-designed architecture (Fig. 8a), this agent learns more slowly (slower convergence of Prosperity of the Physiological Module). This is caused by sub-optimal representation of knowl- edge in the Q-Lambda Module. 2 4 6 8 10 12 14 5 10 15 0 10 20 30 S1 ~ xPos? Utility values of best actions in a given state ! Q(s,a) matrix values S2 ~ yPos? U ti lit y o f th e b e s t a c it o n b a s e d o n a g e n ts p o s it io n (b) Knowledge of the selected agent found by SORGA – Ind1. The Q-Lambda Module has swapped axes and the value of the S2 variable is computed as follows: S2 = X + 0.71Y . This causes slower convergence of learning. The “peak” in the graph is caused by connecting the reward output to the S1 input. Figure 9. Behavior and knowledge learned in the SORGA-designed architecture.Although the convergence of learning is slower, the overall results of the behavior are similar to the SOGA-designed architecture (Fig. 8). input. This means that while receiving the reward, the perceived X position “jumps” to the maximum value. 4.3.3. Discussion It has been shown that the agents’ ability to weight between exploration and knowledge exploitation can be represented only by means of the Prosperity of the Physiological Module. Table 1 shows that all architectures found by SOGA and SORGA have no- tably higher fitness values than the hand-designed architecture. Furthermore, the results of SORGA (particularly Ind1) suggest that it is more efficient to use two components as a source of motivation: time (agent’s physiology) and space (agent’s position). This is a simple example of how evolution is able to cor- rectly identify possibly hidden attributes of the task and employ then while designing architectures suit- able for the task. Solutions provided by the evolution typically use less efficient representation of the knowl- edge in the SRP’s memory, but are able to use Neural Modules in an unexpected manner. One of goals of our research is: to test how known algorithms can be employed in new, unexpected ways. 376 vol. 54 no. 5/2014 Towards Evolutionary Design of Complex Systems Inspired by Nature 5. Conclusion A novel approach for designing hybrid agent architec- tures, which is inspired by neuro-evolution has been presented. This approach uses the newly presented framework of Hybrid Artificial Neural Network Sys- tems (HANNS). It searches for new topologies of these hybrid networks by modifying the connection weights between particular Neural Modules. This method enables semi-automatic design of agent architectures that are specialized for a given task. It has been shown that this approach is able to do the following: • Determine whether and how to the module will be used by connecting its outputs. • Define the form of knowledge representation in the module by connecting its inputs. It has been shown that, during the design of new architectures, the approach is able to pick the im- portant aspects of the task and make efficient use of this knowledge for designing the architecture. This method of automatic design is able to find design so- lutions that are as good as, or even better than, those designed by hand. This can be particularly useful in designing autonomous systems, where the task is too complicated to be fully understood by the human designer. In this particular case, evolutionary design was able to choose the inner representation of a problem in- side a Neural Module (represent X,Y coordinates), to discover inherent properties of the task (that the reward source is near the coordinates to 0, 0) and to define the solution by wiring the connections between Neural Modules (e.g., that the importance of knowl- edge exploitation is directly dependent on the agent’s position). The RGA discovered that it is a more robust approach to use two independent sources of motivation; one based on the agent’s position (envi- ronment state) and the other based on the agent’s physiology. Last but not least, particular nodes in the HANNS framework are implemented in a way that can be used in a variety of different (e.g., more complex) archi- tectures and/or in a variety of modular systems in the future. For example, multiple (differently con- figured) RL modules can either compete (for learn- ing/action selection of antagonistic goals) or cooperate to learn/execute one (hierarchically decomposable) be- havior. Acknowledgements This research has been funded by the Dept. of Cy- bernetics, Faculty of Electrical Engineering, Czech Technical University in Prague, under SGS Project SGS14/144/OHK3/2T/13. References [1] M. Quigley, K. Conley, B. Gerkey, et al. Ros: an open-source robot operating system. In ICRA Workshop on Open Source Software. 2009. [2] F. Jiang, H. Berry, M. Schoenauer. The impact of network topology on self-organizing maps. In Proceedings of the First ACM/SIGEVO Summit on Genetic and Evolutionary Computation, GEC ’09, pp. 247–254. ACM, New York, NY, USA, 2009. doi:10.1145/1543834.1543869. [3] J. Bullinaria. Generational versus steady-state evolution for optimizing neural network learning. In In: Proceedings of the International Joint Conference on Neural Networks Ijcnn. 2004. [4] G. Auda, M. Kamel. Modular neural networks: a survey. Int J Neural Syst 9(2):129–151, 1999. [5] C. Eliasmith, C. H. Anderson. Neural Engineering: Computation, Representation, and Dynamics in Neurobiological Systems. The MIT press, Cambridge, ISBN: 0-262-05071-4, 2003. [6] E. Crawford, M. Gingerich, C. Eliasmith. Biologically plausible, human-scale knowledge representation. In 35th Annual Conference of the Cognitive Science Society, pp. 412–417. 2013. [7] K. Mcgarry, S. Wermter, J. Macintyre. Hybrid neural systems: from simple coupling to fully integrated neural networks. Neural Computing Surveys 2:62–93, 1999. [8] Dept. Cybernetics FEE, CTU in Prague. NengoROS. http://nengoros.wordpress.com/ [2014-10-01]. [9] Centre for Theoretical Neuroscience, U Waterloo. Nengo. http://nengo.ca/ [2014-10-01]. [10] M. Wiering, H. van Hasselt. Ensemble algorithms in reinforcement learning. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 38(4):930–936, 2008. doi:10.1109/TSMCB.2008.920231. [11] J. Jiang. A Framework for Aggregation of Multiple Reinforcement Learning Algorithms. Ph.D. thesis, Waterloo, Ont., Canada, Canada, 2007. AAINR34511. [12] D. Kadlecek, P. Nahodil. Adopting animal concepts in hierarchical reinforcement learning and control of intelligent agents. In Proc. 2nd IEEE RAS & EMBS Int. Conf. Biomedical Robotics and Biomechatronics BioRob 2008, pp. 924–929. 2008. doi:10.1109/BIOROB.2008.4762882. [13] T. Watanabe, T. Sawa. Instruction for reinforcement learning agent based on sub-rewards and forgetting. In Fuzzy Systems (FUZZ), 2010 IEEE International Conference on, pp. 1–7. 2010. doi:10.1109/FUZZY.2010.5584788. [14] J. Fekiac, I. Zelinka, J. C. Burguillo. A review of methods for encoding neural network topologies in evolutionary computation. In Proceedings 25th European Conference on Modelling and Simulation ECMS, pp. 410–416. 2011. ISBN: 978-0-9564944-2-9. 377 http://dx.doi.org/10.1145/1543834.1543869 http://nengoros.wordpress.com/ http://nengo.ca/ http://dx.doi.org/10.1109/TSMCB.2008.920231 http://dx.doi.org/10.1109/BIOROB.2008.4762882 http://dx.doi.org/10.1109/FUZZY.2010.5584788 Acta Polytechnica 54(5):367–377, 2014 1 Introduction — problem decomposition 1.1 Modular design in practice 1.2 How can systems be designed in a general way? 1.3 Structure of the text 2 Hybrid Artificial Neural Network Systems 2.1 Communication and sub-system Representation 2.2 Configuration of a Neural Module 2.3 Prosperity of the Neural Module 2.4 The NengoROS simulator 3 Description of selected Neural Modules 3.1 Reinforcement Learning Module 3.1.1 Learning 3.1.2 Action Selection Method in non-episodic experiments 3.1.3 Prosperity of the Reinforcement Learning Module 3.2 Motivation Source Module 3.2.1 Prosperity of the Motivation Source Module 4 Experiments 4.1 Description of Hardwired Agent Architecture 4.1.1 Simulation of Hardwired Architecture 4.2 Principle of the evolutionary design of new architectures 4.3 Evolutionary design of new architectures 4.3.1 Analyzing new architectures found by SOGA 4.3.2 Analyzing new architectures found by SORGA 4.3.3 Discussion 5 Conclusion Acknowledgements References