INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL ISSN 1841-9836, 12(3), 330-346, June 2017. Walking Motion Generation and Neuro-Fuzzy Control with Push Recovery for Humanoid Robot P.E. Mendez-Monroy Paul Erick Mendez-Monroy IIMAS - Universidad Nacional Autonoma de Mexico Parque CientĂfico y Tecnologico de Yucatan Km. 5.5 Carretera Sierra Papacal - Chuburna C.P. 97302 Sierra Papacal, Yucatan erick.mendez@iimas.unam.mx Abstract: Push recovery is an essential requirement for a humanoid robot with the objective of safely performing tasks within a real dynamic environment. In this environment, the robot is susceptible to external disturbance that in some cases is inevitable, requiring push recovery strategies to avoid possible falls, damage in humans and the environment. In this paper, a novel push recovery approach to counteract disturbance from any direction and any walking phase is developed. It presents a pattern generator with the ability to be modified according to the push recovery strategy. The result is a humanoid robot that can maintain its balance in the presence of strong disturbance taking into account its magnitude and determining the best push recovery strategy. Push recovery experiments with different disturbance directions have been performed using a 20 DOF Darwin-OP robot. The adaptability and low computational cost of the whole scheme allows is incorporation into an embedded system. Keywords: push recovery, neuro-fuzzy systems, reinforcement learning, biped walk- ing. 1 Introduction With the recent growth of humanoid robots performing tasks within the human environment, research has focused on improving their movement and design for stable bipedal walking. Where dynamic walking is a general technique with predefined trajectories and feedback control to ensure that the robot is always rotating around a point in a base of support (BoS). To achieve dynamic walking the centre of gravity (COG) can be outside of the BoS of the robot, but the zero momentum point (ZMP) cannot. The ZMP is the point where the total angular momentum in the ankle is zero and the COG is the projection of the centre of mass (COM) on the ground, inside the BoS established by the feet. These concepts can be associated through an inverted pendulum model as the simplest model. But this technique is susceptible to external disturbance and irregularities in the surface, it requires strategies to maximise the balance and minimise the adverse effects of these disturbances. The aim of push recovery is to perform an action that counteracts an external disturbance to prevent the robot from falling. The problem of push recovery has been researched using several approaches. Hyon et al. [5] use force-controlled actuators and a complete dynamic model to reject external disturbances. Park et al. [10] also use force-controlled actuators with optimisation algorithms for active balancing. Nevertheless, the requirements for multi-axis force sensors, an accurate model and intensive computing are their disadvantages. Other research considers biomechanical movements using the knowledge of human behaviour in responses of unexpected disturbances with simple movements. [11], [17] [7]. Generally, these strategies are classified into three groups: Ankle, hip and stepping. Copyright © 2006-2017 by CCC Publications Walking Motion Generation and Neuro-Fuzzy Control with Push Recovery for Humanoid Robot 331 Some researchers have presented their work in push recovery when a robot walking in place. Pratt et al. [11] introduced the concept of "capture point" to determine where the foot should step after being pushed to return to its original pose. Pratt et al. [12] extended the concept to multiple steps. The use of learning strategies to adapt the push recovery has increased in the last years. Rebula et al. [14] used capture point learning to improve simple step push recovery. While Missura et al. [8] proposed online learning to define the amplitude of the leg depending on the robot posture and the error in the desired step. Some researchers have integrated several methods to complement the simple strategies for this level of complexity. Stephens [17] combines hip and ankle strategies with the purpose of balance control by defining a larger area return of a push and he established theoretical analytic bounds for both strategies. Hyon et al. [5] presented a multi-level posture balance method for humanoid robots and demonstrated push recovery with a biped SARCOS pushed from behind. Semwal et al. [15] used a hierarchical fuzzy control to decide by several push recovery strategies and used a simulated robot with good performance and low-cost computation but without online learning obtaining the same boundaries for each push recovery strategy as an analytic method. However, few researchers have presented push recovery strategies in the walking phase, in order to continue the walking before a push. Komura et al. [6] proposed a feedback controller for the biped walking and applied a hip strategy as push recovery in a simulated robot. Wieber et al. [19] used a predictive control model to minimise the jerk and ZMP error, this improved the robustness to disturbances in walking. The objective of this work is to design a practical strategy according to external disturbances using a neuro-fuzzy system to define the best push recovery behaviour to counteract the distur- bance. The novel push recovery approach can maintain the robot walking, counteracting pushes from any direction and in any phase with a certain magnitude of disturbance. This scheme has the advantages of using a simple model and a controller of low computational cost. Furthermore, the design is applicable to generic robots with position controlled actuators, inertial measurement unit and joint position sensors. The remainder of the paper proceeds as follows. Section 2.1 reviews three biomechanical push recovery controllers and their implementation on position-controlled actuators. Section 2.1 explains the details of the neuro-fuzzy system, structure, design and estimation. Section 2.1 describes the reinforcement learning process with a modified Levengerg-Marquardt algorithm. Section 2.1 explains the walking pattern. Section 6 shows the experimental results using the neuro-fuzzy system with reinforcement learning. The paper is finished with some concluding remarks and future work. 2 Biomechanical push recovery Biomechanical studies of human walking have shown that humans perform three different movement patterns as recovery strategies (ankle, hip, step) to reject a sudden external distur- bance. The ankle strategy modulates torque at the ankle joint to keep humans upright, the hip strategy converts unpredicted linear momentum into angular motion at the torso, and the step- ping strategy moves the swing foot in the direction of the disturbance. Recent work has focused on using such biomechanical strategies [12] [17]. The next review of these biomechanical strate- gies assumes simplified physical models of the robot and explains how they can be implemented on real humanoid robots with position controlled actuators. The ankle strategy maintains the COG in the BoS by applying torque to the ankle joints. For position controlled actuators, the torque at the ankle can be modified either by controlling the ZMP or modulating the target angle of the ankle servo. In this approach, the latter is used 332 P.E. Mendez-Monroy as the strategy of push recovery in the ankle. δzmp = −Kaxa,swing (1) where Ka is the ankle feedback gain and the xa,swing is the current position of the swinging ankle. The hip strategy applies angular acceleration in the torso to generate a reaction force that returns the COG to the BoS from a elapse time {T0,T1}, decelerate in time {T1,T2} and revert to the initial position slowly {T2,T3}, this control called "bang-bang" can be represented for position controlled actuators as δθt = { Kh 0 ≤ t < T2 Kh T3−t T3−T2 T2 ≤ t < T3 (2) where δθt is the torso position bias, T3 the time to complete the control, and Kh is the hip feedback constant. For even larger disturbances, no amount of body torquing will result in push recovery. In this case, it is necessary to take a step. The stepping strategy moves the BoS by taking a step. This strategy looks for the point on the ground where the robot can step to bring it to its original speed or even full stop [17]. The new step (walking phase) or modified step (stance phase) is calculated using the relative target foot position xstep [1]. xstep = θ̇x √ zc g = Ks (3) The definition of a real analytic boundary to apply each strategy is difficult without the exact knowledge of the current system state as well as the kinematic and mechanic restrictions. In addition, its design becomes impractical if some of the parameters, such as the weight or COG change. Instead, the problem is formulated through learning techniques and the parameters are obtained online using past experiences. Some learning strategies have been performed in other robots [3], [18], where the aim is to learn a policy that maps robot states and establish appropriate actions. 3 Neuro-fuzzy system for push recovery The push recovery is one of the most complex tasks for humanoid robots and it requires an adaptive intelligent controller, The push recovery method is achieved using simulation, the manipulation of the real system and with expert knowledge on intuitive biped push recovery. Thus, a neuro-fuzzy system is proposed to select the appropriate control parameters for the three strategies mentioned above. The selection uses the current walking state and the torso position. The fuzzy controller is designed using system constraints (eg. maximum torque for the actuators, maximum angles for the joints, maximum step size, etc.), expert knowledge and is updated using new reinforcement training with a low computational cost. Figure 1 shows the architecture of the neuro-fuzzy system with an "actor-critic" reinforcement learning. The strategy is divided in the sagittal and the coronal planes but, only the sagittal plane will be described as the design for the coronal plane is similar. The controller determines a proper action of push recovery in the presence of an external disturbance using the current states {θt, θ̇t} and the internal reinforcement signal ˆ̂r from the predictor. The scheme separates the push recovery system from the motion generator to simplify the design. Thus, the motion generator performs its normal movements when there is no external Walking Motion Generation and Neuro-Fuzzy Control with Push Recovery for Humanoid Robot 333 Figure 1: The neuro-fuzzy reinforcement structure disturbances or unbalance. As soon as an external disturbance or unbalance occurs the neuro- fuzzy controller will establish the best learning strategy to counteract its effect. The neuro-fuzzy controller uses the trunk position θt and its angular velocity θ̇t to define the feedback gains for the ankle, hip and stepping strategies. It is initialised offline using simulation, expert knowledge and the robot constraints (limit of the actuator torques, maximum step size, etc.) and it learns using reinforcement learning with low computational cost. The neuro-fuzzy predictor has the same structure as the controller but with the internal reinforcement signal as the output. Its objective is to evaluate the performance of the whole neuro-fuzzy system using the external reinforcement signal (Reward) to update the parameters of the controller and predictor. Learning is performed by a modified Levenberg-Marquardt algorithm. In the state estimation, the trunk position and angular velocity in the roll and pitch are esti- mated based on the accelerometer and gyroscope measurements and mixing with the measured joint angles (q) to use in the forward kinematic equations. A walk pattern generator is presented to modify the walking trajectory according to the adopted push recovery strategy. 3.1 Neuro-fuzzy controller The neuro-fuzzy controller ("actor") can determine a better action for the next time instant based on the current system state and the internal reinforcement signal. The internal reinforce- ment signal from the predictor enables both the controller and the predictor to learn without waiting for external reinforcement, which may only be available at a time long after a sequence of actions has occurred. This scheme can speed up the learning of the complete system. The fuzzy structure comprises five principal components: inputs, antecedent labels, rules, normalisation and consequent labels. Moreover, at the computational level, a fuzzy system can be seen as a feedforward neural network with five layers. Thus, The fuzzy structure is defined in five layers with node’s basic functions for an online training. Each layer has a input vector {uk1...p}, a parameter vector {w k 1...p} and a output vector {o k 1...q} defining a activation function. uki = f ( uk1,u k 2, · · · ,u k p,w k 1,w k 2, · · · ,w k p ) (4) okj = aj(u k i ) where a(·) is the activation function (5) 334 P.E. Mendez-Monroy where k = {1, ..., 5} is the kth layer, with p inputs and parameters for the jth output of the layer. The next paragraph describes the functions of the nodes in each of the five layers of the proposed neuro-fuzzy controller with X = {x1, ...,xin} inputs, Y = {y1, ...,yout} outputs with r defined fuzzy rules like: Rj : if x1 is µ1,j and ... and xin is µin,j then y1 is ν1,j and ... and yout is νout,j (6) where µij and νlj are fuzzy sets, the indexes i = {1, ..., in} inputs, l = {1, ...,out} outputs and j = {1, ...,r} fuzzy rules. Layer 1 (Inputs): The nodes in this layer transmit the inputs directly to the next layer (the vector parameters has unity values w1i = 1 ∀i) eq.(7). Layer 2 (Antecedents): Each single node to perform a membership function. The Gaussian membership function is used in this work eq.(8). where m2ij and σ 2 ij correspond to the centre and the width of the Gaussian function of the jth rule with the ith input variable. Layer 3 (Rules): Implements the conjunction of all or some antecedent conditions in the jth rule. Hence, the rule nodes should perform the fuzzy "and" operation, in this case, the product function is used, in fuzzy theory, it is called fire strength eq.(9). Layer 4 (Normalised): The number of nodes in this layer is equal to that of the rules layer. The effect of the layer is to perform a normalisation on each rule. It is called normalised fire strength for the jth rule eq.(10). Layer 5 (Consequent): This layer will have as many nodes as there are output variables. Each output node combines the normalised fire strength with the fuzzy action and the defuzzification is computed with the centre of area method eq.(11) [4]. f1i = u 1 i and a 1 i = f 1 i i = 1, 2, ..., in (7) f2ij = µij ( a1i ; m 2 ij, c 2 ij ) = − ( a1i −m 2 ij )2 ( σ2ij )2 and a2ij = exp(f2ij) (8) f3j = 2∏ i=1 a2ij and a 3 j = f 3 j w 3 ij = 1 ∀i (9) f4j = a3j∑r j=1 a 3 j and a4j = f 4 j where r∑ l=1 a4j = 1 (10) f5l = ∑r j=1(m 5 ljσ 5 lj)a 4 j∑r j=1 σ 5 lja 4 j and a5l = f 5 l (11) The complete fuzzy system F with fuzzy rules (7) is: F = {Fl, l = 1, ...,out}, Fl = ∑r j=1 ν − lj ∏in i=1 µij(xi)∑r j=1 ∏in i=1 µij(xi) , ν−lj = ∑r j=1(m 5 ljσ 5 lj)a 4 j∑r j=1 σ 5 lja 4 j (12) where ν−lj is the centre of the lth output fuzzy set and the jth fuzzy rule. Thus, F is the fuzzy control action with the parameter vector pc = {m2ij, σ 2 ij, m 5 lj, σ 5 lj}. The parameter vector pc = {m2ij, σ 2 ij, m 5 lj, σ 5 lj} of the all weights in the controller is updated by minimising the internal reinforcement signal r̂, such that, the system ends up with a good recovery strategy and avoids failure. The learning rule used to adjust the parameter vector is Walking Motion Generation and Neuro-Fuzzy Control with Push Recovery for Humanoid Robot 335 a gradient-based adaption δpc and Levenberg Marquardt with the objective function Ec to be minimised in the controller. δpc = ηc ∂Ec ∂pc = ηc ∂Ec ∂v ∂v ∂F ∂F ∂pc , Ec(t) = 1 2 v2(t) (13) where ηc is the learning rate factor and v is the prediction of the external reinforcement signal. 3.2 Neuro-fuzzy predictor To update a neuro-fuzzy system supervised learning [4] is the most common strategy used. These schemes need accurate training data to indicate the correct desired output to compute the training error. But, in the case of biped robots is difficult to get the correct output for each push recovery strategy because is a dynamic system with simplifications and uncertainties in the model and controller. However, most of the real systems can provide a right-wrong binary decision (reinforcement signal) based on the current control, making a reinforcement training a better option to learn. Thus, a neuro-fuzzy predictor ("critic") is used to determinate the parameter update using the external reinforcement signal r and to predict an internal reinforcement signal r̂. The predictor structure is the same as the controller but only with the internal reinforcement signal like output. The learning algorithm will be applied to the fuzzy controller and the fuzzy predictor si- multaneously and only conducted by an external reinforcement feedback (R) using an orbital energy. R = θ̇2x + w 2θ2x 2 (14) This is the "critic" information available for learning and indicates whether the output is right or wrong. In this paper, R is mapped to a range r = {−1 → failure, 1 → success} to get a continuous reinforcement signal. Thus, the predictor output v approximates the discounted total reward-to-go R̃(t) = r(t + 1) + αr(t + 2)... [16]. It is the future accumulative reward-to-go, α is a discount factor for the infinite-horizon problem and r(t + 1) is the external reinforcement. Defining the prediction error Ep and the objective function to be minimised as ep(t) = αv(t) −v(t− 1) −r(t) Ep(t) = 1 2 e2p(t) (15) Therefore, the parameter vector pp = {m 2p ij , σ 2p ij , m 5p j , σ 5p j } is updated using a a Levenberg Marquardt algorithm and gradient-based update rule given by δpp = ηp ∂Ep ∂pp = ηp ∂Ep ∂ec ∂ec ∂v ∂v ∂pp (16) 3.3 State estimation State estimation uses the joint angles measured by the encoders, the inertial acceleration and angular velocity of the trunk measured by the IMU and a forward kinematic model to reconstruct the whole-body pose of the robot. With the reconstruction of the whore-body pose is obtained the current COM and which leg is the standing in. By fusing the gyroscope and accelerometer data is estimated the trunk position. The trunk position θ̂t = (θtx,θty) is expressed using the inertial acceleration vector a = (ax,ay) ∈ [−1, 1] 336 P.E. Mendez-Monroy measured by the accelerometer, first, a raw angle estimate is obtained. Then, it is mixed with the angular velocity in the n iteration of the control loop (eq. 19). rlθ̇tx = arcsin(ax) (17) θ̇ty = arcsin(ay) (18) θn = (1 −κ)(θn−1 + ρ( ˆ̇ θn − bn)) + κθ̂n (19) where ˆ̇θn is the raw angular velocity, κ is a blending factor, ρ is the control time, and bn is the gyro drift. There are other more advanced options for mixing the gyroscope and accelerometer measurements [2]. On the other hand, the measurements of the joints (q) obtained through the encoders of each motor are applied to a direct kinetic algorithm to get the current pose. This kinematic model is rotated around the centre of the supporting foot such that the trunk position equals the roll and pitch angles θq = (θtx,θty) obtained in equation (19). Thus, the difference between the trunk angles achieved by the IMU and the joint angles is used by the neuro-fuzzy controller. ∆θ = θt −θq θ̇ = ˆ̇ θ (20) 4 Reinforcement Levenberg-Marquardt learning As mentioned earlier, a neuro-fuzzy system is used in both the controller and the predictor where their parameters are updated using reinforcement learning. According to eq.(13) and eq.(15), it is possible use a modification of the Levenberg-Marquardt algorithm [13] to speed up the update to the neuro-fuzzy system. The parameters p = {pc,pp} are updated using normalisation in both networks to confine the parameter values into some appropriate range (21). p(t + 1) = p(t) + δp(t) ‖p(t) + δp(t)‖1 with δp(t) = −(H(p))−1JT (p)e(p) (21) where approximation of the Hessian matrix H and the Jacobian matrix is calculated using the backpropagation algorithm. H(p) = JT (p)J(p) + µI with J = [ ∂E ∂p1 , ∂E ∂p2 , ..., ∂E ∂pN ] (22) where p0 are the initial parameter values for the controller and predictor, ε1, ε2 and ε3 are specific stop criteria in the LM algorithm, kmax is the maximum iteration number and ε4 is the controller or predictor error prediction. Enew is the value of the objective function evaluated with the new parameters. The Jacobian matrix in eq.(22) for the controller is formed by partial derivatives in eq.(13), they are calculated as ∂Ec ∂v = v ∂v ∂F ≈ dv dF ≈ v(t) −v(t− 1) F(t) −F(t− 1) (23) the term ∂v/∂F in eq.(13) is indirectly dependent on F and complex computation, so that an approximation can be computed by the instantaneous difference ratio. The other term ∂F/∂Pc is more tractable. So that, with F known and differentiable, a few applications of the chain rule Walking Motion Generation and Neuro-Fuzzy Control with Push Recovery for Humanoid Robot 337 Algorithm 1 The modified Levenberg-Marquardt algorithm Data: pc or pp, Ec or Ep, ec or ep Result: pnew k := 0; v := 2; p = po H := JTJ; g := JTe (k) stop:=(‖ g ‖∞≤ ε1); λ = τ ·max (diag (H)) while (not stop) and (k < kmax) and (E (k) > ε4) do h = −(H + λI)−1 g if (‖ h ‖≤ ε2 ‖ p ‖) then stop=true else pnew= p + h ρ := ‖E(k)‖2−‖Enew‖2 hT (µh+g) if ρ > 0 then p = pnew H := JTJ; g := JTe (k) stop=(‖ g ‖∞≤ ε1) or ( ‖ e (k) ‖2≤ ε3 ) µ = µ ·max ( 1/3, 1 − (2ρ− 1)3 ) ; v = 2 else µ = µ ·v; v = 2v end if end if end while through the five layers of the controller give the following set of learning rules. ∂F ∂m5lj = ∂F ∂a5j ∂a5j ∂f5j ∂f5j ∂m5lj = ( f5l )( 1 )( σ5lja4j∑r j=1 σ 5 lja 4 j ) ∂F ∂σ5lj = ∂F ∂a5j ∂a5j ∂f5j ∂f5j ∂σ5lj = ( f5l )( 1 ) (∑r j=1 σ 5 lja 4 j ) m5lja 4 j − (∑r j=1 m 5 ljσ 5 lja 4 j ) a4j(∑r k=1 σ 5 lja 4 j )   (24) The equations in (24) are the partial derivatives to update the consequent parameter values p5lj = {m 5 lj, σ 5 lj}. The antecedent parameter values p2ij = {m 2 ij, σ 2 ij} are updated when the error is propagated to the preceding layers as ∂F ∂p2ij = ∂F ∂a5j ∂a5j ∂f5j ∂f5j ∂p2ij = (∑r j=1 σ 5 lja 4 j )( m5lja 4 j ) ∂a4j ∂p2ij − (∑r j=1 m 5 ljσ 5 lja 4 j )( σ5lj ) ∂a4j ∂p2ij(∑r k=1 σ 5 lja 4 j )2 ∂a4j ∂p2ij = ∂a4j ∂f4j ∂f4j ∂a3j = ∑r j=1 a 3 j ∂a3j ∂p2ij −a3j ∂a3j ∂p2ij(∑r k=1 a 3 j )2 ∂a3j ∂p2ij = ∂a3j ∂f3j ∂f3j ∂a2ij ∂a2ij ∂f2ij ∂f2ij ∂p2ij = (∏ k 6=i a2kj )( exp(f2ij) ∂f2ij ∂p2ij )(2 (a1i −mij)( σ2ij )2 )(2 ( a1i −mij )2( σ2ij )3 ) (25) 338 P.E. Mendez-Monroy In the case of the predictor, the partial derivatives are very similar to those of the controller. Where the differential is: ∂Ep ∂ep = ep ∂ep ∂v = αv (26) The partial derivative ∂v/∂pp is calculated as being equal to ∂f5l /∂pc. 5 Walking pattern The walking pattern generates trajectories from the design parameters (mode, sway ampli- tude, step time, direction) in real time. The parameters are modified at the start of each step. 5.1 ZMP and COG trajectories The objective of defining the trajectories of the zero moment point (ZMP) and the centre of gravity (COG) is to guarantee the balance of the robot in open loop, keeping these trajectories within the base of support (BoS). The trajectory is defined in the sagittal plane, the trajectory in the coronal plane is designed in a similar way. In the case of the ZMP, a dynamic ZMP can be represented with two equations in axes X-Y, using the simple inverted pendulum model (figure 2) and assuming a constant movement zc of the pelvis in the transverse plane, the equation (27) is simplified. In the case of the COG, it is assumed to be identical to the centre of the pelvis. Thus, the ZMP equation in the sagittal plane is represented as xzmp = ∑n i=1 mi(z̈i + g)xi − ∑n i=1 miẍizi − ∑n i=1 Iiyω̈iy∑n i=1 mi(z̈i + g) ≈ xcog − ẍcog l g (27) where mi is the mass, xi, zi is the mass position, Ii is the inertia moment and ωi is the angular velocity. The first term (xcog) is the static component and the second term (ẍcog) is the dynamic component of the ZMP. Figure 2: (Inverted pendulum model (left); Step parameters in the x direction (right) In this paper a third-order polynomial interpolation is used to obtain the desired ZMP tra- jectory. A polynomial can be used to generate a trajectory in a straightforward and direct form. Moreover, the COG trajectory is designed with this trajectory. The polynomials in eq.(28) represent the trajectories of position and velocity of the ZMP in the sagittal plane setting a normalised time to start tn = 0 and end tn = 1 of the step [10] without loss of generality. xzmp(tn) = ∑3 t=0 bit i n ẋzmp(tn) = ∑3 t=1 ibit i−1 n b = Ω −1x boundary zmp (28) Walking Motion Generation and Neuro-Fuzzy Control with Push Recovery for Humanoid Robot 339 The boundary conditions {xzmp(0), ẋzmp(0)} and {xzmp(1), ẋzmp(1)} of the lifting and landing of the ZMP in each step define the polynomial coefficients in eq.(28) to get a smooth ZMP trajectory:   0 0 0 1 0 0 1 0 1 1 1 1 3 2 1 0     b0 b1 b2 b3   =   xzmp(0) ẋzmp(0) xzmp(1) ẋzmp(1)   Ωb = xboundaryzmp (29) By assuming that the COG xcog is located to the centre of the pelvis, it is defined with the same form as (29) but with different coefficients. The boundary conditions for the COG {xcog(0), ẋcog(0), xcog(1), ẋcog(1)} are used to design the polynomial coefficients ai. xcog(tn) = ∑3 t=0 ait i n ẋcog(tn) = ∑3 t=1 iait i−1 n a = Ω −1x boundary cog (30) By substituting the xcog and ẋcog in eq.(30) and xzmp in eq.(27), a direct match between ZMP and COG is obtained:∑3 t=0 bit i n = ∑3 t=0 ait i n − ( l g )∑3 t=1 iait i−1 n (31) The relationship between the boundary conditions allows defining all the coefficients of the polynomials of the ZMP and the COG using (31) in a matrix form b = Γa, where Γ is the coefficient matrix. x boundary zmp = ΦΓΩ −1x boundary cog with Γ =   1 0 −2 l g 0 0 1 0 −6 l g 0 0 1 0 0 0 0 1   (32) In the sagittal plane, each step can be defined by the amplitude of the feet during the double support phase before and after the single support phase (figure 2). The boundary conditions of COG are defined by with these parameters.[ xcog(0) ẋcog(0) xcog(1) ẋcog(1) ]T = [ −0.5ci αci 0.5ci+1 αci+1 ]T (33) where Sx = ci + ci+1 is the length stride in X. To define this trajectory, it is assumed that the position of the pelvis during a step is in the centre of the feet. The boundary conditions are parametrized by a factor α to modify the pelvis velocity. This allows acceleration at the start and end of the step and deceleration when the leg is swinging with a high value of α. By establishing trajectories with a direct relationship, and defining boundary conditions based on the step parameters, it is simpler to define trajectories and modify them in the presence of a disturbance. 5.2 Ankle trajectory Once the COG and ZMP trajectories are defined, it is necessary to identify the trajectories of the ankles (support and swing) according to several movement conditions, which complete the pattern. The swinging foot trajectory (x̂a,swing) is defined by a cycloid function. It has zero velocity at the start and the end, and a maximum velocity in the swinging phase. The time (td) for the 340 P.E. Mendez-Monroy double support phase (DSP) is considered. The trajectory x̂a,swing is defined from the DSP to the other DSP, between the start time (t1) and the final time (t2) (figure 3 (left)). x̂a,swing(ta) = (Sx) ( ta − sin(2πta) 2π ) − ci with ta = t− (t1 + td/2) t2 − t1 − td (34) t 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x̂ a -200 -150 -100 -50 0 50 100 t 0 1 R e la ti v e x p o s it io n s Pelvis Support Swing Figure 3: Position of the swinging ankle (left); Relative trajectories of the feet and the centre of the pelvis (right) The supporting foot trajectory (x̂a,support(t)) must be the opposite action of the centre of the pelvis (xp). These trajectories are generated in the local coordinate frame located in the front of supporting foot, and for its implementation must be transferred to the global coordinates of the robot (figure 3). In fig.(3 (left)) an example of the trajectories of the ankles is shown, it is assumed that the right ankle is the one supporting and the left ankle is swinging. The relative position in the global coordinate frame is: xa,support = −xp(tn) xa.swing = x̂a,swing(ta) −xp(tn) (35) The walking pattern is completely defined for the sagittal plane. Where the input parameters (Sx, t1, t2) can generate the trajectories of the ankles, knees and hip, the ZMP and the COG directly. The tuning parameters (α,td) are determined for the forward, turning and sideways modes. The trajectories for the coronal plane have the same structure but with different condition boundaries. 6 Experimental results A DARwIn-OP humanoid robot developed by the RoMeLa lab was used to implement the neuro-fuzzy system experimentally. It is 45 cm tall, weighs 2.8kg, and has 20 degrees of freedom. It has a web camera for visual feedback, and 3-axis accelerometer and 3-axis gyroscope for inertial sensing. Position-controlled Dynamixel servos are used for actuators, which are controlled by a CM730 microcontroller connected by an Intel Atom-based embedded PC at a control frequency of 100hz. The neuro-fuzzy design begins when the fuzzy rules eq.(6) are defined incorporating informa- tion from the Darwin robot and expert knowledge. Once the structure is defined, it is necessary to reinforce learning. First, learning is applied to a realistic physical simulation to avoid dam- ages in the real system. Finally, the neuro-fuzzy system is implemented in the Darwin robot and tested with random pushes, its performance and strategies show the advantages of the proposed scheme. Walking Motion Generation and Neuro-Fuzzy Control with Push Recovery for Humanoid Robot 341 6.1 Design The neuro-fuzzy system for the controller and the predictor are started with the same rules (r = 9) and with two inputs {∆θx, θ̇x}. But, the controller has three outputs {ka,kh,ks} and the predictor has one output {v}. So, with r defined fuzzy rules: Rj : if θx is µ1j and θ̇x is µ2j then ka is ν1j and kh is ν2j and ks is ν3j (36) where µij and νlj are fuzzy sets, the indexes i = {1, 2} inputs, l = {1, 2, 3} outputs and j = {1, ...,r} fuzzy rules. {ka,kh,ks} are the recovery variables for the ankle eq. (4), hip eq.(5) and stepping eq.(3) strategies respectively. With the fuzzy system defined, expert knowledge and limitations of the humanoid robot will be employed to set the initial rule base before its learning. Three membership functions are used for each input variable, and five membership functions for each output variable. The rule base is defined as follows: Table 1: Fuzzy rule base for the feedback gains ka ∆θx NT ZT PT NV PA SPA SNA θ̇x ZV SPA ZA SNA PV SPA SNA NA kh ∆θx NT ZT PT NV PH SPH SNH θ̇x ZV SPH ZH SNH PV SPH SNH NH ks ∆θx NT ZT PT NV NS ZS SPS θ̇x ZV SNS ZS SPS PV SNS ZS PS The membership functions in table 1 are defined using expert knowledge and the minmax ranges for each push recovery strategy. The term sets of the inputs and output variables and their limits are selected as follows: T(θx) = {µ11,µ12,µ13} = {NT,ZT,PT} {−45, 45} T(θ̇x) = {µ21,µ22,µ23} = {NV,Z,PV} {−120, 120} T(kxa) = {ν11,ν12,ν13,ν14,ν15} = {NA,SNA,ZA,SPA,PA} {−0.35, 0.35} T(kxh) = {ν21,ν22,ν23,ν24,ν25} = {NH,SNH,ZH,SPH,PH} {−1, 1} T(kxs) = {ν31,ν32,ν33,ν34,ν35} = {NS,SNS,ZS,SPS,PS} {−0.85, 0.85} (37) where the letters for each fuzzy set are: Negative (N), Small Negative (SN), Zero (Z), Small Positive (SP), Positive (P), Trunk (T), Velocity (V), Ankle (A), Hip (H) and Step (S). For example, SPA meaning Small Positive Ankle. 6.2 Learning The neuro-fuzzy system learns from random pushes and during walking and is able to suc- cessfully absorb multiple pushes. Simulation trials were performed for 3 push types (frontal, behind and lateral) with 6 repetitions in different strength ranges (small, medium, large). Re- inforcement learning in the simulation was applied to 54 trials resulting in the following fuzzy surface fig.(4), for each output gain K. The ankle gain surface shows that the strategy is applied in all ranges, with a large or medium push. Its gain is high, but with a small push it is regulated. The hip strategy is applied when 342 P.E. Mendez-Monroy 0.05 0.1 -0.15 -0.1 -0.05 0 0.15 1 -0.6 -0.4 0.2 -0.2 0 -1 0.4 0.6 0.8 -0.8 -0.04 -0.02 0 0.02 0.04 0.06 -0.08 -0.06 0.08 Ka Kh Ks Figure 4: Ankle feedback surface after simulation learning (left); Hip feedback surface after simulation learning (middle ); Step feedback surface after simulation learning (left) the disturbance is greater, allowing a recovery without having to execute a step. Finally, the step strategy surface shows boundaries where it is inevitable to apply the step, but inside the robot kinematic. 6.3 Results After designing and training the neuro-fuzzy system in a physically realistic simulation, the system was implemented in the Darwin robot. A trial was defined with multiple pushes to evaluate the ability of the neuro-fuzzy system to represent expert knowledge and respecting the limits, as well as assess the reinforcement learning algorithm to update the system. Figure (5) shows an experimental test with 6 random pushes in walking forward mode, The first and second plots show θx and θ̇x, respectively. The unknown strength push is frontal (3s,29.9s and 40s) and behind (9s, 15.5s and 21.6s) are marked with vertical dash lines in the plots. The first push is the strongest requiring application of the three strategies as shown in the fig.(6), where the robot steps forward and corrects the trunk position angle to an upright position. The feedback gains are weighted using the effect of the inferior strategy learnt in the simulation trails respecting the limits defined in the design. Note the sudden velocity change at the moment of pushes. Then, the robot is able to recover its velocity to the normal level and maintain walking. This is because the external reinforcement signal is applied with minimum delay to the neuro-fuzzy system, and this has a low computational cost with 18 fuzzy rules and 64 parameters to update. In a comparison between the Levenberg Marquardt algorithm and the backpropagation algorithm. The proposed algorithm has an average time of 0.5 ms with 8 average iterations, while, the backpropagation has an average time of 4 ms with 80 average iterations. This allows efficient implementation in embedded systems. Finally, a phase plot between θx and θ̇x is shown in fig. (7), it shows the theoretical boundaries for each push recovery strategy using a lineal inverted pendulum model. The boundaries are calculated using the following parameters: zc = 0.295, d = 0.05, TH = 0.3, m = 2, g = 9.81, Walking Motion Generation and Neuro-Fuzzy Control with Push Recovery for Humanoid Robot 343 t 0 5 10 15 20 25 30 35 40 45 θ̇ x -200 -100 0 100 0 5 10 15 20 25 30 35 40 45 θ x -50 0 50 Figure 5: Experimental test with 6 random pushes without failure 0 5 10 15 20 25 30 35 40 45 K a -0.2 0 0.2 0 5 10 15 20 25 30 35 40 45 K h -1 0 1 t 0 5 10 15 20 25 30 35 40 45 K s -0.05 0 0.05 Figure 6: Gains of the 3 push recovery strategies with 6 random push after learning 344 P.E. Mendez-Monroy w = √ g/zc, τmaxhip = 1 and x max capture = 0.08 [17]. Ackle : ∣∣∣ θ̇xw + θx∣∣∣ < dzc Ankle + Hip : ∣∣∣ θ̇xw + θx∣∣∣ < dzc + τmaxhip (e−wTH−1)2mgzc Ankle + Hip + stepping : ∣∣∣ θ̇xw + θx∣∣∣ < dzc + τmaxhip (e−wTH−1)2mgzc + xmaxcapturezc (38) The phase plot shows that the neuro-fuzzy system improved the robustness to strong pushes with Figure 7: The phase plot shows how the neuro-fuzzy system responds to 6 unknown pushes. The coloured areas are the theoretical boundaries for the ankle(white), hip (light green), step (light blue) and unstable (light red). a larger stable region. So, The robot already learnt the right ankle, hip and stepping strategies to cope with pushes from any direction. 7 Conclusions In this paper, a practical method to implement a full body push recovery system on a general humanoid robot without specialised sensor and actuators is proposed. The method consists of a neuro-fuzzy system designed by combining expert knowledge, system constraints and online reinforcement learning. This approach has a number of advantages over previous approaches. It is able to recover from unknown disturbances from any direction performing a combination of push recovery strategies based on the current system state without an accurate dynamical model of the robot or environ- ment. A straightforward and robust online reinforcement learning algorithm updates the parameters for the neuro-fuzzy controller and predictor to improve the push recovery performance using both simulation environment as well as on a small-size humanoid. Implementation using the modified LM algorithm guarantees low-cost computation. It is integrated with a non-periodic, step-omnidirectional walking generator where step pa- rameters produces soft curves and can be flexibly changed as a response to a disturbance. More- over, it does not attempt to follow a future ZMP reference, only the joint trajectories are pre- scribed using the step parameters and a simple inverted pendulum model. Walking Motion Generation and Neuro-Fuzzy Control with Push Recovery for Humanoid Robot 345 Experimental results show that the trained system can successfully develop a full body push recovery under external perturbations while walking in an arbitrary direction. There is a signif- icant improvement of the stable region in the sagittal plane from a low number of experiences, but enough for the neuro-fuzzy controller to produce convincing push recovery capabilities. Possible future work includes incorporating strategies to unreliable terrains and robustness. Acknowledgment Thanks to the support of the Laboratory of Humanoid Robots at PCC-UNAM for the facilities provided. To Dr. Fernando Arambula Cosio and Dr. Mario Pena Cabrera for the facilities and knowledge provided during the development of the project. Bibliography [1] Adiwahono A.H., Chew C.M., Liu B. (2013); Push recovery through walking phase modification for bipedal locomotion, International Journal of Humanoid Robotics, DOI: http://dx.doi.org/10.1142/S0219843613500229, 10(3), 1350022, 2013. [2] Allgeuer P., Behnke S.(2014); Fused angles for body orientation representation, Proceedings of 9th Workshop on Humanoid Soccer Robots of the IEEE-RAS Int. Conf. on Humanoid Robots (Humanoids), DOI: 10.1109/IROS.2015.7353399, 366-373, 2014. [3] Faber F., Behnke S. (2007); Stochastic optimization of bipedal walking using gyro feedback and phase resetting, 7th IEEE-RAS International Conference on Humanoid Robots, 203-209, doi: 10.1109/ICHR.2007.4813869, 2007. [4] Fuller R. (2013); Introduction to neuro-fuzzy systems, Vol. 2, Springer Science and Business Media, 2013. [5] Hyon S.H., Osu R., Otaka Y. (2009); Integration of multi-level postural balanc- ing on humanoid robots, IEEE Int. Conf. Robotics and Automation (ICRA), doi: 10.1109/ROBOT.2009.5152434, 1549-1556, 2009. [6] Komura T., Leung H., Kudoh S., Kuffner J. (2005); A feedback controller for biped hu- manoids that can counteract large perturbations during gait, IEEE Int. Conf. Robotics and Automation (ICRA), Barcelona, Spain, 1989-1995, 2005. [7] Koolen T., de Boer T., Rebula J.R., Goswami A., Pratt J.E. (2012): Capturability-based analysis and control of legged locomotion: Theory and application to three simple gait models - Part 1, Int. J. of Robotics Research, 31(9), 1094-1113, 2012. [8] Missura M., Behnke S. (2015); Gradient-Driven Online Learning of Bipedal Push Recovery, IEEE/RSJ, Int. Conf. on Intelligent Robots and Systems (IROS), Hamburg, Germany, doi: 10.1109/IROS.2015.7353402, 387-392, 2015. [9] Missura M., Behnke S. (2013); Omnidirectional Capture Steps for Bipedal Walking, The IEEE-RAS Int. Conf. on Humanoid Robots (Humanoids), 14-20, doi: 10.1109/HU- MANOIDS.2013.7029949, 2013. [10] Park J.-W., Kim J.-Y., Oh J.-H. (2008); Online Walking Pattern Generation and Its Appli- cation to a Biped Humanoid Robot - KHR-3 (HUBO), Advanced Robotics, 22(2), 159-190, 2008. 346 P.E. Mendez-Monroy [11] Pratt J., Carff J., Drakunov S., Goswami A. (2006); Capture point: A step toward humanoid push recovery, IEEE-RAS Int. Conf. Humanoid Robots, Genova, Italy, 200-207, 2006. [12] Pratt J.E., Koolen T., de Boer T., Rebula J.R., Cotton S., Carffer J., Johnson M., Neuhaus P.D. (2012); Capturability-based analysis and control of legged locomotion: Application to M2V2, a lower-body humanoid - Part 2, Int. J. of Robotics Research, 31(10), 1117-1133, 2012. [13] Ranganathan A. (2004), The Levenberg-Marquardt Algorithm, Tutorial on LM algorithm, 2004. [14] Rebula J., Pratt J., Canas F., Goswami A. (2007); Learning capture point for improved humanoid push recovery, IEEE-RAS Int. Conf. Humanoid Robots, Pittsburgh, PA, doi: 10.1109/ICHR.2007.4813850, 65-72, 2007. [15] Semwal V.B., Chakraborty P., Nandi G.C. (2015); Less computationally intensive fuzzy logic (type-1)-based controller for humanoid push recovery. Robotics and Autonomous Systems, 63, 122-135, 2015. [16] Si J., Wang Y.T. (2001); On-line Learning Control by Association and Reinforcement, IEEE Trans. on Neural Networks, doi: 10.1109/72.914523, 12(2), 264-276, 2001. [17] Stephens B. (2007); Humanoid push recovery, IEEE-RAS Int. Conf. Humanoid Robots, IEEE Press, Pittsburgh, PA, 589-595, 2007. [18] Tedrake R. (2004); Stochastic policy gradient reinforcement learning on a simple 3d biped, Proc. of the 10th Int. Conf. on Intelligent Robots and Systems, 2849-2854, 2004. [19] Wieber P.B. (2006); Trajectory free linear model predictive control for stable walking in the presence of strong perturbations, IEEE-RAS Int. Conf. Humanoid Robots, IEEE Press, Nashville, doi: 10.1109/ICHR.2006.321375, 137-142, 2006.