Microsoft Word - cet-01.docx CHEMICAL ENGINEERING TRANSACTIONS VOL. 46, 2015 A publication of The Italian Association of Chemical Engineering Online at www.aidic.it/cet Guest Editors: Peiyu Ren, Yanchang Li, Huiping Song Copyright © 2015, AIDIC Servizi S.r.l., ISBN 978-88-95608-37-2; ISSN 2283-9216 Data Fitting Based on Genetic Programming and Least Square Method Pinchao Menga, W eishi Yina, Zhixia Jianga, Yanzhong Li*b a Department of Applied Mathematics, Changchun University of Science and Technology, Changchun, China, b College of mathematics and statistics, Beihua University, Jilin, China. lyz@cust.edu.cn Traditional data fitting techniques usually require estimating basis function and they are specific for different application areas. Based on dynamic characteristics of genetic programming, a two -phase data fitting algorithm is proposed. In this algorithm, genetic programming is used to optimize model structure and Least Square method is applied to estimate parameters. Proposed algorithm is tested for different types of data fitting. This high efficiency and accuracy algorithm can be applied in different areas. 1. Introduction Professor Koza proposed Genetic Programming Algorithm in 1989. Subsequently, genetic programm ing has been widely studied. Algorithm studies mainly include the selection of fitness function, evolution strategies and genetic operator optimization, as well as corresponding algorithm designs. The studies also combine specific problems with practices. In theoretical research, with the help of fitness state diagram and pattern theorem, adaptive mechanism analysis of genetic programming has been carried out. After that, genetic programming began to integrate with other computational intelligence algorithms, such as decision tree algorithms, artificial neural networks, fuzzy inference and rough set algorithms. The new improved variants have very good performance. Later on, some people have combined genetic programming with machine learning methods. Currently, applying genetic programming into machine learning has become a very popular research topic. In this paper, in order to overcome the unknown solution structure of the least square method and errors occurred in parameter determination of genetic programming, genetic programming was combined with least square method to conduct data fitting. 2. Genetic programming Genetic Programming is a new search optimization technology. It imitates the evolution of biological and genetics and follows the principles of "struggle for existence" and "survival of the fittest". Based on this, it can approach the optimal solution of corresponding problems from the initial solution step by step utilizing duplication, exchange and mutation operations. The hierarchical format of computer programs can be used to reflect problems, which is suitable for a variety of complex issues. Its mission is to implement computer programs which can reflect the essence of real problems. The problem solving process of genetic programming is to find out a computer program with best fitness in a search space which is composed of many possible computer programs. And genetic programming can exactly provide the method to find the computer program with the best fitness. 2.1 General steps of genetic programming The general steps of genetic programming are listed as following: (1) Generate an initial population randomly, i.e produce computer programs composed of random functions and variables; (2) Run every computer program (individual) in the population and give each of them a fitness based on their problem-solving abilities (good or bad); DOI: 10.3303/CET1546218 Please cite this article as: Meng P.C., Yin W.S., Jiang Z.X., Li Y.Z., 2015, Data fitting based on genetic programming and least square method, Chemical Engineering Transactions, 46, 1303-1308 DOI:10.3303/CET1546218 1303 (3) Generate a new generation of computer programs based on the following two steps. The individuals to be duplicated are selected randomly according to their fitness; (a) Generate a new generation of computer programs from the current generation by duplicating. The individuals to be duplicated are selected randomly according to their fitness; (b) Generate new computer programs by exchanging the randomly selected parts of the two parent individuals. The parent individuals are also selected randomly according to their fitness; (4) Execute steps (2) and (3) iteratively until the termination criterion is satisfied. The best individuals produced by any generation are considered as potential results of genetic programming. These results may become the correct solutions of corresponding problem which was confirmed (M.Ebner et.al (2007)). 2.2 Generation of the initial population The initial population consists of many initial individuals. And initial individuals, which are a variety of possible symbolic expressions of the problem to be solved, are produced randomly. At the beginning, a function is selected from the function set F according to uniform distribution as the root node of the tree algorithm. The purpose of confining the node root in the function set F is to generate a hierarchical complex structure. We assume that there n functions in the curve fitting function set F :  1, 2 n, , .F f f f . The function i f in the function set can either be arithmetic operators such as +, -, × and ÷ or standard math functions such as sin, cos, log, exp, etc, or be other subfuncions which are already defined. The terminator set T includes m terminators:  1 2 m, , , .T t t t The terminators in the terminator set can be variables, such as x, y and z, or constants, such as a and b. 2.3 Fitness function Fitness is the driving force of natural selection in genetic programming. Darwin's principle of evolution is "survival of the fittest", which means that good individuals will survive and bad individuals will be abandoned. Then optimization of the population can be achieved through generations of genetic recombination. A measurement is needed to judge whether an individual is good or bad. This measurement is fitness. It can provide a fitness measuring value for each individual utilizing valuation methods of specific issues. In general, there are four common fitness measurement parameters: original fitness, standard fitness, conciliatory fitness and normalized fitness. Fitness is the driving force of genetic programming. 2.4 Genetic operator Like genetic algorithms, the main operators of genetic programming are duplication and crossover. The duplication operation (i.e, the better the fitness of an individual, the greater its possibility of being involved in the duplication) is a process in which best individuals are selected from the current generation based on fitness to generate a new generation through self-reproduction. Specifically, it consists of two parts: (1) select a parent individual based on fitness from the population using different selection methods; (2) the selected parent individual duplicates itself to the next generation from the current generation without any change. 2.4.1 Selection When duplication takes place, a parent individual will generate an offspring individual and the duplication process consists of two parts. Firstly, a parent individual will be selected from the population according to the fitness; secondly, the parent individual duplicates the next generation without any change. Generally, selection methods can be divided into the following four types based on fitness: (1) Proportion selection; (2) Classification selection;(3) Competition selection; (4) Elite selection. Generally, for a specific problem, the above methods can be combined or you can just customize more appropriate selection method based on the actual situation of the problem. 2.4.2 Crossover Crossover refers to the operation in which parts of two parent individuals' structure exchange with each other to give birth to two new offspring individuals. Its method and procedures are listed as follows: select two individuals randomly from the parent population according to the selection method mentioned above, but both selection processes must be independent. In this way, both structure and size of the selected two parent individuals may be different with each other. The two selected parent individual s select exchange points using a random method of uniform distribution, generating a subtree whose root is the crossover point. The part of the subtree below the exchange point, including the crossover point, is called crossover section. Sometimes an exchange section is only a leaf. Then, the first individual deletes its exchange section, and then puts the 1304 crossover section of the second individual on its own exchange point, and the second individual does the same thing. That is, two new offspring individuals are generated after crossover and exchange of the two parent individuals. 2.5 Mutation Crossover operation adopts a hybridization method with partial match. That is, two bit string crossover points are generated based on uniform distribution and then the area between these two points will be selected as the match crossover area. And position swap operation will be carried out to exchange the bit strings in the matching crossover area. For the chromosome, gene bits will be selected randomly to execute mut ation, and the mutation space is the set of all available actions of the corresponding layer, and idle actions are also involved in mutation. In genetic programming, the mutation operation is a kind of auxiliary operation. The mutation probability is generally small and its usual range is 0-0.01. The selection method of each mutated individual is: select 1 K individuals from the population randomly and carry out mutation on the individual with the worst fitness 1 K . When pp ~ , random mutation is carried out. 2.6 Termination criterion Genetic programming will execute the evolution iterations continuously. In this process, once a termination criterion is satisfied, the evolutionary process shall stop immediately. Generally, there are two termination criterions: When the maximum allowable number of evolution generations G is satisfied, the evolutionary process shall stop immediately. When the preset problem solving conditions are met, the evolutionary process shall stop immediately. 3. Data fitting algorithm based on genetic programming and least square method The function structure has to be specified when the least square method is used to carry out data fitting, however, the structure of the fitting function is usually unknown in practical applications. Genetic programming method can determine the function structure based on the data, but the parameters of the function may not be optimal. So first of all, we use the genetic programming algorithm to determine the function structure, then utilize the least square method to optimize parameters. This process will be repeated to obtain excellent results. The steps of the data fitting algorithm based on genetic programming and least square method (LSGP) are listed as follows: (1) Determine parameters, including function set and terminator set. Randomly generate variable parameters. For the function set: sin}.log,exp,,*,,{ P For the terminator set: { , }.T p rand (2) Generate of the initial population. The growth method is used to generate M initial individuals randomly. These M initial individuals will form the initial population. (3) Fitness selection. Use original fitness to calculate the fitness individuals. If the termination criterion that the fitness is less than 0.01 is satisfied, then stop. Otherwise go to step 4. (4) Use the genetic operators of genetic programming to carry out the following operations on M individuals: Duplication: it is a selection method which adopts fitness proportion. The higher the individual fitness, the greater the probability of being selected. Doing this can ensure that the good individuals will enter the next population. Exchange: select two individuals from M individuals randomly for exchange to generate two new individuals. The process will be repeated until the number of exchanges is satisfied. Mutation: a single point mutation is used. Select two individuals from M individuals randomly, then carry out mutation randomly on the selected node. The process will be repeated until the number of mutations is satisfied. (5) Select part of the individuals and use the least square method to optimize the corresponding function parameters of each individual, generating M new individuals. (6) Go to step (3) and start looping until the termination condition is met. 1305 4. Calculation examples 4.1 Example 1 A group of data ( , ), i i x y is shown in Table 1: Table 1: First group of data xi -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 yi -1.9 -2.2 -2.4 -2.5 -2.5 -2.3 -2.0 -1.6 -1.1 -0.6 0 xi 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 yi 0.6 1.1 1.6 2.0 2.3 2.5 2.5 2.4 2.2 1.9 For the function set: sin}.log,exp,,*,,{ P For the terminator set: { , }.T p rand Parameters of genetic programming are: the number of individuals in the initial population M = 100, the number of iterations N = 100, the termination condition is that the fitness is less than 0.01. Run the genetic programming algorithm. Parameters of the combination of genetic programming and least square method (LSGP) are: the number of individuals in the initial population M = 100, the number of iterations N = 100, the termination condition is that the fitness is less than 0.01. Take out the first 50 individuals with variable parameters after running the genetic programming algorithm every 5 times. Then use the least square method to optimize the parameters of these individuals. The functions after being optimized will rejoin the population to replace the original individuals. After that, the genetic programming algorithm will be conducted on the new population. This process is considered as one cycle and it will be repeated 20 times. 50 experiments have been carried out and the results are shown in Table 2: Table 2: The experimental result of Example 1 Method Iteration Number Average Iteration Number GP 26.36 21.22 LSGP 32.78 18.40 Figure 1: Scatter diagram of Figure 2: Graph of the fitting function of Example 1 the experimental data of Example 1 It can be seen from Table 2 that: Compared with genetic programming, the combination between genetic programming and least square method has improved the number of convergences of the corresponding algorithm, and the average convergence algebra also improves. As can be seen from Figure 1 and Figure 2, the data points of the function obtained in this way distribute evenly next to the graph of the fitting function and the fitting accuracy is high. Considering these two aspects, the combination of genetic prog ramming and least square method wins out. 4.2 Example 2 A group of data ( , ), i i x y is shown in Table 3: 1306 Table 3: Second group of data xi 22.00 27.00 32.00 37.00 40.00 45.00 48.00 49.00 52.00 59.00 yi 3.08 4.25 4.01 2.97 4.43 4.65 3.10 2.94 4.94 4.71 xi 65.00 66.00 70.00 77.00 80.00 82.00 83.00 85.00 88.00 90.00 yi 5.00 4.16 5.02 5.34 3.39 4.72 5.39 4.27 4.51 5.39 For the function set: sin}.log,exp,,*,,{ P For the terminator set: { , }.T p rand Parameters of genetic programming are: the number of individuals in the initial population M = 100, the number of iterations N = 100, the termination condition is that the fitness is less than 0.01. Run the genetic programming algorithm. Parameters of the combination of genetic programming and least square method(LSGP) are: the number of individuals in the initial population M = 100, the number of iterations N = 100, the termination condition is that the fitness is less than 0.01. Take out the first 50 individuals with variable parameters after running the genetic programming algorithm every 5 times. Then use the least square method to optimize the parameters of these individuals. The functions after being optimized will rejoin the population to replace the original individuals. After that, the genetic programming algorithm will be conducted on the new population. This process is considered as one cycle and it will be repeated 20 times. 50 experiments have been carried out and the results are shown in Table 4: Table 4: The experimental result of Example 1 Method Iteration Number Average Iteration Number GP 17.03 24.20 LSGP 25.59 20.00 It can be seen from Table 4 that: Compared with genetic programming, the combination between genetic programming and least square method has improved the number of convergences of the corresponding algorithm, and the average convergence algebra also improves. As can be seen from Figure 3 and Figure 4, the data points of the function obtained in this way distribute evenly next to the graph of the fitting function and the fitting accuracy is high. Considering these two aspects, the combination of genetic programming and least square method wins out. It can be seen by comparing Figure 2 with Figure 4 that the experimental data of Example 2 look more disorderly than those of Example 1. It is difficult to carry out data fitting on such data using least square method. But the fitting results using the fitting function proposed in this paper are very good. Figure 3: Scatter diagram of Figure 4: Graph of the fitting function of Example 2 the experimental data of Example 2 5. Conclusions The rapid development and wide application of genetic programming has been established in a history of about thirty years. As a branch of evolutionary computation, it has many common c haracteristics of evolutionary computation, such as global optimization, self-learning ability and independence of gradient. Traditional data fitting methods need to pre-determine the structure of the fitting function based on empirical knowledge, thus the designed programs can only solve a certain problem in a certain area, of which the portability is poor. While genetic programming has dynamic variable characteristics and does not require any prior knowledge, therefore, genetic programming has been introduced into the field of data fitting. 1307 In order to improve the computing efficiency and avoid premature convergence and poor population diversity of traditional genetic programming algorithms, practical application experiences shall be utilized and changes with respect to fitness function design, exchange of individual selection and mutation operation shall be made to obtain new algorithms and good results. In the data analysis, the least square fitting method is the most commonly used method to obtain approximate functions. However, on the premise of existence of function structure, this method has to determine all the parameters of the function and the function structure should not be complicated. While when genetic programming conducts data fitting, the obtained parameters may not be the optimal. Therefore, on the basis of this issue, a method which combines genetic programming combined with the least squares method is proposed in this paper. In the process of genetic programming, this method uses the least square method to optimize individual parameters and then puts them back into the population. Doing so breaks the ice that traditional data fitting methods can only use fixed-function model. The number of convergences and average convergence algebra of the new fitting function are significantly better than the traditional methods. The fitting function also has higher precision. The advantage of this method is that it can fit the function structure through genetic programming based on experimental data points when the function model is unknown. Then new initial population can be generated by using the least squares method to optimize parameters, and then genetic programming is applied. In this way, the obtained function has better structure and its number of convergence and average iteration algebra are very good. Of course, this method needs improvement, for example, its iterative algebra and the number of individuals in the initial population have a great impact on the computation speed. References Abraham A., Ramos V., 2003. Web usage mining using artificial ant colony clustering and linear genetic programming. In: Proceedings of the 2003 IEEE Congress on Evolutionary Computation, 2: 1384 -1391. http://dx.doi.org/10.1109/CEC.2003.1299832 Aslam M.W., Zhu Z.C., Nandi A.K., 2012. Automatic Modulation Classification Using Combination of Genetic and KNN, IEEE. Trans. Wireless Communication, 11(8): 2741-2750. Angeline P.J., 1997. Comparing subtree crossover with macromutation. Proc. of the Sixth Annual Conference on Evolutionary Programming. Berlin: Springer- Verlag. 101-111. http://dx.doi.org/10.1007/BFb0014804 Chen Z.W., Wang W.L., 2003. Research actuality and development of Genetic Programming. Journal of Zhejiang University of Technology, 31(2): 153-159. Ebner M., 2007, Fast Genetic programming On GPUs. In: Euro GP 2007, LNCS 4445, 90-101. Freitas A.A., 2003. A survey of evolutionary algorithms for data mining and knowledge discovery. Advances in evolutionary computing. Springer Berlin Heidelberg: 819-845. Hirasawa K., Okubo M., Katagiri H., 2001. Comparison between genetic network programming (GNP) and genetic programming (GP). In: Proceedings of the 2001 IEEE Congress on Evolutionary Computation, 1276-1282. http://dx.doi.org/10.1109/CEC.2001.934337 Jin R., Kou C.H., Liu R.J., 2014. Biclustering algorithm of differential co-expression for gene data, Review of Computer Engineering Studies, 1(1): 7-12. http://dx.doi.org/10.18280/rces.010102 Koza J.R, 1992. Genetic Programming: On the Programming of Computers by means of Natural Selection. Cambridge, MA: MIT Press. Koza J.R, 2000. Automatic Creation of Human-Competitive Programs and Controllers by Means of Genetic Programming. Genetic Programming and Evolvable Machines, 1: 121-164. Miller J.F., Thomson P., 2000. Cartesian genetic programming. Genetic Programming. Springer Berlin Heidelberg, 121-132. http://dx.doi.org/10.1007/978-3-540-46239-2_9 Niu A.G., W ang H.W., 2007. An Improved Algorithm of Genetic Programming. Operations Research and Management Science, 16(1): 67-70. Nordin P., 1999. Book review: Genetic programming Ⅲ -Darwinian Invention and Problem Solving. Evolutionary Computation, 7(4): 451-453. http://dx.doi.org/10.1162/evco.1999.7.4.451 Shao G.F., Zhou Q.F., Chen G.Q., 2009. Data fitting based on improved genetic programming. Application Research of Computers, 26(2): 481-484. Sun Y.G., Qiang H.Y., Yang K.R., Chen Q.L., Dai G.W ., Dong M., 2014. Experimental Design and Development of Heave Compensation System for Marine Crane. Mathematical Modelling and Engineering Problems. 1(2): 15-21. http://dx.doi.org/10.18280/mmep.010204 Zhang M., Zhou Y.Q., W ang D.D., 2006. A data fitting method based on genetic programming. Journal of Harbin Engineering University, 27: 527-530. Zhang Y, Zhang M, 2004. A multiple-output program tree structure in genetic programming. In: Proceedings of the 7th Asia-Pacific Conference on Complex Systems. Zhou Q., 2015. Effective Kidney MPI Segmentation Method Based on Level Set with Prior Shape. Review of Computer Engineering Studies, 2(1): 43-46 http://dx.doi.org/10.18280/rces.020108 1308 http://dx.doi.org/10.1109/CEC.2003.1299832 http://dx.doi.org/10.1109/CEC.2001.934337