CHEMICAL ENGINEERING TRANSACTIONS VOL. 51, 2016 A publication of The Italian Association of Chemical Engineering Online at www.aidic.it/cet Guest Editors: Tichun Wang, Hongyang Zhang, Lei Tian Copyright © 2016, AIDIC Servizi S.r.l., ISBN 978-88-95608-43-3; ISSN 2283-9216 An Improved Shark-Search Algorithm for Agriculture Web Search Engine Bin Wanga, Junhao Wang*a, Xiaohua Sunb, Na Wangc a College of Information Science and Technology,Agricultural University of Hebei,Baoding,China b Department of Digital Media,Hebei Software Institute,Baoding,China c. Department Economics and Management,Baoding Vocational and Technical College, Baoding, China wb900@126.com An improved algorithm for agriculture web search engine based on Shark-Search is proposed. Aiming at deficiency of the traditional algorithm including the context of the anchor text and the trapping in local optimum, the new algorithm clusters the links between the different block of the web pages by the k-average algorithm. It randomly selects k texts as the mass of the original class. According to the average value of the text in the class, each text is assigned to the nearest class. By using the tunnel technique, the crawler can complete the search process. The range of the optimality has been expanding. It can get a globally optimal solution from locally optimal solutions. This algorithm is easy to implement and the computing speed can meet the business needs. The experiment indicates that the new algorithm can improved the performance in fetching the pages evidently. 1. Introduction The research of agriculture search engine is very essential when more and more agricultural information is required by farm workers and the self-employed. As a special crawler category, focused crawler is developed to retrieve as many documents related to a topic of interest as possible while reducing computational resources and the network which was confirmed (Guo et al., 2005; SI, et al., 2010; Ginsberg et al., 2008; Gao et al., 2010; Apte et al., 1994).Traditional focused crawler targets pages that are related to some specific topics. The goal of focused web crawler is getting more pages that are correlative with a certain topic to prepare information for querying which was confirmed Brin et al., 1998; Kleinberg et al., 1999; Chakrabarti et al., 1999). Having been developed for several years, there are many algorithms at present for the focused crawler. The typical examples are Fish-Search algorithm and Shark-Search algorithm which was confirmed (Debra et al., 1994; Herseovici et al., 1998).The advantages of them are the better theoretical foundation and the simple calculation. However, they ignore the relevant information about the linking structure.so there are inadequacies in predicting the value of links. The Shark-Search algorithm has improved a great deal based on the Fish-Search algorithm. But the insufficient of the context of the anchor text and the trapping in local optimum are its obvious defect. For the lack of proposed Shark-Search algorithm, this paper means to ameliorate and demonstrate it with the clustering of link and the tunnelling technique. The comparisons data of the two algorithms show that the improved method can improve the performance in fetching the pages related to the topic evidently. 2. the improved algorithm The Shark-Search algorithm gains a significant improvement for the Fish-Search method which was confirmed (Menczer et al., 2004). It introduces the vector space model to improve the determination of the correlation. And it makes the most of the correlation of the parent nodes and the anchor text along with the context. However, there are two drawbacks in current Shark-Search algorithm. One is the insufficient of the context of the anchor text; the other is trapping in local optimum. This paper improves the algorithm; the method is elaborated as follows. DOI: 10.3303/CET1651134 Please cite this article as: Wang B., Wang J.H., Sun X.H., Wang N., 2016, An improved shark-search algorithm for agriculture web search engine, Chemical Engineering Transactions, 51, 799-804 DOI:10.3303/CET1651134 799 2.1 the clustering of link Suppose that a pattern set is expressed as: The Clustering form can be described below which was confirmed (Sun et al., 2008): },...,,{ 21 n pppU (1) The ith pattern is expressed as pi, and i= (1, 2, … , n). Meanwhile, Suppose that: UC t , },...,2,1{ kt (2) },...,,{ 21 twttt pppC (3) ),( irms ppproximity (4) The result of clustering is expressed as Ct, and some conditions need to be satisfied: The similarity distance of the pattern is described by the function proximity. Among the formulas, the first subscript of pti expresses the class, to which the mode belongs; the second subscript expresses the pattern of the class. (1) UCU t k t 1 . (5) (2)For mC , UC r , rm CC , rm CC . (6) )),(()),(( ,, mymxpprvmupp ppproximityMAXppproximityMIN mymxrvmu (7) There are many kinds of clustering algorithms. The K-means clustering methods are described below. Step1: randomly select k texts as the mass of the original class. Step2: repeat the steps below until there are not any more changes. 1) Each text is assigned to the nearest class according to the average value of the text in the class. 2) Update the average value. The pseudo-code of K-means clustering algorithm is which was confirmed (Gianluigi, et. al, 2002): KMean(X1, X2, … , XN, K) Initialize the allocation schemes of clustering, the imputation vector are: A [1], A [2],…A [N]; do {change = false; For (i=1;i<=N;i++) { K = argminkdist(Xi,Ck); if(A[i]!=K) { A[i]= K; change=true; } } } while(change==false) returnA[1],A[2],…,A[N]; } This algorithm is easy to implement and the computing speed can meet the business needs. 2.2 The tunnel technique The figure1 can describe a case of crossing the irrelevant web page to locate the pages which are really needed. The Web Community exists in the network, so some nodes have the characteristics of aggregation. And they can form an obviously pages set which was confirmed (Gibson et al., 1998). 800 The Web Community A and Web Community B are two pages set related to the topic. The web pages corresponding to the seeds set are also related to the topic. The page one that links with the seeds set has the link to the page in pages set A, so the pages related to the topic can be easily get from page one. For the pages set B, the web pages that links with the seeds set has not the link to the page in pages set B, and there is not any page in pages set A directly related to the pages set B. By adopting conventional web crawler method, all the related pages cannot be crawled. By the tunnel technique, the page two can get through the irrelevant pages and reach the pages set B. The page in pages set A can also get through the irrelevant pages and reach the pages set B. Figure 1: The tunnel technique By using the tunnel technique, the crawler can complete the Process shown in figure2. Figure 2: The crawling Process Traditional focused crawler is targeting web pages that are relevant to some specific topics. But it has many local optimizations. In the figure, the related pages are A0,B0,A3,B3,B4,B5.The irrelevant pages are A1,A2,B1,B2.The search path is P0,P1,P2,P3,P4,P5,P6,P7,P8.Suppose that the focused crawler starts from A0. The B0 page is related to the topic and the A1 page is non-relevant, so the traditional crawler ends its searching when reaches B0 along P0.When using the tunnel technique, the P1 path can also be used for searching. So the path newly adds P1→P2→P4.And the A3 page is related to the topic.it can be added to the queue. Thus, by the tunnel technique, the range of the optimality has been expanded. It can get a globally optimal solution from locally optimal solutions. 2.3 The improved Shark-Search algorithm Aiming the deficiencies of the Shark-Search algorithm including the insufficient of the context of the anchor text and the trapping in local optimum, the improved algorithm clusters the links between the different block of the web pages by the k-average algorithm. The similarity between the corresponding categories and the topic is figured out. Then the nodes of the tree are numbered by the hierarchy traversal. The numbering of paths corresponding to the link is extracted. The operation is shown in Figure3. Seed set tunnel page page tunnel Web community A Web community B tunnel page page tunnel tunnel tunnel A0 A1 A2 A3 B0 B1 B2 B3 B4 B5 P0 P1 P2 P3 P4 P5 P6 P7 P8 801 Figure 3: The operation of improved algorithm In the figure3, the link17,link18 and link19 has the same path 1→3→5→8,and the link20,link21 and link22 has the same path 1→3→6→10.According to the clustering algorithm, the link17, link18 and link19 belongs to the same class and also the link20, link21 and link22 belongs to the same class. Step1: the web page information is translated into a document object model tree. The context of the anchor text in Shark-Search algorithm is replaced by the similarity to influence the potential_score, as shown in the following steps. Step2: put the links in web page2 to the queue by the retrieving sequence. Find the matching string which satisfies that the path between any two nodes is greater than or equals to 2.Take all the elements in the string out of the queue and put it into the corresponding class. Repeat the process until all the satisfied links are put it into the class. Step3: all the links for categorizing are expressed as L, the links that belong to category I are expressed as Gi, the number of current category is expressed as class_num. Define a mark expressed as flag. 1) Initialization. Some settings are listed below. },...,,{ 21 n uuuL (8) n GGG ,...,, 21 (9) 1_ numclass (10) 1flag (11) 2) When the set L is nonempty and flag=1, let flag=0. 3) Goes through each link ui in L, if there is the same route as ui and the value is greater than 1,put ui in the corresponding Gclass_num and class_num is plused one.set flag=0. 4) Repeat 2) until flag=0 or L is empty. Step4: according to the previous step, the number of the links that is contained in each class can be figure out as |Gi|.the total of the classes is expressed as cluster_url_num, so: )_max(__ numclassnumurlcluster (12) The score of class is expressed as class_score,so: numurlcluster urlscoreanchor scoreclass numurlcluster __ )(_ _ __ (13) html head body title div div hs hs ul ul li li li li li li a a a a a a 2 3 4 5 6 7 9 10 8 11 12 13 14 15 16 17 18 19 20 21 22 1