Scalable and Accur ate Clones Detection B ased on M etr ics for Dependence Gr aph ¤ S e va k S . S a r g s ya n , S h a m il F. K u r m a n g a le e v, A r t io m V . B a lo ia n a n d H a yk K . A s la n ya n Laboratory of System Programming IT Educational and Research Centre Yerevan State University e-mail: sevaksargsyan@ispras.ru, kursh@ispras.ru, artyom.baloyan@ysu.am, hayk@ispras.ru Abstract The article describes a new method of code clones detection for C/C++ program- ming languages. The method is based on metrics for program dependence graph. For every node of program dependence graph a characteristic vector is constructed, which contains information about neighbors. These characteristic vectors are represented as sixty four bit integer numbers, which allows determining similarity between two nodes in amortized constant time. Due to this it is possible to analyze e®ectively projects with million lines of source code. The high accuracy of the determined clones was achieved by checking locations of source code for corresponding nodes. The paper also describes new approach for dependency graphs generation, which allows building them much faster than any of the existing methods. This method was compared with several widely used tools. It performs better both execution time and accuracy. Keywords: Metrics, Dependency graph, Scalable, Code clones, Compiler, Bit vector. 1 . In t r o d u c t io n S o ft wa r e d e ve lo p e r c a n r e u s e t h e s a m e p ie c e o f c o d e m a n y t im e s . It c a n b e d o n e wit h d ir e c t c o p y-p a s t e o r c o p y-p a s t a n d s m a ll m o d i¯ c a t io n s . R e u s e d c o d e c a n le a d t o m a n y s e m a n t ic e r r o r s . Fo r e xa m p le , s o ft wa r e d e ve lo p e r c a n fo r g e t t o r e n a m e s o m e va r ia b le a ft e r c o p y-p a s t e . Th e s o ft wa r e , wh ic h h a s m a n y c lo n e s p r o b a b ly will h a ve m a n y m is t a ke s a n d lo w qu a lit y. A c c o r d in g t o d i®e r e n t s t u d ie s [1 ,2 ] u p t o 2 0 p e r c e n t o f s o u r c e c o d e c a n b e a c lo n e in s o ft wa r e . Clo n e d e t e c t io n t o o ls a r e wid e ly u s e d d u r in g s o ft wa r e d e ve lo p m e n t t o a vo id m is t a ke s a n d im p r o ve c o d e qu a lit y. N u m b e r s o f a p p r o a c h e s we r e p r o vid e d [3 ] fo r c lo n e s d e t e c t io n , b u t t h e y h a ve s o m e r e s t r ic t io n s . S o m e o f t h e m c a n n o t d e t e c t a ll c lo n e t yp e s ( s e e 2 .) . Ot h e r s h a ve h ig h c o m p u t a t io n a l c o m p le xit y, wh ic h m a ke s t h e m u n u s a b le fo r la r g e s c a le p r o je c t s a n a lys is . Th e g o a l o f t h is p a p e r is t o in t r o d u c e a m e t r ic s -b a s e d s c a la b le a n d a c c u r a t e a lg o r it h m o f c o d e c lo n e s d e t e c t io n . Th e m e t r ic s a r e d e ¯ n e d fo r P r o g r a m D e p e n d e n c e Gr a p h ( P D G) . B e s id e s t h e a lg o r it h m , a n e w a p p r o a c h is p r e s e n t e d fo r d e p e n d e n c y g r a p h s g e n e r a t io n b a s e d ¤The paper is supported by RFBR grant 15-07-07541 5 4 Mathematical Problems of Computer Science 42, 54{62, 2014. S. Sargsyan, S. Kurmangaleev, A. Baloian and H. Aslanyan 5 5 o n L L V M c o m p ile r [4 ]. It a llo ws g e n e r a t in g t h e s e g r a p h s e ®e c t ive ly, wit h o u t d o u b le a n a lys e s o f s o u r c e c o d e . Fle xib le d e ¯ n it io n o f m e t r ic s a n d e ®e c t ive g e n e r a t io n o f P D Gs a llo w u s t o c r e a t e a s c a la b le a n d a c c u r a t e t o o l o f c o d e c lo n e s d e t e c t io n . 2 . B a c kg r o u n d Clone T ypes: Th e r e a r e t h r e e b a s ic t yp e s o f c lo n e s . Th e ¯ r s t t yp e is id e n t ic a l t o c o d e fr a g m e n t s e xc e p t t h e va r ia t io n s in wh it e s p a c e ( m a y b e a ls o va r ia t io n s in la yo u t ) a n d c o m - m e n t s (T 1). Th e s e c o n d t yp e is s t r u c t u r a lly/ s yn t a c t ic a lly id e n t ic a l t o c o d e fr a g m e n t s e xc e p t t h e va r ia t io n s in id e n t i¯ e r s , lit e r a ls , t yp e s , la yo u t a n d c o m m e n t s (T 2). Th e t h ir d t yp e is c o p ie d fr a g m e n t s o f c o d e wit h fu r t h e r m o d ī c a t io n s . S t a t e m e n t s c a n b e c h a n g e d , a d d e d o r r e m o ve d in a d d it io n t o va r ia t io n s in id e n t ī e r s , lit e r a ls , t yp e s , la yo u t a n d c o m m e n t s (T 3)[5 ]. Code clone detection appr oaches: Th e r e a r e ¯ ve b a s ic a p p r o a c h e s fo r c o d e c lo n e d e t e c t io n . 1 . Me t h o d s b a s e d o n t e xt u a l a p p r o a c h c o n s id e r t h e s o u r c e c o d e a s t e xt a n d t r y t o ¯ n d e qu a l s u b s t r in g s [6 ]. Th e s e s u b s t r in g s a r e c lo n e s . W h e n a ll c lo n e s a r e fo u n d , t h e c lo n e s wh ic h a r e lo c a t e d n e a r b y c a n b e c o m b in e d in t o o n e . B a s ic a lly (T 1) t yp e o f c lo n e s is d e t e r m in e d . 2 . In c a s e o f le xic a l a p p r o a c h t h e s o u r c e c o d e is p a r s e d t o s e qu e n c e o f t o ke n s . Th e n t h e lo n g e s t c o m m o n s u b s e qu e n c e is d e t e r m in e d . Th e r e a r e a fe w e ®e c t ive a lg o r it h m s b a s e d o n t h e p a r a m e t e r iz e d s u ± x t r e e fo r c lo n e d e t e c t io n [7 ]. On e m o r e in t e r e s t in g m e t h o d t r a n s fo r m s Ja va c o d e t o s o m e in t e r m e d ia t e r e p r e s e n t a t io n a n d c o m p a r e s t h e m in s t e a d o f o r ig in a l s o u r c e [8 ]. Th e s e t yp e s o f a lg o r it h m s c a n ¯ n d b a s ic a lly (T 1) a n d (T 2) c lo n e t yp e s . 3 . Th e n e xt is t h e s yn t a c t ic a p p r o a c h . Th e a lg o r it h m wo r ks o n A b s t r a c t S yn t a x Tr e e ( A S T) . In t h is c a s e t h e c lo n e s m a t c h e d s u b t r e e s o f A S T. S o m e a lg o r it h m s d ir e c t ly c o m p a r e t wo A S Ts t o ¯ n d c o m m o n s u b t r e e s [9 ]. A n o t h e r a lg o r it h m c o n s t r u c t s ve c t o r s fo r A S T s u b t r e e s a n d c o m p a r e s t h e m [1 0 ]. A lg o r it h m s b a s e d o n t h is a p p r o a c h ¯ n d a ll t h r e e t yp e s o f c lo n e s . 4 . Me t r ic s -b a s e d a lg o r it h m s a r e wid e ly u s e d fo r c lo n e d e t e c t io n . A lg o r it h m s b a s e d o n t h is m e t h o d , c o m p u t e n u m b e r o f m e t r ic s fo r c o d e fr a g m e n t a n d c o m p a r e t h e m . B a s ic a lly t h e s e m e t r ic s a r e c o m p u t e d fo r A S T a n d P r o g r a m D e p e n d e n c e Gr a p h ( P D G) [1 1 ]. A n o t h e r m e t h o d c lu s t e r s c o m p u t e d m e t r ic s b y u s in g n e u r a l n e t wo r ks [1 2 ]. Me t r ic s - b a s e d a lg o r it h m s h a ve b e t t e r p e r fo r m a n c e t h a n A S T o r P D G c o m p a r is o n a lg o r it h m s , b u t wit h lo w a c c u r a c y. 5 . Th e la s t is t h e s e m a n t ic a p p r o a c h . Th e s o u r c e c o d e is p a r s e d t o P D G. N o d e s o f P D G a r e in s t r u c t io n s o f p r o g r a m . E d g e s o f P D G a r e d e p e n d e n c e s b e t we e n t h e in s t r u c t io n s . A lg o r it h m s b a s e d o n P D G t r y t o ¯ n d m a xim a l is o m o r p h ic s u b g r a p h s fo r p a ir o f P D Gs [1 3 ,1 4 ]. A ll a lg o r it h m s a r e a p p r o xim a t e b e c a u s e t h e m a xim a l is o m o r p h ic s u b g r a p h s d e t e c t io n p r o b le m is N P h a r d . P D G-b a s e d m e t h o d s h a ve h ig h a c c u r a c y b u t lo w p e r - fo r m a n c e . 3 . A p p r o va l Te xt u a l a n d le xic a l a p p r o a c h e s a r e n o t ve r y e ®e c t ive fo r d e t e c t in g c lo n e s o f (T 3) t yp e . Tr e e - b a s e d m e t h o d s a r e m o r e e ®e c t ive fo r d e t e c t in g c lo n e s o f (T 1) a n d (T 2) t yp e s ; (T 3) t yp e 5 6 Scalable and Accurate Clones Detection Based on Metrics for Dependence Graph o f c lo n e s a r e d e t e c t e d wit h lo w a c c u r a c y, b e c a u s e t h e a d d e d o r d e le t e d in s t r u c t io n s s t r ic t ly c h a n g e t h e s t r u c t u r e o f A S T. A lg o r it h m s b a s e d o n s e m a n t ic a n a lys is h a ve h ig h c o m p u t a t io n a l c o m p le xit y, wh ic h m a ke s t h e m u n u s a b le fo r la r g e s o ft wa r e s ys t e m s a n a lys is . Me t r ic s -b a s e d a p p r o a c h e s h a ve lo w a c c u r a c y. Fo r qu a lit a t ive a n a lys is o f s o ft wa r e s ys t e m s , (T 3) a s we ll a s t h e o t h e r c lo n e s s h o u ld b e d e t e c t e d . S o , t h e r e a r e t wo s c e n a r io s fo r g e t t in g a ll c lo n e s wit h h ig h a c c u r a c y. On e o f t h e m is m e t r ic s -b a s e d a lg o r it h m s a c c u r a c y im p r o ve m e n t ( n o t e : m e t r ic s s h o u ld b e d e ¯ n e d fo r P D G) . A n d t h e s e c o n d o n e is t h e r e d u c t io n o f c o m p u t a t io n a l c o m p le xit y o f t h e s e m a n t ic a p p r o a c h [1 5 ]. Th is wo r k d e s c r ib e s a n e ®e c t ive m e t h o d fo r P D Gs g e n e r a t io n a n d a c c u r a t e a lg o r it h m o f c o d e c lo n e s d e t e c t io n b a s e d o n m e t r ic s . W id e ly u s e d a lg o r it h m s a r e b a s e d o n m e t r ic s wo r k fo r A S T [9 ] o r fo r s o m e s p e c ia liz e d P D G [1 6 ]. A s wa s d is c u s s e d , A S T-b a s e d m e t h o d s d e t e c t (T 3) c lo n e s wit h lo w a c c u r a c y. Ot h e r m e t h o d s m o d ify P D Gs t o a c h ie ve h ig h p e r fo r m a n c e o r m a ke d e ¯ n e d m e t r ic s s t a b le fo r va r ia t io n s in c o d e fr a g m e n t s . It d o e s n o t s o lve t h e p r o b le m o f a c c u r a t e d e t e c t io n fo r (T 3) c lo n e s . S o m e o f t h e m [1 7 ] b r in g g r a p h is o m o r p h is m p r o b le m t o t r e e s im ila r it y t a s k, wh ic h c a n t h e d e c r e a s e a c c u r a c y o f t h e d e t e c t e d c lo n e s . Ou r m e t h o d c o m p u t e s m e t r ic s fo r n o d e s o f P D G, a n d c o m p a r e s t h e m . D u e t o t h e ° e xib le d e ¯ n it io n o f t h e m e t r ic , t h e a d d e d o r r e m o ve d in s t r u c t io n s h a ve a s m a ll im p a c t o n n o d e s , wh ic h a llo ws t o d e t e c t a ll t h r e e t yp e s o f c lo n e s wit h h ig h a c c u r a c y. Ge n e r a t io n o f P D Gs h a s h ig h c o m p u t a t io n a l c o m p le xit y wh ic h r e qu ir e s m u c h t im e . To r e d u c e it s c o s t we h a ve p u r p o s e d L L V M-b a s e d m o d e l o f t h e s e g r a p h s g e n e r a t io n . In t h is m o d e l P D Gs a r e c o n s t r u c t e d d u r in g t h e c o m p ila t io n o f t h e p r o je c t . Ot h e r m e t h o d s u s e t h e ir o wn p a r s e r fo r P D G c o n s t r u c t io n . It h a s a n u m b e r o f d is a d va n t a g e s . S o u r c e c o d e s h o u ld b e a n a lyz e d a n d p a r s e d s e p a r a t e ly. D e p e n d e n c e s b e t we e n t h e c o m p ila t io n m o d u le s s h o u ld b e p r o p e r ly p r o c e s s e d . Th e la r g e p r o je c t s a r e n o t p o s s ib le wit h o u t t h e u s e o f M ake¯le. L L V M b u ilt -in g e n e r a t o r o f P D Gs a llo ws u s in g M ake¯le o f t h e p r o je c t a n d a n a lyz in g it o n ly o n c e , d u r in g it s c o m p ila t io n . Co m p a r in g wit h o t h e r s c a la b le P D G-b a s e d t o o ls [1 7 ] o u r m e t h o d a llo ws t o b u ild t h e s e g r a p h s m u c h fa s t e r . 4 . D e p e n d e n c y Gr a p h s Ge n e r a t io n P D Gs fo r t h e p r o je c t a r e g e n e r a t e d b y a s e p a r a t e p a s s o f L L V M ( s e e Fig .1 ) . Fig. 1. LLVM-based PDG generation. S. Sargsyan, S. Kurmangaleev, A. Baloian and H. Aslanyan 57 It has several advantages. Graphs are generated during the compilation time of the project. It allows to effectively construct graphs for large scale projects (up to million lines of source code). Vertices of PDG graph are LLVM bitcode [4] instructions. Edges are obtained based on LLVM use-def [4], alias and control flow analyses. Those vertices which have no edges are removed. Then the optimized PDGs are stored to a file. The stored graphs are loaded from files to memory before running the clone detection algorithm. 5. Algorithm Bit vector (BV) of PDG’s node is a vector the length of which is equal to 2 ∗ N , where N is the number of all possible different types of nodes. Assume that each type of node is labeled with a number from 1 to N (these labels are instruction types). The bit vector for node is initialized in the following way. For every i = 1, ..., N , i−th position of vector is assigned to 1 if the corresponding node has an incoming edge from the node, which has the label i. Analogically, for every j = N + 1, ..., 2 ∗ N , j−th position of vector is assigned to 1 if the corresponding node has an outgoing edge to some node labeled as j − N (see Fig. 2). Other positions are assigned to 0. µ´ ¶³ 3 µ´ ¶³ 1 µ´ ¶³ 2 µ´ ¶³ 4 ¡ ¡¡ª ? @ @@R 1 0 1 0 0 0 0 1- Fig. 2. Bit vector’s representation. Definition 1: For two vectors V 1 and V 2 which have a length N , V 1&V 2 is a new vector V with a length N where V [i] = V 1[i]&V 2[i] (1&1 = 1, otherwise 0), 1 ≤ i ≤ N . Definition 2: For two vectors V 1 and V 2 which have a length N , V 1|V 2 is a new vec- tor V with a length N where V [i] = V 1[i]|V 2[i] (0|0 = 0, otherwise 1), 1 ≤ i ≤ N . Definition 3: For two vectors V 1 and V 2 with equal lengths, and with elements 0 or 1, andC(V 1, V 2) is a number of 1 in V 1&V 2 vector. Definition 4: For two vectors V 1 and V 2 with equal lengths, and with elements 0 or 1, orC(V 1, V 2) is a number of 1 in V 1|V 2 vector. Definition 5: The similarity for two vectors V 1 and V 2 with equal lengths, and with ele- ments 0 or 1 is a number 1 − (orC(V 1,V 2)−andC(V 1,V 2)) (1+orC(V 1,V 2)) . Definition 6: Density for set of nodes of PDG is |S| (max(S)−min(S)) where S is a set of nodes of PDG sorted by the corresponding source code lines of these nodes, max(S) is a correspond- ing source code line for maximal node of S, min(S) is a corresponding source code line for minimal node of S. 58 Scalable and Accurate Clones Detection Based on Metrics for Dependence Graph The first stage of algorithm constructs two sets of similar nodes based on BV(bit vector) for the corresponding graphs. Every PDG node has information about the source code line from which it was constructed. The second stage of algorithm eliminates the nodes from the constructed sets. Node is removed if its corresponding source code line is located far from the other nodes source code lines. Description of the Metrics-Based Code Clone Detector (MBCCD) algorithm: 1. Input: G1 and G2 PDG graphs, S similarity level and CL minimal length of clone. 2. Construct C1 and C2 multi maps (key is BV, value is PDG node) of BV (bit vector) for G1 and G2 nodes. 3. Any element n1 from C1 is removed, if there is no such n2 from C2, that similarity of the corresponding BVs of n1 and n2 is higher than S. The same is done for C2. 4. Construct S1 and S2 sets of PDG’s nodes for the corresponding C1 and C2. Sort the sets by the corresponding lines numbers of source code for nodes. 5. Consider the first element is f e from S1 and the last element is le from S1. If Density(S1 \ {f e}) > Density(S1 \ {le}) then remove le from S1, otherwise remove f e. Repeat the process until Density(S1) becomes higher than S. The same is done for S2. 6. If S1 and S2 have more elements than CL, this pair of sets as clones are printed. 6. Complexity A few optimizations were applied to achieve high performance and make this algorithm scalable for analyzed million lines of source code. Programming languages have some limited set of operations. Usually it is less than 60. It allows representing the above described BV as a 64-bit integer number. Consequently, the complexity of two BV comparisons is constant. For every BV of the first PDG, MBCCD algorithm examines all BVs of the second PDG, which requires N 1∗N 2 steps, where N 1 is a number of nodes for the first graph and N 2 is a number of nodes for the second graph. Complexity of the MBCCD algorithm is O(N 1∗N 2). 7. Results Comparison: The described method was compared with two widely used tools. The first one is MOSS [18]. It has been developed for detecting a plagiarism in programming classes (Stanford University). The second one is CloneDR [19]. It was developed by Semantic Designs Company, which provides different tools for software design and analysis. Test suites are described in the Table 1. The first test (Original Code) was modified in different ways to obtain all three types of clones. Article [5] contains more details for all tests. Theoretically all files are clones, because they were obtained by modification of one test. Clone detection tool with high accuracy should determine as much clones as possible. S. Sargsyan, S. Kurmangaleev, A. Baloian and H. Aslanyan 59 Table 1. Test suites. copy00.cpp: Original Code. 1: void foo(float sum, float prod) { 2: float result = sum + prod; 3: } 4: void sumProd(int n) { 5: float sum = 0.0; //C1 6: float prod = 1.0; 7: for (int i = 1; i<=n; i++) { 8: sum = sum + i; 9: prod = prod * i; 10: foo(sum, prod); 11: } 12: } copy01.cpp: spaces are changed with tabs in line 8 and 9. copy02.cpp: comments are added in line 6 and 9. copy03.cpp: variables ”sum” and ”prod” are renamed to ”s” and ”p”. copy04.cpp: arguments of ”foo” are exchanged, line 10. copy05.cpp: type of ”sum” and ”prod” are changed into int, line 5 and 6. copy06.cpp: i is exchanged into (i*i), line 8 and 9. copy07.cpp: lines 5 and 6 are exchanged. copy08.cpp: lines 8 and 9 are exchanged. copy09.cpp: lines 9 and 10 are exchanged. copy10.cpp: ”for” replace with ”while”. copy11.cpp: condition (if(i%2)) is added after line 7. copy12.cpp: instruction in line 9 is deleted. copy13.cpp: condition (if(i%2)) is added after line 9. copy14.cpp: default value is added for the second argument for ”foo” function. copy15.cpp: extra argument is added for ”foo” function. Table 2 shows results of comparison for MOSS, CloneDR and MBCCD algorithms. Table 2. The results of comparison: ”yes” - clone is found, ”no” - clone is not found. Test Name MOSS CloneDR MBCCD copy00.cpp yes yes yes copy01.cpp yes yes yes copy02.cpp yes yes yes copy03.cpp yes yes yes copy04.cpp yes yes yes copy05.cpp yes yes yes copy06.cpp yes yes yes copy07.cpp yes yes yes copy08.cpp no no yes copy09.cpp no yes yes copy10.cpp no yes yes copy11.cpp no no yes copy12.cpp yes yes no copy13.cpp yes yes yes copy14.cpp yes yes yes copy15.cpp yes yes yes Run time: The MBCCD was applied to a number of widely used libraries and software systems. Tests were run on Intel Core i3 CPU 540, 8GB RAM. Table 3 shows basic results of clone detection with minimal length equal to fifteen and similarity higher than 85%. 60 Scalable and Accurate Clones Detection Based on Metrics for Dependence Graph Table 3. Test results of libraries: openssl, llvm/clang and firefox. Test Name Lines PDGs PDG gen. time Time clones False openssl 280.000 1800 43 3 16 0 llvm/clang 1.300.000 19946 396 1876 94 4 firefox 3.800.000 61643 465 2489 18 0 The second column contains the number of source code lines written in C/C++ pro- gramming languages. The third column contains the number of PDGs for the project. The fourth column contains PDGs generation time. The fifth column contains the detection time in seconds. The sixth column contains the number of detected clones. The last column presents the results of manual analysis. We have tried to run MOSS and CloneDR on the same tests, but these tests were analyzed partially (MOSS and CloneDR are not able to process dependences between the compilation modules properly). Even for these partially analyzed pieces of code MBCCD is faster and more accurate. Table 4. illustrates one of the detected clones for the openssl library. Table 4. Illustration of clones. openssl-1.0.1g/crypto/cast/c enc.c openssl-1.0.1g/crypto/bf/bf enc.c 141: for (l-=8; l>=0; l-=8) 237: for (l-=8; l>=0; l-=8) 142: { 238: { 143: n2l(in,tin0); 239: n2l(in,tin0); 144: n2l(in,tin1); 240: n2l(in,tin1); 145: tin0∧=tout0; 241: tin0∧=tout0; 146: tin1∧=tout1; 242: tin1∧=tout1; 147: tin[0]=tin0; 243: tin[0]=tin0; 148: tin[1]=tin1; 244: tin[1]=tin1; 149: CAST encrypt(tin,ks); 245: BF encrypt(tin,schedule); 150: tout0=tin[0]; 246: tout0=tin[0]; 151: tout1=tin[1]; 247: tout1=tin[1]; 152: l2n(tout0,out); 248: l2n(tout0,out); 153: l2n(tout1,out); 249: l2n(tout1,out); 154: } 250: } 8. Conclusion We have proposed a metrics-based method of code clones detection, which is capable to analyze million lines of source code. (T3) as well as other types of clones are detected. The method is used as built-in instrument for LLVM. LLVM bitcode-based construction of PDGs allows building them much faster than any existed method. This tool can analyze and compare source code quality. Semantic mistakes arising during the software development process can be detected by the compiler in early stages. It can also be used for automatic refactoring. References [1] B. Baker, “On finding duplication and near-duplication in large software systems”, Pro- ceedings of the 2nd Working Conference on Reverse Engineering, WCRE 1995, pp. 86-95, 1995. S. Sargsyan, S. Kurmangaleev, A. Baloian and H. Aslanyan 61 [2] C. K. Roy and J. R. Cordy, “An empirical study of function clones in open source software systems”, Proceedings of the 15th Working Conference on Reverse Engineering, WCRE 2008, pp. 81-90, 2008. [3] D. Rattana, R. Bhatiab and M. Singhc, “Software clone detection”, A systematic review, Information and Software Technology, vol. 55, no. 7, pp. 1165-1199, 2013. [4] [Online]. Available: http://llvm.org [5] Ch. K. Roya , J. R. Cordya and R. Koschkeb, “Comparison and evaluation of code clone detection techniques and tools”, A qualitative approach, Science of Computer Program- ming, vol.74, no. 7, pp. 470-495, 2009. [6] S. Ducasse, M. Rieger and S. Demeyer, “A language independent approach for detecting duplicated code”, Proceedings of the 15th International Conference on Software Mainte- nance, (ICSM’99), Oxford, England, UK, pp. 109-119, 1999. [7] T.Kamiya, S.Kusumoto and K.Inoue, “CCFinder: A multilinguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, vol. 28, no. 7, pp. 654-670, 2002. [8] R. Kaur and S. Singh, “Clone detection in software source code using operational sim- ilarity of statements”, ACM SIGSOFT Software Engineering Notes, vol. 39, no. 3, pp. 1-5, 2014. [9] I. Baxter, A. Yahin, L. Moura and M. Anna, “Clone detection using abstract syntax trees”, Proceedings of the 14th IEEE International Conference on Software Maintenance, IEEE Computer Society, pp. 368-377, 1998. [10] L.Jiang, G.Misherghi, Z.Su and S.Glondu, “DECKARD : Scalable and accurate tree- based detection of code clones”, Proceedings of the 29th International Conference on Software Engineering, (ICSE07), IEEE Computer Society, pp. 96-105, 2007. [11] J. Mayrand, C. Leblanc and E. Merlo, “Experiment on the automatic detection of function clones in a software system using metrics”, Proceedings of the 12th International Conference on Software Maintenance, (ICSM96), Monterey, CA, USA, pp. 244-253, 1996. [12] N. Davey, P. Barson, S. Field and R. Frank, “The development of a software clone detector”, International Journal of Applied Software Technology, vol. 1, no. 3/4, pp. 219-236, 1995. [13] R.Komondoor and S.Horwitz, “Using slicing to identify duplication in source code”, Proceedings of the 8th International Symposium on Static Analysis, pp. 40-56, 2001. [14] J. Krinke, “Identifying similar code with program dependence graphs”, Proceedings of the 8th Working Conference on Reverse Engineering, (WCRE 2001), pp. 301-309, 2001. [15] S. Gupta and P. C. Gupta, “Literature Survey of Clone Detection Techniques”, Inter- national Journal of Computer Applications, vol. 99, no. 3, pp. 41-44, 2014. [16] Y. Higo and S. Kusumoto, “Code clone detection on specialized PDGs with heuristics”, Proceedings of the 15th European Conference on Software Maintenance and Reengineering (CSMR11), Oldenburg, Germany, pp.75-84, 2011. [17] M. Gabel, L. Jiang and Z. Su, “Scalable detection of semantic clones”, Proceedings of 30th International Conference on Software Engineering (ICSE08), Leipzig, Germany, pp. 321-330, 2008. [18] [Online]. Available: http://theory.stanford.edu/∼ aiken/moss/ [19] [Online]. Available: http://www.semdesigns.com/products/clone/ 6 2 Scalable and Accurate Clones Detection Based on Metrics for Dependence Graph Submitted 06.09.2014, accepted 27.11.2014. Ìñ³·ñÇ Ï³Ëí³ÍáõÃÛ³Ý ·ñ³ýÇ íñ³ ë³ÑÙ³Ýí³Í Ù»ïñÇϳݻñáí Ïá¹Ç ÏÉáÝÝ»ñÇ ÁݹɳÛÝ»ÉÇ ¨ ×ß·ñÇï áñáÝáõÙ ê. ê³ñ·ëÛ³Ý, Þ. Îáõñٳݷ³É»¨, ². ´³ÉáÛ³Ý ¨ Ð. ²ëɳÝÛ³Ý ²Ù÷á÷áõÙ Ðá¹í³ÍáõÙ Ýϳñ³·ñí³Í ¿ Ïá¹Ç ÏÉáÝÝ»ñÇ áñáÝÙ³Ý Ýáñ Ù»Ãá¹ C/C++ Íñ³·ñ³íáñÙ³Ý É»½áõÝ»ñÇ Ñ³Ù³ñ: ²ÛÝ ÑÇÙÝí³Í ¿ Íñ³·ñÇ Ï³Ëí³ÍáõÃÛ³Ý ·ñ³ýÇ Ñ³Ù³ñ ë³ÑÙ³Ýí³Í Ù»ïñÇϳݻñÇ íñ³: γËí³ÍáõÃÛ³Ý ·ñ³ýÇ ³Ù»Ý ÙÇ ·³·³ÃÇ Ñ³Ù³ñ ë³ÑÙ³ÝíáõÙ ¿ µÝáõó·ñÇã í»Ïïáñ, áñÁ å³ñáõݳÏáõÙ ¿ ïíÛ³ÉÝ»ñ` ѳñ¨³Ý ·³·³ÃÝ»ñÇ Ù³ëÇÝ: ÞÝáñÑÇí Ýñ³Ý, áñ µÝáõó·ñÇã í»ÏïáñÝ»ñÁ Ý»ñϳ۳óíáõÙ »Ý áñå»ë í³ÃëáõÝãáñë µÇà å³ñáõݳÏáÕ ³ÙµáÕç³ïÇå Ãí»ñ, Ñݳñ³íáñ ¿ »ñÏáõ ·³·³ÃÝ»ñÇ ÝÙ³ÝáõÃÛáõÝÁ å³ñ½»É ѳëï³ïáõÝ Å³Ù³Ý³ÏáõÙ: ¸³ ÃáõÛÉ ¿ ï³ÉÇë ³ñ¹Ûáõݳí»ïáñ»Ý ѻﳽáï»É ÙÇÉÇáݳíáñ ïáÕ»ñ å³ñáõݳÏáÕ Íñ³·ñ»ñ: ¶ïÝí³Í ÏÉáÝÝ»ñÇ ×ß·ñïáõÃÛáõÝÁ ³å³ÑáííáõÙ ¿ ѳٳå³ï³ëË³Ý ·³·³ÃÝ»ñÇ ïáÕ»ñÇ ¹³ë³íáñí³ÍáõÃÛ³Ý ëïáõ·Ù³Ùµ: Ðá¹í³ÍáõÙ Ýϳñ³·ñí³Í ¿ ݳ¨ ϳËí³ÍáõÃÛ³Ý ·ñ³ýÇ ëï³óÙ³Ý Ýáñ Ùáï»óáõÙ, áñÁ ÃáõÛÉ ¿ ï³ÉÇë ëï³Ý³É ³Û¹ ·ñ³ýÝ»ñÁ ß³ï ³í»ÉÇ ³ñ³·, ù³Ý ·áÛáõÃÛáõÝ áõÝ»óáÕ Ù»Ãá¹Ý»ñÁ: Üϳñ³·ñí³Í Ù»Ãá¹Á ѳٻٳïí»É ¿ ÙÇ ß³ñù ɳÛÝ ï³ñ³ÍáõÙ ·ï³Í ³ÛÉ Ù»Ãá¹Ý»ñÇ Ñ»ï: ²ñ¹ÛáõÝùÝ»ñÁ óáõÛó »Ý ï³ÉÇë, áñ ³Ûë Ù»Ãá¹Á ³ß˳ïáõÙ ¿ ³í»ÉÇ ³ñ³· ¨ ×ß·ñÇï: Ìàñøòàáèðóåìûé è òî÷íûé èíñòðóìåíò ïîèñêà êëîíîâ êîäà íà îñíîâå ìåòðèê äëÿ ãðàôà çàâèñèìîñòåé ïðîãðàììû Ñ. Ñàðãñÿí, Ø. Êóðìàíãàëååâ, À. Áàëîÿí è À. Àñëàíÿí Àííîòàöèÿ Ñòàòüÿ îïèñûâàåò íîâûé ìåòîä ïîèñêà êëîíîâ êîäà äëÿ ÿçûêîâ C/C++. Îí îñíîâàí íà ìåòðèêàõ äëÿ ãðàôà çàâèñèìîñòåé ïðîãðàììû. Äëÿ êàæäîé âåðøèíû ãðàôà ñòðîèòñÿ õàðàêòåðèñòè÷åñêèé âåêòîð, êîòîðûé ñîäåðæèò èíôîðìàöèþ î ñîñåäíèõ âåðøèíàõ. Ýòè âåêòîðû ïðåäñòàâëåíû â âèäå øåñòèäåñÿòè ÷åòûðåõ áèòíûõ öåëûõ ÷èñåë, è ýòî ïîçâîëÿåò îïðåäåëèòü ñõîäñòâî ìåæäó äâóìÿ âåðøèíàìè â ïîñòîÿííîå âðåìÿ. Ýòî ïîçâîëèëî ýôôåêòèâíî àíàëèçèðîâàòü ìèëëèîíû ñòðîê èñõîäíîãî êîäà. Âûñîêàÿ òî÷íîñòü íàéäåííûõ êëîíîâ áûëà äîñòèãíóòà ïóòåì ïðîâåðêè ìåñòîïîëîæåíèÿ èñõîäíîãî êîäà ñîîòâåòñòâóþùèõ âåðøèí.  ñòàòüå òàêæå ïðåäëàãàåòñÿ íîâûé ïîäõîä äëÿ ãåíåðàöèè ãðàôîâ çàâèñèìîñòåé, ÷òî ïîçâîëÿåò ïîëó÷èòü èõ ãîðàçäî áûñòðåå, ÷åì ëþáîé ñóùåñòâóþùèé ìåòîä. Ïðåäëîæåííûé ìåòîä áûë ñðàâíåí ñ íåñêîëüêèìè øèðîêî èñïîëüçóåìûìè ìåòîäàìè. Ðåçóëüòàòû ïîêàçàëè, ÷òî ýòîò ìåòîä ðàáîòàåò áûñòðåå è òî÷íåå. 1_BAL.pdf (p.1-3) 2_Bal.pdf (p.4-8) 3_Bal.pdf (p.9)