About Metrics for Clone Detection Electronic Communications of the EASST Volume 63 (2014) Proceedings of the Eighth International Workshop on Software Clones (IWSC 2014) About Metrics for Clone Detection — Position Paper — Thierry Lavoie, Ettore Merlo 5 pages Guest Editors: Nils Göde, Yoshiki Higo, Rainer Koschke Managing Editors: Tiziana Margaria, Julia Padberg, Gabriele Taentzer ECEASST Home Page: http://www.easst.org/eceasst/ ISSN 1863-2122 http://www.easst.org/eceasst/ ECEASST About Metrics for Clone Detection Thierry Lavoie1, Ettore Merlo1 1 thierry-m.lavoie@polymtl.ca, ettore.merlo@polymtl.ca Departement de genie informatique et logiciel Ecole Polytechnique de Montreal Montreal, Canada Abstract: This paper presents some results about some metrics and their possible impact on clone detectors. It also discusses some advantages of why we should use metrics instead of arbitrary measures. Keywords: Clones, Metrics 1 Introduction Clone detectors rely on the concept of similarity and distance measures to identify cloned frag- ments. The choice of a specific distance function in a clone detector is arbitrary up to some extent. However, with a deeper knowledge of similarity measures, we can condition this choice to have some properties that can help improve scalability and quality of tools. This paper presents some interesting results, insights and questions about similarity and distance measures, including a somehow counter-intuitive result on the cosine distance. For a comprehensive survey of many distances and metrics, the reader is invited to read [CC10]. This paper covers the following topics: • A link between the Jaccard measure on sets and the Manhattan distance in euclidean space • Limitations of the cosine distance on normalized vectors and approaches to overcome them • Unanswered questions on some similarity detection techniques As a convention in this paper, we set Rn+ = [0, ∞) n. 2 An Equivalence of the Jaccard Metric and the Manhattan Dis- tance Sometimes it is useful to link two known metrics to get a desired behavior or a better under- standing of one of the two. In our case, we link the Jaccard metric on sets with the Manhattan distance on the space Rn+. Our equivalence holds for arbitrary sets, but different representations in the space R⋉+ may be possible, which leads to different applications. We first prove the result, then we will explain how the result may be used. 1 / 5 Volume 63 (2014) mailto:thierry-m.lavoie@polymtl.ca mailto:ettore.merlo@polymtl.ca About Metrics for Clone Detection Recall the Jaccard measure on two finite sets U,V is defined as: J(U,V ) = |U ⋂ V | |U ⋃ V | and from this we define the Jaccard metric as: δ (U,V ) = 1 − |U ⋂ V | |U ⋃ V | = |U ⋃ V |−|U ⋂ V | |U ⋃ V | which is known to satisfy all the metric axioms (see section 3 for a brief recall). The Manhattan distance, noted l1, on two vectors 1 u, v from Rn+ is: l1(u, v) = n ∑ i=1 |ui − vi| Now, lets associate a unique integer 1 ≤ i ≤ |U ⋃ V | to every element µ in U and V . Now, choose n = |U ⋃ V | and take u, v ∈ Rn+ such as ui = 1 ↔ µi ∈ U otherwise ui = 0, and vi = 1 ↔ µi ∈ V otherwise vi = 0. From this, we draw: |U ⋃ V | = n ∑ i=1 max(ui, vi) |U ⋂ V | = n ∑ i=1 min(ui, vi) The min and max terms in the preceding equations are only equal if ui = vi, otherwise one of the two is ui and the other is vi and because |ui − vi| = |vi − ui| we must have: |U ⋃ V |−|U ⋂ V | = | n ∑ i=1 max(ui, vi) − n ∑ i=1 min(ui, vi)| = n ∑ i=1 |ui − vi| Replacing the last two equalities in the original Jaccard metric leads to: δ (U,V ) = |U ⋃ V |−|U ⋂ V | |U ⋃ V | = ∑ni=1 |ui − vi| ∑ni=1 max(ui, vi) with the numerator of the last term being the Manhattan distance between u and v. This proves the existence of an equivalence between a normalization of the Manhattan distance between certain vectors and the Jaccard similarity between sets. It is easy to generalize this result to allow any positive integer for ui and vi instead of 0, 1. We simply need to project multiple elements of the sets U and V onto a single coordinate i. The proof is then almost identical to the one presented here. Why is this result interesting ? In practice, clone detectors use a lot of similarity and distance measures on sets. Most of them are not metrics (like the Dice coefficient, the Tanimoto distance, 1 Actually, to define the Manhattan distance , we do not need the full power of a vector space but only requires the n-tuples in Rn+. However, it is now a custom to name everything in R n + a vector and we shall follow the custom. Proc. IWSC 2014 2 / 5 ECEASST etc.) and thus have a behavior less understood. Moreover, metrics lead to known opportunity of optimizations in search spaces that arbitrary distances do not offer [CPZ97]. Thus, even if the choice is ultimately arbitrary, there exist some arguments that favor metrics over arbitrary distances and knowing the link between some of them can help making a better choice. In this case, a simple distance between vectors gives us a useful interpretation as a distance between sets. 3 Properties of Some Angular Distances We start by proving a result on the sine function. Theorem 1 Let X be a subset of Rn+|x ∈ X → ||x|| = 1 . Let θx,y be the angle between any two x and y ∈ X . Finally, let δ : Rn+ × R n + −→ R be defined as δ (x, y) = sin θx,y. Then, δ is a metric on X . Proof. We need to prove that δ satisfies the four properties of a metric. (Non-negativity) δ (x, y) ≥ 0. This is true, since the all x, y have positive coordinates and the angle between such vectors must be in [0, π2 ]. (Nullity) δ (x, y) = 0 ↔ x = y. This is true, since the sine of an angle restricted to [0, π2 ] is 0 if and only if that angle is 0, and the angle between x and y is 0 if and only if x = y. (Symmetry) δ (x, y) = δ (y, x). This is true since the angle between x and y equals the angle between y and x. (Triangle inequality) δ (x, y) + δ (y, z) ≥ δ (x, z). The property holds, but it is tricky to prove. First, observe that if the angle between x and y or the angle between y and z is greater than the angle between x and z, then the property must hold since the sine is monotonically increasing in [0, π2 ]. It remains to prove that it holds if the bigger angle is between x and z. Now, clearly the sum of the angles θx,y and θy,z is greater or equal than θx,z. Because sine is monotonically increasing in the considered interval, we now have: sin(θx,y) + sin(θy,z) ≥ sin(θx,y + θy,z) ≥ sin(θx,z) We develop the first two members of this inequality: sin(θx,y) + sin(θy,z) ≥ sin(θx,y + θy,z) = sin(θx,y) cos(θy,z) + sin(θy,z) cos(θx,y) Subtracting the right member to the left leaves: sin(θx,y) − sin(θx,y) cos(θy,z) + sin(θy,z) − sin(θy,z) cos(θx,y) ≥ 0 Because the cosine of these angles lies in [0, 1], we have sin(θx,y) − sin(θx,y) cos(θy,z) ≥ 0 sin(θy,z) − sin(θy,z) cos(θx,y) ≥ 0 3 / 5 Volume 63 (2014) About Metrics for Clone Detection and we conclude that the sum of the two must be greater than or equal to 0. The inequality between the first two members of equation 1 holds, and because of our definition of the second member, the inequality between the last two members already held. Thus, we conclude that: sin(θx,y) + sin(θy,z) ≥ sin(θx,z) and the triangle inequality is satisfied. All four properties hold and we have secured the theorem. Why is this result interesting ? First, even if the cosine distance is a very popular similarity measure on vectors, it is not a metric. To preserve the non-linearity of the cosine function, this results actually states that you need its dual function, the sine, to get a metric, under certain restriction. The sine is not as straightforward to compute as the cosine on vectors, but it has the advantage of being a metric on the first quadrant of Rn. It is also worth questioning whether or not the non-linearity of the sine and cosine is a de- sirable property. The angle distance between two vectors is a metric (this fact is used in the above proof) and is linear on the arc distance between the vectors on a unit circle. It would be interesting to verify whether or not the cosine is actually better than the sine and the angular dis- tance considering our arguments. In general, it would be safe to compare a measure with closely related one to assess which is better even if it means a small additional computational cost. 4 Further Questions and Research: Implicit Distances, Why Should we Recover them ? The previous sections dealt with special cases of distances that can be easily converted to a metric in order to use all the properties and knowledge we can draw from the vast literature on the subject. All these observations are interesting to ponder in the context of clone detectors explicitly based on a similarity or a distance measure. However, many tools only use implicit distances that are not properly defined as a mapping between a pair or a cluster of objects onto the real line R. What should be the course of action for these tools ? As hard as it can be, it should be possible in practice to recover the distance function. The mapping might be hard to express, or too many parameters may interact together to produce a long and complex formula, but these are only additional reasons to support the need to formalize implicit or hidden distance functions. Complete understanding of clone detection technology is intertwined with our ability to encapsulate their behavior in mathematical formula: if this task proves to be tedious, what can we say of understanding their behavior or how can we even completely compare them ? The following questions are a starting point to help investigate the mathematical foundation of clone detection tools: • What distance function does the clone detector uses ? • How many parameters does the distance depends upon ? Are some redundant ? Are some useless ? Proc. IWSC 2014 4 / 5 ECEASST • If no mathematical formula seems possible to exist to encapsulate the distance used, why is it so ? • What key features does the tool use ? Are there other tools using those features ? Do those tools have a well-formulated distance ? • Is there an already existing distance that approximates the tool’s implicit function ? How good is this approximation ? Nevertheless, unanswered questions do not hinder performances and tools built around im- plicit distances can produce good results. To shed a deeper light on why they do have good results might however help the general understanding of clone detectors’ behavior. Acknowledgement The authors wish to thank Theotime Menguy for his quick review of the proof in section 3. They would also like to thank Mathieu Merineau for providing the excellent reference [CC10]. Bibliography [CC10] S. seok Choi, S. hyuk Cha. A survey of Binary similarity and distance measures. Jour- nal of Systemics, Cybernetics and Informatics, pp. 43–48, 2010. [CPZ97] P. Ciaccia, M. Patella, P. Zezula. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In Proc. of 23rd International Conference on Very Large Data Bases. Pp. 426–435. Morgan Kaufmann Publishers, 1997. 5 / 5 Volume 63 (2014) Introduction An Equivalence of the Jaccard Metric and the Manhattan Distance Properties of Some Angular Distances Further Questions and Research: Implicit Distances, Why Should we Recover them ?