Formal verification of a theory of packages Electronic Communications of the EASST Volume 48 (2011) Proceedings of the Fifth International Workshop on Foundations and Techniques for Open source Software Certification (OpernCert 2011) Formal verification of a theory of packages Jaap Boender 9 pages Guest Editors: Luis Soares Barbosa, Dimitrios Settas Managing Editors: Tiziana Margaria, Julia Padberg, Gabriele Taentzer ECEASST Home Page: http://www.easst.org/eceasst/ ISSN 1863-2122 http://www.easst.org/eceasst/ ECEASST Formal verification of a theory of packages Jaap Boender1∗ 1 boender@pps.jussieu.fr Univ Paris Diderot, Sorbonne Paris Cité, PPS, UMR 7126 CNRS, F-75205 Paris, France Abstract: Over the years, open source distributions have become increasingly large and complex—as an example, the latest Debian distribution contains almost 30 000 packages. Consequently, the tools that deal with these distribution have also become more and more complex. Furthermore, to deal with increasing distribution sizes optimisation has become more important as well. To make sure that correctness is not sacrificed for complexity and optimisation, it is important to verify the underlying assumptions formally. In this paper, we present an example of such a verification: a formalisation in Coq of a theory of packages and their interdependencies. Keywords: verification, theorem proving, open source distributions 1 Introduction During the last decade, open source distributions have become more and more popular, and more and more complex. Where the first release of Debian only had 128 packages, the latest is well on its way to 30 000. In order to deal with these complexities, the MANCOOSI project has developed new tools and methods to help distribution editors gain insight into the structure of their distributions, to more easily discover errors and to help with administrative procedures (such as migrating packages from unstable to stable distributions). Since distributions are so large and complex, it is very important that these tools be as fast as possible, without sacrificing correctness. We have therefore used the Coq theorem prover to formally verify some of the assumptions used for optimising the MANCOOSI tools. In the rest of this article, we will present a brief overview of the formalisation and explain the design decisions taken. 2 Formalisation In this section, we will discuss the formalisation of the theory of packages as described in [MBD+06], [ADBZ09] and [DB10]. We shall not repeat the definitions presented there unless absolutely necessary, so that we can concentrate on the formalisation itself. ∗ Partially supported by the European Community’s 7th Framework Programme (FP7/2007-2013), grant agreement n◦214898, “Mancoosi” project. Work developed at IRILL. 1 / 9 Volume 48 (2011) mailto:boender@pps.jussieu.fr Formal verification of a theory of packages The Coq sources of the formalisation are available through the Web at http://www.pps.jussieu. fr/∼boender/package-theory.tar.gz 2.1 Definitions The formalisation has been done in the Coq proof assistant, which comes with an extensive library. The first choice to be made is of how to represent packages and repositories. In the original definition, packages are represented as a pair (u,v), where u is the name and v ∈ N the version. For the formalisation, we have decided to treat packages as atomic entities, and forget the version number completely. This is possible because the theory we want to formalise never uses the specific fact that a package can exist in multiple versions. In fact, the only point where this is used in the real world is in the Debian distribution, where there is an implicit conflict between two packages that have the same name but a different version (in RPM, this is not the case). This specific instance can easily be modelled by adding the implicit conflict ourselves where necessary. A repository, then, is a set of packages. There are at least three ways in Coq to represent sets: the ListSet library, which uses linked lists, the Ensemble library, which uses characteristic functions, and the MSet library. We have chosen the latter, because it is very modular, flexible, and comes with a large set of theorems. We do not need the complexity of the Ensemble library, since package repositories are finite. The modularity of MSet comes to the fore when looking at the definition of a package. The only thing we need to do is to define a package module type. This type is defined as a refine- ment of the built-in OrderedType, with a few standard axioms defining the equality and order relations, as well as a comparison function: Module Type PACKAGE <: OrderedType. Parameter t: Set. Parameter eq: t -> t -> Prop. Parameter lt: t -> t -> Prop. Axiom eq refl: forall x: t, eq x x. Axiom eq sym: forall x y: t, eq x y -> eq y x. Axiom eq trans: forall x y z: t, eq x y -> eq y z -> eq x z. Instance eq equiv : Equivalence eq := Build Equivalence t eq eq refl eq sym eq trans. Axiom lt trans: forall x y z: t, lt x y -> lt y z -> lt x z. Axiom lt not eq: forall x y: t, lt x y -> ˜ eq x y. Axiom lt strorder: StrictOrder lt. Axiom lt compat : Proper (eq==>eq==>iff) lt. Parameter compare: t -> t -> comparison. Parameter compare spec : forall s s’, CompSpec eq lt s s’ (compare s s’). Parameter eq dec: forall x y, { eq x y } + { ˜ eq x y }. End PACKAGE. Proc. OpenCert 2011 2 / 9 http://www.pps.jussieu.fr/~boender/package-theory.tar.gz http://www.pps.jussieu.fr/~boender/package-theory.tar.gz ECEASST Declare Module Package: PACKAGE. The declaration of a set of packages (i.e. a repository) now can be done very easily: Declare Module PackageSet : MSetInterface.Sets with Module E := Package. All theorems about sets that are part of MSet are now defined for package sets. The modularity of MSet also makes it very easy to define conflicts. A conflict is defined as a pair of two packages, such that conflicts are symmetrical, and a package never conflicts with itself. We can define this using the built-in type PairOrderedType. This is a functor that, given two OrderedTypes, returns the type of a pair of these two OrderedTypes. Module Conflict := PairOrderedType Package Package. Declare Module ConflictSet : MSetInterface.Sets with Module E := Conflict. Axiom conflicts sym: forall (C: ConflictSet.t) (p q: Package.t), ConflictSet.In (p, q) C -> ConflictSet.In (q, p) C. Axiom conflicts irrefl: forall (C: ConflictSet.t) (p: Package.t), ˜ ConflictSet.In (p, p) C. In only four lines, we have now defined conflicts, sets of conflicts, and their properties. We now have the basic definitions of packages, repositories and conflicts. In the following sections, we will use these definitions to formalise the rest of the theory of packages. 2.2 Dependencies All definitions pertaining to dependencies have been put in a separate module. This is not only for ease of reference, but it also allows us to use the dependency module as a parameter later on. This is especially interesting when we start formalising the dependency cone (the transitive closure of the dependency function, i.e. all packages that could possibly be needed to install another package); by making this independent of the dependency function, we can easily specify different cones: the normal dependency cone, but also the cone that includes only conjunctive dependencies (dependencies that can only be satisfied by a single package), or even the reverse dependency cone (the cone of packages that depend on a given package). The dependency function itself is modelled as a Coq function Dependencies : Package.t→listPackageSet.t. The dependencies of a package are spe- cified as a list of alternatives; every alternative is a set of packages, one of which has to be installed for the package itself to be installed. We have added a ‘filter’ to the formalisation of dependencies, a function dep filter : PackageSet.t→bool. This function allows us to consider only select depend- encies, for example conjunctive ones (a dependency is conjunctive iff its set is a singleton). An example of how this works can be seen in subsection 2.4. By default, the dep filter function simply returns true for any argument, so that all dependencies are taken into account. The predicate direct dependency that notes whether a package q is a direct dependency of p (that is to say directly mentioned in the dependencies of p): 3 / 9 Volume 48 (2011) Formal verification of a theory of packages Definition direct dependency (p: Package.t) (q: Package.t) := exists d: PackageSet.t, In q d ∧ List.In d (List.filter dep filter (Dependencies p)). We use the direct dependency predicate to define the dependency relation; a package p depends on a package q if it is possible to trace a path of successive direct dependencies from p to q. Function dependency path (p q: Package.t) (l: list Package.t) { struct l }: Prop := match l with | nil => direct dependency p q | h::t => direct dependency p h ∧ dependency path h q t end. Definition dependency (p q: Package.t): Prop := exists l: list (Package.t), dependency path p q l. The dependency path function is recursive on l; we need to specify this explicitly so that Coq can be sure that the function terminates (the recursive call must be on a subterm of the original argument). Since the dep filter and Dependencies functions are defined as parameters in the dependency module, every function outside the dependency module that calls them will have them added as parameter. This allows us to define functions on other dependency relations, as we will see in subsection 2.4. 2.3 The dependency cone The dependency cone is defined as the transitive closure of the aforementioned dependency func- tion. It is formalised as follows: Function cone (P: {x: PackageSet.t | x [<=] R}) {measure (fun x => cardinal R - cardinal (proj1 sig P)) P}: PackageSet.t := if equal (inter R (dependencies (proj1 sig P))) (proj1 sig P) then (proj1 sig P) else cone (exist (fun v => v [<=] R) (inter R (dependencies (proj1 sig P))) (fun a => inter subset 1 (s:=R) (s’:=dependencies (proj1 sig P)) (a:=a))). This definition basically states that the cone of a set P is the fixed point of the repeated applic- ation of the dependency function. There is a lot of extra information, because Coq needs to be sure that the function actually terminates. To start with, the argument P is not a simple package set, but a subset of a repository R (as stated in the dependent type). This is used for the termination: the measure keyword states that the number |R|−|dependencies P| strictly decreases. Coq now generates the necessary proof obligations for this statement automatically. The only disadvantage here is that we can no longer use P by itself, since it is not a simple set anymore, but a set with a specific property, and a witness that proves that the set actually has that property. To extract the set, we use the function proj1 sig; there is also proj2 sig if we need to extract the witness. Proc. OpenCert 2011 4 / 9 ECEASST Note that for the recursive application of the cone function, we have to provide a proof that the recursive argument is also a subset of R (to satisfy the dependent type); we do this using the built-in MSet theorem inter subset 1 theorem which states that ∀s,s′[s∩s′ ⊆ s]. We can also see that the cone function calls the dependencies function; if we want to have a cone of another type dependencies, we can specify another dependency function, for example conjunctive dependencies for a cone of only conjunctive dependencies. 2.4 Strong dependencies In [ADBZ09], we introduced the notion of strong dependencies; a package p strongly depends on another package q if every installation of p also contains q. However, if p itself is not installable, it is discounted, because then ‘every installation of p’ would be the empty set, and so p would strongly depend on every other package in the distribution. In Coq, this is formalised as follows: Definition satisfied pkg (S: PackageSet.t) (p: Package.t): Prop := forall d: PackageSet.t, List.In d (Dependencies p) -> exists p’: Package.t, In p’ (inter S d). Definition abundant (S: PackageSet.t): Prop := PackageSet.For all (satisfied pkg S) S. Definition concerned (S: PackageSet.t) (c: Package.t * Package.t): Prop := let (p, q) := c in (In p S) ∧ (In q S). Definition peaceful (S: PackageSet.t) (C: ConflictSet.t): Prop := ConflictSet.For all (fun c => ˜(concerned S c)) C. Definition healthy (S: PackageSet.t) (C: ConflictSet.t): Prop := abundant S ∧ peaceful S C. Definition is install set (p: Package.t) (R: PackageSet.t) (C: ConflictSet.t) (I: PackageSet.t) := I [<=] R ∧ In p I ∧ healthy I C. Definition strong dep (R: PackageSet.t) (C: ConflictSet.t) (p: Package.t) (q: Package.t) := (exists N: PackageSet.t, is install set p R C N) ∧ forall I: PackageSet.t, I [<=] R -> In p I ∧ healthy I C -> In q I. The notions of abundance and peace are explained in [MBD+06]. Informally, a set of packages is abundant iff all the dependencies of its constituent packages are satisfied, and it is peaceful iff there are no conflicting packages within the set. If a healthy (abundant and peaceful) set can be found that contains a package p, then p is installable, and the set is called an install set of p. As for the definition of strong dependency, it is necessary that p is installable, as noted above, and every healthy set that includes p must also include q. We can use this definition to prove an optimisation of the algorithm to find strong dependencies (as presented in [ADBZ09]). This optimisation uses the fact that a conjunctive dependency is automatically a strong dependency (a conjunctive dependency only has one alternative, which 5 / 9 Volume 48 (2011) Formal verification of a theory of packages means that this alternative must necessarily be installed). We can therefore mark all conjunctive dependencies as strong dependencies, which drastically reduces the search space. In order to formalise this notion, we first define the notion of a conjunctive dependency: Definition conjunctive dependency (R: PackageSet.t) (p q: Package.t): Prop := dependency Dependencies is conjunctive bool p q. The conjunctive dependency function is simply defined as the dependency function with another dependency filter. Now we see why the definition of Dependencies and dep filter as parameters is useful: we parametrise the conjunctive dependency function with the normal Dependencies function, but we filter them with is conjunctive bool, which only returns true if its argument is a conjunctive dependency. Now we can use this conjunctive dependency function to define the theorem: Theorem conjunctive strong dep: forall R C p q, conjunctive dependency R p q -> installable R C p -> strong dep R C p q. 2.5 Strong conflicts In [DB10], we introduced the notion of strong conflicts, i.e. pairs of packages that cannot be installed together under any circumstances. We also presented an algorithm to quickly find all the strong conflicts in a given distribution. This algorithm is much faster than the naive method of just checking every possible pair, by using a theorem that states that for two packages p and q not to be installable together, there must be an explicit conflict (c,c′) such that p depends on c and q depends on c′. The algorithm can then traverse the dependency tree backwards from every explicit conflicts, which limits the number of candidates to check. This theorem is easy to formalise. First the definition of a strong conflict: Definition strong conflict (R: PackageSet.t) (C: ConflictSet.t) (p: Package.t) (q: Package.t) := installable R C p ∧ installable R C q ∧ ˜exists I: PackageSet.t, healthy I C ∧ In p I ∧ In q I. The definition of the theorem: Theorem scp: forall (R: PackageSet.t) (C: ConflictSet.t) (p: Package.t | In p R) (q: Package.t | In q R), strong conflict R C (proj1 sig p) (proj1 sig q) -> ConflictSet.Exists (fun c => match c with (c1,c2) => (E.eq (proj1 sig p) c1 ∨ normal dependency (proj1 sig p) c1) ∧ (E.eq (proj1 sig q) c2 ∨ normal dependency (proj1 sig q) c2) end) C. The proof can be found in [DB10]; the Coq proof closely follows this scheme. Proc. OpenCert 2011 6 / 9 ECEASST alpha bravo charlie # Figure 1: Example of a triangle conflict 2.6 Triangle conflicts An optimisation to the strong conflict algorithm uses the concept of triangle conflicts. Here is its definition: Definition 1 A conflict (c1,c2) is a triangle conflict if and only if there exists a package p such that: • there is a d ∈Dependencies(p) such that c1 ∈ d and c2 ∈ d; • there is no other p′ such that p′ depends on either c1 or c2. An example of this situation can be found in figure 1; bravo and charlie are in conflict, and there are no packages that depend on them except for alpha. We have proven that triangle conflicts can be discarded when checking for strong conflicts. Informally, the reason for this is that since the only way to get to the two packages in conflict (c1 and c2) is through p; in order to install p, we can choose either c1 or c2. Since there are no other packages that depend on c1 or c2, which one we choose does not matter. In Coq, we express this by the following theorem: Theorem triangles ok: forall (R: PackageSet.t) (C: ConflictSet.t) (a: PackageSet.t | a [<=] R), (For all (fun a => installable (NoSupDependencies R) R C a) (proj1 sig a)) -> only triangles R C (fun => true) -> ˜ConflictSet.Exists (fun c => let (c1, c2) := c in In c1 (proj1 sig a) ∨ In c2 (proj1 sig a)) C -> co installable (NoSupDependencies R) R C (proj1 sig a). This expresses that given a repository R, a set of conflicts C, and a set of packages a, if all packages in a are installable separately, if the only conflicts are triangle conflicts, and if there is no direct conflict between any members of a, then all packages from a must be installable together. The proof sketch is as follows: • By induction on a. 1. If a = /0, then all packages in a are trivially installable together. 7 / 9 Volume 48 (2011) Formal verification of a theory of packages 2. If a = a1 ∪{x}, then: – By the induction hypothesis, all packages from a1 are installable together (take A1 as the installation set). Also, x is a member of a, and therefore it is installable by the original hypothesis (with Ax as the installation set). – Now, we take the union of A1 and Ax, where we remove c1 or c2 if both are present. This set is an installation set as well (it contains all packages needed and no conflicts). • Hence, all packages from a are installable together. 3 Conclusion and discussion We have presented a short overview of the formalisation of the theory of packages as used in the MANCOOSI project. More details can be found in the Coq files, which are available through the Web (see the previous section for the exact location). As for future work, the formalisation is not yet complete. For example, the proof presented in [Boe11] has not yet been formalised, mostly due to the lack of an appropriate graph theory library in Coq, especially one that integrates well with MSet. After this extension, the logical next step would be to verify the actual algorithms used in the tools; these use the theorems presented above, but only proving the theorems is no guarantee for correctness. There are several methods to execute this verification; it could be done using the Program extension to Coq [Soz07], which automatically generates proof obligations for programs written in Coq, based on invariants and assertions supplied by the user. Another method would be code extraction, where the code is first written in Coq, proved cor- rect, and then ‘extracted’ to OCaml. The resulting code would be faster than the code generated by Coq and Program, but if there were to be a bug in the code extraction process, it could produce false results. Most of the algorithms in MANCOOSI rely on a SAT solver to check for installability. Form- ally verifying a modern SAT solver would be very difficult, but it might be possible to treat the SAT solver as a black box and only verify its result. Acknowledgements: The author would like to thank Roberto Di Cosmo and Pierre Letouzey for their very useful advice and encouragement. References [ADBZ09] P. Abate, R. Di Cosmo, J. Boender, S. Zacchiroli. Strong dependencies between software components. In ESEM ’09: Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement. Pp. 89–99. IEEE Computer Society, Washington, DC, USA, 2009. doi:http://dx.doi.org/10.1109/ESEM.2009.5316017 Proc. OpenCert 2011 8 / 9 http://dx.doi.org/http://dx.doi.org/10.1109/ESEM.2009.5316017 ECEASST [Boe11] J. Boender. Efficient Computation of Dominance in Component Systems. In Pro- ceedings of SEFM. 2011. [DB10] R. Di Cosmo, J. Boender. Using strong conflicts to detect quality issues in component-based complex systems. In ISEC ’10: Proceedings of the 3rd India soft- ware engineering conference. Pp. 163–172. ACM, New York, NY, USA, 2010. doi:http://doi.acm.org/10.1145/1730874.1730905 [MBD+06] F. Mancinelli, J. Boender, R. Di Cosmo, J. Vouillon, B. Durak, X. Leroy, R. Treinen. Managing the Complexity of Large Free and Open Source Package-Bas ed Software Distributions. In ASE. Pp. 199–208. 2006. [Soz07] M. Sozeau. Subset Coercions in Coq. In Altenkirch and McBride (eds.), Types for Proofs and Programs. Lecture Notes in Computer Science 4502, pp. 237–252. Springer Berlin / Heidelberg, 2007. 9 / 9 Volume 48 (2011) http://dx.doi.org/http://doi.acm.org/10.1145/1730874.1730905 Introduction Formalisation Definitions Dependencies The dependency cone Strong dependencies Strong conflicts Triangle conflicts Conclusion and discussion