Incremental Pattern Matching for Regular Expressions Electronic Communications of the EASST Volume 47 (2012) Proceedings of the 11th International Workshop on Graph Transformation and Visual Modeling Techniques (GTVMT 2012) Incremental Pattern Matching for Regular Expressions Arash Jalali, Amir Hossein Ghamarian, Arend Rensink 12 pages Guest Editors: Andrew Fish, Leen Lambers Managing Editors: Tiziana Margaria, Julia Padberg, Gabriele Taentzer ECEASST Home Page: http://www.easst.org/eceasst/ ISSN 1863-2122 http://www.easst.org/eceasst/ ECEASST Incremental Pattern Matching for Regular Expressions Arash Jalali2, Amir Hossein Ghamarian1, Arend Rensink3 1 a.h.ghamarian@cs.utwente.nl 3 rensink@cs.utwente.nl Department of Computer Science, University of Twente, The Netherlands 2 arash@netstairs.com NetStairs.com, Inc. Abstract: Graph pattern matching lies at the heart of any graph transformation- based system. Incremental pattern matching is one approach proposed for reducing the overall cost of pattern matching over successive transformations by preserving the matches that stay relevant after a rule application. An important issue in any matching scheme, is the ability to properly and consistently deal with various facil- ities that add to the expressiveness of a GT-tool’s rule language. One such feature is the support for regular path expressions, which would let two nodes to be con- sidered as a “match”, if a certain path of edges exists between them. In this paper, the incorporation of regular expression support into incremental pattern matching is discussed within the context of the GROOVE tool set. This includes laying down a formal foundation for incremental pattern matching for regular expressions which is then used to justify the extension proposed to add regular expression support to a well-known pattern matching algorithm. Keywords: Incremental Matching, Path Matching, Regular Expressions, RETE Al- gorithm, State Space Exploration 1 Introduction Tool performance is essential to the success of any specification formalism. This is certainly also true for graph transformation, which has an innate disadvantage that it crucially relies on the computationally expensive problem of pattern matching. Previous work of Bergmann et al [BÖR+08] has shown that the principle of incremental matching, in the form of the RETE algorithm [BGT91, For82], can outperform plan-based searching (as advocated in, e.g., [GBG+06, VVF06]). The fundamental observation is that graph transformation is all about the gradual evolution of graphs, and therefore search results found for a given graph mostly continue to be valid for any graph obtained by the application of a single rule. We have ourselves reported that the same is true for the exhaustive state space exploration in GROOVE; see [GJR10]. In this paper we extend the scope of the RETE algorithm to matching regular expressions. These were already included in PROGRES [SWZ99] (where they are called path expressions) and are also offered by GROOVE. A regular expression can be used on a rule edge to indicate that the 1 / 12 Volume 47 (2012) mailto:a.h.ghamarian@cs.utwente.nl mailto:rensink@cs.utwente.nl mailto:arash@netstairs.com Incremental Pattern Matching for Regular Expressions host graph is expected to have a path between the image of the rule edge’s source to the image of the rule edge’s target, such that the path labels, when concatenated, form a word in the language of the regular expression. Typically, one is only interested in the existence (or absence) of such a path, and not in the precise path itself. A limited version of the problem of dynamic transitive closure, where one is merely interested in answering queries of the form “is node y reachable from node x?” in the face of updates to the graph, has already been dealt with in the past (see [DI06] for an exhaustive survey). The existing RETE algorithm for incremental matching is not geared towards checking regular expressions. The main hurdle is the fact that the match of a regular expression does not have a fixed size; in particular, transitively closed expressions allow paths of arbitrary length. This does not sit well with the basic principle of RETE, which relies on the propagation of a match through a static, non-cyclic network of subgraph checker nodes. For regular expressions this network needs to be cyclic; this could potentially make the propagation non-terminating. Research question and approach. The questions we set out to answer in this paper are: (i) can RETE be extended to cope with regular expression matching, and (ii) does the resulting algorithm continue to outperform search plan-based matching? In order to answer the first question, we will extend the concept of a RETE network with path checker nodes; in contrast to the existing subgraph checkers, path checkers may be connected in a cycle. We will then give a cut-off criterion to stop the propagation and guarantee termination. The main proof obligation is to show that the set of paths found at the cut-off moment is large enough to include at least one representative path between every pair of connected nodes. However, it is not enough to have a terminating algorithm: it should also perform well in prac- tice. This means that the cut-off criterion should be as sharp as possible, so that the propagation is fast enough to beat ordinary regular expression matching. In order to evaluate this, we have added regular expression matching to the RETE implementation in GROOVE and compared the performance with the pre-existing, search plan-based matching algorithm. The results show that incremental matching continues to pay off in the presence of regular expressions, but that the actual gain depends on the dynamics of the rules. In particular, if the structure that is tested by regular expression matching changes frequently, then incremental matching loses out. This is indeed not surprising, as match propagation for path checker nodes is a potentially complex operation which, when repeated often, does not scale as well as the time-honoured word checking algorithm for regular automata. 2 Definitions We first define some basic concepts: graphs, paths and match morphisms. Graphs consist of nodes and labelled edges. We assume the existence of global universes Vertex (ranged over by v), Edge (ranged over by e), and Label of vertex (or node), edge and label entities. Label is further partitioned into the universe Atom of atomic labels (ranged over by a) and RegExp of regular expressions (ranged over by R). There are also global functions src,tgt : Edge → Vertex and lab : Edge → Label associating with every edge a source and target node and label. Proc. GTVMT 2012 2 / 12 ECEASST Definition 1 (graph) A graph G is a tuple 〈VG,EG〉 in which VG ⊆ Vertex is the set of vertices (or nodes) and EG ⊆ Edge is the set of edges, such that src(EG)⊆VG and tgt(EG)⊆VG. We distinguish host graphs, in which only Atom is used as label universe, and rule graphs, in which also RegExp labels may occur. Regular expressions are defined by the following grammar: R ::= λ | a | −R | R1|R2 | R1.R2 | R∗ where λ stands for the empty word and a ∈ Atom. The only non-standard operation is −R. In graphs, this will be matched by an inversed path; to capture this formally, we introduce the complementary actions Atom, disjoint from Label. For arbitrary a ∈ Atom∪Atom we use ā to denote the complement of a; this operator is its own inverse, hence ¯̄a = a. This is extended to words w ∈ Word where Word = (Atom∪Atom)∗ by also reversing them: ε̄ = ε and aw = w̄ ā. The language of a regular expression R is then inductively defined by LR ⊆ Word such that: Lλ ={ε} La ={a} L−R ={w̄ | w ∈ LR} LR1.R2 ={w1w2 | w1 ∈ LR1,w2 ∈ LR2} LR1|R2 = LR1 ∪LR2 LR∗ ={w1 ···wn | n ≥ 0,w1,...,wn ∈ LR} To define the meaning of regular expressions over graphs, we need the concept of paths in a host graph. For this purpose, we first introduce inverse edges: ĒG for a graph G will denote the set of inverse edges, such that src(ē) = tgt(e), tgt(ē) = src(e) and lab(ē) = lab(e). Definition 2 (path) A path is an alternating sequence p = v1 e2 v2 ··· en vn of vertices and edges, starting and ending on a vertex, such that vi−1 = src(ei) and tgt(ei) = vi for 1 < i ≤ n. We use Π to denote the set of all paths and Π(G) for the subset of all paths in G, and extend the source and target functions to paths by src(p) = v1 and tgt(p) = vn. Path concatenation of p = v1 e2 v2 ···en vn and p = v′1 e ′ 2 v ′ 2 ···e ′ n′ v ′ n′ is defined by p.p′ = v1 e2 v2 ···en vn e′2 v ′ 2 ···e ′ n′ v ′ n′ if vn = v ′ 1 Two further relevant functions on paths are word : Π → Word that maps each path in G to the corresponding sequence of labels, and path reversal rev : Π → Π, inductively defined by: word(v) = ε rev(v) = v word(p e v) = word(p)lab(e) rev(p e v) = v ē rev(p) We can now formally define the relation between regular expressions and graphs. For an arbitrary regular expression R and graph G, we define ΠR(G) as the set of paths in G satisfying R, thus: Π R(G) ={p ∈ Π(G) | word(p)∈ LR} . (1) 3 / 12 Volume 47 (2012) Incremental Pattern Matching for Regular Expressions Note that ΠR(G) is in general infinite, due to the possible presence of loops in G. Finally, a match morphism is a function from a rule graph L to a host graph G that provides images in G for the atom-labelled edges of L and guarantees the existence of paths in G for the regular expression-labelled edges of L: Definition 3 (match morphism) Given a rule graph L and a host graph G, a match morphism of L into G is a partial function f : (VL ∪EL) → (VG ∪EG) that is total on VL such that f (VL) ⊆VG and f (EL)⊆ EG, satisfying for all e ∈ EL: • if lab(e) = a ∈ Atom, then f (e) = e′∈ EG such that src(e′) = f (src(e)), tgt(e′) = f (tgt(e)) and lab(e′) = a; • if lab(e) = R ∈ RegExp, then there is a p ∈ ΠR(G) such that src(p) = f (src(e)) and tgt(p) = f (tgt(e)). For the purpose of our discussions, we shall only be concerned with the second condition in the above definition. We use F L to denote the set of all match morphisms of L, and F L(G) for the subset of match morphisms into G. It is important to note that the paths themselves are not part of the match morphism images. Among other things, this implies that F L(G) is always finite. It should also be noted that, in this paper, we deal with positivie conditions only. Negative application conditions can be supported using the same techniques as described in [GJR10]. 3 Incremental Matching The RETE algorithm for incremental matching uses the notion of subgraph checkers that keep a record of their match morphisms into a host graph, and are updated upon every graph change. Graph changes are additions and deletions of nodes and edges, using the following (partially defined) operations: G + v = 〈V ∪{v},E〉 for all v /∈VG G + e = 〈V,E ∪{e}〉 for all e /∈ EG with src(e),tgt(e)∈VG G−v = 〈V \{v},E〉 for all v ∈VG such that @e ∈ EG : v ∈{src(e),tgt(e)} G−e = 〈V,E \{e}〉 for all e ∈ EG The essence of incremental matching is the existence of “cheap” updates δ L+x(G),δ L −x(G) ⊆ F L for every rule graph L host graph G and graph element x ∈ Vertex∪Edge, which capture the change to F L(G) when a vertex or edge is added or removed: F L(G + v) = F L(G)∪δ L+v(G) F L(G−v) = F L(G)\δ L−v(G) F L(G + e) = F L(G)∪δ+e(G) F L(G−e) = F L(G)\δ L−e(G) whenever the graph operations are defined. In RETE, this is achieved by creating a directed acyclic graph 〈N,@〉 (called a network) of checker nodes n ∈ N, each of which is associated with a subgraph Sn ⊆ L such that n1 @ n2 implies Sn1 ⊂ Sn2 . For the @-minimal nodes, the Sn are single nodes and edges of L, whereas Sn = L when n is the @-maximum. Proc. GTVMT 2012 4 / 12 ECEASST Each checker node n continually maintains F Sn(G), and computes its own δ ’s upon every graph update, propagating from small to large in the @-ordering. For instance, for a rule graph L = 〈{v′0,v ′ 1},{e ′}〉 consisting only of a single edge e′ with src(e′) = v′0 and tgt(e ′) = v′1, the δ ’s are given by δ L +v(G) = δ L −v(G) = /0 δ L +e(G) = δ L −e(G) = { {(v′0 7→ v0,v ′ 1 7→ v1,e ′ 7→ e)} if lab(e) = lab(e′) /0 otherwise Only non-empty δ are propagated to the @-successors in the network, reducing the amount of work required to update the entire structure. 3.1 Path matches Since this paper deals with the extension of RETE to regular expressions, we will not go further into the details of the standard algorithm. The extension involves the definition of new checker nodes for paths and extending the RETE network with those nodes. There are two important novelties: • The path checker nodes do not have to maintain all paths, but only have to test for the existence of a path (see Def. 3). • The @-ordering over path checkers is no longer acyclic; instead, we will have n @ n when- ever Rn (the regular expression associated with n) is of the form R∗. The main issue is to decide what kind of information every path checker node n should maintain, given that it is only required to know whether a satisfying path exists between given (arbitrary) v1,v2 ∈VG. Important criteria for this decision are: (i) the amount of information should be kept as small as possible and (ii) incremental updates should be cheap. To meet these criteria, we introduce an algebra of path matches: m ::= 〈v0〉 | 〈e〉 | inv(m) | dot(m1,m2) | close(m1,m2) Here, v0 ∈ Vertex and e ∈ Edge; the other operators are syntactic constructors that give rise to a tree-like structure of path matches. The close constructor, for instance, represents a match for a closure regular expression. The universe of all path matches will be denoted M. The following table defines functions start,end : M→Vertex that retrieve the start and end nodes of a path match, and a function int : M → 2Vertex that retrieves the set of internal nodes: m start(m) end(m) int(m) 〈v0〉 v0 v0 /0 〈e〉 src(e) tgt(e) /0 inv(m1) end(m1) start(m1) /0 dot(m1,m2) start(m1) end(m2) /0 close(m1,m2) start(m1) end(m2) {start(m1)}∪int(m2) m is said to be a path match in G if it only uses nodes and edges from G. We call a path match m consistent if 5 / 12 Volume 47 (2012) Incremental Pattern Matching for Regular Expressions • m = inv(m1) implies m1 is consistent; • m = dot(m1,m2) implies (i) m1 and m2 are consistent and (ii) end(m1) = start(m2); • m = close(m1,m2) implies (i) m1 and m2 are consistent, (ii) end(m1) = start(m2), (iii) start(m1) /∈ int(m2) and (iv) m2 = 〈v0〉 or m2 = close(m21,m22). Note in particular condition (iv) of close, which states that the second operand is always either an empty path match or itself of the form close. Obviously, every path match is essentially a path with additional structure. Indeed, we can easily retrieve the path from the path match: path(〈v0〉) = v0 path(〈e〉) = src(e)e tgt(e) path(inv(m)) = rev(path(m)) path(dot(m1,m2)) = path(m1).path(m2) path(close(m1,m2)) = path(m1).path(m2) We now recursively define the path matches of arbitrary regular expressions in a given graph G, as the smallest sets satisfying the following equations: Mλ (G) ={〈v〉 | v ∈VG} Ma(G) ={〈e〉 | e ∈ EG,lab(e) = a} M−R(G) ={inv(m) | m ∈ MR(G)} MR1|R2(G) = MR1(G)∪MR2(G) MR1.R2(G) ={m = dot(m1,m2) | m1 ∈ MR1(G),m2 ∈ MR2(G),m is consistent} MR ∗ (G) = Mλ (G)∪{m = close(m1,m2) | m1 ∈ MR(G),m2 ∈ MR ∗ (G),m is consistent} In particular, the path matches of the Kleene star are defined recursively. The following property is the core of the correctness proof for our incremental match algorithm. Theorem 1 Let G be a host graph and R a regular expression. 1. MR(G) is finite; 2. For every m ∈ MR(G), path(m)∈ ΠR(G); 3. For every p ∈ ΠR(G), there is an m ∈ MR(G) such that start(m) = src(p) and end(m) = tgt(p). Clause 1 ensures that MR(G) can be effectively computed. It follows from the fact that, due to the consistency conditions on path matches, int(close(m1,m2))⊃ int(m2), combined with the fact that int(m) ⊆VG for all m ∈ MR(G). Clause 2 guarantees that the path matches are sound: they only generate paths that satisfy the regular expression. Finally, Clause 3, called the sufficient cov- erage clause, states that the path matches are complete: they contain at least one representative element for every pair of host nodes between which there exists a path. Proc. GTVMT 2012 6 / 12 ECEASST This provides us with an answer to the question above: the RETE path checker node for R will maintain the set MR(G). In the remainder of this section we will show that this can be updated incrementally. 3.2 Incremental Computation of Path Matches Given the definition of the path matches, defining the δ -functions, which now range over the terms of our path match algebra, is relatively straightforward. For the base regular expressions they are given by δ λ +v(G) = δ λ −v(G) ={〈v〉} δ λ +e(G) = δ λ −e(G) = /0 δ a +v(G) = δ a −v(G) = /0 δ a +e(G) = δ a −e(G) = { {〈e〉} if lab(e) = a /0 otherwise For the composed regular expressions we can ignore the distinction between node and edge updates: for x ∈ Vertex∪Edge δ −R +x (G) ={inv(m) | m ∈ δ R +x(G)} δ R1|R2 +x (G) = δ R1 +x(G)∪δ R2 +x(G) δ R1.R2 +x (G) ={m = dot(m1,m2) | m1 ∈ δ R1 +x(G),m2 ∈ M R2(G + x),m is consistent} ∪{m = dot(m1,m2) | m1 ∈ MR1(G + x),m2 ∈ δ R2+x(G),m is consistent} δ R∗,0 +x (G) = δ λ +x(G)∪{m = close(m1,m2) | m1 ∈ δ R +x(G),m2 ∈ M R∗(G),m is consistent} δ R∗,i+1 +x (G) ={m = close(m1,m2) | m1 ∈ M R(G + x),m2 ∈ δ R∗,i +x (G),m is consistent} δ R∗ +x(G) = ⋃ ω i=0 δ R∗,i +x (G) The most interesting is of course the case for R∗. This is approximated by a sequence of steps, which is defined as unbounded but in practice will always result in δ R ∗,i +x (G) = /0 from a certain value of i onward — the argument is the same as for the finiteness of MR ∗ (G). The removal updates for the composed operators only have to look at existing matches; in other words, we do not have to compute them: δ −R −x (G) ={inv(m)∈ M R(G) | m ∈ δ R−x(G)} δ R1|R2 −x (G) = δ R1 −x(G)∪δ R2 −x(G) δ R1.R2 −x (G) ={dot(m1,m2)∈ M R(G) | m1 ∈ δ R1−x(G) or m2 ∈ δ R2 −x(G)} δ R∗,0 −x (G) = δ λ −x(G)∪{close(m1,m2)∈ M R(G) | m1 ∈ δ R−x(G)} δ R∗,i+1 +x (G) ={m = close(m1,m2) | m2 ∈ δ R∗,i −x (G)} δ R∗ −x(G) = ⋃ ω i=0 δ R∗,i −x (G) In an actual implementation, removal can be implemented efficiently by keeping a mapping from each match to the set of matches derived from it; i.e., from m to all existing matches of the form inv(m), dot(m,m′), dot(m′,m), close(m′,m) and close(m,m′). This obviates all further lookups. The correctness criterion of incremental updating is captured by the following theorem: 7 / 12 Volume 47 (2012) Incremental Pattern Matching for Regular Expressions Theorem 2 For every graph G, regular expression R and element x ∈ Vertex∪Edge, MR(G + x) = MR(G)∪δ R+x(G) MR(G−x) = MR(G)\δ R−x(G) . In order to show this theorem for the choice operator, we need an alternative characterisation of the set δ R−x, namely that it always removes precisely those match objects that contain x. For every x ∈ Vertex∪Edge define Mx ={m ∈ M | x occurs in m} . Now we can easily show the following property, by induction on the structure of R: Proposition 1 For every graph G, regular expression R and element x ∈ Vertex∪Edge, δ R −x(G) = M R(G)∩Mx . With the use of this property we can prove the main theorem above. Proof of Theorem 2. By induction on the structure of R. The only really interesting cases are those of MR1|R2(G−x) and MR ∗ (G + x). Case MR1|R2(G−x) = MR1|R2(G)\δ R1|R2−x (G). This follows from MR1|R2(G−x) = MR1(G−x)∪MR2(G−x) (by definition) = (MR1(G)\δ R1−x)∪(M R2(G)\δ R2−x) (by the induction hypothesis) = (MR1(G)\Mx)∪(MR2(G)\Mx) (by Prop. 1) = (MR1(G)∪MR2(G))\Mx (by set algebra) = MR1|R2(G)\δ R1|R2−x (G) (by definition and Prop. 1). Case MR ∗ (G + x) = MR ∗ (G)∪δ R ∗ +x(G). The inclusion from right to left is immediate. Now assume m ∈ MR ∗ (G + x). We proceed by induction on |m|, where |m|= 0 if m is not of the form close, and |m| = |m2|+ 1 if m = close(m1,m2). The hypothesis is that either m ∈ MR ∗ (G) or m ∈ δ R ∗,i +x (G) for some i ≤|m|. Base case: |m|= 0 It follows that m ∈Mλ (G+x). Then m ∈Mλ (G) or m ∈δ λ+x(G) by the outer induction hypothesis; in the first case m ∈ MR ∗ (G) and in the second m ∈ δ R ∗,0 +x (G). Since 0 ≤|m| we are done. Induction step: |m|= n + 1. It follows that m = close(m1,m2) with |m2|= n; hence m1 ∈MR(G+ x) and m2 ∈ MR ∗ (G + x). By the outer induction hypothesis, either (i) m1 ∈ MR(G) or (ii) m1 ∈ δ R+x(G), and by the inner induction hypothesis, either (iii) m2 ∈ MR ∗ (G) or (iv) m2 ∈ δ R∗,i +x (G) for i ≤|m2|. This gives rise to the following combinations of cases: Proc. GTVMT 2012 8 / 12 ECEASST • (i) and (iii): Then m ∈ MR ∗ (G), hence we are done. • (ii) and (iii): Then m ∈ δ R ∗,0 +x (G); since 0 ≤|m| we are done. • (iv): Then m ∈ δ R ∗,i+1 +x (G); since i + 1 ≤|m| we are done. 4 Implementation In this section we briefly discuss how the extended RETE algorithm has been implemented in GROOVE. Our discussion will be exclusive to regular expressions. For a more general expla- nation of RETE and its implementation in GROOVE, the reader is referred to [GJR10]. Our explanation will mainly focus on the static build of the RETE network, as the dynamic behavior of the network follows directly from the formal semantics of the δ operations. 4.1 Path-checkers The RETE algorithm relies heavily on the network of checker nodes that are responsible for finding and passing along partial matches for one or more rules. The network in the classic RETE algorithm comes equipped with several types of checker nodes, including edge checkers, node checkers, subgraph checkers, as well as condition/production rule checkers. In order to add the functionality of finding matches for paths, as opposed to partial graphs, it is natural to think that one needs a new type of checker node that finds and passes along matches for a specific kind of path - hence the addition of path checkers. A path checker node, in general, receives one or two partial path matches, and then tries to combine and/or transform them based on the semantics of the regular expression operator they represent. For instance, a sequence path checker, receives two path matches, and tries to see if they can be combined into a larger path by sequentially concatenating them. 4.2 The RETE Static Build Given a rule edge with a regular expression label R, we use the syntax of R as a guide to build the RETE network, such that it would be able to incrementally find path matches. Figure 1 shows Path−checker for: a Path−checker for: a|(b.c)∗ Path−checker for: (b.c)∗ ROOT Path−checker for: b Path−checker for: b.c Path−checker for: c Production Node rule1 receive receive receive receive receive receive receive receive receive Figure 1: RETE network for a|(b.c)* 9 / 12 Volume 47 (2012) Incremental Pattern Matching for Regular Expressions Legend: A Ab Matched and preserved A Ab Forbidden A Ab Matched and deleted A Ab Created − end + end end Pool next next next+ in (a) ‘add’ end+ end nextnext+ (b) ‘del’ − end+ end ! end next+ next (c) ‘move’ + end ! end next+ next (d) ‘mark’ next next (e) A chain of nodes Figure 2: The chain manipulation example the RETE network that finds matches for the expression a|(b.c)*. It is not a coincidence that the way the path checkers are connected to each other represents an upside-down view of the expression’s syntax tree, as syntax trees are top-down representations of an expression, while the RETE network is constructed to build matches in a bottom-up way. The highlighted path checker in Figure 1 is the one that performs the final choice operation. It should also be noted that although the network looks acyclic, the closure path checker for (b.c)+ does in fact have a built-in loop-back mechanism that performs the closure operation repeatedly enough, i.e. until the path matches become inconsistent, to construct a sufficient coverage for that expression. 5 Experimental Results Figure 2, shows a grammar for manipulating chains of nodes. Four operations (grammar rules) are defined over these chains: marking the end of the chain, adding nodes to the marked end from a pool of available nodes, moving the end marker closer to the beginning, and deleting a node from the marked end. We chose this grammar for our experiments as it allowed us to create the different scenarios under which RETE is expected to perform better or worse than search plan. This was inspired by the benchmarking results in [GJR10], as well as the intuitive rule that RETE is expected to perform better when the extra work it does at finding all complete and partial matches at each state is preserved across several consecutive rule applications. To better demonstrate the behavior of RETE, some of the runs in our experiment made use of a subset of the rules. The experiment was performed on a machine with a dual core 1.6GHz Intel CPU and 2.5GB of RAM. Table 1 summarizes the results of our experiment. The first column indicates which rules have been used. The ‘Len’ column specifies the length of the chain in the initial graph, and the ‘Pool’ column indicates the number of available nodes in the pool, if that run makes use of the ‘add’ rule. The ‘RETE’ and ‘SP’ columns contain the total exhaustive exploration time for RETE and search plan. In the case of search plan, the reported time is the average time taken using DFS and BFS exploration strategies. For RETE, we used the specially tailored strategy as discussed in [GJR10]. The right-most column in Table 1 indicates RETE’s performance gain or loss (if it is a negative number) compared to SP. The numbers clearly show that in the absence of any rule that modifies the structure of the Proc. GTVMT 2012 10 / 12 ECEASST # Rules Len Pool States / Transitions RETE(ms) SP(ms) Prf. Gain(%) 1 move 27 - 2941 / 8042 1781 1984 10.2 2 move 54 - 19063 / 54439 18734 40437 53.7 3 move,mark 12 - 2048 / 8192 1016 2531 59.9 4 move,mark 15 - 16384 / 77824 7297 21758 66.5 5 move,add,del 2 5 23248 / 42615 6062 5750 -5.2 6 move,add,del 4 5 357605 / 634772 89719 87000 -3.0 7 all 2 5 179480 / 764760 90125 125375 28.1 Table 1: Experimental results for exhaustive state space exploration chain (rows 1 to 4), the path matches found for next+ by RETE remain valid throughout the ex- ploration, while SP is forced to follow and match the chain repeatedly after each rule application. When the rules ‘del’ and ‘add’ are present, RETE starts losing a lot of matches, or ends up creat- ing more matches every time a next edge is deleted or created by those rules. This leads to RETE losing the race to SP (rows 5 and 6). This performance loss is easily compensated when the rule ‘mark’, which also does not modify the structure of the chain, is used along with the others (row 7), where a 5.2% loss is turned into a 28% performance gain. While the other experiments are meant to reveal the underlying mechanisms affecting performance, row 7 represents the typical and naturally ocurring scenario in which RETE can be used to achieve gains. There is however another characteristic that should be noted about the above scenarios: RETE’s loss occurs at a more or less stable rate (rows 5 and 6) even when doubling the length of the chain has resulted in a 11 fold increase in the size of the state space. Rows 1-4 show that this is not the case when SP is losing, as increasing the length of the chain causes further declines in its performance compared to RETE. 6 Conclusion and Future Work In this paper, we provided an extension to the RETE algorithm to support matching for regular path expressions. Finding paths in a graph that satisfy a given regular expression is particularly problematic as the matches for a regular expression with transitive closures are neither of a fixed size, nor is there always a finite number of them due to the presence of cycles. This gave rise to the question of what is actually meant by finding a match for a regular expression, which we addressed by formally defining paths and their matches. We also provided formal specifications of incremental updates to the finite set of matches found for each regular expression and showed that our formulation of the set of matches for a given regular expression gives rise to a finite and complete set, which makes it suitable for implementation. Our experimental results showed that RETE can in fact provide performance gains in regular expression matching when the dynamics of the situation are such that RETE’s ’find once, use many times’ behavior makes it possible to avoid repetitive finding of long, mostly unaltered paths. As future work, one possible direction for improvement would be to boost RETE’s regular expression matching by enhancing its path checkers, especially the closure path checkers, with an on-demand capability, such that they would only produce matches when needed and only 11 / 12 Volume 47 (2012) Incremental Pattern Matching for Regular Expressions those that are anchored at a certain node in the host graph. In our example grammar discussed in Section 5, the closure path checker does in fact find paths that lie in the middle of the chain, which are basically of little use, since all our operations happen at the end of the chain. This could lead to serious performance declines in finding matches for the expression a* in cases where the host graph is dense with a-labelled edges and there are frequent updates in the form of addition and removal of such edges. Bibliography [BGT91] H. Bunke, T. Glauser, T.-H. Tran. An Efficient Implementation of Graph Gram- mars Based on the RETE Matching Algorithm. In Proceedings of the 4th Interna- tional Workshop on Graph-Grammars and Their Application to Computer Science. Pp. 174–189. Springer-Verlag, London, UK, 1991. [BÖR+08] G. Bergmann, A. Ökrös, I. Ráth, D. Varró, G. Varró. Incremental pattern matching in the VIATRA model transformation system. In GRaMoT ’08: Proceedings of the third international workshop on Graph and model transformations. Pp. 25–32. ACM, NY, USA, 2008. [DI06] C. Demetrescu, G. F. Italiano. Dynamic shortest paths and transitive closure: Al- gorithmic techniques and data structures. Journal of Discrete Algorithms 4(3):353– 383, 2006. [For82] C. Forgy. RETE, a fast algorithm for the many pattern / many object pattern match problem. Artificial Intelligence 19:17–37, 1982. [GBG+06] R. Geiß, G. V. Batz, D. Grund, S. Hack, A. Szalkowski. GRGEN: A Fast SPO-Based Graph Rewriting Tool. In Corradini et al. (eds.), International Conference on Graph Transformations (ICGT). LNCS 4178, pp. 383–397. Springer, 2006. [GJR10] A. H. Ghamarian, A. Jalali, A. Rensink. Incremental Pattern Matching in Graph- Based State Space Exploration. ECEASST 32, 2010. [SWZ99] A. Schürr, A. J. Winter, A. Zündorf. The PROGRES approach: language and environ- ment. In Ehrig et al. (eds.), Handbook of graph grammars and computing by graph transformation: applications, languages, and tools. Volume 2, pp. 487–550. World Scientific Publishing Co., Inc., 1999. [VVF06] G. Varró, D. Varró, K. Friedl. Adaptive Graph Pattern Matching for Model Transfor- mations using Model-sensitive Search Plans. In Karsai and Taentzer (eds.), GraMot 2005, International Workshop on Graph and Model Transformations. ENTCS 152, p. 191–205. Elsevier, 2006. http://www.inf.mit.bme.hu/FTSRG/Publications/varro/2005/gramot05 vvf.pdf Proc. GTVMT 2012 12 / 12 http://www.inf.mit.bme.hu/FTSRG/Publications/varro/2005/gramot05_vvf.pdf Introduction Definitions Incremental Matching Path matches Incremental Computation of Path Matches Implementation Path-checkers The rete Static Build Experimental Results Conclusion and Future Work