Journal of Software Engineering Research and Development, 2021, 9:10, doi: 10.5753/jserd.2021.477  This work is licensed under a Creative Commons Attribution 4.0 International License.. A Data­centric Model Transformation Approach using Model2GraphFrame Transformations Luiz Carlos Camargo  [ Universidade Federal do Paraná C3SL Labs | lccamargo@inf.ufpr.br ] Marcos Didonet Del Fabro  [ Universidade Federal do Paraná, C3SL Labs | didonet@inf.ufpr.br ] Abstract Data­centric (Dc) approaches are being used for data processing in several application domains, such as dis­ tributed systems, natural language processing, and others. There are different data processing frameworks that ease the task of parallel and distributed data processing. However, there are few research approaches studying on how to execute model manipulation operations, as model transformations models on such frameworks. In addition, it is of­ ten necessary to provide extraction of XMI­based formats into possibly distributed models. In this paper, we present a Model2GraphFrame operation to extract a model in a modeling technical space into the Apache Spark framework and its GraphFrame supported format. It generates GraphFrame from the input models, which can be used for partitioning and processing model operations. We used two model partitioning strategies: based on sub­graphs, and clustering. The approach allows to perform model analysis applying operations on the generated graphs, as well as Model Transformations (MT). The proof of concept results such as model2GraphFrame, GraphFrame partitioning, GraphFrame connectivity, and GraphFrame model transformations indicate that our Model Extraction can be used in various application domains, since it enables the specification of analytical expressions on graphs. Furthermore, its model graph elements are used in model transformations on a scalable platform. Keywords: Model Extractor, Data­centric approach, Spark GraphFrames, Model Transformations 1 Introduction Model Transformations (MTs) are key artifacts for exist­ ing MDE (Model­Driven Engineering) approaches, since they implement operations between models (Brambilla et al., 2012). Nevertheless, the transformation of models via paral­ lel and/or distributed processing is still a challenging ques­ tion in MDE platforms. There are recent initiatives that aim to improve existing solutions by adapting the computa­ tion models, for instance, using MapReduce (Dean and Ghe­ mawat, 2008) to integrate model transformation approaches within the data­intensive computing models. Works such as Burgueno et al. (2016), Pagán et al. (2015), Benelallam et al. (2015) and Tisi et al. (2013) aim at providing solu­ tions for this new scenario using frameworks such as Linda and MapReduce. Even when adopting these frameworks, the model processing is not a straightforward task, since the mod­ els are semi­structured, which can have self­contained or inter­contained elements, different of flat data structures on linear space usage, such as logs, text files, and others. The need for performing complex processing on large vol­ umes of data has led to the re­evaluation of the utilization of different kinds of data structures (Raman, 2015). Very Large Models (VLMs) are composed of millions of elements. VLMs are present in specific domains such as the automo­ tive industry, civil engineering, Software Product Lines, and modernization of legacy systems (Gómez et al., 2015). Fur­ thermore, new applications are emerging involving domains, such as Internet of Things (IoT), open data repositories, so­ cial networks, among others, demanding intensive and scal­ able computing for manipulating their artifacts (Ahlgren et al., 2016). There is a wide range of approaches of model transforma­ tions (Kahani et al., 2018), such as QVT (OMG, 2016), ATL, ETL (Kolovos et al., 2008), VIATRA (Varró et al., 2016), among others. However, most of these approaches adopt as strategy the local and sequential execution for the transforma­ tion of models, conditioning the processing of models with large amounts of elements (VLMs) to the capacity of the ex­ ecution environment. Given the nature of models and meta­models, they can have elements that are densely interconnected. This hardens the processing of transformation rules, mainly when execut­ ing a pattern matching step (Jouault et al., 2008). Moreover, distributed Model Transformation (MT) requires strategies for partitioning and distributing the model elements on dis­ tinct nodes, while at the same time, ensuring the consistency among their elements (Benelallam et al., 2018). A large part of model­based tools uses a graph­oriented data model. These tools have been designed to help users in specifying and executing model­graph manipulation op­ erations efficiently in a variety of domains (Xin et al., 2013; Szárnyas et al., 2014; Junghanns et al., 2016; Shkapsky et al., 2016; Li et al., 2017; Benelallam et al., 2018; Tomaszek et al., 2018; Azzi et al., 2018). The extraction of large semi­ structured data under a graph perspective can be useful in choosing a strategy to design distributed/parallel MTs, graph­ data processing, model partitioning, and to analyze model inter­connectivity, as well as to offer graph­structured infor­ mation to different contexts. Even though, the graph pro­ cessing in the MT context requires more research, involving implicit parallelism, parallel/distributed environments, lazy­ evaluation, and other mechanisms for model processing. For these reasons, in this paper, we present an evalu­ ation study on the application of a Data­centric (Dc) ap­ proach for model extraction and MT in the Spark framework, based on GraphFrames (Apache, 2019). Therefore, we con­ sider that the mechanisms, such as implicit parallelism, lazy­ https://orcid.org/0000-0001-7879-9893 mailto:lccamargo@inf.ufpr.br https://orcid.org/0000-0002-8573-6281 mailto:didonet@inf.ufpr.br A Data­centric Model Transformation Approach using Model2GraphFrame Transformations Camargo and Del Fabro 2021 evaluation, model partitioning, and scalable framework, can compose an approach for MT. First, we inject the input model into a DataFrame, which is a format supported by Apache Spark. Second, we im­ plement in Scala a model extraction with graph generation from the DataFrame and its schema. It translates the in­ put models into GraphFrame from a DataFrame, through a Model2GraphFrame transformation, which allows us to pro­ cess them. We evaluate how to query the graph elements us­ ing its native query language, and also, how to specify dif­ ferent kinds of operations over GraphFrames. We focus on the partition of graphs from GraphFrames into sub­graphs, as well as the clustering of its vertices, which are used in Model Transformations. We provide the following contributions: • We produce an automated mechanism for data trans­ lations between the MDE technical space and the DataFrame and GraphFrame formats, which allows the execution of different operations (including MT) over the models from the GraphFrame; • We use two partitioning strategies of models on Graph­ Frame (semi­automated), one based on the Motif al­ gorithm and another on clustering using the Infomap framework. The model partitioning result is used on MT, aiming to improve the execution performance; • To validate our approach, We implemented a proof of concept, in which we compared the partitioning strate­ gies in MT executions on top of the Spark, a scalable framework. This paper is organized into 6 sections. In Section 2, we introduce the context for this work with the DataFrame and GraphFrames APIs and their data formats, as well as Model Transformations using Graphs; In Section 3, we present the specifications of our approach, including extracting, trans­ lating, partitioning, and model transformations; In Section 4, we describe the proof of concepts for validating our ap­ proach; In Section 5, we present related work; In Section 6, we conclude with future work. 2 Context In this section, we present DataFrame, a distributed col­ lection of data organized into named columns, and Graph­ Frames, a graph processing library based on DataFrames, both for Apache Spark. We also introduce: the MT, the key artifact for existing MDE approaches; Model Extractor (ME) for extracting model elements from different technical spaces; and Graph, a data structure composed of vertices and edges, which may be used in MT. 2.1 Data Structures on GraphFrame Apache Spark (Apache, 2019) is a general­purpose data pro­ cessing engine providing a set of APIs that allow the im­ plementation of several types of computations, such as in­ teractive queries, data and stream processing, and graph pro­ cessing. The DataFrame Spark API uses distributed Datasets. A Dataset is a strongly­typed data structure organized in collections. The Dataset API allows the definition of a dis­ tributed collection of structured data from JVM objects, and its manipulation using functional transformations such as map, flatMap, filter, and others. Structurally, a DataFrame is a two­dimensional labeled data structure with columns of potentially different types. Each row in a DataFrame is a single record, which is rep­ resented by Spark as an object of type Row. Each DataFrame contains data grouped into named columns, and keeps track of its own schema. Summarizing, a DataFrame is similar to a table in a relational database, but with a difference, their columns allow the manipulation of multivalued attributes. A DataFrame can be transformed into new DataFrames using various relational operators available in its API and expres­ sions based on SQL­like functions. DataFrames and Datasets are (distributed) table­like collections with well defined rows and columns. Each column must have the same number of rows and each column has type information that must be consistent for every row in the collection. DataFrames and Datasets represent immutable and lazily evaluated plans that specify what operations to apply to data residing at a loca­ tion to generate some output (Chambers and Zaharia, 2018). Figure 1 shows an example of a DataFrame. It is formed by three rows and five columns, and contains data extracted from model Families (Rows with March, Sailor, and Camargo families. A Row can have Columns with dif­ ferent types, such as String, Integer, Date, Boolean, and Array. Rows Columns--------------- -------------- +---------+--------------------+---------+----------+-----------------+ | lastName| daughters| father| mother| sons| +---------+--------------------+---------+----------+-----------------+ | March| [[, Brenda]]| [, Jim]| [, Cindy]| [[, Brandon]]| | Sailor| [[, Kelly]]|[, Peter]|[, Jackie]|[[,David],[,Dy...| | Camargo|[[, Jor], [, Teste]]| [, Luiz]| [, Sid]| [[, Lucas]]| +---------+--------------------+---------+----------+-----------------+ Figure 1. DataFrame Families Another possible way to describe elements and their rela­ tionships is the creation of graphs, due to their high expres­ siveness. Spark provides the GraphX and GraphFrames APIs to process data in graph formats. In the GraphFrames API, the GraphFrame class is used for instantiating graphs. In Fig­ ure 2, we present a simple illustrative example of a Family model, using the March family elements into a GraphFrame instance. It can be created from vertex (nameVerticesDF) and edge (roleEdgesDF) DataFrames. A vertex DataFrame has to contain a special column named "id", which specifies a unique ID for each vertex in the graph. An edge DataFrame should contains two special columns: "src" (as the source vertex ID of the edge) and “dst” (as the destination ver­ tex ID of the edge) (Chambers and Zaharia, 2018; Apache, 2019). The GraphFrame model supports user­defined attributes within each vertex and edge. The GraphFrames API provides the same operations of the DataFrame API, such as map, select, filter, join, and others. It has a set of built­in graph algorithms, such as breadth­first search (BFS), label propagation, PageRank, and others. The GraphFrames and DataFrame APIs are based on the concept of a Resilient Dis­ tributed Dataset (RDD), which is an immutable collection of records partitioned across a number of computers or nodes. To provide fault tolerance, each RDD is logged to construct A Data­centric Model Transformation Approach using Model2GraphFrame Transformations Camargo and Del Fabro 2021 roleEdgesDF +---+---+--------+ |src|dst| role| +---+---+--------+ | 1| 2|daughter| | 1| 3| father| | 1| 4| mother| | 1| 5| son| +---+---+--------+ GraphFrame nameVerticesDF +---+-------+ | id| Name| +---+-------+ | 1| March| | 2| Brenda| | 3| Jim| | 4| Cindy| | 5|Brandon| +---+-------+ Figure 2. March Family GraphFrame a lineage Dataset (Data lineage (Tang et al., 2019)). When a data partition of a RDD is lost due to the node failure, the RDD can recompute that partition with the full information on how it was generated from other RDD partitions (Apache, 2019). 2.2 Model Transformations using Graphs A directed graph may be represented by (G(V, E)), where V represents a set of vertices and E the set of edges of the graph G. A sub­graph S of a graph G is a graph whose ver­ tices V (S) are a sub­set of the set of vertices V (G), where V (S) ⊆ V (G), and the set of edges E(S) is a sub­set of the edges E(G), that is, E(S) ⊆ E(G). Extensions of this basic representation have been proposed to define the graph as a data model (Junghanns et al., 2016; Barquero et al., 2018). Graphs are useful for modeling computational problems. They can be adopted to model relationships among objects. A graphcan be used, such as a representation format for models, enabling abstract features of a model. In model transforma­ tion processes, graphs can be used to translate instances from one modeling language to another, since the structures of that language can be represented by a type of graph. The Triple Graph Grammars approach (Schürr, 1995) is a way to specify translators of data structures and to check their consistency. In addition to model transformation, there is a variety of based­graph algorithms used for processing graph models in different domains, such as complex network structures, net­ work analysis, business intelligence, and others (Junghanns et al., 2016; Löwe, 2018). Graph transformation has been widely used for express­ ing model transformations, since graphs are well suited to de­ scribe the underlying structures of models and meta­models. Operations are implemented as model transformations solv­ ing different tasks. A transformation is a set of rules that describe how a model in the source language can be trans­ formed into a model in the target language (Rutle et al., 2012). The extraction is a process that transcribes model/meta­ model elements from the native source platform to the tar­ get platform (Jia and Jones, 2015). This is necessary mainly when the input model comes from a different technical space (e.g., input model is in the XMI format and the transforma­ tion platform works on data collections). 3 A Data­centric Approach for MT In a previous work (Camargo and Fabro, 2019), we presented astudyonapplyingadata­centric languagecalledBloom(Al­ varo et al., 2011) to develop model transformations. There are three major differences from the previous study to this paper: a) We define a specific format based on RDF (W3C, 2014), and we used it in the injection/extraction operations for translating source model in new modeling domain; b) We implement the RDF models in data collections and specify transformation rules, mapping the source and target meta­ models and models elements as Ruby classes; and c) We choose the Bloom language, a Data­centric declarative lan­ guage, since it is based in collections (unordered sets of facts) and provides implicit­parallelism. On the other hand, the use of the Data­centric approach, and parallel model transforma­ tions are the main similarities between these works. The proposed approach in this work is built on top of the Apache Spark framework, using Dc aspects such as high­ level programming, parallel/distributed environments, and considering that a model element is a set of data. It allows the extraction of models and meta­models in different for­ mats and transforming them to a directed­graph, which is as­ signed to a GraphFrame. The transformation output is the in­ put to process graph operations and model transformations. In order to improve the performance of transformation exe­ cutions, we use two different strategies for partitioning mod­ els from GraphFrame. Figure 3 shows an overview of our ap­ proach. There are arrows between Spark components, mainly in Spark Context. It is the responsible for managing all exe­ cutions on the Spark framework. The arrows among the ap­ proach modules (2, 3, and 4) represent the interaction be­ tween them and their outputs, forming a workflow. All the steps of the workflow are automated, except for the Opera­ tion on Graph to the partitioning of models (semiautomated). We describe these steps in the next sections. The Driver Node controls the execution of a Spark Ap­ plication and maintains all states of a Spark cluster. It ex­ changes messages with the Cluster Manager in order to ob­ tainphysical resources and launchexecutors (WorkerNodes). The Executor is the process that performs the tasks assigned by the Spark driver. The Executors have the responsibility to receive the tasks (Task) assigned by the driver, run them, and report back their state and results. The interaction between the Work Nodes and Spark Context is supported by a Cluster Manager, which is responsible for maintaining a cluster of machines (nodes) that will run one or more Spark Applica­ tions (Chambers and Zaharia, 2018; Apache, 2019). In our approach, the modules 2 and 3 are executed on the Driver Node. The Injector module is responsible for extracting the input model to the DataFrame, which is transformed into a GraphFrame by the Model Translator module. The Model Transformation (module 4) is executed on Worker Node(s). For the Module 3, we create a meta­model to instantiate the result of the translation of the input model to a graph model. It is necessary for assuring the conformance and con­ sistency of translation output. Such meta­model is based on the GraphDB meta­model proposed by (Daniel et al., 2016), which focuses on NoSQL graph databases. Figure 4, de­ picts our Graph Meta­model, where GraphElement repre­ sents all elements of a graph. Their sub­types, Graph Vertex and Graph Edge, express the vertices and edges, respectively. A GraphVertex has an Id attribute, meaning that each ver­ tex is unique. Also, there are type and value attributes to A Data­centric Model Transformation Approach using Model2GraphFrame Transformations Camargo and Del Fabro 2021 represent the model element properties, forming a triple. In contrast, the GraphEdge type has a string attribute key for identifying the elements from src and dst links, which are represented by src (source) and dst (destination) associa­ tions between GraphVertex and GraphEdge Classes. We use the Graph Meta­model as a schema to instan­ tiate model elements and their relationships by means of the GraphVertex and GraphEdge classes. Their properties, such as attributes and associations indicate the model ele­ ment structures. GraphVertex and GraphEdge classes are in­ stantiated into a GraphFrame, and from the GraphFrame it is possible to specify operations and queries to manipulate them. An instance of the Graph Meta­model is shown in Sub­ Figures 5a and 5b. Figure 3. An Overview of Data­centric Approach for MT Figure 4. Graph Meta­model A set of operations over graph elements of GraphFrame can be executed, such as the Motif algorithm to split graph in sub­graphs, graph degree to compute the valency of a ver­ tex in a graph, queries, and others1. In addition to such exe­ cutions, the Model2GraphFrame (M2G) output is also used as input by the Model Transformation module, which trans­ forms the input model elements in a directed­graph format to the target model. In the next sections, we present the steps to extract and transform models, as well as two alternatives for model par­ titioning. 3.1 Extracting model elements into a DataFrame The initial step consists of the extraction of the input model elements into a DataFrame model. It starts when the user submits (1 in Figure 3) the input model with its name, and 1The valency of a vertex of a graph is the number of edges that are incident to the vertex location (path) (Figure 3) to the Driver Node. The Injector Module (2 in Figure 3) assigns the input model in formats such as XMI or JSON to a variable (modelPath) which is read for loading the input model. Next, the input model is parsed (DataFrame API) and its elements are assigned to a DataFrame (modelDF). All DataFrame has a schema for de­ scribing the data structures, such as the input model. Thus, a schema is formed according to the input data structures. List­ ing 2 shows an example of a DataFrame schema. We choose to use the DataFrame in this step due to their schema. It pre­ serves the input data structures, easing the translation of the input models to the GraphFrame through the reuse of these structures. Furthermore, it is not necessary to implement a parser for loading the input model to DataFrame. We use the Family model excerpt from the ATL Zoo (Eclipse, 2019) to illustrate the extraction into the DataFrame and we then describe how model elements are represented in a DataFrame. In Spark, the operations on data are made by means of Transformations and Actions. A Trans­ formation is formed by a set of instructions to manipulate data and an Action is specified to trigger the computation on data. When it is called, it notifies the Spark Engine to com­ pute a result from a series of transformations (Chambers and Zaharia, 2018). Listing 3 illustrates the extraction result from the model Family (excerpt) in XMI format (Listing 1) to a DataFrame, where its structure is supported by DataFrame Schema shown in Listing 2. Listing 1: Model Families Excerpt ... Listing 2: Family Schema Excerpt root |-- Family: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- lastName: string (nullable = true) | | |-- daughters: struct (nullable = true) | | | |-- firstName: string (nullable = true) ... Listing 3: DataFrame Family Excerpt +---------+--------------------+---------+----------+---+ | lastName| daughters| father| mother| sons| +---------+--------------------+---------+----------+---+ | March| [[, Brenda]]| [, Jim]| [, Cindy]| [[, Brandon]]| ... ... ... ... .... According to Figure 3, the model elements are structured in a set of columns with an unspecified number of rows, since a schema defines the column names and types of a DataFrame. The rows are unspecified because the reading of the model elements is a lazily­evaluated operation (lazy evaluation (Michael l., 2016)). The schema does not require the rows to be identified explicitly. Although a DataFrame Schema can be specified manually, we opt for the Schema generated by the parser by the read op­ eration of the input model (Extraction step). In this schema, the structures of input model elements are preserved in a A Data­centric Model Transformation Approach using Model2GraphFrame Transformations Camargo and Del Fabro 2021 tree format by the translation process. Listing 2 has a trans­ lation example, where the DataFrame Schema is structured by element root and their rows are represented by Family element. The multivalued elements are represented by ar­ rays (array) and their elements are represented by structs that may have one or more elements, including null values (containsNull). These elements represent the leaves (i.g., lastName) and have a type (i.g., string). All elements represented on DataFrame Schema have the (nullable) at­ tribute assigned as true by default. This is for fitting the Spark framework for handling the Dataframe columns, with the nullable attribute true or false. Their columns are logical constructions that represent a value computed by means of programmatic expressions. Thus, to have a real value for a column, we need to have a row, and consequently to have a DataFrame. Therefore, since the input model was translated to a DataFrame, it can be transformed according to the transformation domains of the user. 3.2 Translating the input DataFrame to GraphFrame In a second step, the Model Translator Module (3 in Fig­ ure 3) translates the input model, which was assigned to a DataFrame, into a GraphFrame. We use the model elements in the DataFrame as input to the Model Translator. In addi­ tion to elements, the schema associated with the DataFrame that describes the model element structures is essential for our Model Translator, since we use it for reproducing these element structures in a graph, assigning them to the Graph­ Frame. We create an algorithm for translating a DataFrame to a GraphFrame, conforming to the meta­model of Figure 4. Algorithm 1 is responsible for such translation. As input, the Algorithm receives a DataFrame, which is processed by combining its content and Schema. Algorithm 1 contains the functions model2GraphFrame and model2GraphSchema. The source code of the functions is available on2. Since the modelDF DataFrame contains all model elements, it is assigned as a parameter to the model2GraphFrame function. It is responsible for starting the transformation process called. For simplicity’s sake, we omit the specification of the model2GraphSchema function in Algorithm 1 (line 2), the model2GraphSchema function with the model elements and the DataFrame Schema as parameters. It performs the processing of model elements and their structures together with the respective schema columns of DataFrame in a recursive way, assigning its result into the verticesDF and edgesDF DataFrames. (lines 3 and 4). We use the wildcard parameters (_1 and _2) and the toDF function with its parameters, and the respective DataFrame columns ("id","value"). Thus, the first elements are separated to the verticesDF DataFrame and the remaining elements are to the edgesDF DataFrame. Both DataFrames shape the vertices and edges and are assigned into the GraphFrame (GF, line 7) by model2GraphFrame function. 2https://github.com/lzcamargo/extracSpk (a) GraphFrame Vertices (b) GraphFrame Edges Figure 5. Family Model Elements Translating to GraphFrame Algorithm 1 M2G Translation Algorithm Input: modelDF : DataF rame Output: GF : GraphF rame 1: function model2GraphFrame(modelDF ) 2: graphData ← model2GraphSchema(modelDF.collect, modelDF.schema, 0) 3: verticesDF ← graphData._1.toDF (”id”, ”value”) 4: edgesDF ← graphData._2.toDF (”src”, ”dst”, ”key”) 5: return (verticesDF, edgesDF ) 6: end function 7: GF ← model2GraphF rame(modelDF ) We use some Family model elements (Listing 2) as input to present a translation example (an Algorithm 1 execution). To access the vertex and edge contents, we execute the com­ mands: GF.vertices.show() and GF.edges.show(). Its outputs are represented in Figures 5a and 5b. The values of Family model elements from the DataFrame are instantiated into graph vertices. The model element names are assigned to graph edges as keys. The links (src and dst) among ver­ tices and edges establish the relationship of the model ele­ ments. In Figure 5 we use circles and rectangles for illus­ trating the model element structures and their relationships. For example, the vertices and edges marked in red demon­ strate the structure of the lastName Sailor element, and the blue ones denote the firstName David element. The relationship between these two elements is marked on edge (Figure 5b), where the src column value is noted in red, and the value of the dst column is noted in blue. The join of these structures (the match between id, src, and dst columns) al­ lows to identify that David is a son (sons), and belongs to Sailor Family. Thus, the model elements are structured into GraphFrames so that they can be queried and processed for different purposes. In the first two steps, we obtain the extraction of the input model to the modelDF DataFrame and its translation to the GraphFrame GF. We consider the result of these operations as the transformation of the input model to a graph, in particu­ lar the Model2GraphFrame transformation. In the next steps, we use the GraphFrame contents for Model Partitioning and Model Transformations. https://github.com/lzcamargo/extracSpk A Data­centric Model Transformation Approach using Model2GraphFrame Transformations Camargo and Del Fabro 2021 3.3 Model Partitioning In this step, we present two strategies that we use for parti­ tioning models from GraphFrame: one based on the model key­element names with the Motif Algorithm, and another using clustering. First we present their implementation. In the next section, we present a proof of concept on using these strategies. We choose the first strategy because it allows us to use the transformation rule names with an algorithm im­ plemented on the GraphFrames API itself, in this case the Motif algorithm. Regarding clustering, we choose it to link the model elements on clusters by means of the related ver­ tices (src to dst) in edges contained on the GraphFrame. We use the clusters as parameters for the Spark framework par­ titions in the processing of the Model Transformations. In a graph, a motif can be defined as a pattern of interconnections of edges that occurs in a graph (Milo et al., 2002). We are in­ terested in finding patterns in a graph for a given purpose, forming sub­graphs as such partitions from this graph. Thus, we consider the following definition, where a Graph G′ is a sub­graph of graph G = (V, E), if V ′ ⊆ V and E′ ⊆ E ∪ (V ′ ∗ V ′). If G′ ⊆ G and G′ contains all of the edges ⟨u, v⟩ ∈ E with u, v ∈ V ′, then G′ is an induced sub­graph of G. In our context, consider a scenario with the following transformation rule names: Package2Schema, Class2Table, Att2Col, and Family2Person. From each rule name, we use its prefix (i.e., Package, Class, Att, and Family) as a pa­ rameter (key­element) in graph partitioning using Motif al­ gorithm, particularly for the key column of the edges. This means that these prefixes are interest points in the graph. In a GraphFrame, the Motif Finding is implemented in a Domain­Specific Language (DSL) for expressing struc­ tural queries. For example, graph.find("(a)-[e]->(b); (b)-[e2]->(a)") will search for pairs of vertices a,b con­ nected by edges in both directions. It will return a DataFrame of all the structures in the graph, with columns for each of the named elements (vertices or edges) in the motif. The returned columns will be the vertices a, b, and edges e, e2 (Apache, 2019). We specify the sub­graphs extraction combining Motif Finding and a filter. This means that depending on the in­ put model it is necessary to adjust of Motif algorithm pa­ rameters and/or filter, characterizing the model partition­ ing semi­automated. Listing 4 shows the implementation in Spark Scala for the Class elements through the tag "classes", which were mapped to column key of the edgesDF DataFrame. Graph motifs are patterns that occur re­ peatedly in the graphs and represent the relationships among the vertices. In a GraphFrame, Motif Finding uses a declar­ ative DSL for expressing structural queries for finding pat­ terns among edges and vertices by means of the find() func­ tion. Therefore, we choose it for easing the sub­graph extrac­ tions. We believe that its characteristics can generate consis­ tent sub­graphs from key model elements (prefix name rules). Line 3 of Listing 4 is the specification of a query for search­ ing for pairs of vertices between (a,b), (b,c), and (c,d), which are respectively connected by edges e, ea, and eb. We also use a filter for delimiting the vertex pairs, starting from an edge, whose key property element is equal to the tag "classes". This means that the execution of this ex­ pression will return as motifsDF all the structures (vertices and edges) related to the filtered property (classes) on the graph, which are arranged in a, e, b, ea, c, eb, and d columns. We select the edges contained in motifsDF and as­ sign them to the subE immutable variable (line 5). We use it as edges for composing the subG sub­graph, whose vertices are the same as in the GF graph. We apply the dropIsolated­ Vertices() function to exclude the isolated vertices (i.e., ver­ tices with degree zero, if there are any.) for ensuring that the links among vertices and edges in subG sub­graph. In this case, Listing 4 allows us to get all the Class elements and their associated elements from the GraphFrame that repre­ sent a Class model, producing a sub­graph. Listings 11 and 12 show an example of the edges and ver­ texes of a sub­graph (S­G), such as a result from Listing 4. This example and the results from of the other Motif specifi­ cations for the model key­elements, such as Package, Att, Female, and Male are presented in Section 4. Listing 4: Motifs Sub­Graph Extraction 1 object SubGraph { 2 def main ( args : Array [ Str ing ] ) : Unit = { 3 val motifsDF = GF. find ( ” ( a ) −[ e] − >(b ) ; ( b) −[ ea ] − >(c ) ; 4 ( c ) −[ eb ] − >(d ) ” ) . f i l t e r ( ”e . key = ’ c l a s s e s ’” ) 5 val subE = motifsDF . s e l e c t ( ”eb . src ” , ”eb . dst ” , ”eb . key” ) 6 val subG = GraphFrame (GF. v e r t i c e s , subE ) 7 . d r o p I s o l a t e d V e r t i c e s ( ) 8 } 9 } Now we present the utilization of clustering as a strategy, by implementing it using the Infomap from the MapEqua­ tion framework (Bohlin et al., 2014). There are other alter­ natives for such implementation, such as the utilization of the k­means algorithm (MacQueen, 1967), one of the most commonly used clustering algorithms. We could also adapt the Apache Spark MLlib, machine learning (ML) library. It provides various operations based in ML, including cluster­ ing. Infomap is a fast stochastic and recursive search algo­ rithm with a heuristic method Louvain (Blondel et al., 2008) based on the optimization of modularity. When it is exe­ cuted with vertices and edges of a graph, the neighbor nodes are joined into modules, which are subsequently joined into super­modules and so on, clustering tightly interconnected nodes into modules. Infomap has been used in community partition problems (Aslak et al., 2018; Edler et al., 2017), for detecting communities in large networks, and to help in the analysis of complex systems. In addition, Infomap oper­ ates on graph­structures in the Pajeck format (file.net)3, which can be easily extracted from the GraphFrame as input to Infomap. For example, Listing 5 shows a excerpt of the File.net extracted from Class­0 model, and Listing 6 shows the .clu output file, the clustering result, where the nodes are gathered in the respective clusters (node and cluster columns). Column flow contains cluster indices for each node, but they are discarded when the .clu file is injected into DataFrame by a loading operation and used in clustering model elements. However, the clustering from GraphFrame using the Infomap framework is a semi­automated operation, since we do not implement integration between our approach and the Infomap framework (Operations on Graph, Figure 3) 3https://gephi.org/users/supported­graph­formats/pajek­net­format/ A Data­centric Model Transformation Approach using Model2GraphFrame Transformations Camargo and Del Fabro 2021 Listing 5: Class­0 File.net *Vertices 50031 0 0 1 1 2 2 ... *Arcs 50030 1 2 4 5 4 6 .. Listing 6: Clustering Nodes # node cluster flow: 8 1 0.0457141 7 1 0.00261991 10 1 0.00261991 6 1 0.00222776 9 1 0.00222776 5 1 0.00195755 11 1 0.027326 46 1 0.00233907 ... Later, we present the use of Infomap and the model parti­ tioning in Section 4. 3.4 MT using GraphFrame In the last step, we specify a set of operations and transfor­ mation rules to transform the source model in GraphFrame into a target model. They are executed as parallel tasks on Worker Nodes of the Spark framework, through the Model Transformation module (4 in Figure 3). The source code of the operations and transformation rules are available on4. Listing 7 shows the Family2Person rule written in Scala as a singleton object (object Family2Person). We sepa­ rate the male and female elements in the maleEdgesDF and femaleEdgesDF DataFrames. They contain the target val­ ues (dstm, dstf, dst) that link each last name with its first names. We use the select, join, and filter func­ tions to select the last and first names from of maleEdgesDF. For each join operation, we use the filter function (lines 4, 6, 12, and 14) to ensure the accurate selection of model el­ ements, since they are formed by relationships among edges and vertices ("dstm" === "id"). In lines 7 and 15, we use the select and concat functions to assign the last name (lastName) and the respective first names (value) as the full name (fullName column) to the maleFullNamesDF DataFrame. Listing 7: F2P Rule 1 object Family2Person { 2 val maleFullNamesDF = maleEdgesDF 3 .select($"dstm", $"dst").join(GF.vertices) 4 .filter($"dstm" === $"id") 5 .select($"value".alias("lastName"), $"dst") 6 .join(GF.vertices).filter($"dst"===$"id") 7 .select(concat($"lastName", lit(" "), $"value") 8 as "fullName") 9 10 val femaleFullNamesDF = femaleEdgesDF 11 .select($"dstf", $"dst").join(GF.vertices) 12 .filter($"dstf" === $"id") 13 .select($"value".alias("lastName"), $"dst") 14 .join(GF.vertices).filter($"dst"===$"id") 15 .select(concat($"lastName", lit(" "), $"value") 16 as "fullName") 17 } For the femaleFullNamesDF DataFrame (lines 9 to 14), we use the same idea applied to the maleFullNamesDF Dataframe. These DataFrames are merged (union function) in the personDF DataFrame, each one with a new column Gender (withColumn("Gender")) to ensure the gender dis­ tinction among persons. 4https://github.com/lzcamargo/transformSpk Next, we specify an operation, using coalesce(1) method to instantiate the transformation output in a single partition (1). This means that output tasks will be reduced in a single partition (distinct output) as the final result of the transformation. The example in Listing 8 is obtained with the write function, and the tags (root and row) of the databricks:spark-xml library, indicating that the format was assigned as xml. We separate these commands (write op­ erations in the target model) from the loading rules for better code legibility. Since the target model was stored in a repos­ itory, it enables to load the output in xml/xmi format and in­ stantiate it back in GraphFrame. Listing 8 shows a portion of the persons.xml file content. It represents the Family2Person transformation result, using the Family model presented in Listing 1 as the source model. Listing 8: Persons Model Excerpt < Persons > < gender > Male < fullName > March Jim < gender > Male < fullName > Sailor Dylan < gender > Female < fullName > March Cindy < gender > Female < fullName > March Brenda ... In this section we described our approach. In the next sec­ tion, we perform the proof of concepts in order to validate its feasibility. 4 Implementation We implemented a Proof of Concept (PoC) (Kendig, 2016) using GraphFrames to demonstrate the feasibility of our ap­ proach and to show its usefulness under following aspects: the processing of Model2GraphFrame outputs, the partition­ ing of graphs contained in the GraphFrame, connectivity among model elements in a set of GraphFrames, and the ex­ ecution of model transformation using the GraphFrames. We run the PoC in a single machine with the following software stack: Ubuntu 18.04; Spark 2.4; and Scala 2.3. It is hosted by an Intel Core i5­4210U 1600 CPU with 8096 MB of RAM; and the processor has two cores. As input, we use the both Class and Family models in XMI format. There are four models with the following specifications: • Class­0, class model with no attributes or methods, only Package and Class elements. This kind of model is used in Domain Modeling, useful to understand the ideas and concepts of the domain (Larman, 2004); • Class­3, class model with Package and Class elements, each Class contains from 1 to 3 methods and attributes; • Class­6, as the previous item, but each class contains from 1 to 6 methods and attributes; • Family model with 0 to 3 sons and daughters. Its el­ ements are self­contained in LastName elements and their attributes. https://github.com/lzcamargo/transformSpk A Data­centric Model Transformation Approach using Model2GraphFrame Transformations Camargo and Del Fabro 2021 We get the Class models from, each one with 10000 classes5. They were created to be used as a benchmark for the Class2Relational transformation case studies in parallel transformations using Lintra (Burgueno et al., 2015)6. These models have references among their elements established by attributes. For instance, the Class­0 model has 10 Pack­ age elements and each Package has 1000 Class elements. The Family model has 10000 LastName elements, which we created for this proof of concept. In this case, we consider these elements as self­contained (Class­0 and Family). How­ ever, there are models (Class­3 and Class­6) that besides self­ contained elements, also contain inter­connected elements, where Class elements are referenced by one or more Class elements, which are contained in other Packages. Attributes such as super, and type establish such references. The models used on PoC have a different density (Class­ 0, Family, and Class­6) and interconnectivity (Class­3 and Class­6) among their elements. This means that we will validate our approach in relation to these model aspects. To measure the execution times in seconds, we use the System.currentTimeMillis() function from the Scala language, in a dedicated machine with no UI interactions. The input model elements once extracted to a GraphFrame, they must be available. Each model element in the Graph­ Frame vertices has to be linked to its properties through GraphFrame edges. We have defined three research questions to validate the PoC implementation and its main aspects. Q1: How to check if the Model2GraphFrame output is available for processing? To address this question, we use the directed­graph prop­ erty (DGP) to check the total of Edges and Vertices in a directed­graph G, ∑V (G) v=0 −1 = ∑E(G) e=0 , where the V (G) total minus 1 is equal to E(G) total. When this property is true to a directed­graph it is considered as a simple directed graph (Hochbaum, 2008). A directed­graph is no longer simple if there are multiple edges or loops. Hence, the V(G) total is less than to the E(G) total ( ∑V (G) v=0 <∑E(G) e=0 ). In addition, we execute a set of queries on the GraphFrame to validate the contents of vertices and edges, whose input models contain 100 classes and 100 families. This means that we take a set of model elements contained into GraphFrame and we compare it with its input model elements. Although the M2G outputs are directed­graphs into GraphFrame, we need to know whether it is achievable to use them in model transformations. To address this issue, we define question Q2. Q2: Is it possible to perform MT using GraphFrame? We address this question in order to use GraphFrame in Model Transformations. Our goal is to verify how the source models into GraphFrames can be transformed to target models. We specify operations and rules using methods and functions in Scala for manipulating vertices and edges in GraphFrame (e.g., Listing 7). They are similar 5http://atenea.lcc.uma.es/Descargas/MTBenchmark/classModels 6http://atenea.lcc.uma.es/index.php/Main_Page/Resources/LinTra to transformation specifications in ATL ­ ATLAS Transfor­ mation Language (Jouault et al., 2008), where Helpers and Transformation Rules are the constructs used to specify the transformation functionality. Finally, the last question is about performance of MT exe­ cutions using clusters. Q3: Does executing model transformations using model par­ titioning improve performance? We address this question in order to verify whether the execu­ tions of model transformations using model partitioning im­ prove performance, since we adopted two partitioning strate­ gies for this approach: partitioning of input model into Graph­ Frame in sub­graphs, and generating of clusters from Graph­ Frame vertices. In the following Sections, we present the proof of concepts, results and the answers for the above ques­ tions, as well as further discussions. 4.1 Processing Model2GraphFrame Outputs To check the GraphFrame outputs with respect to the input models, we obtain the total of vertices and edges and we use the DGP to check their amount. Columns V(G) and E(G) of Table 2 show the total of vertices and edges from the input models (Model column). The amount of vertices V(G) ­ 1 is equal to the amount of edges E(G) for the Class­0 and Family models, demonstrating that they are simple directed­graphs. However, the total of vertices and edges from the Class­3 and Class­6 models indicate that they are not simple directed­graphs (V(G) < E(G)). In addition, we execute queries as shown below, and their results are compared to input model elements to validate the M2G consistency. It returns the values of class properties such as name, isAbstract, and visibility from the GraphFrame vertices. It does not return Attributes and Methods, because the key­element (key) is assigned the "classes" value. gf.edges.where($"key"==="classes") .select($"dst".as("dstv")).join(gf.edges) .filter($"dstv"===$"src").select($"dst") .join(gf.vertices).filter($"dst"===$"id").show() Listings9and10showexcerptsofClass­0modelelements and the query output. They represent an example of our vali­ dation. In this case, the relation among classes and their prop­ erties are established by the GraphFrame edges (gf.edges src and dst), whereas the value of each property is assigned to the GraphFrame vertices (gf.vertices). Listing 9: Class­0 Model