1. Article JISIB 2012 - Version publiée 32 Multiversion Document Warehouse: An Approach to Multidimensional Analysis Kaïs Khrouf*, Jamel Feki*, Chantal Soulé-Dupuy** * MIR@CL Laboratory - University of Sfax - Tunisia, ** IRIT - University of Toulouse I – France Received 15 December 2010; received in revised form 2 March 2012; accepted 27 April 2012 ABSTRACT: Document warehouses allow the storage of selected and filtered heterogeneous documents, as well as their exploitation through multidimensional analyses techniques. However, the content of documents is dynamic and changes across time. In practice, decisional analysts may be interested with various versions of documents. Thus, the document warehouse should store and manage these versions. This paper presents an extended generic model for document warehouses allowing the management of the multiversion documents. In addition, it interests with multidimensional analysis on documents versions. Keywords: Document warehouse, Multiversion documents, Multidimensional analyses 1. Introduction Nowadays, Internet allows an exponential evolution of data volumes stored and exchanged among organizations. These evolutions raise new problems: How to deal with changes undergone by documents? What are these changes and how to detect them? For instance, a user revisiting a document might want to be informed of the document changes since his last visit. In order to maintain various versions of the same warehoused document, we need the concept of document warehouse. The author of (Khrouf & Soulé-Dupuy, 2004) defined the document warehouse as a source of information that is subject-oriented, filtered, integrated, archived (versions), and organized for a process of retrieval, interrogation or analysis. According to this definition, documents integrated in the warehouse could be historized (i.e., retain their evolution over time through different versions). In order to reach this objective, we propose an extension for the document warehouse meta-model defined in (Khrouf, Feki and Soulé- Dupuy, 2011). This extension is expected to manage content changes (i.e., when the document content is modified) and structural changes (i.e., when the document structure changes) that can undergo one document or class of documents. The extended meta-model allows applying techniques of multidimensional analyses on multiversion documents. We distinguish two types of analysis: i) Multiversion analysis, i.e., analysis covering all versions for the same document, and ii) Recent-version analysis; i.e., analysis relying on the last version of document(s). Available for free online at https://ojs.hh.se/ Journal of Intelligence Studies in Business 2 (2012) 32-40 33 This paper deals with the problematic of multiversion document warehouse; it is organized as follows. In section 2 we outline some works devoted to the management of multiversion documents. In section 3, we propose an extended meta-model for document warehouse and, in sections 4 to 6 we detail our approach of multidimensional analyses on multiversion documents integrated in the warehouse. Finally, we give an overview of our software prototype baptized DocWare (Document Warehouse). 2. Related works For the management of multiversion documents, several theoretical works have been proposed in the literature; furthermore, software prototypes have emerged. Nicolle, Alvarez & Amghar (2001) consider that the document is a set of independent fragments (parts). They distinguish two types of versions: a document version and a fragment version. In fact, the modification of certain document fragments creates new versions of fragments, and therefore a new version of the whole document. XyDiff (Cobéna Abiteboul & Marian, 2002) is a component of Xylème (Abiteboul, Cluet, Ferran & Rousset, 2002) to manage different versions of a document. Every modified item is represented as an XML file, stored in a data warehouse and indexed. These files are used thereafter to reconstruct previous versions of documents. XyDiff uses the tree structure of XML documents in order to detect movements and changes taking place on a document. X-Diff (Wang, DeWitt & Cai, 2003) is an algorithm for integrating the characteristics of XML structures with standard techniques of tree comparison in order to calculate the differences between two versions of an XML document. The main feature of this algorithm is that XML documents are modeled by unordered tree structures, unlike the work of XyDiff. Rusu, Rahayu & Taniar (2006) propose an approach for extracting rules from the changes of version of dynamic XML documents. Specifically, the authors propose an algorithm that studies the conduct of versions of XML documents in time and thus determines learning rules to predict document changes in the future. In our work, we are interested not only in the management of document versions (track and detect changes of the document evolution through time), but also for managing the versions of the collections of documents (set of documents gathered in the same class). In addition, we develop a multidimensional analysis approach for these multiversion documents. 3. Meta-model for document warehouses 3.1 Meta-model description The document warehouse should store pertinent documents in order to apply the multidimensional analyse on these documents; In addition, it should be able to manage the heterogeneity and support the evolution of structures and contents. To do so, we propose the meta-model of Figure 1. Figure 1: Meta-model for multiversion document warehouses Specific Structures (c) 1..* 1..* Associate 1 1..* 1..* 1 1..* GenElt NameGE CardGE VersGenElt DateVGE VersSpeElt DateVSE Information Content GenAtt NameGA SpeAttc NameSA ValueSA Include S_Include Define 0..* 0..* Generic Structures (b) Documents (a) 0..* 0..* S_Compose1 Compose 0..1 Contain {ordre} {order} {order} ItsGenStr 1..* Documents NameDoc Content (d) {order} 1 Ontologies (e) 1..* 1 1 Ontologies NameOnt Concept NameCpt belong 0..* S_Concept Assign0..* 0..1 1..* 1 GenStr NameGS VersGenStr DateVGS VersDoc DateVDoc 1..* ItsDoc 1 1..* ItsVersGenStr 34 This metadata includes the following components: •A set of documents (Figure 1.a) to be integrated in the document warehouse and their different versions (Figure 1.a). •The hierarchical structure of documents. It is made up of two types of structures: I. The generic structure (Figure 1.b): It is a common structure for a document set. It is composed of a set of versions each of which is defined by a set of versions of generic elements which can be composed of other versions of generic elements. Each of these elements can also be described by generic attributes for example book-Id. II. The specific structure (Figure 1.c): It is associated to a single document and has to be compliant/identical to one among the existing versions of generic structures. This structure is defined by a set of versions of specific elements that can include specific attributes. •The content (Figure 1.d) is the textual element of the specific structure. •The semantic layer (Figure 1.e) is defined using domain ontologies. In our context, ontology is composed of a set of concepts hierarchically organized where each leave concept is described by a set of keywords. 3.2 Example Figure 2 depicts a simple instantiation example for our meta-model of Figure 1. In this example, we manage three versions of the same document Doc1: •Doc1 is initially compliant to Version1 of the generic structure Article composed of Title and Content. •After changes made on the Content element, Doc1 belongs now to the new Version2 of Article. •After renaming the Content element to Section composed of two Paragraphs (i.e., Dimension and Fact), the new version of Doc1 is becoming conform to Version3 of the generic structure Article. Figure 2: An instantiation example for the meta-model in Figure 1.
DW DSS …
Doc1 Version1
DW OLAP …
Doc1 Version2 V1 Article V1 Title V1 Content Doc1 V2V1 V2 Article V2 Content DW DSS… OLAP…
DW

Dimension…

Fact…

Doc1 Version3 V3 V3 Article V1 Section V1 P V1 P Dimension… Fact… Discovery Knowledge Data Mining OLAP Design kw11 ... kw22 kw23 ... kw12 ... Data Warehouse kw11 ... 35 3.3 Meta-model advantages The meta-model we proposed has the following advantages: •Grouping heterogeneous documents having identical or similar structures into classes. This relies on an algorithm for comparing labeled tree structures (Ben Messaoud, Feki, Khrouf & Zurfluh, 2011) •Storing various versions of documents due to evolutions. •Adding up of semantics to the documents by linking the textual content to the concepts of domain-ontologies (Ben Meftah, Khrouf, Feki, Ben Kraiem & Soulé-Dupuy, 2011). •Applying multidimensional techniques on documentary information. This feature will be detailed in section four. 3.4 Meta-model implementation As shown in Figure 1, the meta-model is designed using the Unified Modeling Language (UML) object-oriented modeling. The meta-model implementation is carried out in an object relational DBMS (Oracle 10g). To ensure this translation, we have used the following transformation rules: •Classes are transformed into tables. •For one-to-many relationships implementation, we have two alternatives: use one mono-valuated link or one multi-valued link in the opposite direction. We opted for the mono-valuated link as they facilitate the generation phase of views necessary for the multidimensional analyses. Example 1 •We implement many-to-many relationships using multi-valuated links, specifically by using a list of references as nested tables. Example 2 •For inheritance, we opted for mono-valued links from subclasses to super-classes in order to separate the two structures, generic and specific. Figure 3: The navigational diagram of the proposed meta-model in Figure 1 VersDoc VersGenStr Id_Doc DateDoc ItsVGS Id_VGS DateVGS 319 04/02/2012 17 01/02/2012 716 05/04/2012 24 05/04/2012 1426 14/05/2012 VersGenStr VersGenElt Id_VGS DateVGS ItsVGE Id_VGE DateVGS 17 01/02/2012 67 01/02/2012 68 01/02/2012 24 05/04/2012 85 05/04/2012 InheritGE ItsGenStr Specific Structures (c) Associate GenElt Id_GE NameGE CardGE VersGenElt Id_VGE DateVGE VersSpeElt Id_VSE DateVSE Information Id_Cont Content GenAtt Id_GA NameGA SpeAttc Id_SA NameSA ValueSA Include S_Include Generic Structures (b) Documents (a) S_Compose Compose Contain Documents Id_Doc NameDoc Content (d) Ontologies (e) Ontologies Id_Ont NameOnt Concept Id_Cpt NameCpt Belong S_Concept Assign GenStr Id_GS NameGS VersGenStr Id_VGS DateVGS VersDoc Id_VDoc DateVDoc ItsDoc ItsVersGenStr InheriteGA InheriteVGE Define 36 3.5 Meta-model instantiation The integration of a document into the warehouse is accomplished through the three following steps: I. Extraction of the specific structure for the document by using a parser; it includes the document tags and its hierarchical structure. II. Comparison of the specific structure of the document with the generic structures stored in the warehouse. This step is accomplished through an algorithm which calculates a similarity degree to compare labeled tree structures (Ben Messaoud, Feki, Khrouf & Zurfluh, 2011). III. Insertion of the document content, information and list of keywords into the warehouse while linking the textual information to one or more concepts that also are characterized by keywords. We use the information retrieval techniques to perform this step (reference). 4. Multidimensional analyses The document warehouse is intended to allow decision-making. To do so, we adopt the multidimensional model (Kimball & Ross, 2002) that considers an analyzed subject as a point within a space having several dimensions. This model relies on the concepts of fact and dimension. The fact represents the subject to be analyzed as the number of articles and, the dimensions represent the context of recording the fact such as Author, publication Year and Conference. Dimensions are made up of attributes organized, from the finest to the greatest granularity, into hierarchies. Figure 3 describes our proposed multidimensional process to analyze textual information stored in the document warehouse. Figure 4: Multidimensional analysis process In following section, we detail the first two phases of this process. 5. Phase 1: Construction of the document mart schema Let us remember that a generic structure gathers a set of documents having identical or similar structures. The decision makers can focus on a generic structure to perform his/her analyses. The first step consists in (1) selecting the analysis context through the choice of the generic structure on which analyses will be applied, and then (2) selecting the type of analysis: Analysis covering all versions or relying only on the last version of documents. During step two, the decision-maker selects the multidimensional schema components, one fact and a set of related dimensions: •A fact represents a subject of analysis, composed of a set of attributes describing the business activity. These attributes are called measures or indicators and have numeric values. As an example, let us consider the fact Publication that has the measure Number of published articles. •The dimensions represent the analysis axes of measures. This means that the measures of an activity are observed according to these different dimensions. For instance, measures of the Publication fact can be analyzed according to the several dimensions as Author, Year, and Concept. In addition, the decision-maker indicates the order of dimensions and the aggregation function (Count, Sum, Max, Min and Avg) to be applied to the fact measures. In the third step, the decision-maker can select specific values or introduce predicates in order to filter data for analysis. We distinguish two types of data filtering: •Dimension filtering through which the user can select values on a dimension. •Fact filtering where the user restricts the values of the fact measures using the comparison operators (<, >, <>, <=, >=, =). Example: Let us analyze the number of Publications addressing the Data warehouse concept by Author and by Year. Construction of mart schema Warehouse Multidimensional Schema Document Mart Multidimensional Table Automatic Generation of mart Visualization Multidimensional Table 37 Figure 5: Affectation of analysis components Once all these document mart schema-components are defined, the next phase generates the document mart. In our approach, this generation is automatically performed. 6. Phase 2: Automatic generation of document mart The decision-maker task is now completed and the automatic generation produces a document mart instantiated from the warehouse. To simplify this generation, we decompose it into two complementary steps namely view generation for each analysis component, element or concept, and joining and grouping generated views. 6.1 Views generation for analysis component The first step is to recover the identifiers of the versions of documents belonging to the same generic structure and concerned by the analysis. •Multiversion analysis SELECT Id_VDoc FROM VersDoc VD WHRE VD.ItsVersGenStr.ItsGenStr.SaGS.NameGS = 'NameGS'; •Recent version analysis SELECT vd.ItsDoc.Id_Doc, Max(DateVDoc) FROM VersDoc VD GROUP BY VD.ItsDoc.Id_Doc; Secondly, we recuperate trough a sub-query three attributes: (1) The identifier of each document. (2) The identifier of the common ancestor of analysis components. (3) The concerned information. These sub-queries are merged by the SQL Union operator to obtain a single view. The sub-query the system generates is the following. SELECT 'Id_VDOC', (1) i.Associate.S_Compose.S_Compose....ID_VSE, (2) i.Content (3) FROM Information i (4) WHERE i.Associate IN (Select nt.AdrVSE From The (select vd.Contain From VersDoc vd Where ID_Doc= 'ID_Doc')nt);(5) --If the dimension is a generic element AND i.Associate.InheritVGE.InheriteGE.NameGE= 'NameGE' (6) --If the dimension is a concept AND i.contain.NameCpt='NameCpt' (7) Where: (1) Document identifier (2) Identifiers of specific elements those inherit from the first common ancestor of all analysis elements. (3) Content of the specific element. (4) Meta-model table name. (5) Selection of the specific elements belonging to the document ID_Doc. (6) Selected name of the generic element (when a dimension is based on a generic element). (7) Name of the concept on which a dimension is based. Note that the fact view is generated in the same way like dimensions; the S_Compose denotes the link between a specific element and its father Conference Name Year Language Thematics Thematic Dates Submission Notification Registration Conf Committee program Member Papers Paper Title Authors Abstract AuthorFact (Count) Dimension 2 Information System Database Cube Data Warehouse OLAP Dimension 1 Dimension 3 38 specific element so then the occurrences of S_Compose equal the number of levels between a chosen element and its ancestor. As an example, for the Year dimension (cf. Figure 5) and the document 314 the system generates the following script. SELECT '314', i.Associate.S_Compose.ID_VSE, i.Content FROM Information i WHERE i.Associate IN (Select nt.AdrVSE From The (select vd.Contain From VersDoc vd Where ID_Doc= '314')nt); AND i.Associate.InheritVGE.InheriteGE.NameGE= 'Year' The ancestor element of the analysis components (Abstract, Author, Year, Title) is Conference. There is one level between Year and Conference. That’s why S_Compose is 1. For the analysis component Data Warehouse concept (cf. Figure 5), the system generates the following script for the same document Id 314. SELECT '314', i.Associate.S_Compose.S_Compose.S_Compose.ID_ SE, i.Content FROM Information i WHERE i.Associate IN (Select nt.AdrVSE From The (select vd.Contain From VersDoc vd Where ID_Doc= '314')nt); AND i.contain.NameCpt='Datawarehouse' The number of levels between Abstract and Conference (ancestor element of the analysis components) is 3. Thus the occurrences of S_Compose equal 3. 6.2 Joining and grouping generated views After generating the view for the fact and its dimension views, we follow by linking these views on their two first attributes, thus we generate a new view called Joint. For our running example, it is the following. CREATE VIEW Joint (DataWarehouse, Year, Author, Title) AS SELECT DataWarehouse, Year, Author, Title FROM DataWarehouse d1, Year d2, Author d3, Title f WHERE d1.doc = d2.doc AND d2.doc = d3.doc AND d3.doc = f.doc AND d1.Anc = d2.Anc AND d2.Anc = d3.Anc AND d3.Anc = f.Anc; To generate the final view that describes the document mart we Group by all dimensions and apply the Count function. CREATE VIEW Result (DataWarehouse, Year, Author, Nb) AS SELECT DataWarehouse, Year, Author, Count(title) FROM Join GROUP BY DataWarehouse, Year, Author; Figure 6 displays the result, obtained with the generated view, in a multidimensional table. Figure 6: Multidimensional table 7. DocWare prototype: Experimentation To validate our proposals we developed the software prototype DocWare (Document Warehouse) for the integration and the analysis of textual data. Specifically, DocWare provides the two following main features: First it determines the generic and specific structures of documents and then inserts these documents automatically into the document warehouse, and secondly assists the administrator (or even skilled decision-makers) during the construction of the document mart. In the remainder we illustrate some functionalities of DocWare through the following example. Suppose we want to count the number of scientific papers dealing with the Data Warehouse concept, by Author and publication Year. •CONTEXT Accessing the document warehouse content we find that the documents describing the papers are grouped into the generic structure Conference. It contains all necessary elements to perform the analysis (Abstract, Year and Author). Nb 2007 1 1 2008 * 2009 2 … … Concept Data Warehouse Publication Foulen Dupont 39 •APPROACH We follow the three steps of our approach. I. Choice of analysis context: We start by defining the generic structure for the document mart to be constructed. Thus, the system displays. Among the list of stored structures in the warehouse, we choose the generic structure Conference that will be visualized by a tree (Figure 7). II. Selection of analysis components: We specify the role (dimension or fact) of elements to build the mart by using contextual menus. Chosen elements are automatically highlighted by using different shapes and colors for dimensions (read) and facts (yellow). In our example, we assign the Data Warehouse concept to the generic element Abstract as the first dimension. Then, we select the generic elements Year and Author as the second and third dimensions. Finally, the measure is the count of Titles. To assign a concept to a generic element, DocWare displays the list of all existing ontologies in the warehouse; this enables us to choose the appropriate ontology (cf. Figure 8). III. Filtering: As we want to analyze the count of papers for the authors of this paper, we apply a filter on the third dimension. The system displays all Author values; among them we select the three following names: Kaïs Khrouf, Jamel Feki and Chantal Soulé- Dupuy. •RESULT To visualize the result, DocWare creates views according the approach described in section 6 and displays the result multidimensional table (cf. Figure 9). Figure 7: Affectation of a fact and dimensions Figure 8: Affectation of concept for the generic element Abstract 40 Figure 9: The Result multidimensional table 8. Conclusion The document warehouse allows flexible manipulation of heterogeneous collections of documents based on their structures and contents. In this paper, we extended the document warehouse meta-model toward a metamodel that supports multiversion document warehouse. This is for integrating a new feature: the management and analysis of multiple versions of documents. As documents evolution may concern their structure and/or content, we addressed the storage of versions compliant to a same document structure, as well as versions compliant to a multiple document structures. Decision makers could be interested with the document evolutions, or even ignore them. Therefore, we suggested two types of analysis on documents namely: i) Multiversion analysis; i.e., covering all versions for a same document; and ii) Recent-version analysis; i.e., analysis relying only on the last version of documents. In our proposed approach, each document version is compliant to a version of specific structure. Furthermore, various versions of the same document are able to be compliant to several versions of generic structures. As an immediate perspective, we aim to extend the process of multidimensional analysis by integrating personalization criteria and metadata; this could be done by the user himself or by an assisted process. In addition, semantic aspects during the analysis process are interesting; they can help decision makers to get better analytics. Acknowledgement We would like to kindly thank Dr Mohamed Mbarki and Ms Maha Azabou (Master degree student) for their contribution to the implementation of the DocWare system prototype. References Abiteboul S., Cluet S., Ferran G., Rousset M.C. (2002). The Xyleme Project, Computer Networks, 39(3): 225-238, 2002. Ben Meftah S., Khrouf K., Feki J., Ben Kraiem M., Soulé-Dupuy C. (2012). Document Warehouse: Integration of Semantic Structures, International Conference on Information Systems ans Intelligence Economic, Djerba, Tunisia. Ben Messaoud I., Feki J., Khrouf K., Zurfluh G. (2011). Unification of XML Document Structures for Document Warehouse (DocW), International Conference on Enterprise Information Systems, p. 85-94, Beijing, China. Cobéna G., Abiteboul S. & Marian A. (2002). Detecting changes in XML documents. In International Conference on Data Engineering (ICDE’2002), p. 41-52, San Jose, California, USA. Kimball R. & Ross M. (2002). The Data Warehouse Toolkit (2 edition). New York: John Wiley & Sons. Khrouf K. & Soulé-Dupuy C. (2004). A Textual Warehouse Approach: a Web Data Repository, (p. 101-124). Hershey: Idea Group Publishing. Khrouf K., Feki J., Soulé-Dupuy C. (2011). An Approach of Multidimensional Analysis of Document. International Conference on Information Systems ans Intelligence Economic, Marrakech, Morocco. Nicolle C., Alvarez A., Amghar Y. (2001). Managing Versions and Links for Structured Legacy Documents, International Symposium on Information Systems and Engineering (ISE’2001), June 25-28, Las Vegas, Nevada, USA. Rusu L.I., Rahayu J.W., Taniar D. (2006). Mining Changes from Versions of Dynamic XML Documents, p. 3- 12, Workshop on Knowledge Discovery in XML Documents (KDXD), p. 3-12, Singapore. Wang Y., DeWitt D.J., Cai J.Y. (2003). X-Diff: An Effective Change Detection Algorithm for XML Documents, International Conference on Data Engineering (ICDE’03), p. 519-530, Bangalore, India.