Lontar - Template LONTAR KOMPUTER VOL. 11, NO. 1 APRIL 2020 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2020.v11.i01.p02 e-ISSN 2541-5832 Accredited B by RISTEKDIKTI Decree No. 51/E/KPT/2017 9 Implementation of Data Backup and Synchronization Based on Identity Column Real-Time Data Warehouse I Gede Adnyana a1 , I Made Dwi Jendra Sulastra b2 a Computer System Department, STMIK STIKOM Indonesia Denpasar, Indonesia 1 adnyana.nakkuta@gmail.com b Accounting Department, Bali State Polytechnic Denpasar, Indonesia 2 dwijendrasulastra@gmail.com Abstract Failure in the process of loading data from the Online Transactional Processing(OLTP) system to the Normalized Data Store (NDS) database can occur. This caused by a disruption in the network so that the OLTP system is unable to save data to the OLTP and NDS databases. Backup and synchronization data scenarios are needed to maintain data consistency and data availability. In this research, the process of data backup and synchronization is done by providing an identity column for the table in the OLTP database. An identity column is used to give status data, value '0' if the inserting process fails, and value '1' if successful. Data backup is done by storing temporary data into a CSV file format, then the CSV file is read, and an insert process is carried out into the OLTP database. After the insertion process into the OLTP database is successful, it continues with the synchronization process between the OLTP database and the NDS. Data synchronization between OLTP and NDS databases is done by checking the value of the Identity Column in each table in the OLTP database. Keywords: Normalized Data Store, Real-Time Data Warehouse, Backup, Synchronization, Identity Column 1. Introduction Data processing using information systems provides benefits, including speed up processing time due to automation. Information systems can also process data accurately, thereby reducing the risk of human error. Problems arise when information systems cannot process data with very large volumes, and data is scattered in different systems with diverse database structures [1][2][3][4][5][6]. An organization will experience difficulty in decision making when it encounters conflicting reports due to the lack of consistency of data from the various data sources used. Therefore we need an integrated data processing model that can process heterogeneous transactional data called data warehouse (DWH) [1][2][3][4][5][6][7][8][9][10][11]. The current trend, an organization needs the latest information in decision making. The real- time data warehouse has different characteristics from the classic data warehouse [1][2][5][7][11]. To achieve real-time data warehouse is very dependent on the process known as ETL (Extract, Transform, Loading) [1][2][5][7][10][11][12][13][14][15][16][17]. There are several ETL approaches to realizing real-time data warehouse between processing data that only undergoes updating or known as the Change Data Capture (CDC) concept [1][2][5][10][11][16][18]. Another approach is to accelerate the frequency of data extraction [2][11][12][15][19]. Both approaches aim to reduce the data processing time lag so that real-time data warehouse is realized. In recent years Real-Time Data Warehouse (RTDW) has become a trend worldwide. ETL Data Warehouse that used to be run once a day is now running every hour, even every 5 minutes LONTAR KOMPUTER VOL. 11, NO. 1 APRIL 2020 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2020.v11.i01.p02 e-ISSN 2541-5832 Accredited B by RISTEKDIKTI Decree No. 51/E/KPT/2017 10 (mini-batch). This can be done by using two approaches, namely push approach and pull approach. By using a push approach, the source system pushes data into the Data Warehouse. The data warehouse will be updated as soon as the data in the source system changes. Changes to the source system will be detected using database triggers. The pull approach method uses time intervals to update data in the data warehouse. Changes to the source system are detected using the timestamp or identity column method. An identity column is a column (field) in a table whose value will be used as a benchmark when data will be pulled from the source system and then stored in a data warehouse [2]. Some research has been carried out of RTDW, such as research on RTDW modeling using the CDC [5][10][11][16][18]. There is also research on making RTDW using the CDC Event-Driven Programming method [11]. Then there is research on the architecture of RTDW and how the process of making RTDW from the Traditional model [7]. The process of making RTDW is very much determined by the ETL process. To optimize the process of making RTDW, several studies have been conducted on the ETL process [11][12][13][14][15][16][17][18]. From these, several studies all focus on making RTDW without discussing the possibility of failure in the process of transferring data from the source system to the RTDW database. To realize RTDW, data from the source system is loaded into a staging area or Normalized Data Store (NDS) which is used as a temporary storage place for data from various sources and ETL process sites before it is loaded in the Dimensional Data Store [1][2][5][11][12][13][14][15][16][17]. In one study on RTDW that implemented the use of CDC based on Event-Driven Programming [11], data was parallel stored in the OLTP database and the NDS database. In the process of loading data from the source system into the NDS, it is possible for the failure to occur due to the failure of the OLTP system to save data to the OLTP database and the NDS database. Failures can be caused by disruptions on the network, causing connection failures between the OLTP system and the OLTP database and the NDS database. To maintain data consistency between the source database and the NDS database and data availability, backup and synchronization data scenarios are needed. In this research, the data backup process is carried out by storing temporary data in the form of a file with the format of Comma Separated Values (CSV), which is then followed by the process of synchronizing data. The data synchronization process is carried out by giving identity columns to each table in the source database. Furthermore, checking is done by checking the value of the identity column. This research is expected to be able to handle the failure of the data transfer process from the OLTP system to the Real-Time Data Warehouse. 2. Research Methods The stages in this research are generally shown in Figure 1. In Figure 1, the research stage starts with defining the problem where the problem that arises is the possibility of failure data transfer from the source system to the formation of RTDW. Then the next stage is the study of literature to obtain supporting literature for solving the problem. The backup scenario design is done by designing a backup scenario model of the problem if there is a failure of data transfer from the source system to the data warehouse. The next step is to design a data synchronization scenario between the source database and the NDS database using the identity column method. Data backup and synchronization models are then developed and implemented. Then do the system functionality test and analysis of test results. Then from the analysis of the test results, conclusions are made. LONTAR KOMPUTER VOL. 11, NO. 1 APRIL 2020 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2020.v11.i01.p02 e-ISSN 2541-5832 Accredited B by RISTEKDIKTI Decree No. 51/E/KPT/2017 11 Figure 1. General Stages of Research 2.1. Real-Time Data Warehouse The data warehouse is a system that retrieves and consolidates data periodically from source systems into dimensional or normalized data stores [2]. The data warehouse is a collection of data that has a subject-oriented, integrated, time-variant, and non-volatile nature of data collection in support of the management decision-making process [2]. The data warehouse is a system that extracts, cleans, adapts, and sends data sources into a dimensional data storage and then supports the implementation of queries and analysis for decision-making purposes. Information in the data warehouse is always presented in the form of dimensions and facts [1][2]. Classic data warehouse usually updates the data every day or every week. In accordance with business requests that require up to date or real-time data processing, the concept of the real- time data warehouse was created. In a real-time data warehouse, the process of updating data is carried out dynamically continuously with a break time that is almost close to zero [2]. To create a real-time data warehouse highly dependent on the Extract, Transform, Load (ETL) process. Another approach to reducing the lag time is to only process data that has been updated or known as the Change Data Capture concept [1][2]. 2.2. Change Data Capture (CDC) Change Data Capture (CDC) is an innovative approach to data integration, based on identifying, capturing, and sending changes made by data sources. By processing only the changes, the CDC makes the data integration process more efficient and reduces costs by reducing latency [1][2]. LONTAR KOMPUTER VOL. 11, NO. 1 APRIL 2020 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2020.v11.i01.p02 e-ISSN 2541-5832 Accredited B by RISTEKDIKTI Decree No. 51/E/KPT/2017 12 CDC is designed to maximize the efficiency of the ETL process. Without CDC, all data in the Online Transaction Processing (OLTP) database will be moved to the data warehouse whenever needed, while with CDC, only data changes that occur in the OLTP database will be moved. Therefore, the CDC can minimize the restore that is used to move data changes and minimize the latency of sending information, so this will save costs. There are two CDC scenario models integrated with ETL tools: [2]. a. Batch-Oriented CDC (Pull CDC) Scenario: It is processing a set of data that only experiences periodic changes in high or low frequency. b. Live CDC Scenarios (Push CDC): Are sending data changes to the ETL tool after the changes occur. It can be done with an event-delivery mechanism or messaging middleware. 2.2.1. CDC Based on Event-Driven Programming. Event-driven programming is a programming technique where all program execution flows are determined by an event. When the program starts, it will wait for user input events. For each event that appears, the program will run the syntax to respond. The flow of program execution is determined by the order in which events occur [11]. In CDC based event-driven programming, when a user runs an event by clicking the button on the GUI, the data that has been filled in the GUI will be stored in the OLTP database, and the data will also be sent to the Normalized Data Store (NDS) database for further process. Data that has been inputted in a parallel GUI will be stored in two databases, namely the OLTP database and the NDS database [11]. 2.3. Extract, Transform, Loading (ETL) Extract, Transform, Loading (ETL) is a very important process in the data warehouse, with this ETL data from the operating system can be entered into the data warehouse. The purpose of ETL is to collect, filter, process, and combine data from various sources to be stored in a data warehouse [1][2]. 2.3.1. Extract Most of the data in the source system are very complex, so determining the relevant data is very difficult. Efforts to design and create extraction processes are very consuming time [1][2]. Raw data originating from the source system can usually be directly stored in the staging area with minimal restructuring to maintain the authenticity of the data. There are three methods for extraction that are commonly used, namely [2]: a. Whole table every time, this method extracts all rows in the table (full extraction). This method is suitable if the table size is small and consists of only a few rows. b. Incremental extract, this method extracts only the changed rows, not extracts the entire table. In getting a changed row, you can use the timestamp column, identity column, transaction date, triggers, or combinations. c. Fixed range, this method extracts several records or extracts with a certain time period; for example, the last six months of data. 2.3.2. Transform The transformation phase applies a set of rules or functions to data taken from the source to get data to be sent to the final target. Some data sources will require very little or no data manipulation [1][2]. 2.3.3. Loading The process of loading or also known as the process of delivering, is a process in which the transformed data is ready to be entered into a data warehouse, where the design of the table structure of data to be loaded (load) is made in the dimensional design process. The data from the loading process is ready to be queried and presented by the data warehouse. Therefore the LONTAR KOMPUTER VOL. 11, NO. 1 APRIL 2020 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2020.v11.i01.p02 e-ISSN 2541-5832 Accredited B by RISTEKDIKTI Decree No. 51/E/KPT/2017 13 dimensional data warehouse design will determine the speed of the query process performed [1][2]. 2.4. Data Backup and Synchronization Based on Identity Column Data backup is needed to maintain the availability of data if there are problems with the system; for example, there are database connection problems in the system. In this research, data backup is done by storing temporary data in the form of files in the format of Comma Separated Values (CSV). If the connection problem has been resolved, then the CSV file reading process is then performed then the insert process is carried out in the OLTP database. If the insert process in the OLTP database is successfully continued with the process of synchronizing the data between the OLTP database and the NDS database Data synchronization is needed to ensure data consistency is maintained. Data flow in the data warehouse, data from the source system is loaded into a staging area storage or Normalized Data Store (NDS). It is possible that the data failed to be loaded into the NDS, therefore synchronization of the data between the OLTP database as the source and the NDS database as the destination. The technique used to check records that fail to load on an NDS basis is the Identity Column, which is a column with a certain status value such as '0' or '1'. If the record fails to load in the NDS database, then the Identity Column record will have a value of ‘0’. Then the record with the Identity Column value of ‘0’ is carried out by inserting into the NDS database. If the insert process is successful, then the process of updating the Identity Column value will be made to a value of ‘1’. 3. Results and Discussion The test is carried out using a Customer Relationship Management System (CRM) Online Transaction Processing (OLTP) simulation application that is useful for recording any customer complaints telecommunications services. 3.1. CDC Event Programming When users enter new data in the CRM system, the data will be sent in parallel to the OLTP database and NDS database. CDC event programming is triggered by events that occur in the CRM system. The data input process on the CRM system can be seen in Figure 2. Figure 2. CDC Event Programming on CRM Systems LONTAR KOMPUTER VOL. 11, NO. 1 APRIL 2020 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2020.v11.i01.p02 e-ISSN 2541-5832 Accredited B by RISTEKDIKTI Decree No. 51/E/KPT/2017 14 Figure 3 shows the data has been successfully stored in the OLTP database in the TComplain table. Figure 3. Results of CDC Event Programming in OLTP Database Figure 4 shows the data has been successfully stored in the NDS database in the TNDSComplain Figure 4. Results of CDC Event Programming in the NDS Database 3.2. Data Backup Data backups are needed to ensure that every transaction remains recorded when a database connection problem occurs. Transactions will be saved in a file format with Comma Separated Values (CSV) format and will be processed into the OLTP database when the database connection is back to normal. The process of inputting data on an OLTP system while offline can be seen in Figure 5. Figure 5. Create Case Form Input If there is a database connection problem, the save and update buttons will be enabled on the offline action feature. In this research, the connection failure is simulated with a scenario of giving a false (closed) value to the connection variable in the OLTP system so that the connection with the OLTP database and the NDS database fails. Since the connection cannot be made, all transaction data will be saved in the CSV file shown in Figure 6. LONTAR KOMPUTER VOL. 11, NO. 1 APRIL 2020 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2020.v11.i01.p02 e-ISSN 2541-5832 Accredited B by RISTEKDIKTI Decree No. 51/E/KPT/2017 15 Figure 6. Transaction Data in CSV Format When the database connection has returned to normal, the data file in CSV format will first be read into a tabular form, and then the process is inserted into the TComplain table in the OLTP database. The process of reading a CSV file and inserting data into an OLTP database is shown in Figure 7. Figure 7. CSV File Format Reading If the OLTP insert key is pressed, the data stored in the CSV file will be inserted into the TComplain table in the OLTP CRM database. The results of adding data to the TComplain table in the OLTP CRM database are shown in Figure 8. LONTAR KOMPUTER VOL. 11, NO. 1 APRIL 2020 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2020.v11.i01.p02 e-ISSN 2541-5832 Accredited B by RISTEKDIKTI Decree No. 51/E/KPT/2017 16 Figure 8. CRM Synchronization Menu Adding data from files in CSV format is done in the OLTP CRM database, which then requires the process of synchronizing data between the OLTP database and the NDS database shown in Figure 9. Figure 9. Synchronization between TComplain Table and TNDSComplain Table Then every 30 seconds interval, the scheduler runs the data extraction process in the NDS database. Data in the NDS database is sent to the database warehouse using the Incremental Extraction based Timestamp method. This method can process only the latest data contained in the NDS database. The results of the process of inserting into a data warehouse database with the Incremental Extraction method based on Timestamp are shown in Figure 10. Figure 10. Results in the Data Warehouse LONTAR KOMPUTER VOL. 11, NO. 1 APRIL 2020 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2020.v11.i01.p02 e-ISSN 2541-5832 Accredited B by RISTEKDIKTI Decree No. 51/E/KPT/2017 17 3.3. Data Synchronization In the CDC based event-driven programming method, data inputted on the parallel CRM system GUI is stored in the OLTP database and the NDS database. Data synchronization is needed to ensure the data in the OLTP database is the same as the NDS database. Checking and synchronizing data in the OLTP database can be seen in Figure 11. Figure 11. CRM Synchronization Menu Figure 11 shows the results of checking synchronization in the OLTP CRM database. Two data are not synchronized. Both of these data failed to be stored in the NDS database, and the data was stored in the OLTP CRM database. Asynchronous data is found by checking the status of the Identity Column in the TComplain table, which functions as an Identity Column, the NDS_Status column, which, if not synchronized, has a value of '0'. If data that has not been synchronized found, then the data synchronization process is carried out by pressing the Synchronize button. When the Synchronize button is pressed, insert the asynchronous data into the TNDSComplain table in the NDS database. If the data is successfully inserted into the NDS database, the NDS_status column value is updated to '1'. The process of synchronizing data between the TComplain table and the TNDSComplain table is shown in Figure 12. Figure 12. Synchronizing data between TComplain table and TNDSComplain table LONTAR KOMPUTER VOL. 11, NO. 1 APRIL 2020 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2020.v11.i01.p02 e-ISSN 2541-5832 Accredited B by RISTEKDIKTI Decree No. 51/E/KPT/2017 18 Then every 30 seconds interval, the scheduler runs the data extraction process in the NDS database. Data in the NDS database is sent to the database warehouse using the Incremental Extraction based Timestamp method. This method can process only the latest data contained in the NDS database. The results of the process of inserting into a data warehouse database with the Incremental Extraction method based on Timestamp are shown in Figure 13. Figure 13. Results in the TFact_Complain table 4. Conclusions Data backup is done by storing temporary data in a file in the format of Comma Separated Values (CSV), then the CSV file is read, and the process is inserted into the OLTP database. After the insertion process into the OLTP database is successful, it continues with the synchronization process between the OLTP database and the NDS database. Data synchronization between the OLTP database and the NDS database is done by checking the value of the Identity Column in each table in the OLTP database. If the value of the Identity Column is ‘0’, then the process of inserting the data into the NDS database is carried out. If the data is successful, it is loaded into the NDS database, then the value of the Identity Column is updated to be ‘1’. Data in the NDS database is loaded into the Dimensional Data Store with the Incremental Extraction based Timestamp method to create the Real-Time Data Warehouse. References [1] R. Kimball and M. Ross, The Data Warehouse Toolkit, The Definitive Guide to Dimensional Modeling. 2013. [2] V. Rainardi, Building a data warehouse: With examples in SQL server. 2008. [3] A. Khalaf Hamoud, A. Salah Hashim, and W. Akeel Awadh, “CLINICAL DATA WAREHOUSE A REVIEW,” Iraqi Journal Computing Informatics, 2018, doi: 10.25195/2017/4424. [4] M. P. Ambara, M. Sudarma, and I. N. S. Kumara, “Desain Sistem Semantic Data Warehouse dengan Metode Ontology dan Rule Based untuk Mengolah Data Akademik Universitas XYZ di Bali,” Majalah Ilmiah Teknologi Elektro 2016, doi: 10.24843/mite.2016.v15i01p02. [5] I. M. D. J. Sulastra, M. Sudarma, and I. N. S. Kumara, “PEMODELAN INTEGRASI NEARLY REAL TIME DATA WAREHOUSE DENGAN SERVICE ORIENTED ARCHITECTURE UNTUK MENUNJANG SISTEM INFORMASI RETAIL,” Majalah Ilmiah Teknologi Elektro 2015, doi: 10.24843/mite.2015.v14i02p03. [6] M. R. Pasha, “Data Warehousing and the Unstructured Data,” Bahria Univercity Islam. Campus Gradute Resuslt, vol. DOI:10.1, 2016. [7] S. Bouaziz, A. Nabli, and F. Gargouri, “From traditional data warehouse to real time data warehouse,” in Advances in Intelligent Systems and Computing, 2017, doi: 10.1007/978- 3-319-53480-0_46. [8] S. Bouaziz, A. Nabli, and F. Gargouri, “Design a data warehouse schema from document- oriented database,” in Procedia Computer Science, 2019, doi: 10.1016/j.procs.2019.09.177. [9] F. Z. Al Faris, Suharjito, Diana, and A. Nugroho, “Development of Data Warehouse to Improve Services in IT Services Company,” in Proceedings of 2018 International Conference on Information Management and Technology, ICIMTech 2018, 2018, doi: 10.1109/ICIMTech.2018.8528146. [10] H. Chandra, “Analysis of Change Data Capture Method in Heterogeneous Data Sources to Support RTDW,” in 2018 4th International Conference on Computer and Information LONTAR KOMPUTER VOL. 11, NO. 1 APRIL 2020 p-ISSN 2088-1541 DOI : 10.24843/LKJITI.2020.v11.i01.p02 e-ISSN 2541-5832 Accredited B by RISTEKDIKTI Decree No. 51/E/KPT/2017 19 Sciences: Revolutionising Digital Landscape for Sustainable Smart Society, ICCOINS 2018 - Proceedings, 2018, doi: 10.1109/ICCOINS.2018.8510574. [11] I. G. Adnyana, M. Sudarma, and W. G. Ariastina, “Middleware ETL with CDC based on Event Driven Programming,” International Journal Of Engineering And Emerging Technology, vol. Vol. 3, No, 2018. [12] A. Wibowo, “Problems and available solutions on the stage of Extract, Transform, and Loading in near real-time data warehousing (a literature study),” in 2015 International Seminar on Intelligent Technology and Its Applications, ISITIA 2015 - Proceeding, 2015, doi: 10.1109/ISITIA.2015.7220004. [13] A. Sabtu et al., “The challenges of Extract, Transform and Loading (ETL) system implementation for near real-time environment,” in International Conference on Research and Innovation in Information Systems, ICRIIS, 2017, doi: 10.1109/ICRIIS.2017.8002467. [14] R. P. Deb Nath, K. Hose, T. B. Pedersen, and O. Romero, “SETL: A programmable semantic extract-transform-load framework for semantic data warehouses,” Information Systems, 2017, doi: 10.1016/j.is.2017.01.005. [15] N. Biswas, A. Sarkar, and K. C. Mondal, “Efficient incremental loading in ETL processing for real-time data integration,” Innovation in System Software Engineering, 2019, doi: 10.1007/s11334-019-00344-4. [16] S. Thulasiram and N. Ramaiah, “Real Time Data Warehouse Updates Through Extraction-Transformation-Loading Process Using Change Data Capture Method,” 2020. [17] B. Pan, G. Zhang, and X. Qin, “Design and realization of an ETL method in business intelligence project,” in 2018 3rd IEEE International Conference on Cloud Computing and Big Data Analysis, ICCCBDA 2018, 2018, doi: 10.1109/ICCCBDA.2018.8386526. [18] Denny, I. P. M. Atmaja, A. Saptawijaya, and S. Aminah, “Implementation of change data capture in ETL process for data warehouse using HDFS and apache spark,” in Proceedings - WBIS 2017: 2017 International Workshop on Big Data and Information Security, 2018, doi: 10.1109/IWBIS.2017.8275102. [19] I. Mekterović and L. Brkić, “Delta view generation for incremental loading of large dimensions in a data warehouse,” in 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2015 - Proceedings, 2015, doi: 10.1109/MIPRO.2015.7160496.