CHEMICAL ENGINEERING TRANSACTIONS VOL. 51, 2016 A publication of The Italian Association of Chemical Engineering Online at www.aidic.it/cet Guest Editors: Tichun Wang, Hongyang Zhang, Lei Tian Copyright © 2016, AIDIC Servizi S.r.l., ISBN 978-88-95608-43-3; ISSN 2283-9216 A Design of ETL for the Construction of Traffic Network Based on Big Data Qinan Liu College of energy and transportantion engineering,Jiangsu Vocational Institute of Architectural Technology, Xuzhou, Jiangsu 221006, China lqnch@163.com Intelligent traffic management system based on big data is the trend of the development of the transportation system. It integrates information technology, wireless communication technology, computer technology, sensor technology and other advanced technologies to form a comprehensive and efficient integrated traffic management system. Traffic system based on big data needs to deal with a large number of unstructured data and semi-structured data. Also, the capacity of the data becomes larger, the data grows faster, and the format of the data becomes more complex. Traditional ETL technology has been unable to meet the needs of the construction of intelligent traffic network which is based on big data. According to the characteristics of the traffic network based on big data, this paper designs a kind of ETL system with high universality and high data processing efficiency. First, in order to improve the efficiency of data processing, we optimize the workflow of ETL. In order to make ETL suitable for big data traffic network environment, we redesign the ETL data processing rules by identifying and merging. And then we optimize the extracting, transforming and loading of the ETL system. Finally, the experimental results show that the redesigned ETL system can effectively serve the traffic network system based on big data. This method has a high efficiency in processing complex data structure and large data capacity of big data. 1. Introduction With the rapid development of the city, traffic congestion, traffic pollution and traffic accidents are the major problems to be solved in the city. Intelligent transportation system based on big data is imperative. It embeds the internet of things and car networking technology into traffic management, so it can control the entire traffic management system efficiently and comprehensively. However, due to the rapid increase of traffic users, the traffic related data of intelligent traffic system has jumped from the TB level to PB, EB level, even ZB level. A great amount of data has a great impact on the operation and management of intelligent transportation system. Traffic data includes data base, picture, video, text and other structured data, unstructured data and semi-structured data and so on. Researchers in the field of big data generally believe that data mining and analysis is the key of big data technology. However, many scholars in the study find that the implementation time of the data acquisition and preprocessing stage usually accounts for 60%-80% of big data entire process time. This stage is the most time consuming stage in the whole system development process (Abiteboul et al., 1999). In the process of data acquisition and preprocessing, we need to extract data from distributed and heterogeneous data sources, and then clean, transform and integrate the data. Finally, we load the data into the data warehouse. This process is called extraction, transformation and loading (ETL). Due to the large number of data from different sources, the work performance of intelligent traffic system based on big data is directly influenced by processing efficiency and quality of ETL. The traditional traffic control system mainly uses the ring coil and video to detect the traffic flow, but the system is passive. The whole framework of intelligent transportation based on big data includes physical sensing layer, data processing and analysis prediction, and optimization management application. According to the wide application of big data systems in variety of fields, the traffic network architecture based on big data is shown in Figure 1. DOI: 10.3303/CET1651076 Please cite this article as: Liu Q.A., 2016, A design of elt for the construction of traffic network based on big data, Chemical Engineering Transactions, 51, 451-456 DOI:10.3303/CET1651076 451 Video Surveillance GPS RFID Traffic Data Relational Database Database File System Data Integration Flow Analysis Data Mining Road Network Residential Travel Behavior Traffic Flow Forecasting Traffic Model Forecasting Traffic Congestion Monitoring Travel Information Services Traffic Guidance Road Planning Map Revision Physical Perception Data Processing Application Figure 1: The traffic network architecture based on big data The big data technology can be widely used and widely recognized in traffic network, what is mainly reflected in the real-time distribution, efficiency and predictability of big data technology. The real-time and distributed data processing guarantees the high efficiency and predictive ability of the big data in the traffic network application. Excellent ETL design is the premise to guarantee the real time processing of the distributed traffic data. At present, the major software companies have developed many outstanding ETL system, such as Oracle's Oracle Warehouse Builder (OWB) (Borowski et al., 2008), Microsoft’s SQL Server Integration Services (SSIS) (Haselden et al., 2007), IBM’ Data Stage. There are single ETL tools, such as Kettle, Clover ETL, Talend and Data Cleaner. These soft wares have their own advantages and the application scope. Researchers in the study of ETL mainly concentrate in 4 areas: ETL modeling, ETL processing, data quality and metadata (Xu et al., 2011). Modeling research mainly focuses on conceptual modeling, logical modeling, model transformation and optimization etc. Mu et al. (2009) propose a method for automatically generating code in the process of ETL, which is based on conceptual model. Lenzerini et al. (2003) study the definition, initial loading scenarios and conceptual model specification, and improved the concept of modeling. Ding (2007) simplifies the general model of ETL activities, and models the workflow on basis of the process. Study on the process of ETL focuses on the research of data extraction and data transformation. Zhang Rui (2010) describes the core contents of ETL from extraction, transformation and loading of three different angles. Zhang Xufeng et al. (2006) divide data extraction into total data extraction and incremental data extraction, and the ETL process is divided into full ETL process and incremental ETL process. In the face of complex data, Strong et al. (1996) defined a common data quality, which can make ETL work efficiently. Kimball et al. (2007)) put forward 6 important indicators of data quality evaluation in ETL. Wang et al. (2010) propose an ETL service framework based on metadata, the metadata in data warehouse is used to describe the structure of the data warehouse and the establishment of the method. 452 2. ETL technology ETL is an important part of the construction of data warehouse (Zhang Ning et al. (2002)). Firstly, it solves the problem of data dispersion (Ma (2004)). Secondly, ETL can not only filter the "dirty data" in the data set, but also can guarantee the data clean and effective (Lin (2003)). ETL is the process of data extraction, data transformation and data loading, every industry will encounter the problem of mass data processing. Figure 2 shows the structure of the traditional ETL model. Data Loading Data Extraction Data Transformation Data Access Interface Temporary Data Storage Area E T L P ro g ra m In te rfa ce Data Base Video Data GPS Data Textual Data Data Warehouse O p e ra tio n M a n a g e m e n t P la tfo rm Metadata Figure 2: The structure of the traditional ETL model. 2.1 Data extraction Data extraction is the process of extracting data from the source database system. The data that is extracted from different databases or heterogeneous sources is the basis for subsequent data processing. Data extraction has two modes, namely incremental extraction and total extraction. Incremental extraction is the most important way to extract data. 2.2 Data transformation The result of the data transformation is that the data required by the database is obtained after the transformed. Data transformation focuses on the data processing, and it contains a lot of data processing rules. Common rules for handling data include data extraction rule, data loading rule, filter rule, the projection rule, column splitting rule, rule to merge column, column function rule, column filtering rule, table joining rule, multi broadcast rule, search and replace rule, derived column rule, column mapping rules, order rule, and group rule. 2.3 Data loading Data loading refers to loading transformed and processed data in the activity area into the user specified database. The whole process of ETL is over until the completion of the procedure. 3. Traffic network data analysis The construction of intelligent transportation network makes the traffic data more and more diversified. The continuous video signal of camera, the text information of traffic police portable equipment, and traffic illegal captured pictures produce a large number of types of data. These data are not only complex but also have huge data capacity. The core of ETL technology is loading data from the source system into data warehouse in accordance with certain rules and logic, which is used to further analysis of data mining. 453 4. The design and implementation of ETL for the construction of traffic network based on big data 4.1 Optimal design of ETL Workflow is a summary model of workflow execution process. In this paper, a reasonable set of ETL data processing rules can achieve the optimization of workflow and improve the efficiency of data processing. In addition, we add a temporary use of the database in the middle of the database, namely the memory database, so that the frequent operation of the data can be completed quickly in the memory. 1) Optimizing the setting of data processing rules. Firstly, we classify the rules in accordance with their handled data size into different types. These rules acting on the same data type are the first class, such as column splitting rule, column function rule, column filtering rule, and search and replace rule etc. The rules acting on the single data set are the second class, such as filter rule, the projection rule, multi broadcast rule, column mapping rules, group rule etc. Third type of rule can act on two data sets, namely, the join rule. Then, in this paper, these three categories of rules are classified, the first type of rule is denoted as CRule rule, the second type is Output rule, the third type is Group rule, and the fourth type is Join rule. 2) In order to ensure that the data transformation can quickly extract the required data in memory, instead of the disk database I/O operation, we put all kinds of data temporarily stored inside the memory. This method is convenient for data transformation, and greatly reduces the time cost for disk database I/O operation. This is the purpose of adding temporary database. The optimized ETL design architecture is shown in Figure 3. Split Replace � CFilter Data Base Video Data GPS Data Textual Data Data Warehouse Temp Database Filter Project � Map Data Extraction CRule Output Data Loading Data Transformation Figure 3: The optimized ETL design architecture. 4.2 Optimal design of ETL According to the redesign and optimization of ETL structure, this paper implements a strong data processing capability, real-time and efficient ETL tools. ETL first extracts different types of data sources. Before the execution of the workflow, the process is first analyzed. Then, according to the type of conversion, the different conversion is sent to different execution parts. ETL workflow execution starts from the Begin node and ends at the End node. 5. Experiment and analysis 5.1 Introduction of experimental environment This experiment compares the improved ETL with the traditional ETL, the results of experiments shows that the method has a significant improvement in the processing of big data. In order to verify the performance of the improved ETL, the configuration of the computer we choose is as follows: Intel(R) Core(TM) i5-4300M @2.60GHz; RAM: 4G; OS: windows 7; jdk7.0. Database using My SQL 5.0, SQL Server 2008, Oracle 10i. Data comes from a city near the second ring traffic network information in 2013. 454 5.2 Experimental results and analysis 1) Firstly, we compare the efficiency of ETL and traditional ETL in data conversion. The experiments are divided into 5 groups. Each group is divided into 5 experiments, which is based on the number of rules. The numbers of these rules are 2, 4, 6, 8, and 12. The test of each rule is carried out for 2 times, and the average time of the test is taken. As can be seen from Figure 4, the execution time of ETL in this paper is obviously less than that of the traditional ETL tools. Figure 4: Comparison of ETL time 2) The advantage of the improved ETL tool in this paper is that it has a fast response time when the data is processed. And at the same time, the ETL can quickly extract, transform and load the useful data into the data warehouse. In this experiment, we compare the throughput of the data with two different ETL tools, and verify the processing ability of the different ETL tools to the data. As can be seen from Figure 5, the processing capacity of the method is not obvious, even worse than the traditional method, when the data size is small. However, with the increase of data size, the advantages of this method in the data processing ability are reflected. Figure 5: Comparison of experimental results 6. Conclusions By optimizing the data processing rules of the ETL data transformation process, and adding a memory database in this process, we propose a design and implementation of ETL for the construction of traffic network based on big data. Compared with the traditional ETL tools, the ETL can process a variety of data types, and transform data faster. In this paper, the ETL mainly deals with the large amount of data generated 0 50 100 150 200 250 300 2 4 6 8 12 T im e (s ) Number of Rules Traditional ETL Improved ETL 455 by intelligent traffic. First of all, we prove this method to achieve the desired purpose, saving more time during data conversion. Secondly, we verify the method in the ability to handle large amounts of data processing. With the increase of data size, the advantages of this method in the data processing ability is reflected. Finally, the ETL tool designed in this paper can not only meet the requirements of data acquisition and pre-treatment of big data of traffic network, but also has very good performance. References Abiteboul S., Cluet S., Milo T., et al. Tools for data translation and integration. IEEE Data Engineering Bulletin, 1999, 22(1): 3-8. Borowski E.L., 2008, Design of a workflow system to improve data quality using oracle warehouse builder [J]. Journal of Applied Quantitative Methods, 3:198-206. Ding Z.M. 2007, Modelling and optimization of ETL workflow: [D]. Shanghai: Shanghai Normal University Haselden K., Baker B., 2007, Microsoft SQL server 2005 integration services [M].E'earson Education India. Kimball R., Caserta J., 2007, The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Journal of Information Management. 23(11):1123 Lenzerini M., Vassiliou Y., Vassiliadis P., 2003, Fundamentals of data warehouses [M].Springer. Lin Y. et al., 2003, Principles and practice of data warehouse [M], Beijing, Post & Telecom Press. Ma H.L., 2004, Research on data acquisition and modelling in data warehouse [J], Journal of Minzu University of China (Natural Sciences Edition), 13(4):339-342 Mu L., Maz O., Trujillo J., 2009, Automatic generation of ETL processes from conceptual models. in: Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP. 2 Penn Plaza, Suite 701 New York NY USA: ACM. 3340 Wang H., Ye Z., 2010, An ETL Services Framework Based on Metadata in: 2nd International Workshop on Intelligent Systems and Applications. Wuhan, China: IEEE Computer Society. 1-4 Wang R., Strong D., 1996, Beyond accuracy: What data quality means to data consumers. Journal of management information systems. Journal of Management Information System. 12 (4):5-33 Xu J.G, Pei Y., 2011, Data ETL research review. Computer Science, 38(4): 15-20 Zhang N., Jia Z.Y., Shi Z.Z., 2002, Research on ETL technology in data warehouse[J], Computer Engineering and Applications, 24:213-216 Zhang R., 2010, An overview of ETL data extraction [J]. Software Guide. 9(10): 164165 Zhang X.F, Sun W.W, Wang W. et al., 2006, Research on the generation method of incremental ETL process automation [J], Journal of Computer Research and Development, 43(6): 10971103 456