IBN AL- HAITHAM J. FOR PURE & APPL. S CI. VOL.24 (2) 2011 Developing a Real Time Method for the Arabic Heterogonous DBMS Transformation S. M. Hadi , S. Murtatha Departme nt of Information & Comm. Eng. College of Engineering Al- Kha warizmi ,Unive rsity of Baghdad Departme nt of Computer Science, College of Science, Unive rsity of Baghdad Received in : 12, June, 2011 Accepte d in : 13, July, 2011 Abstract A common p roblem facing many Ap p lication models is to extract and combine information from multiple, heterogeneous sources and to derive information of a new quality or abstraction level. New app roaches for managing consistency , uncertainty or quality of Arabic data and enabling e-client analysis of distributed, heterogeneous sources are st ill required. This p aper p resents a new method by combining two algorithms (the p artitioning and Group ing) that will be used to transform information in a real time heterogeneous Arabic database environment. Key word: Heterogeneous DB, M app ing DB, cleaning data and real time transforming Introduction One of the first and most imp ortant steps in any data p rocessing task is to verify that t he data values are correct or, at the very least, conform a set of rules. Data warehouses require and p rovide extensive sup p ort for data cleaning. They load and continuously refresh huge amounts of data from a v ariety of sources so the p robability that some of the sources contain”dirty data” is high. Furt hermore, data warehouses are used for decision making, so that the correctness of t heir data is vital to avoid wrong conclusions [1]. M any applications today need information fro m diverse d ata sources, in which related data may be represented quite differ ently , in one common scenario, if a DBA wants to add data from a new source to an existing warehouse DB, the data in the new source may not match the exist ing warehouse schema and this will cause a problem [2]. In any scenarios, one or more data sets must be mapp ed into a single target representation. Needed transformations may include two app roaches:  Schema transformations (changing the structure of the data)  And data transformation with cleansing (chan ging the format and vocabulary of the data and eliminating or at least reducing dup licates and errors). In each area, there is a broad range of p ossible transformations, from simple to comp lex. Schema and data transformation has typically been st udied sep arately . We believ e they need to be handled together via a uniform mechanism [3]. In addition to a consistent and uniform access to heterogeneous sources there is a further value of integr ation. The integrated data can contain information, which are not integrated data exp licitly but in the form of dependencies, relationships or p atrons over the various sources [4]. IBN AL- HAITHAM J. FOR PURE & APPL. S CI. VOL.24 (2) 2011 The Data Cleaning Data cleaning, also called data cleansin g or scrubbing, deals with detecting and remov ing errors and inconsistencies from d ata in order to imp rove the quality of data. Data quality p roblems are p resent in single data collections, such as files and databases, e.g., due to missp ellings dur ing data entry , missing information or other invalid d ata. The data warehouses require and p rovide extensive sup p ort for the data cleanin g [5]. They load and continuously refresh huge amounts of data from a variety of sources so t he p robability that some of the sources contain”dirty data” is high. Furt hermore, data warehouses are used for decision making, so that the correctness of their data is vital to avoid wrong conclusions. The need for data cleaning increases si gn ificantly . This is because the sources often contain redundant data in different r epresentations. Provide access to an accurate and consistent data, consolidation of different data representations and to elimination the duplicate information, in this fut ure age the cleanin g p rocess become n ecessary [6]. During the ETL p rocess (extraction, transformation, loading), illustrated in Figure (1), further data transformations deal with schema/data translation and integr ation, and with filterin g and aggregatin g data to be stored in the warehouse. In this Fi gure, all data cleanin g is typically p erformed in a sep arate data st aging area b efore loading the transformed data into the warehouse. A large number of tools of varying functionality are available to supp ort these tasks, but often a significant p ortion of the cleaning and transformation work has to be done manually or by low-level p rograms t hat are difficult t o write and maintain. While a hu ge body of research deals with schema translation and schema inte gration, data clean ing has received only little attention in the research community . A number of authors focused on the p roblem of dup licate identification and elimination, while som e resear ch group s concentrate on general p roblems not limited but relevant to data cleaning, such as sp ecial data mining approaches and data transformations based on schema matchin g. M ore recently , several research efforts p rop ose and invest igate a more comp rehensive and uniform treatment of data cleaning cover ing several transformation phases, sp ecific operators and their imp lementation. [7]. In this p ap er we focus on the new p roblem that faces transformation and conversion in ETL p rocess related to our DB environment, which is the Arabic data. Several st andard p rograms and methods (workin g under windows op erating sy st em) are available for transformation of English data in any database. But the same transformation will be so difficult when we deal with an Arabic data in a real time from any sp ecial database or warehouses. The Proposed System In the p resent work, four methods have been p rop osed to solve the problem in a sequential approach to assess the advantages and disadvantages of each method. All are b ased on find ing new relationship s between the original Arabic data and the converted un-clear (rubb ish) one (using the standard methods). T o check the validity of these prop osed methods, a case st udy is conducted to retrieve an Arab ic d atabase file from FoxPro DB under Dos to Oracle 10g und er window. T he p rop osed sy stem is based on a server to servers' architecture conn ection; or what we call it as a three tire architecture that will consist of the following: A- The Se rver (desti nation se rver): This server will represent the database server that collect data from mu lti-source and can b e act as a warehouse server, oracle DBM S will b e the core that hold the warehouse data which will be treated and cleaned in the middleware. B- The Middleware: is the software that do ETL p rocess, this p art can be built in the destination server or could be bu ild in a sep arated server to increase the p erformance of the database and to avoid the load that may be happ ened on the destination server as shown in Figure (2) that contain the prop osed sy stem architecture. IBN AL- HAITHAM J. FOR PURE & APPL. S CI. VOL.24 (2) 2011 C- The Clients (sources servers): the second server will be the multi source local d atabase servers that contain all the desired d ata ,each source can be use different DBM S (M S-Access, FoxPro under-DOS, Excel, M ysql…etc). M ost of the Arabic information has a p roblem when transformin g it from one d atabase to another due to many reasons. The reason why the data ap p ears t o be different (rubbish) is not related to the difference in the DBM S only. This p roblem might appear also when transferring the Arabic data to the same kind of DM BS with the same version. The real fact and the main reason to this p roblem is the difference in the Character Set used in the Op erating Sy st em combined with the DBM S. For examp le, the Op erating Sy stem (DOS) uses a sp ecial character set for the Arabic lan guage and the Op erating Sy st em WINDOWS uses t he (M S-WIN 1256) Character Set for the Arabic lan guage and (M S-WIN 1252) Character Set for t he En glish lan gu age. From other hand the Op erating Sy st em (LINUX), use the different set character for the Arabic lan guage and for the English lan gu age. So if we tried to transfer the data that have been written in Arabic lan gu age from Oracle10 g DBM S using the (M S-WIN 1256 Character Set) host ed by Windows Server 2003 Op erating system to another Oracle10g DBM S hosted by another comp uter that has the same operating sy stem but using the (M S-WIN 1252 Character Set) t hat will not supp ort the Arabic langu age, then the transferred data will defiantly appear to be different in shap e and meaning (rubbish). There are many soft ware p rograms used to solve this p roblem that will transfer the data to its original for m regardless of what the data is, but the real problem is that all these p rograms works using the Single Patch form, which means it transfers all the data directly and at once for the same DBM S. Because of that t he needs t o design a new method b ecome requ ired that will enable us to transfer huge amounts of data from any language in Real Time's environment between differ ent DBM S(s) and op erating sy stems. The Problem Solutions In this p rop osed sy stem we imp lement four solutions to solve the p roblem that mansion before each of these methods have been imp lemented on a real case st udy that contain a kind of an Arabic data. The four methods will be illustrated sequentially as the following:  First Method It has been thought of many way s to solve this p roblem, one solution used to be a new method was designed to insert all the Arabic letters and their movements (The movement in the Arabic language used to be treated as a sep arated letter in the comp uter that will has a sep arated ASCII code) into t he syst em that will transfer data from it, the result of translating these data to the related ASCII Code b efore the op eration of transformin g and after, can b e seen in the Look up Table shown in table (1). This method requires sep arating the words (data) letter by letter, and each letter is to be worked on and swapp ed with t he corresp onding letter in the Look up Table. This method can be exp lained as the following st eps which have been written as a p rogram with the (Procedural Langu age/Structured Query Language) known as (PL/SQL) in Oracle10g, 1. The data will be converted for each table column by column, 2. And each colu mn will be read raw by raw, 3. And each raw will b e read cell by cell, 4. Then each cell will be read word by word 5. Finally each word will be read letter by letter (of course the mov ement will be treated as a separated lett er). When we transform the Arabic d ata and gett ing the rubbish view as mentioned before these rubbish letters will be managed and after read ing each one, it will be co mpared with the Look up Table to obtain the original letter, this is where the swapp ing p rocedure will take p lace. Each rubbish l ett er will be swapped with the original lett er. This method has been highly IBN AL- HAITHAM J. FOR PURE & APPL. S CI. VOL.24 (2) 2011 efficient when readin g the transferred data, howev er, it is not suitable for hu ge amounts of transferred data (just like a warehouse database), and due to the fact it takes t oo much time for the transformation p rocess. T herefore, this method can be suitable only for the small Database that has small amount of data and this will gu ide us t o the second method.  Second Method The second method has been p rogrammed usin g SQL Server 2008 as a p ackage in ord er to get benefit of using the Integrated Services characteristics. This method is quite similar to the first method; the only diference is that it works as a Parallel for the tables and the columns. In fact by using this method it will be possible to t ransfer more than one table in the same time and ev en transferring all the columns in one table in the same time, but t he process will remain to be Serial for any colu mns' data ( Serial concerning the rows). Obviously this method gives us more sp eed in the transformation process comp aring with the first method.  Thir d Method The third method has been built on the second method with some development. With This method we use the Table Partitioning Technique, through this technique the table wi ll b e divided in to dummy p arts according to the st ored data, in a way that each p art will contain some p ortion of the data from all columns (that’s mean the dividing will be hor izont ally not vertically ). Here we will have p arallel p rocessing for all the Table's p arts at the same time, and this will ensure to make this method faster than the second and the first method, but we find that the coming method will be the optimum solution for the p roblem rep resented here.  Forth Method This method is a develop ment of the third method. Aft er study ing and analyzing the data in the Look up Table, we discover that an implementation of a minus p rocedure p erformed between the ASCII code of the original data and the rubbish data will give us a new fa ct. T his op eration of havin g the d ifference between the Two ASCII cod will gain a sp ecial number that will be related to a certain amount for each group of letters (each group of letters beholds one differ ence as shown in Table No. (1). by using some analy sis we will reduce the p ossibilities of search ing from (35) p ossibilities to (10) p ossibilities only. This will lead the p rogram not to search in all the Look up Table In another word by elimin ating the amount of p ossibilities we reduce the I/O op erations. In this method we will divide all the Arab ic lett ers to only (10) ASCII group s, t hen during the transformation p rocess we will check each letter in which group it will be related, then we will mak e the swap op eration on the original ASCII code according to its group number. This method will b ecome the fastest comp aring with the other 3 methods mentioned above; it is indeed the method in which we reco mmend when transferrin g hu ge amounts of data in the Real Time environment as shown in Figure (3). Result and Conclusions Through app ly ing an analy sis methods concernin g the 4 methods and the time required to p erform the transp ortation op eration in a real time environment we have the result of this mathematical analy sis shown in Figure (4) which will app ear that the forth method will be the new fasts method that will solve the p roblem of transferring a huge Arabic data from one DBM S to another without any rubbish data and the most imp ortant in a real time for m. As a conclusion for imp lementing this method on a real data taken from the database of the M inistry of Internal Affairs we can list the following p oints: 1. Through the imp lementation p rocess we concluded that the forth developed method will work very fast in an accurate manner concerning the transformation op eration for any DBM S hosted by any Op erating Sy st em p laced in any Network architecture. IBN AL- HAITHAM J. FOR PURE & APPL. S CI. VOL.24 (2) 2011 2. It has been found that this method can be imp lemented successfully not only on the Arabic data, in fact it will be p owerful also with any language data which have the 3. same characteristics of the Arabic lan guage just like the Kurdish, Chinese, Korean, …..ext. 4. By using the data mining with this method we could obtain different data and information from the DBM S of the different source dest ination servers and in different lan guages not only in the Arabic. Re ferences 1. Niswonger, B. ; Haas, L. M . , M iller, R. J.( 2009), T ransforming Heterogeneous Data with Database M iddleware beyond Integration,www.10.1.1.56.1853.edu. 2. Heiner Stuckensch midt, Ubbo Visser, (2008), Using environmental Information Efficiently: Sharin g Data and Knowled ge From Hetero geneous Sources, T he Computing Technologies Center. 3. Kai_Wie sattler and Gunter Sakee, (2006),sup p orting Information Fusion with Federated database Technolo gy , IEEE co mputer transaction. 4. Susan B. Davidson and Kosky , S. (2006), A Langu age for Database Transformations and Const raints, www.ICDE97.com. 5. Sundus Norry Shukry , (2007), Decision supp ort sy stem of JAVA infrastructure Analysis and control, Research Journal of App lied Science,USA. 6. Erhard Rahm Hon g Hai Do, (2009), Data Cleaning: Problems and Current App roaches, www.ghh.com. 7. Ronald Cody , Ed.D., Robert Wood Johnson, (2009), Data Cleanin g 101, www.ats.ucla.edu. IBN AL- HAITHAM J. FOR PURE & APPL. S CI. VOL.24 (2) 2011 Table :(1) Contains the Arabic le tters and its corresponding AS CII Code Seq. rubbish Character rubbish ASCII Original Character Original ASCII ASCII Difference group 1 À 22 21 ض 4 3 ­11 Group 1 22 ل 2 21 ط 5 6 ­9 Group 2 3 Â 22 21 ظ 6 7 ­9 22 م 4 21 ع 7 8 ­9 22 ن 5 21 غ 8 9 ­9 6 Ç 23 22 ق 1 2 ­9 7 È 23 22 ك 2 3 ­9 22 ه 8 22 ف 9 1 ­8 Group 3 9 É 23 22 ل 3 5 ­8 10 Ê 23 22 م 4 7 ­7 Group 4 11 Ë 23 22 ن 5 8 ­7 23 ى 12 22 ه 6 9 ­7 23 ي 13 23 و 7 0 ­7 14 rubbish Î 23 23 ى 8 6 ­2 Group 5 15 Ï 23 23 ي 9 7 ­2 16 << and >> 18 21 س 7 1 24 Group 6 17 ­ 17 21 ص 3 2 39 Group 7 15 ں 18 19 ا 9 9 40 Group 8 19 space 16 20 ب 0 0 40 20 ، 16 20 ة 1 1 40 21 ¢ 16 20 ت 2 2 40 22 £ 16 20 ث 3 3 40 23 ¤ 16 20 ج 4 4 40 24 ¥ 16 20 ح 5 5 40 16 ال 25 20 ح 5 5 40 26 ¦ 16 20 خ 6 6 40 27 § 16 20 د 7 7 40 28 ¨ 16 20 ذ 8 8 40 29 © 16 20 ر 9 9 40 ؛|||| ھ 30 17 21 ز 0 0 40 31 « 17 21 س 1 1 40 32 ¬ 17 21 ش 2 2 40 15 ک 33 19 ء 2 3 41 Group 9 34 › 15 19 ؤ 5 6 41 35 ‹ 13 19 ؤ 9 6 57 Group 10 IBN AL- HAITHAM J. FOR PURE & APPL. S CI. VOL.24 (2) 2011 Fig. (1): The ETL process i n a data warehouse.[6] Fig. (2): Represe nt the Middleware for the propose d S ystem IBN AL- HAITHAM J. FOR PURE & APPL. S CI. VOL.24 (2) 2011 Fig. (3): The Block Diagram of the 4 methods Fig. (4): The Time Analysis for the 4 Methods 2011) 2( 24مجلة ابن الهیثم للعلوم الصرفة والتطبیقیة المجلد لتحویل البیانات العربیة في قواعد البیانات تطویرطریقة ضمن الزمن الحقیقي الموزعة سها محمد هادي ، صفاء مرتضى تصاالت ، كلیة الهندسة الخوارزمي ، جامعة بغداد قسم هندسة المعلومات واال قسم علوم الحاسبات،كلیة العلوم ، جامعة بغداد 2011،حزیران،12: استلم البحث في 2011،تموز، 13: قبل البحث في الخالصة المشـــاكل التـــي تواجــه نمـــاذج تطبیقـــات قواعــد البیانـــات یمكـــن تلخیصــها بكیفیـــة التعامـــل وادارة ن واحــدة مـــن اهــم ا ان الحاجـة الـى . من المصادر المختلفة لقواعد البیانـات التوزیعیـة لیـتم تخزینهـا فـي قاعـدة بیانـات موحـدة ودمجها المعلومات ومــات المخزنــة باللغـة العربیــة ضـمن قواعــد البیانــات فر طریقــة جدیـدة للتعامــل بثبـات مــن اجــل الحصـول علــى نـوع المعلاتـو ین مــن ولهـذا سیتضـمن هــذا البحـث تقـدیم طریقـة جدیـدة تتــالف مـن دمـج أثنـ.التوزیعیـة اصـبح ضـروریا فـي الوقــت الحاضـر ي مـن لنقل البیانات باللغة العربیة دون حدوث اي معوقـات اوتشـویه للبیانـات و ضـمن الوقـت الحقیقـالخوارزمانات المستعملة .خالل بیئة قواعد البیانات التوزیعیة نقل البیانات ضمن الزمن الحقیقي، تنظیف البیانات، قواعد البیانات التوزیعیة: الكلمات المفتاحیة