IBN AL- HAITHAM J. FOR PURE & APPL. S CI.         VOL.24 (2) 2011  

 
Developing a Real Time Method for the Arabic 

Heterogonous DBMS Transformation  
 
  S. M. Hadi  , S. Murtatha  
Departme nt of Information & Comm. Eng. College  of Engineering  Al-
Kha warizmi  ,Unive rsity of Baghdad                                                             
Departme nt of  Computer Science,   College of Science, Unive rsity of 
Baghdad  
Received  in :  12, June, 2011 
Accepte d  in :  13, July, 2011 
 
Abstract 
       A common p roblem facing many  Ap p lication models is to extract and combine 
information from multiple, heterogeneous sources and to derive information of a new quality  
or abstraction level. New app roaches for managing consistency , uncertainty  or quality  of 
Arabic data and enabling e-client analysis of distributed, heterogeneous sources are st ill 
required. This p aper p resents a new method by  combining two algorithms (the p artitioning 
and Group ing) that will be used to transform information in a real time heterogeneous Arabic 
database environment. 
  
Key word: Heterogeneous DB, M app ing DB, cleaning data and real time transforming 
 

 Introduction 
    One of the first and most imp ortant steps in any data p rocessing task is to verify  that t he 
data values are correct or, at the very least, conform a set of rules. Data warehouses require 
and p rovide extensive sup p ort for data cleaning. They load and continuously  refresh huge 
amounts of data from a v ariety  of sources so the p robability  that some of the sources 
contain”dirty  data” is high.  Furt hermore, data  warehouses are used  for decision making,  so  
that the correctness of t heir data is vital to avoid wrong conclusions [1]. 
        M any  applications today need information fro m diverse d ata sources, in which related 
data may  be represented quite differ ently , in one common scenario, if a DBA wants to add 
data from a new source to an existing warehouse DB, the data in the new source may not 
match the exist ing warehouse schema and this will cause a problem [2]. 
In any scenarios, one or more data sets must be mapp ed into a single target representation. 
Needed transformations may  include two app roaches: 

 Schema transformations (changing the structure of the data)  
 And data transformation with cleansing (chan ging the format and vocabulary  of the 

data and eliminating or at least reducing dup licates and errors).  
In each area, there is a broad range of p ossible transformations, from simple to comp lex.  
Schema and data transformation has typically been st udied sep arately . We believ e they need 
to be handled together via a uniform mechanism [3]. 
    In addition to a consistent and uniform access to heterogeneous sources there is a further 
value of integr ation. The integrated data can contain information, which are not integrated  
data exp licitly but in the form of  dependencies, relationships or p atrons over the various  
sources [4].   

 
IBN AL- HAITHAM J. FOR PURE & APPL. S CI.         VOL.24 (2) 2011  

 
The Data Cleaning 
     Data cleaning, also called data cleansin g or scrubbing, deals with detecting and remov ing 
errors and inconsistencies from d ata in order to imp rove the quality  of data. Data quality  
p roblems are p resent in single data collections, such as files and databases, e.g., due to  
missp ellings dur ing data entry , missing information or other invalid d ata. The data  
warehouses require and p rovide extensive sup p ort for the data cleanin g [5].  
     They load and continuously  refresh huge amounts of data from a variety  of sources so t he 
p robability  that some of the sources contain”dirty  data” is high. Furt hermore, data warehouses  
are used for decision making, so that the correctness of their data is vital to avoid wrong 
conclusions. The need for  data cleaning increases si gn ificantly . This is because the sources  
often contain redundant data in different r epresentations. Provide access to an accurate and 
consistent data, consolidation of different data representations and to elimination the duplicate 
information, in this fut ure age the cleanin g p rocess become n ecessary  [6]. 
        During the ETL p rocess (extraction, transformation, loading), illustrated in Figure (1),  
further data transformations deal with schema/data translation and integr ation, and with  
filterin g and aggregatin g data to be stored in the warehouse. In this Fi gure, all data cleanin g is  
typically p erformed in a sep arate data st aging area b efore loading the transformed data into  
the warehouse. A large number  of tools of varying functionality  are available to supp ort these 
tasks, but often a significant p ortion of the cleaning and transformation work has to be done 
manually  or by  low-level p rograms t hat are difficult t o write and maintain. 
    While a hu ge body  of research deals with schema translation and schema inte gration, data  
clean ing has received only little attention in the research community . A number of authors 
focused on the p roblem of dup licate identification and elimination, while som e resear ch 
group s concentrate on general p roblems not limited but relevant to data cleaning, such  as  
sp ecial data mining approaches and data transformations based on schema matchin g. M ore 
recently , several research efforts p rop ose and invest igate a more comp rehensive and uniform 
treatment of data cleaning cover ing several transformation phases, sp ecific operators and their  
imp lementation. [7]. 
     In this p ap er we focus on the new p roblem that faces transformation and conversion in 
ETL p rocess related to our DB environment, which is the Arabic data. Several st andard  
p rograms and methods (workin g under windows op erating sy st em) are available for  
transformation of English data in any database. But the same transformation will be so  
difficult when we deal with an Arabic data in a real time from any sp ecial database or  
warehouses. 
 

The Proposed System 
In the p resent work, four methods have been p rop osed to solve the problem in a sequential 
approach to assess the advantages and disadvantages of each method. All are b ased on find ing 
new relationship s between the original Arabic data and the converted un-clear (rubb ish) one 
(using the standard methods). T o check the validity  of these prop osed methods, a case st udy  is 
conducted to retrieve an Arab ic d atabase file from FoxPro DB under Dos to Oracle 10g und er 
window. T he p rop osed sy stem is based on a server to servers' architecture conn ection; or what  
we call it as a three tire architecture that will consist  of the following: 
 

A- The  Se rver (desti nation se rver): This server will represent the database server that 
collect data from mu lti-source and can b e act as a  warehouse server,  oracle DBM S will b e the  
core that hold the warehouse data which will be treated and cleaned in the middleware.  
B- The Middleware: is the software that do ETL p rocess, this p art can be built in the 
destination server or could be bu ild in a sep arated server to increase the p erformance of the 
database and to avoid  the load  that may  be happ ened on the destination server as shown in 
Figure (2) that contain the prop osed sy stem architecture.  
 

IBN AL- HAITHAM J. FOR PURE & APPL. S CI.         VOL.24 (2) 2011 

  
C- The  Clients (sources servers): the second server will be the multi source local d atabase 
servers that contain all the desired d ata ,each source can be use different DBM S  (M S-Access,  
FoxPro under-DOS, Excel, M ysql…etc).  
  M ost  of the Arabic information has a p roblem when  transformin g it from one d atabase to  
another due to many reasons. The reason why  the data ap p ears t o be different (rubbish) is not  
related to the difference in the DBM S only. This p roblem might appear also when transferring 
the Arabic data to the same kind of DM BS with the same version. The real fact and the  main  
reason to this p roblem is the difference in the Character Set used in the Op erating Sy st em 
combined with the DBM S. 
 

    For examp le, the Op erating Sy stem (DOS) uses a sp ecial character set for the Arabic 
lan guage and the Op erating Sy st em WINDOWS uses t he (M S-WIN 1256) Character Set for  
the Arabic lan guage and (M S-WIN 1252)  Character Set for t he En glish lan gu age. From other 
hand the Op erating Sy st em (LINUX), use the different set character for the Arabic lan guage 
and for the English lan gu age. So if we tried to transfer the data that have been written in 
Arabic lan gu age from Oracle10 g DBM S using the (M S-WIN 1256 Character Set) host ed by 
Windows Server 2003 Op erating system to another Oracle10g DBM S hosted by  another 
comp uter that has the same operating sy stem but using the (M S-WIN 1252 Character Set) t hat 
will not supp ort the Arabic langu age, then the transferred data will defiantly  appear to be 
different in shap e and meaning (rubbish).  
      There are many soft ware p rograms used to solve this p roblem that will transfer the data to  
its original for m regardless of what the data is, but  the real problem is that all these p rograms  
works using the Single Patch form, which means it transfers all the data directly and at once 
for the same DBM S. Because of that t he needs t o design a new method b ecome requ ired  that 
will enable us to transfer huge amounts of data from any language in Real Time's 
environment between differ ent DBM S(s) and op erating sy stems. 
 

The Problem Solutions 
    In this p rop osed sy stem we imp lement four solutions to solve the p roblem that mansion 
before each  of these methods have been  imp lemented on a real case  st udy  that contain a kind 
of an Arabic data. The four methods will be illustrated sequentially as the following: 

 First Method 
  It has been thought of many way s to solve this p roblem, one solution used to be  a new 
method was designed to insert all the Arabic letters and their movements (The movement in 
the Arabic language used to be treated as a sep arated letter in the comp uter that will has a 
sep arated ASCII code) into t he syst em that will transfer data from it, the result of translating 
these data to the related ASCII Code b efore the op eration of transformin g and after, can b e 
seen in the Look up  Table shown in table (1). This method requires sep arating the words 
(data) letter by letter, and each letter is to be worked on and swapp ed with t he corresp onding 
letter in the Look up  Table. This method can be exp lained as the following st eps which have 
been written as a p rogram with the (Procedural Langu age/Structured Query Language) known  
as (PL/SQL) in Oracle10g,  

1. The data will be converted for each table column by column,  
2. And each colu mn will be read raw by  raw, 
3. And each raw will b e read cell by cell,  
4. Then each cell will be read word by  word 
5. Finally each word will be read letter by letter (of course the mov ement will be treated 

as a separated lett er). 
    When we transform the Arabic d ata and gett ing the rubbish view as mentioned before these  
rubbish letters will be managed and after read ing each one, it will be co mpared with the Look  
up  Table to obtain the original letter, this is where the swapp ing p rocedure will take p lace. 
Each rubbish l ett er will be swapped with the original lett er. This method has been highly  


IBN AL- HAITHAM J. FOR PURE & APPL. S CI.         VOL.24 (2) 2011 

  
efficient when readin g the transferred  data, howev er, it is  not suitable for  hu ge amounts of  
transferred data (just like a warehouse database), and due to the fact it takes t oo much time for 
the transformation p rocess. T herefore, this method can be suitable only  for the small Database 
that has small amount of data and this will gu ide us t o the second method. 

 Second Method 
      The second method has been p rogrammed usin g SQL Server 2008 as a p ackage in ord er 
to get benefit of using the Integrated Services characteristics. This method is quite similar to 
the first method; the only  diference is that it works as a Parallel for the tables and the 
columns. In fact by  using this method it will be possible to t ransfer more than one table in the  
same time and ev en transferring all the columns in one table in the same time, but t he process 
will remain to be Serial for  any colu mns' data ( Serial concerning the rows). Obviously  this 
method gives us more sp eed in the transformation process comp aring with the first  method. 

 Thir d Method 
     The third method has been built on the second  method with some development. With This  
method we use the Table Partitioning Technique, through this technique the table wi ll b e 
divided in to dummy  p arts according to the st ored data, in a way  that each p art will contain 
some p ortion of the data from all columns (that’s mean the dividing will be hor izont ally not 
vertically ). Here we  will have p arallel p rocessing for  all the Table's p arts at the same time, 
and this will ensure to make this method faster than the second and the first method, but we 
find that the coming method will be the optimum solution for the p roblem rep resented here. 

 Forth Method 
    This method is a develop ment of the third method. Aft er study ing and analyzing the data in 
the Look up  Table, we discover that an implementation of a minus p rocedure p erformed 
between the ASCII code of  the original data and the rubbish data will give us a new fa ct. T his 
op eration of havin g the d ifference between the Two ASCII cod will gain a sp ecial number 
that will be related to a certain amount for each group  of letters (each group  of letters beholds  
one differ ence as shown in Table No. (1).  by  using some analy sis we will reduce the 
p ossibilities of search ing from (35) p ossibilities to (10) p ossibilities only. This will lead the 
p rogram not to search  in all the Look up Table In another word by  elimin ating the amount of  
p ossibilities we reduce the I/O op erations.  
      In this method we will divide all the Arab ic lett ers to only (10) ASCII group s, t hen during 
the transformation p rocess we will check each letter in which group  it will be related, then we 
will mak e the swap op eration on the original ASCII code according to its group  number.  This  
method will b ecome the fastest comp aring with the other 3 methods mentioned above; it is  
indeed the method in which we  reco mmend when transferrin g hu ge amounts of data in the 
Real Time environment as shown in Figure (3). 

 
Result and Conclusions 
      Through app ly ing an analy sis methods concernin g the 4 methods and the time required to  
p erform the transp ortation op eration in a real time environment we have the result of this 
mathematical analy sis shown in Figure (4) which will app ear that the forth method will be the 
new fasts method that will solve the p roblem of transferring a huge Arabic data from one 
DBM S to another without any  rubbish data and the most imp ortant in a real time for m. 
 

As a conclusion for imp lementing this method on a real data taken from the database of the 
M inistry  of Internal Affairs we can list the following p oints: 

1. Through the imp lementation p rocess we concluded that the forth developed method 
will work very  fast in an accurate manner concerning the transformation op eration for 
any DBM S hosted by any  Op erating Sy st em p laced in any  Network architecture. 

 
IBN AL- HAITHAM J. FOR PURE & APPL. S CI.         VOL.24 (2) 2011  

 
2. It has been found that this method can be imp lemented successfully  not only on the 

Arabic data, in fact it will be p owerful also with any language data which have the  
3. same characteristics of the Arabic lan guage just like the Kurdish, Chinese, Korean, 

…..ext.  
4. By  using the data mining with this method we could obtain different data and 

information from the DBM S of the different source dest ination servers and in different  
lan guages not only in the Arabic. 

 
Re ferences 

1. Niswonger, B. ; Haas, L. M . , M iller, R. J.( 2009), T ransforming Heterogeneous Data 
with Database M iddleware beyond Integration,www.10.1.1.56.1853.edu. 

2. Heiner Stuckensch midt, Ubbo Visser, (2008), Using environmental Information 
Efficiently: Sharin g Data and Knowled ge From Hetero geneous Sources, T he 
Computing Technologies Center.  

3. Kai_Wie sattler and Gunter Sakee, (2006),sup p orting Information Fusion with 
Federated database Technolo gy , IEEE co mputer transaction. 

4. Susan B. Davidson and  Kosky  ,  S. (2006), A Langu age for Database Transformations 
and Const raints, www.ICDE97.com. 

5. Sundus Norry  Shukry , (2007), Decision supp ort sy stem of JAVA infrastructure 
Analysis and control, Research Journal of App lied Science,USA.  

6. Erhard Rahm  Hon g Hai Do, (2009), Data Cleaning: Problems and Current  
App roaches, www.ghh.com.  

7. Ronald Cody , Ed.D., Robert Wood Johnson, (2009), Data Cleanin g 101,  
www.ats.ucla.edu.  

 
IBN AL- HAITHAM J. FOR PURE & APPL. S CI.         VOL.24 (2) 2011 

 
Table :(1) Contains the Arabic le tters and its corresponding AS CII Code 

Seq. 
rubbish 

Character 
rubbish 

ASCII 
Original 

Character 
Original 

 ASCII 
ASCII 

 Difference 
group 

1 À 22 21 ض 4 3 ­11 Group 1 

22 ل 2 21 ط 5 6 ­9  
 

Group 2 

3 Â 22 21 ظ 6 7 ­9 
22 م 4 21 ع 7 8 ­9 
22 ن 5 21 غ 8 9 ­9 
6 Ç 23 22 ق 1 2 ­9 
7 È 23 22 ك 2 3 ­9 

22 ه 8 22 ف 9 1 ­8  
Group 3 9 É 23 22 ل 3 5 ­8 

10 Ê 23 22 م 4 7 ­7  
 

Group 4 
11 Ë 23 22 ن 5 8 ­7 
23 ى 12 22 ه 6 9 ­7 
23 ي 13 23 و 7 0 ­7 
14 

rubbish Î 23 23 ى 8 6 ­2 
Group 5 

15 Ï 23 23 ي 9 7 ­2 
16 << and >> 18 21 س 7 1 24 Group 6 

17  ­ 17 21 ص 3 2 39 Group 7 

15 ں 18 19 ا 9 9 40  
 

Group 8 

19 space   16 20 ب 0 0 40 
20 ، 16 20 ة 1 1 40 
21 ¢ 16 20 ت 2 2 40 
22 £ 16 20 ث 3 3 40 
23 ¤ 16 20 ج 4 4 40 
24 ¥ 16 20 ح 5 5 40 
16 ال 25 20 ح 5 5 40 

26 ¦ 16 20 خ 6 6 40 
27 § 16 20 د 7 7 40 

28 ¨ 16 20 ذ 8 8 40 
29 © 16 20 ر 9 9 40 
؛|||| ھ 30  17 21 ز 0 0 40 
31 « 17 21 س 1 1 40 
32 ¬ 17 21 ش 2 2 40 

15 ک 33 19 ء 2 3 41  
Group 9 34 › 15 19 ؤ 5 6 41 

35 ‹ 13 19 ؤ 9 6 57 
Group 

10 
 

IBN AL- HAITHAM J. FOR PURE & APPL. S CI.         VOL.24 (2) 2011 

 
Fig. (1): The ETL process i n a data warehouse.[6]  

 
Fig. (2): Represe nt the  Middleware for the  propose d S ystem 

 
IBN AL- HAITHAM J. FOR PURE & APPL. S CI.         VOL.24 (2) 2011 

 
Fig. (3): The Block Diagram of the 4 methods 

 
Fig. (4): The Time Analysis for the 4 Methods 

 
 2011) 2( 24مجلة ابن الهیثم للعلوم الصرفة والتطبیقیة               المجلد

لتحویل البیانات العربیة في قواعد البیانات تطویرطریقة ضمن الزمن الحقیقي 

  الموزعة

  
  سها محمد هادي  ،  صفاء مرتضى   

تصاالت  ،  كلیة الهندسة الخوارزمي ،   جامعة بغداد                                                       قسم هندسة المعلومات واال

 قسم علوم الحاسبات،كلیة العلوم ، جامعة بغداد  

  2011،حزیران،12: استلم البحث في 

  2011،تموز، 13:   قبل البحث في

  
  الخالصة
المشـــاكل التـــي تواجــه نمـــاذج تطبیقـــات قواعــد البیانـــات یمكـــن تلخیصــها بكیفیـــة التعامـــل وادارة ن واحــدة مـــن اهــم ا        

ان الحاجـة الـى . من المصادر المختلفة لقواعد البیانـات التوزیعیـة لیـتم تخزینهـا فـي قاعـدة بیانـات موحـدة  ودمجها المعلومات

ومــات المخزنــة باللغـة العربیــة ضـمن قواعــد البیانــات فر طریقــة جدیـدة للتعامــل بثبـات مــن اجــل الحصـول علــى نـوع المعلاتـو 

ین مــن ولهـذا سیتضـمن هــذا البحـث تقـدیم  طریقـة جدیـدة تتــالف مـن دمـج أثنـ.التوزیعیـة  اصـبح ضـروریا فـي الوقــت الحاضـر

ي مـن لنقل البیانات باللغة العربیة دون حدوث اي معوقـات اوتشـویه للبیانـات و ضـمن الوقـت الحقیقـالخوارزمانات المستعملة 

  .خالل بیئة قواعد البیانات التوزیعیة

  
  نقل البیانات ضمن الزمن الحقیقي، تنظیف البیانات، قواعد البیانات التوزیعیة: الكلمات المفتاحیة