Acta Polytechnica doi:10.14311/AP.2018.58.0297 Acta Polytechnica 58(5):297–307, 2018 © Czech Technical University in Prague, 2018 available online at http://ojs.cvut.cz/ojs/index.php/ap NEW APPROACH FOR ONLINE ARABIC MANUSCRIPT RECONGNITION BY DEEP BELIEF NETWORK Benbakreti Samir∗, Aoued Boukelif University of Djillali Liabes, Department of Electronic, Sidi Bel Abbes 22000, Algeria ∗ corresponding author: benbakretisamir@gmail.com Abstract. In this paper, we present a neural approach for an unconstrained Arabic manuscript recognition using the online writing signal rather than images. First, we build the database which contains 2800 characters and 4800 words collected from 20 different handwritings. Thereafter, we will perform the pretreatment, feature extraction and classification phases, respectively. The use of a classical neural network methods has been beneficial for the character recognition, but revealed some limitations for the recognition rate of Arabic words. To remedy this, we used a deep learning through the Deep Belief Network (DBN) that resulted in a 97.08 % success rate of recognition for Arabic words. Keywords: Arabic manuscript; online recognition; neural networks; MLP; TDNN; RBF; deep learning; DBN. 1. Introduction Nowadays, the cursive writing recognition problem represents a major challenge for both handwritten and typed forms. The complexity of this problem comes from the style, the variability of the patterns, the tilt in the case of the unconstrained handwriting, this is even worse for some languages that present more complicated morphologic characteristics. The Arabic handwriting, which is cursive by nature, presents a big variability and is consequently very difficult to resolve. The recent electronic revolution allows the emergence of new typing devices capable of creating high quality online documents, such as exam papers, notes, filling forms, online reporting, etc. This progress extends the application field of the online handwriting recognition, which was restricted by terminals of small size (PDA, Smartphone) that support character recognition only. The online documents represent a new source of infor- mation that has few applications in the field of the pattern recognition. They are represented by signals which may be assimilated to sampled trajectories cap- tured from the handwriting instrument as a temporal sequence of points (x(t),y(t)), represented in an or- thonormal coordinate system. The Arabic script is semi-cursive in the two forms, printed and handwrit- ten. The word may be composed of one or several pseudo-words. These pseudo-words are ligated hori- zontally or vertically, which makes the segmentation task very complicated. The shape of the character differs according to its position in the word. Some characters include diacritical points, which lie above or below. However, others have the same body or shape, only the diacritical point allows to differentiate between two characters. Also, in the handwritten Arabic text, the variations in the horizontal bands, according to the calligraphy of the characters con- tained in the pseudo-word, appears [1]. In the case of the manuscript, this complexity is more important, because there are other problems that may arise, such Figure 1. Online Arabic handwriting: (a) the “Bah” character in the middle of a word; (b) the “Tah” char- acter at the end of a word; (c) the word “wahran” composed of 4 pseudo-words. as: the variability intra and inter-writer, the overlap of pseudo-words, the fusion of diacritical points, the writing conditions or the writer state. In this paper, the use of the on-line aspect decreases some processing difficulties because the diacritical points are considered as a signal, which complements the character body. Hence the fusion of diacritical points is not a real problem. However, the segmenta- tion in pseudo-words or characters is not performed because the word is taken in its entirety, therefore, the overlaps are ignored. Nevertheless, these features increase the variability intra or inter-writer. Among the approaches presented in the literature for a handwriting recognition, the neural networks demonstrate a great discriminating power and a very good capacity to construct class separators in multidi- mensional spaces. That being said, models based on hidden Markov models (HMM) [2] use a parametric approach to model sequences of observations. These sequences are generated by stochastic processes (e.g., handwriting manuscript), which are more visible in the word recognition process. The HMM has a strong 297 http://dx.doi.org/10.14311/AP.2018.58.0297 http://ojs.cvut.cz/ojs/index.php/ap Benbakreti Samir, Aoued Boukelif Acta Polytechnica expressive power to model the distribution of obser- vations for any class. For the character case, the neural networks are much more adapted because it is the global shape that dominates. These networks present the advantage to be compatible with the two dimensional pictorial nature of the writing. 1.1. Related work The handwriting is a complicated task, due to the variability of handwriting styles. Indeed, the shape of the handwriting character differs not only from one writer to another, but also for a single writer ac- cording to: its position in the word, the neighbouring letters, the writer’s health (motor disabilities, Parkin- son ... ), the psychological status ( depression, sadness ...). In addition, it is also possible to have the same trace for two different meanings depending on the context. This represents another source of ambiguity. The aforementioned properties make the handwrit- ing recognition more challenging and the complicated patterns recognition a problem. The online handwriting recognition is not a recent subject. It goes back to around 30 years ago. How- ever, the emergence of more sophisticated devices, like smartphones and tablets, has stimulated more research works in this field. The online recognition problem is characterized by three fundamental prop- erties: line temporal chaining, the dynamic of the trace (speed, acceleration, pressure on the pen) and the trace skeleton (ignore the thickness of line). In [3], Liu et al. presented the state of the art of online Chinese handwriting recognition. In [4], Kessentiniet al. have developed a unified approach for Arabic and Latin alphabetic scripts recognition. However, few works were interested in the online Arabic writing recognition. In fact, its cursive nature triggers numerous problems in the automatic recogni- tion system. These problems are bound to the diver- sity and the complexity of the handwriting signal. The unavailability of databases with online character also represents another problem for this field of research. Consequently, most of the works focused on the offline aspect, such as [5] where Margner et al. present a development strategy for offline Arabic manuscript recognition systems. More recent works have been pre- sented in [6], which propose a deep architecture for the Arabic manuscript recognition. Nevertheless, several works treat the online case, like Mezkhani et al. They have used a Bayesian classification [7] and a Kohonen network [8] for the online Arabic character recognition. In [9], Biadsy used the hidden Marcov model for the Arabic manuscript recognition. And more recently, S.Benbakreti and A.Boukelif proposed a TDNN “time delay neural network” based recognition system of the online isolated Arabic characters [10]. In this work, we present a design of an Arabic hand- writing recognition system; we propose a classification approach based on neural networks. Figure 2 illus- trates the architecture of the proposed system. Acquisition and Storage Interface Pretreatment Feature Extraction Classifier (NN) Optional step according to NN choice Character or word identified Figure 2. Architecture of the recognition system. 2. Processing and features extraction The preprocessing stage described in the previous recognition system used for the character size normal- ization and trace sampling in a fixed number equidis- tant points fashion. The next stage extracts features from the previously sampled trace, which prepares the input for the neural network classifier. After the training phase, the class of the character is supplied at the output layer. This procedure is respected by almost all the neural networks. However, the deep neural networks do not require any features extrac- tion phase. The task is done implicitly in the net- work. The task of the character recognition is difficult, because of the big variability inherent to the handwrit- ing style and the variation of the character position in the word (see Figure 3). For most of the Arabic letters, similarities are no- ticed between forms at the beginning/ middle on the one hand and the end/ isolated on the other. The presence of a ligature with a previous/next letter does not significantly modify the form of the letter (no more than the case of Latin cursive handwriting). Arabic ligatures occur where two letters are written one upon the other. Furthermore, the Arabic writing is rich in diacritic signs (or secondary signs) such as vowels, points (dots), chaddah, maddah, hamzah, etc. In our work, we define a diacritical mark as a secondary component of a letter, which may complete it or even modify the whole sense of the word. But the notion of vowel, such as el damah, el kasrah, el fathah and el soukoun are not considered. 298 vol. 58 no. 5/2018 New Approach for Online Arabic Manuscript Recongnition by Deep Belief Network Figure 3. Different forms of the Arabic character according to its position in the word. Figure 4. Result of preprocessing: (a) character Noun before preprocessing; (b) character Noun after preprocessing. These specificities of the Arabic language make the recognition task more complicated, hence the necessity to do a preprocessing including the following steps: • A spatial sampling that allows preserving the useful information of the signal and excluding the redundant information resulting from the repetition of points. Indeed, the duration of character forma- tion as well as its shape can vary significantly from one writer to another and from an occurrence to another. The spatial sampling transforms a hand- writing signal into a sequence of equally spaced points with coordinates [X(n),Y (n)] (n is the point index) according to the length of the trace for a fixed number of points. . Definition of the line between pen up and pen down. N: number of sampling points Reservation of the memory space for the following vectors: X, Y , PenUp / Down . Calculation of the total length of the signal: If only one trait: Length = length + distance between points If several traits: Length = length + length be- tween strokes. . Calculation of the sampling distance Dist-samp = Length/(N − 1) Determination of N − 1 points by interpolation • Character normalization and centring: The character is recentred to (x0,y0) then normalized to the maximal size of the character. To obtain an invariant representation with respect to the trans- lations and the spatial distortions. A result sample of these two preprocessing steps is illustrated in Figure 4: Once the preprocessing step is finished, in order to facilitate the classifier task, geometric information, such as the direction of movement and the curvature of the trajectory, are extracted from this sequence of points. We then obtain a vector sequence of 7 charac- teristics, which are mentioned in the following: • The x coordinates • The y coordinates 299 Benbakreti Samir, Aoued Boukelif Acta Polytechnica Algorithm 1 xmax ← x(0) xmin ← x(0) ymax ← y(0) ymin ← y(0) {Calculation of the center of the character [x0,y0]} n ← 0 while n ≤ N − 1 do if x(n) > xmax then xmax ← x(n) end if if x(n) < xmin then xmin ← x(n) end if if y(n) > ymax then ymax ← y(n) end if if y(n) < ymin then ymin ← y(n) end if end while x0 ← (xmin + xmax)/2 y0 ← (ymax + ymin)/2 {Delta scaling factor calculation} Deltay ← (ymax − ymin) Deltax ← (xmax − xmin) if Deltax < Deltay then Delta ← Deltay else Delta ← Deltax end if {Calculation of the new coordinates x[n][1] and x[n][2]: x[n][1] the x coordinates x[n][2] the y coor- diantes. Scanning of (X and Y ) points} x[n][1] ← (X[n] −x0)/Delta x[n][2] ← (Y [n] −y0)/Delta x[n][7] ← penUpDown[n] {Calculation of the direction x[n][3] and x[n][4]: co- sine directions Scanning points Calculation of the length of the dS string} dx ← X[i + 1] ∗X[i− 1] dy ← Y [i + 1] ∗Y [i− 1] dS ← sqrt(dx ∗ dx + dy ∗ dy) if dS = 0 then x[i][3] ← 0 x[i][4] ← 0 else x[i][3] ← arcx/dS x[i][4] ← arcy/dS end if {Special points: initial and last point} Initial point ← next point Last point ← previous point {Curvature calculation x[n][5] and x[n][6]: cosine directions Scanning points} x[i][5] ← x[i+1][3]∗x[i−1][3] + x[i+1][4]∗x[i−1][4] x[i][6] ← x[i+1][3]∗x[i−1][4] + x[i+1][4]∗x[i−1][3] {Special points: initial and last point} Initialpoint ← nextpoint Lastpoint ← previouspoint Figure 5. MLP type connexion -++ complete field of view. • The cosine directions of the movement direction (cos θ et sin θ) • The cosine directions of the trajectory curvature (cos Φ et sin Φ) • The pen up/down. This step is summarized in Algorithm 1. Once the features extraction is realized, we can proceed to the classification. 3. Neural Network Classification We have used a set of neuronal classifiers, which gave satisfactory results, the proposed neural architectures are as follows. 3.1. Multi-Layer Perceptron (MLP) We have used the MLP with error back-propagation that has one hidden layer only [11, 12]. The seven features extracted from the Arabic characters (charac- ters in their isolated forms, at the beginning, middle and end), generated by the previous module, consti- tutes the inputs of the network. The hidden layer includes 80 neurons. The classes to be discriminated are 28 characters of the Arabic alphabet (or 48 words), hence the choice of 28 (48) neurons for the output layer. The unipolar sigmoid was used as a neuron activation function. Figure 5 shows how the MLP processes the letter Mim. However, the training of the lowest layer (the closest to the output layer) is less efficient in a deep MLP [13, 14]. It seems that the update of the parameters is less and less relevant as we propagate in the lowest layers. Because of that, we use one hidden layer for our MLP. 3.2. Time Delay Neural Network (TDNN) The used TDNN consists of three layers: input, hidden and output. Furthermore, every layer possesses two directions: a direction of features and a temporal 300 vol. 58 no. 5/2018 New Approach for Online Arabic Manuscript Recongnition by Deep Belief Network Figure 6. Convolution network connexion — re- stricted field of view. direction. The TDNN is distinguished from a multi- layer perceptron by the fact that it takes into account a time notion. Consequently, it processes all the neurons of the input layer at the same time (see Figure 6), before making a temporal scanning. It is the notion of the temporal window. The TDNN has three basic principles: temporal window, shared weights, and delays: • The temporal window: the basic idea states that every neuron of the L + 1 layer is only connected to one subset of neurons from the L layer. After an experiment study, we have fixed the size of this window to seven neurons. • The shared weights: this notion allows reducing the number of network parameters, which positively in- fluences the network generalization capacity. A win- dow with a given characteristic will have the same weights according to the temporal direction. It also allows progressively extracting the differences when scanning the signal. • The delays: in addition to the previous two con- straints, we introduce delays between two succes- sive windows for a given layer. The first delay is three neurons and the second one is four. For the shared weights constraint, the same neuron is duplicated in the time direction (the same du- plicated weight matrix) to detect the presence or the absence of the same feature on various points in the signal. By using several neurons in every temporal position, the network of neuron detects the different features: the outputs of the different neurons of one layer produces new features vector for the next layer and so on. Thus, the temporal component of the original signal is progressively eliminated and transformed into a feature by up- per layers. To compensate this loss of information, we increase the number of neurons in the feature direction. Input Layer Centers Output Layer Figure 7. The RBF network architecture. 3.3. Radial Basis Function network (RBF) The RBF uses radial functions rather than sigmoid (e.g., MLP) to build a local decision function centred on a subset from the inputs [15]. The summation of all local functions represents the global decision function [16] to solve the local minima problem. The used RBF network [17, 18] consists of 3 layers (Fig- ure 7). Every hidden node applies a kernel function to the input data, then the output layer performs a weighted sum of these kernel functions. Every node is characterized by two parameters, which are the width and the centre of the radial function. If the input data of a node are close to the centre (estimated using the Euclidian distance), the output value will be high. In this work, we have used a Gaussian kernel function. The complete configuration of the network is achieved after determining the centre and the width associated to each node as well as the weight of the connections between the hidden layer and the output layer, which contains 28 Arabic characters or the 48 words representing the cities of Algeria. In this work, we have experimented with several variants of RBF networks, a brief description of each one of them is given in the following paragraphs: 3.3.1. Radial basis function networks exact (RBFNE) The design of an RBFNE can be done using a function that takes as input : the matrices of input vectors P, the target vectors T and the spread factor of the radial basis layer, then it sends back a network with weights and bias values such that the output is exactly T when the input is P. The same function creates many radial basis neurons that input vectors in P. So, we have a radial basis neuron layer in which every neuron acts as a detector for a different input vector. Then, the only parameter that RBFNE has is the spread of the radial basis functions of the first layer. In this work, the value of the spread factor is 0.3. 301 Benbakreti Samir, Aoued Boukelif Acta Polytechnica ... ... Input Layer Radial Basis Layer Special Linear Layer Output Layer Figure 8. The GRNN network architecture. 3.3.2. Radial basis function (RBF) The second type iteratively creates a radial based neu- ron in each instant. The difference is that the RBF creates only one neuron at a time. In every iteration, the input vector, which will succeed to reduce most of the network error, is used to create a radial based neuron. The error of the new network is checked. If it is sufficiently small, the network creation is ended, otherwise, a following neuron is added. This pro- cedure is repeated until the targeted mean squared error is reached (10e−6), or the maximum number of neurons is reached (300). The number of neurons added between every evaluation is fixed, after several experiences, to 25. 3.3.3. Generalized regression neural networks (GRNN) This variant is another type of the NN that was pro- posed by Speckt [19]. This regression based network estimates the expected mean value of the output vari- able using Bayesian technique. Any continuous func- tion approximated to a linear combination of Gaussian functions. The objective is to make a regression, i.e., to build a good approximation of a function, which is known by a limited number of experimental points. This network can be used to solve the classification problem. For each input, the first layer calculates the distances between the input vector and the weight vector and then it produces a vector, which is mul- tiplied by the bias. The network GRNN diagram is described in Figure 8: The other GRNN specific layer is called a special lin- ear layer and consist of two subparts: Numerator part performs a summation of the multiplication of training output data and activation function, the Denominator part is the summation of all activation function. This layer feeds both the Numerator & Denominator to the next output layer. 3.3.4. Probabilistic neural networks (PNN) The probabilistic neural network, introduced by Don- ald Specht in 1988, is a feedforward network with three layers used for the classification of the data [20]. Contrary to the other neural networks, which are based on the backpropagation method, the PNN is based on the Bayes decision strategy and the proba- bility density estimation. The PNN uses radial basis spherical Gaussian functions centred in each training vector. The membership probability of a vector in a given class is expressed as follows [21, 22]: fi(x) = 1 2πp/2σpMi M∑ j=1 e −(x−xij) T (x−xij) 2σ2 , (1) where i is the number of classes (28 or 48 in our case), j is the number of the forms to be recognized, Mi is the number of the training vector of the ith class, x is a test vector, xij is the jth training vector of the ith class, p is the dimension of the vector x, is the standard deviation, and fi(x) is the summation of the Gaussian spherical multi-variable functions, centred on each training vectors xij used for the evaluation of the probability density function of the ith class. The classification decisions are taken according to the decision of Bayes rule [22]: d(x) = Ci if fi(x) > fk(x) for k 6= i, (2) where Ci is the ith class. When an input is presented; • The first layer calculates the distances between the input vector and all the training vectors, and produces a vector with elements that indicate how this input vector is close to every training vector. • The second layer adds these contributions for ev- ery inputs class to produce a probability vector at the network output. Finally, a transfer function at the output of the second layer selects the maxi- mum of these probabilities, and assigns “1” to the corresponding class and “0” to the others. 3.4. Deep Learning The deep learning models are built in the same way as the previously described multilayer perceptron, except that the different middle layers are more numerous. But the main difference is the learning method, which distinguishes them from a classic MLP and which gave them a renewed interest since 2006. In fact, the limits and the drawbacks [23] of the classic architectures mentioned so far in this paper contributed to this interest. For these reasons, Bengio et al. [11] proposed a greedy learning algorithm based on a staking of auto- associators, which allows building the hidden layers one after the other. In this paper, we have used the deep network model proposed by Hinton et al. [12] in 2006, which is based on staked restricted Boltzmann machines (RBMs), to construct the so called Deep Belief Networks. The topology and the learning method used in this model are presented below. 302 vol. 58 no. 5/2018 New Approach for Online Arabic Manuscript Recongnition by Deep Belief Network V W H Figure 9. RBM: V is the visible layer, H is the hidden layer and W is the connection weight. 3.4.1. Restricted Boltzmann machine (RBM) and contrastive divergence learning (CD) RBM is an undirected graph with two layers. The first layer is considered visible while the second is hidden. Each node represents a neuron. The neuron is active when its cost is equal to 1 and inactive if it is 0. The visible layer is connected to the hidden layer by weighted edges (Figure 9). The visible layer represents the observed data and the hidden layer represents the unknown elements associated with the data. Its size (fixed arbitrarily) allows the RBM to model more or less complex distributions. Let us suppose that a bias unit, always active, is presented in the visible layer and the hidden layer. W is the weights matrix, where Wij represents the weight of the link between the units vj and hi The RBM energy is given by: Energy(v,h) = −hT Wv, (3) Each unit corresponds to a hypothesis to which we as- sign a binary value(1 or 0). The connections represent constraints for these hypotheses: if Wij is negative, i and j should not be activated at the same time to reduce the RBM energy. But, if the weight Wij is positive, then the simultaneous activation of both units reduces the energy. To summarize, the energy is a function of configuration, which is dependent on the constraints related to the weights. This energy function allows associating to an RBM, of weight W , a probability on the (v,h) configurations space: Pw(v,h) = e−Energy(v,h) Z , (4) where Z is a partition function, it is the sum of all pos- sible joined configurations. It is given by the following formula: Z = ∑ v,h e−Energy(v,h). (5) The activation probability of the visible or hidden units are independent from each other. Let sgm(t) be the sigmoid function: sgm(t) = 1 1 + e−t , (6) The conditional probabilities for an RBM are given by: ∀i ≤ r, PW (hi \v) = sgm (∑ j wijvj ) , (7) ∀j ≤ q, PW (vj \h) = sgm (∑ i wijhi ) . (8) An RBM models the probability of an input v with the joined PW (v,h). We use the Gibbs sampling technique to make a random draw from the model to generate the configurations (v,h). It serves as reliable examples of inputs v. This technique uses the Markov Chain Monte Carlo method, which consists in a repeated random draw from PW (h\v) and PW (v\h), such that the Markov Chain is guaranteed to be converged to the PW distri- bution. In summary, this distribution is modelled by the RBM independently of the inputs, and according to it we can make a random draw with the Gibbs sampling method. In the training part, we define PT R as the proba- bility distribution corresponding to the random draw of a sample v from a training database TR, then to the random draw of a hidden representation h, which is associated to v. PT R is the objective distribution of training, it is fixed from a sample basis. Having defined PW and PT R , the training step will mini- mize the KullbackLeibler divergence between these two distributions in such way that the probability to draw a sample v from the examples approaches the probability to generate it from the training RBM. It turns out that the KullbackLeibler divergence minimization has a very high treatment cost, thus we prefer to minimize an approximated criterion: Con- trastive Divergence (CD). This technique was devel- oped by Geoffrey Hinton in 2001. The pioneering paper, written in 2002 [24], demon- strated an improvement of the handwritten digit recog- nition results (dataset MNIST) by using a CD algo- rithm to train RBM. 3.4.2. Deep Belief Network (DBN) The restricted Boltzmann machines stacked in a gen- erative way represents a particular type of a deep neural network, the reason why we detailed the RBM structure previously. The topology of such network is presented in Figure 10. The training of the stacked RBM is made layer by layer, in a greedy way. A first RBM is trained on the dataset of samples to mini- mize the Contrastive Divergence (CD). Then, each of the following RBM makes its training on the hidden representations of the previous RBM. The training process includes two phases: 303 Benbakreti Samir, Aoued Boukelif Acta Polytechnica v = input h1 h2 RBM R2 RBM R1 stacked RBMs Figure 10. A deep network topology: stacked RBM. (1.) Unsupervised pre-training: the DBN distin- guishes itself from the other neural networks by the way of learning layer-by-layer, the idea is to train every layer as an RBM. The training begins from the lowest hidden layers (close to the visible input vector) and progresses to the output vector, so a DBN learns to probabilistically reconstruct its inputs. The layers then act as feature detectors. The DBN training algorithm for each RBM layer is presented as follows: Step 1: build a multilayer network from the re- stricted Boltzmann machines (RBM), then train the layers by the Greedy-wise algorithm. Natu- rally, the input vector is the visible layer v. Step 2: encode X as an RBM to produce a new sample v′, by using the Gibbs sampling method. Step 3: repeat the second step with v ←− v′ for each two successive layers until the two top layers of the network are reached. (2.) Supervising training: the use of a supervised algorithm like back-propagation algorithm in MLP that helps to search the optimal parameters of the DBN network. Nevertheless, we use the weight W and the hidden bias b parameters to initialize this MLP. In [13], several experiments were performed on differ- ent networks, trained for simple classification tasks (handwritten digit, geometrical forms). The conclu- sions of this reference paper are that training deep architectures was unsuccessful until the recent advent of algorithms based on an unsupervised pretraining. The concept of deep learning does not mean that the number of layers is high, it is rather the way of learn- ing that is a different (layer by layer) method that Database Number classes Writers Sample of training/ test Class letters 28 20 1400/1400 Class word 48 20 2400/2400 Table 1. Description of the database. has been applied in our work with the DBN network. We deduce from these works that if the network contains enough number of neurons, the classifica- tion performances are better with an unsupervised training. Even on classical networks (not deep), the performances are improved. In our current work, we suggest the use of all the neural architectures described previously for the on-line Arabic handwriting recog- nition without constraint (letters and words). We also observe how the DBN manages to discriminate between the data without having a prior information and without making the feature extraction. In this classifier, the complexity resides in the choice of the hyper-parameters of the algorithm CD, this heuristic choice requires many experiments to find the optimal architecture, giving the best rates of classification. 4. Description of the Database The objective of the architectures described previ- ously is the classification and the recognition of Arabic characters/words coming from our database NOUN- DATABASE built from an on-line acquisition by means of a tablet of acquisition. This choice is moti- vated by the fact that most of the existing databases are off-line i.e.,containing images. In this work, we wanted to handle the on-line writing signal rather than the images. The Arabic alphabet consists of 28 letters of variable shapes depending on its position in the word. Our database contains 2800 Arabic characters written by 20 different writers, each writer inserts the alphabet 5 times (once for every position in the word: in its isolated form, at the beginning, middle, end, and another time in the choice of the writer). The constructed database includes 4800 words, inputted by the same writers (20). The words represent the 48 Algerian provinces (Wilaya). Each writer performs five insertions of the 48 Wilayas. The handwriting can be influenced by several parameters, such as the age, the sex and the state of the writer (each writer possesses an appropriate writing style). To preserve the variability of the writing signals, we took care to make the data acquisition (the samples database) by 20 people with different sexes and age categories. The database is summarized in Table 1. 5. Experiments and Results In this section, we shall only consider the final results of our system of the Arabic handwritten character recognition, i.e. with all the parameter adjustments 304 vol. 58 no. 5/2018 New Approach for Online Arabic Manuscript Recongnition by Deep Belief Network NN Arabs letters % correct % error Time for classification (sec) MLP 97.73 2.27 48.21 TDNN 49.46 50.54 2358.00 RBF 86.09 13.91 23.09 RBFe 75.39 24.61 8.68 GRNN 86.36 13.64 8.14 PNN 84.92 15.08 8.73 Table 2. Arabic characters recognition rates and classification time according the neural network. NN Arabs words % correct % error Time for classification (sec) MLP 85.31 14.69 27.01 TDNN 46.76 53.24 32017.00 RBF 81.85 18.15 37.57 RBFe 62.58 37.42 16.38 GRNN 78.44 21.56 16.74 PNN 78.44 21.56 17.77 Table 3. Arabic words recognition rates and classifi- cation time according the neural network. made in the different neuronal approaches. We also proceeded to the division of the database in two parts: 50 % of samples formed the training (1400 characters and 2400 words) and the remaining 50 % for the test (1400 characters and 2400 words). 5.1. Neural Approach (classical methods) Experiment 1: In the first experiment, we have used the classic architectures described previously for the on-line Arabic handwritten character recog- nition regardless of its position in the word, i.e., in its isolated form, at the beginning, middle and end. The best configurations of our system gave the results illustrated in the Table 2. Discussion: We notice that the used classic ar- chitectures give more or less satisfactory results, having said that, the best classification results of the Arabic characters were obtained by using the multi-layer perceptron MLP with a 97.73 % success rate. Experiment 2: In the second experiment, we are going to use the same previous architectures for the online Arabic words recognition, represented by 48 Wilayas of Algeria. The best configurations of our system gave the results illustrated in the Table 3. Discussion: We notice that the word recognition results have considerably fallen in comparison with Table 2, this is due to the increase of the number of classes (which rises from 28 to 48). That can also be explained by the complexity of the data samples, indeed a word can consist of several pseudo-words which can contain one or several characters. How- ever, the variability of the writing signal is richer in the case of the word writing. In addition, the re- duction of the rate recognition can be explained by the main limitation of the classic architectures. If the architecture is too deep, the optimization of the parameters very often leads to a non-optimal local minima. The increase of the number of layers did not improve the results (some-times, it degraded them). For this reason, we had previously high- lighted that the results mentioned are those of the ideal architecture (with the best parameters for each neural network, including the number of layers). As explained in [13, 14]), the surface of the error is not convex and is more irregular as the network is deep. Despite this, the MLP still gives the best results, which is 85.31 % with an acceptable time of classification of 27.01 seconds. Other observation is that the TDNN takes a long time for the classifi- cation, because of its complex architecture, which adds an additional dimension to the network, i.e. a temporal dimension in this particular case. 5.2. Deep Neural Approach Experiment 3: To enhance that results, we will use a deep neuronal architecture of the type Deep Be- lief Network (DBN), which represents a stacking of restricted Boltzmann machines. It should be noted that the recognition architecture system is slightly modified because we have eliminated the “Features extraction” block (See Figure 2). This is due to the abstraction ability of the deep learning network. In fact, it allows automatically learning the different representations of data by an abstraction level from the raw data. Phase 1: Pre-training (Un-supervising training without backpropagation,): In this phase, the algo- rithm utilizes a concatenation of the RBM layers to learn the distribution of the inputs data X (hand- writing signal) without taking into consideration the Y labels. Each RBM layer (the two first layers in this case) has a more abstract representation than the precedent. Hinton et al. [12] proved that a greedy learning method for unsupervised training is effective. This method is also called contrastive divergence (CD). This phase, called sometimes pre- training, could be considered as an efficient initiali- sation of the DBN. Phase 2: Fine-tuning (Supervising phase with back-propagation) The role of the second part of the DBN is to convert the abstract representation X into a Y labels that are used in the case of the supervised learning (the last RBM layer with 2000 units). The back-propagation algorithm is used to readjust the network parameters, and finally, the global optimal network could be obtained. The out- put layer has the number of classes (48 in this case). The training phase is considered as accomplished 305 Benbakreti Samir, Aoued Boukelif Acta Polytechnica Parameter Value Number of hidden layer 2 Units of 1st hidden layer 100 Units of 2nd hidden layer 100 Units of 3rd hidden layer 2000 Batch size 100 Number of RBM epochs 50 Table 4. DBN Parameters. NN Arabs words % error % error % correct before after BP BP Time for classi- fication (sec) DBN 33.75 2.92 97.08 252.21 Table 5. Arabic words recognition rates and classifi- cation time according the DBN network. once the error between the output and desired one is stabilized. After that, we proceed to test another samples from the database (see Table 1). This phase, which is called fine-tuning, is more time consuming than the previous one. The parameters of the used DBN network for each layer (the pre-training and fine tuning phases) are mentioned in Table 4. The above described DBN network architecture gave the results shown in Table 5. Discussion: We can see that the results of the word classification are significantly improved with the use of the DBN. Indeed, in the previous table, the best rate of classification was 85.31 % with the MLP. It reaches a rate of 97.08 % with the DBN network with the RBM staked, giving a gain of 11.77 %. Also, we noted that the second step, the supervised part (after the back propagation), that is called “fine-tunning” is very important in terms of the error rate reduction (2.92 % of error rate) and turned out to be much more efficient than the pre-training step (33.75 % of error rate), because we do not start from an initial random solution. Figure 11 represents a summary of the classification rates obtained for each used neural network. 6. Conclusion In this paper, we have proposed a neuronal approach, which aims to develop a solution based on neural net- works for the on-line Arabic manuscript recognition acquired dynamically. With the aim of performing the on-line recognition, we built our own database, containing 2800 characters and 4800 words. We di- vided our work in two parts; the first presents a classic neuronal approach using these different neural net- works: MLP, TDNN, RBFNE, RBF, GRNN and PNN. Our experiments on the character recognition gave interesting results, such as 97.73 % success rate for M L P T D N N R B F R B Fe G R N N P N N D B N 0 10 20 30 40 50 60 70 80 90 100 Classifier C la ss ifi ca ti on ra te (% ) Figure 11. Comparison between the deep neuronal approach (DBN) and the classic neuronal approach for the Arabic word recognition. the MLP and 86.36 % for the GRNN. Nevertheless, the application of the same architectures on the word recognition significantly deteriorates the results, which did not exceed 85.31 % for the MLP and 81.85 % for the RBF. This can be explained by the complexity of the input signal and the local minima problem. The second approach uses a deep learning, involving the DBN, which represents a stacking of restricted Boltzmann machines and which is characterized by a layer-by-layer learning method. This network sig- nificantly improves the performance for the words database and gives a classification rate of 97.08 %. In fact, the DBN has shown its ability to exerts an excellent performance for the classification tasks and dimension reduction. We conclude the superiority of the deep learning network compared to the classic neural networks. We also note that the various pa- rameters used in all the previous neural architectures have allowed us to obtain the best classification rates. References [1] N. Ben Amara, A. Belaïd. Modélisation pseudo bidimensionnelle pour la reconnaissance de chaînes de caractères arabes imprimés. In Colloque International Francophone sur l’ECrit et le Documernt - CIFEd’98, pp. 131–141. Québec, Canada, 1998. Colloque avec actes et comité de lecture. [2] R. Rabiner, L, H. Juang, B. An introduction to hidden markov models. pp. 4–16. IEEE ASSP Magazine, 1986. [3] L. Cheng-Lin, J. Stefan, N. Masaki. Online recognition of chinese characters: The state-of-the-art. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(2):198–213, 2004. doi:10.1109/TPAMI.2004.1262182. [4] K. Yousri, P. Thierry, B. H. AbdelMajid. A multi-lingual recognition system for arabic and latin handwriting. In Proceedings of the 2009, 10th International Conference on Document Analysis and Recognition, pp. 1196–1200. July 26-29, 2009. doi:10.1109/ICDAR.2009.55. 306 http://dx.doi.org/10.1109/TPAMI.2004.1262182 http://dx.doi.org/10.1109/ICDAR.2009.55 vol. 58 no. 5/2018 New Approach for Online Arabic Manuscript Recongnition by Deep Belief Network [5] M. Volker, E. A. Haikal. Databases and competitions: strategies to improve arabic recognition systems. In Proceedings of the 2006 conference on Arabic and Chinese handwriting recognition, pp. 82–103. College Park, MD, USA, September 27-28, 2006. doi:10.1109/ICDAR.2009.55. [6] E. Mohamed, T. Najiba, K. Monji. Optimization of dbn using regularization methods applied for recognizing arabic handwritten script. In International Conference on Computational Science, ICCS. 12-14 June 2017. doi:10.1016/j.procs.2017.05.070. [7] M. Neila, M. Amar, C. Mohamed. Bayes classification of online arabic characters by gibbs modeling of class conditional densities. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(7):1121–1131, 2008. doi:10.1109/TPAMI.2007.70753. [8] M. Neila, C. Mohamed, M. Amar. Combination of pruned kohonen maps for on-line arabic characters recognition. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, p. 900. August 03-06, 2003. doi:10.1109/ICDAR.2003.1227790. [9] F. Biadsy, J. El-sana, N. Habash. Online arabic handwriting recognition using hidden markov models, 2006. [10] S. Benbakreti, A. Boukelif. Features Extraction and On-line Recognition of Isolated Arabic Characters, pp. 481–500. 2018. doi:10.1007/978-3-319-67056-0_23. [11] Y. Bengio, P. Lamblin, V. Popovici, H. Larochelle. Greedy layer-wise training of deep networks. in Advances in Neural Information Processing Systems 19,eds Schölkopf B, Platt J, Hoffman T, editors (Vancouver: MIT Press) pp. 153–160, 2007. [12] G. Hinton, S. Osindero, Yee-WhyeTeh. A fast learning algorithm for deep belief nets. Neural Computation 18(7):1527–1554, 2006. [13] E. Dumitru, M. Pierre-Antoine, B. Yoshua, et al. The difficulty of training deep architectures and the effect of unsupervised pre-training. In International Conference on artificial intelligence and statistics, pp. 153–160. 2009. [14] I. Guyon, P. Albrecht, I. Le Cun, al. Design of a neural network character recognizer for a touch terminal. Pattern Recognition 24(2):105–119, 1991. doi:10.1016/0031-3203(91)90081-F. [15] P. Burrascano. Learning vector quantization for the probabilistic neural network. IEEE Transactions on Neural Networks 2(4):458–461, 1991. doi:10.1109/72.88165. [16] K. Kim, D, H. Kim, D, K. Chang, SE, K. Chang, SA. Modified probabilistic neural network considering heterogeneous probabilistic density functions in the design of breakwater. KSCE Journal of Civil Engineering 11(2):65–71, 2007. doi:10.1007/BF02823849. [17] J. D. Powell, M. Radial basis functions for multivariable interpolation: A review. In Algorithms for Approximation, pp. 143–167. Clarendon Press, Oxford, 1987. [18] I. Park, W. Sandberg, I. Universal approximation using radial basis function networks. Neural Computation 3(2):246–257, 1991. doi:10.1162/neco.1991.3.2.246. [19] D. Speckt. A generalized regression neural network. IEEE Transactions on Neural Networks 2(6):568–576, 1991. doi:10.1109/72.97934. [20] F. Specht, D. Probabilistic neural networks. Neural networks 3(1):109–118, 1990. doi:10.1016/0893-6080(90)90049-Q. [21] M. Berthold, J. Diamond. Constructive training of probabilistic neural networks. Neurocomputing 19(1):167–183, 1998. doi:10.1016/S0925-2312(97)00063-5. [22] W. Tomasz, J. Jacek, M. Jacek. Probabilistic neural network for direction estimation. In Proceedings of the Third Conference Neural Networks and their Applications and Summer School on Neural Networks Applications to Signal Processing, Kule, 14 X-18 X 97, pp. 173–178. Eds R. Tadeusiewicz, L. Rutkowski, J. Chojcan. Czȩstochowa: Pol. Neural Netw. Soc. s., 1997. [23] Y. Bengio, Y. LeCun. Scaling learning algorithms towards AI. In Large-Scale Kernel Machines. MIT Press, 2007. [24] H. Geoffrey, E. Training products of experts by minimizing contrastive divergence. Neural Computation 14(8):1771–1800, 2002. doi:10.1162/089976602760128018. 307 http://dx.doi.org/10.1109/ICDAR.2009.55 http://dx.doi.org/10.1016/j.procs.2017.05.070 http://dx.doi.org/10.1109/TPAMI.2007.70753 http://dx.doi.org/10.1109/ICDAR.2003.1227790 http://dx.doi.org/10.1007/978-3-319-67056-0_23 http://dx.doi.org/10.1016/0031-3203(91)90081-F http://dx.doi.org/10.1109/72.88165 http://dx.doi.org/10.1007/BF02823849 http://dx.doi.org/10.1162/neco.1991.3.2.246 http://dx.doi.org/10.1109/72.97934 http://dx.doi.org/10.1016/0893-6080(90)90049-Q http://dx.doi.org/10.1016/S0925-2312(97)00063-5 http://dx.doi.org/10.1162/089976602760128018 Acta Polytechnica 58(5):297–307, 2018 1 Introduction 1.1 Related work 2 Processing and features extraction 3 Neural Network Classification 3.1 Multi-Layer Perceptron (MLP) 3.2 Time Delay Neural Network (TDNN) 3.3 Radial Basis Function network (RBF) 3.3.1 Radial basis function networks exact (RBFNE) 3.3.2 Radial basis function (RBF) 3.3.3 Generalized regression neural networks (GRNN) 3.3.4 Probabilistic neural networks (PNN) 3.4 Deep Learning 3.4.1 Restricted Boltzmann machine (RBM) and contrastive divergence learning (CD) 3.4.2 Deep Belief Network (DBN) 4 Description of the Database 5 Experiments and Results 5.1 Neural Approach (classical methods) 5.2 Deep Neural Approach 6 Conclusion References