Acta Polytechnica


doi:10.14311/AP.2018.58.0297
Acta Polytechnica 58(5):297–307, 2018 © Czech Technical University in Prague, 2018

available online at http://ojs.cvut.cz/ojs/index.php/ap

NEW APPROACH FOR ONLINE ARABIC MANUSCRIPT
RECONGNITION BY DEEP BELIEF NETWORK

Benbakreti Samir∗, Aoued Boukelif
University of Djillali Liabes, Department of Electronic, Sidi Bel Abbes 22000, Algeria

∗ corresponding author: benbakretisamir@gmail.com

Abstract. In this paper, we present a neural approach for an unconstrained Arabic manuscript
recognition using the online writing signal rather than images. First, we build the database which
contains 2800 characters and 4800 words collected from 20 different handwritings. Thereafter, we
will perform the pretreatment, feature extraction and classification phases, respectively. The use of a
classical neural network methods has been beneficial for the character recognition, but revealed some
limitations for the recognition rate of Arabic words. To remedy this, we used a deep learning through
the Deep Belief Network (DBN) that resulted in a 97.08 % success rate of recognition for Arabic words.

Keywords: Arabic manuscript; online recognition; neural networks; MLP; TDNN; RBF; deep learning;
DBN.

1. Introduction
Nowadays, the cursive writing recognition problem
represents a major challenge for both handwritten and
typed forms. The complexity of this problem comes
from the style, the variability of the patterns, the tilt
in the case of the unconstrained handwriting, this
is even worse for some languages that present more
complicated morphologic characteristics. The Arabic
handwriting, which is cursive by nature, presents a big
variability and is consequently very difficult to resolve.
The recent electronic revolution allows the emergence
of new typing devices capable of creating high quality
online documents, such as exam papers, notes, filling
forms, online reporting, etc. This progress extends the
application field of the online handwriting recognition,
which was restricted by terminals of small size (PDA,
Smartphone) that support character recognition only.
The online documents represent a new source of infor-
mation that has few applications in the field of the
pattern recognition. They are represented by signals
which may be assimilated to sampled trajectories cap-
tured from the handwriting instrument as a temporal
sequence of points (x(t),y(t)), represented in an or-
thonormal coordinate system. The Arabic script is
semi-cursive in the two forms, printed and handwrit-
ten. The word may be composed of one or several
pseudo-words. These pseudo-words are ligated hori-
zontally or vertically, which makes the segmentation
task very complicated. The shape of the character
differs according to its position in the word. Some
characters include diacritical points, which lie above
or below. However, others have the same body or
shape, only the diacritical point allows to differentiate
between two characters. Also, in the handwritten
Arabic text, the variations in the horizontal bands,
according to the calligraphy of the characters con-
tained in the pseudo-word, appears [1]. In the case
of the manuscript, this complexity is more important,
because there are other problems that may arise, such

Figure 1. Online Arabic handwriting: (a) the “Bah”
character in the middle of a word; (b) the “Tah” char-
acter at the end of a word; (c) the word “wahran”
composed of 4 pseudo-words.

as: the variability intra and inter-writer, the overlap
of pseudo-words, the fusion of diacritical points, the
writing conditions or the writer state.

In this paper, the use of the on-line aspect decreases
some processing difficulties because the diacritical
points are considered as a signal, which complements
the character body. Hence the fusion of diacritical
points is not a real problem. However, the segmenta-
tion in pseudo-words or characters is not performed
because the word is taken in its entirety, therefore,
the overlaps are ignored. Nevertheless, these features
increase the variability intra or inter-writer.
Among the approaches presented in the literature

for a handwriting recognition, the neural networks
demonstrate a great discriminating power and a very
good capacity to construct class separators in multidi-
mensional spaces. That being said, models based on
hidden Markov models (HMM) [2] use a parametric
approach to model sequences of observations. These
sequences are generated by stochastic processes (e.g.,
handwriting manuscript), which are more visible in
the word recognition process. The HMM has a strong

297

http://dx.doi.org/10.14311/AP.2018.58.0297
http://ojs.cvut.cz/ojs/index.php/ap


Benbakreti Samir, Aoued Boukelif Acta Polytechnica

expressive power to model the distribution of obser-
vations for any class. For the character case, the
neural networks are much more adapted because it
is the global shape that dominates. These networks
present the advantage to be compatible with the two
dimensional pictorial nature of the writing.

1.1. Related work
The handwriting is a complicated task, due to the
variability of handwriting styles. Indeed, the shape
of the handwriting character differs not only from
one writer to another, but also for a single writer ac-
cording to: its position in the word, the neighbouring
letters, the writer’s health (motor disabilities, Parkin-
son ... ), the psychological status ( depression, sadness
...). In addition, it is also possible to have the same
trace for two different meanings depending on the
context. This represents another source of ambiguity.
The aforementioned properties make the handwrit-
ing recognition more challenging and the complicated
patterns recognition a problem.
The online handwriting recognition is not a recent

subject. It goes back to around 30 years ago. How-
ever, the emergence of more sophisticated devices,
like smartphones and tablets, has stimulated more
research works in this field. The online recognition
problem is characterized by three fundamental prop-
erties: line temporal chaining, the dynamic of the
trace (speed, acceleration, pressure on the pen) and
the trace skeleton (ignore the thickness of line).
In [3], Liu et al. presented the state of the art

of online Chinese handwriting recognition. In [4],
Kessentiniet al. have developed a unified approach for
Arabic and Latin alphabetic scripts recognition.

However, few works were interested in the online
Arabic writing recognition. In fact, its cursive nature
triggers numerous problems in the automatic recogni-
tion system. These problems are bound to the diver-
sity and the complexity of the handwriting signal. The
unavailability of databases with online character also
represents another problem for this field of research.
Consequently, most of the works focused on the offline
aspect, such as [5] where Margner et al. present a
development strategy for offline Arabic manuscript
recognition systems. More recent works have been pre-
sented in [6], which propose a deep architecture for the
Arabic manuscript recognition. Nevertheless, several
works treat the online case, like Mezkhani et al. They
have used a Bayesian classification [7] and a Kohonen
network [8] for the online Arabic character recognition.
In [9], Biadsy used the hidden Marcov model for the
Arabic manuscript recognition. And more recently,
S.Benbakreti and A.Boukelif proposed a TDNN “time
delay neural network” based recognition system of the
online isolated Arabic characters [10].

In this work, we present a design of an Arabic hand-
writing recognition system; we propose a classification
approach based on neural networks. Figure 2 illus-
trates the architecture of the proposed system.

Acquisition and Storage Interface

Pretreatment

Feature Extraction

Classifier (NN)

Optional step according
to NN choice

Character or word
identified

Figure 2. Architecture of the recognition system.

2. Processing and features
extraction

The preprocessing stage described in the previous
recognition system used for the character size normal-
ization and trace sampling in a fixed number equidis-
tant points fashion. The next stage extracts features
from the previously sampled trace, which prepares
the input for the neural network classifier. After the
training phase, the class of the character is supplied
at the output layer. This procedure is respected by
almost all the neural networks. However, the deep
neural networks do not require any features extrac-
tion phase. The task is done implicitly in the net-
work.

The task of the character recognition is difficult,
because of the big variability inherent to the handwrit-
ing style and the variation of the character position
in the word (see Figure 3).
For most of the Arabic letters, similarities are no-

ticed between forms at the beginning/ middle on the
one hand and the end/ isolated on the other. The
presence of a ligature with a previous/next letter does
not significantly modify the form of the letter (no more
than the case of Latin cursive handwriting). Arabic
ligatures occur where two letters are written one upon
the other. Furthermore, the Arabic writing is rich
in diacritic signs (or secondary signs) such as vowels,
points (dots), chaddah, maddah, hamzah, etc. In
our work, we define a diacritical mark as a secondary
component of a letter, which may complete it or even
modify the whole sense of the word. But the notion
of vowel, such as el damah, el kasrah, el fathah and
el soukoun are not considered.

298


vol. 58 no. 5/2018 New Approach for Online Arabic Manuscript Recongnition by Deep Belief Network

Figure 3. Different forms of the Arabic character
according to its position in the word.

Figure 4. Result of preprocessing: (a) character
Noun before preprocessing; (b) character Noun after
preprocessing.

These specificities of the Arabic language make the
recognition task more complicated, hence the necessity
to do a preprocessing including the following steps:
• A spatial sampling that allows preserving the
useful information of the signal and excluding the
redundant information resulting from the repetition
of points. Indeed, the duration of character forma-
tion as well as its shape can vary significantly from
one writer to another and from an occurrence to
another. The spatial sampling transforms a hand-
writing signal into a sequence of equally spaced
points with coordinates [X(n),Y (n)] (n is the point
index) according to the length of the trace for a
fixed number of points.
. Definition of the line between pen up and pen
down.
N: number of sampling points
Reservation of the memory space for the following
vectors: X, Y , PenUp / Down

. Calculation of the total length of the signal:
If only one trait: Length = length + distance
between points
If several traits: Length = length + length be-
tween strokes.

. Calculation of the sampling distance
Dist-samp = Length/(N − 1)
Determination of N − 1 points by interpolation

• Character normalization and centring: The
character is recentred to (x0,y0) then normalized
to the maximal size of the character. To obtain an
invariant representation with respect to the trans-
lations and the spatial distortions.

A result sample of these two preprocessing steps is
illustrated in Figure 4:
Once the preprocessing step is finished, in order

to facilitate the classifier task, geometric information,
such as the direction of movement and the curvature
of the trajectory, are extracted from this sequence of
points. We then obtain a vector sequence of 7 charac-
teristics, which are mentioned in the following:
• The x coordinates
• The y coordinates

299


Benbakreti Samir, Aoued Boukelif Acta Polytechnica

Algorithm 1
xmax ← x(0)
xmin ← x(0)
ymax ← y(0)
ymin ← y(0)
{Calculation of the center of the character [x0,y0]}
n ← 0
while n ≤ N − 1 do
if x(n) > xmax then xmax ← x(n) end if
if x(n) < xmin then xmin ← x(n) end if
if y(n) > ymax then ymax ← y(n) end if
if y(n) < ymin then ymin ← y(n) end if

end while
x0 ← (xmin + xmax)/2
y0 ← (ymax + ymin)/2
{Delta scaling factor calculation}
Deltay ← (ymax − ymin)
Deltax ← (xmax − xmin)
if Deltax < Deltay then

Delta ← Deltay
else

Delta ← Deltax
end if
{Calculation of the new coordinates x[n][1] and
x[n][2]: x[n][1] the x coordinates x[n][2] the y coor-
diantes. Scanning of (X and Y ) points}
x[n][1] ← (X[n] −x0)/Delta
x[n][2] ← (Y [n] −y0)/Delta
x[n][7] ← penUpDown[n]
{Calculation of the direction x[n][3] and x[n][4]: co-
sine directions Scanning points Calculation of the
length of the dS string}
dx ← X[i + 1] ∗X[i− 1]
dy ← Y [i + 1] ∗Y [i− 1]
dS ← sqrt(dx ∗ dx + dy ∗ dy)
if dS = 0 then
x[i][3] ← 0
x[i][4] ← 0

else
x[i][3] ← arcx/dS
x[i][4] ← arcy/dS

end if
{Special points: initial and last point}
Initial point ← next point
Last point ← previous point
{Curvature calculation x[n][5] and x[n][6]: cosine
directions Scanning points}
x[i][5] ← x[i+1][3]∗x[i−1][3] + x[i+1][4]∗x[i−1][4]
x[i][6] ← x[i+1][3]∗x[i−1][4] + x[i+1][4]∗x[i−1][3]
{Special points: initial and last point}
Initialpoint ← nextpoint
Lastpoint ← previouspoint

Figure 5. MLP type connexion -++ complete field
of view.

• The cosine directions of the movement direction
(cos θ et sin θ)

• The cosine directions of the trajectory curvature
(cos Φ et sin Φ)

• The pen up/down.

This step is summarized in Algorithm 1.
Once the features extraction is realized, we can

proceed to the classification.

3. Neural Network Classification
We have used a set of neuronal classifiers, which gave
satisfactory results, the proposed neural architectures
are as follows.

3.1. Multi-Layer Perceptron (MLP)
We have used the MLP with error back-propagation
that has one hidden layer only [11, 12]. The seven
features extracted from the Arabic characters (charac-
ters in their isolated forms, at the beginning, middle
and end), generated by the previous module, consti-
tutes the inputs of the network. The hidden layer
includes 80 neurons. The classes to be discriminated
are 28 characters of the Arabic alphabet (or 48 words),
hence the choice of 28 (48) neurons for the output
layer. The unipolar sigmoid was used as a neuron
activation function. Figure 5 shows how the MLP
processes the letter Mim.

However, the training of the lowest layer (the closest
to the output layer) is less efficient in a deep MLP [13,
14]. It seems that the update of the parameters is less
and less relevant as we propagate in the lowest layers.
Because of that, we use one hidden layer for our MLP.

3.2. Time Delay Neural Network
(TDNN)

The used TDNN consists of three layers: input, hidden
and output. Furthermore, every layer possesses two
directions: a direction of features and a temporal

300


vol. 58 no. 5/2018 New Approach for Online Arabic Manuscript Recongnition by Deep Belief Network

Figure 6. Convolution network connexion — re-
stricted field of view.

direction. The TDNN is distinguished from a multi-
layer perceptron by the fact that it takes into account a
time notion. Consequently, it processes all the neurons
of the input layer at the same time (see Figure 6),
before making a temporal scanning. It is the notion
of the temporal window.
The TDNN has three basic principles: temporal

window, shared weights, and delays:

• The temporal window: the basic idea states that
every neuron of the L + 1 layer is only connected
to one subset of neurons from the L layer. After
an experiment study, we have fixed the size of this
window to seven neurons.

• The shared weights: this notion allows reducing the
number of network parameters, which positively in-
fluences the network generalization capacity. A win-
dow with a given characteristic will have the same
weights according to the temporal direction. It also
allows progressively extracting the differences when
scanning the signal.

• The delays: in addition to the previous two con-
straints, we introduce delays between two succes-
sive windows for a given layer. The first delay
is three neurons and the second one is four. For
the shared weights constraint, the same neuron
is duplicated in the time direction (the same du-
plicated weight matrix) to detect the presence or
the absence of the same feature on various points
in the signal. By using several neurons in every
temporal position, the network of neuron detects
the different features: the outputs of the different
neurons of one layer produces new features vector
for the next layer and so on. Thus, the temporal
component of the original signal is progressively
eliminated and transformed into a feature by up-
per layers. To compensate this loss of information,
we increase the number of neurons in the feature
direction.

Input
Layer Centers

Output
Layer

Figure 7. The RBF network architecture.

3.3. Radial Basis Function network
(RBF)

The RBF uses radial functions rather than sigmoid
(e.g., MLP) to build a local decision function centred
on a subset from the inputs [15]. The summation
of all local functions represents the global decision
function [16] to solve the local minima problem. The
used RBF network [17, 18] consists of 3 layers (Fig-
ure 7). Every hidden node applies a kernel function
to the input data, then the output layer performs a
weighted sum of these kernel functions. Every node is
characterized by two parameters, which are the width
and the centre of the radial function. If the input data
of a node are close to the centre (estimated using the
Euclidian distance), the output value will be high. In
this work, we have used a Gaussian kernel function.
The complete configuration of the network is

achieved after determining the centre and the width
associated to each node as well as the weight of the
connections between the hidden layer and the output
layer, which contains 28 Arabic characters or the 48
words representing the cities of Algeria. In this work,
we have experimented with several variants of RBF
networks, a brief description of each one of them is
given in the following paragraphs:

3.3.1. Radial basis function networks exact
(RBFNE)

The design of an RBFNE can be done using a function
that takes as input : the matrices of input vectors P,
the target vectors T and the spread factor of the radial
basis layer, then it sends back a network with weights
and bias values such that the output is exactly T when
the input is P. The same function creates many radial
basis neurons that input vectors in P. So, we have a
radial basis neuron layer in which every neuron acts
as a detector for a different input vector. Then, the
only parameter that RBFNE has is the spread of the
radial basis functions of the first layer. In this work,
the value of the spread factor is 0.3.

301


Benbakreti Samir, Aoued Boukelif Acta Polytechnica

...
...

Input
Layer Radial

Basis
Layer Special

Linear
Layer

Output
Layer

Figure 8. The GRNN network architecture.

3.3.2. Radial basis function (RBF)
The second type iteratively creates a radial based neu-
ron in each instant. The difference is that the RBF
creates only one neuron at a time. In every iteration,
the input vector, which will succeed to reduce most
of the network error, is used to create a radial based
neuron. The error of the new network is checked. If
it is sufficiently small, the network creation is ended,
otherwise, a following neuron is added. This pro-
cedure is repeated until the targeted mean squared
error is reached (10e−6), or the maximum number
of neurons is reached (300). The number of neurons
added between every evaluation is fixed, after several
experiences, to 25.

3.3.3. Generalized regression
neural networks (GRNN)

This variant is another type of the NN that was pro-
posed by Speckt [19]. This regression based network
estimates the expected mean value of the output vari-
able using Bayesian technique. Any continuous func-
tion approximated to a linear combination of Gaussian
functions. The objective is to make a regression, i.e.,
to build a good approximation of a function, which
is known by a limited number of experimental points.
This network can be used to solve the classification
problem. For each input, the first layer calculates the
distances between the input vector and the weight
vector and then it produces a vector, which is mul-
tiplied by the bias. The network GRNN diagram is
described in Figure 8:

The other GRNN specific layer is called a special lin-
ear layer and consist of two subparts: Numerator part
performs a summation of the multiplication of training
output data and activation function, the Denominator
part is the summation of all activation function. This
layer feeds both the Numerator & Denominator to
the next output layer.

3.3.4. Probabilistic neural networks (PNN)
The probabilistic neural network, introduced by Don-
ald Specht in 1988, is a feedforward network with
three layers used for the classification of the data [20].
Contrary to the other neural networks, which are
based on the backpropagation method, the PNN is
based on the Bayes decision strategy and the proba-
bility density estimation. The PNN uses radial basis
spherical Gaussian functions centred in each training
vector. The membership probability of a vector in a
given class is expressed as follows [21, 22]:

fi(x) =
1

2πp/2σpMi

M∑
j=1

e
−(x−xij)

T (x−xij)
2σ2 , (1)

where i is the number of classes (28 or 48 in our case),
j is the number of the forms to be recognized, Mi is
the number of the training vector of the ith class, x
is a test vector, xij is the jth training vector of the
ith class, p is the dimension of the vector x, is the
standard deviation, and fi(x) is the summation of the
Gaussian spherical multi-variable functions, centred
on each training vectors xij used for the evaluation
of the probability density function of the ith class.
The classification decisions are taken according to the
decision of Bayes rule [22]:

d(x) = Ci if fi(x) > fk(x) for k 6= i, (2)

where Ci is the ith class. When an input is presented;
• The first layer calculates the distances between
the input vector and all the training vectors, and
produces a vector with elements that indicate how
this input vector is close to every training vector.

• The second layer adds these contributions for ev-
ery inputs class to produce a probability vector at
the network output. Finally, a transfer function
at the output of the second layer selects the maxi-
mum of these probabilities, and assigns “1” to the
corresponding class and “0” to the others.

3.4. Deep Learning
The deep learning models are built in the same way as
the previously described multilayer perceptron, except
that the different middle layers are more numerous.
But the main difference is the learning method, which
distinguishes them from a classic MLP and which gave
them a renewed interest since 2006. In fact, the limits
and the drawbacks [23] of the classic architectures
mentioned so far in this paper contributed to this
interest. For these reasons, Bengio et al. [11] proposed
a greedy learning algorithm based on a staking of auto-
associators, which allows building the hidden layers
one after the other.

In this paper, we have used the deep network model
proposed by Hinton et al. [12] in 2006, which is based
on staked restricted Boltzmann machines (RBMs), to
construct the so called Deep Belief Networks. The
topology and the learning method used in this model
are presented below.

302


vol. 58 no. 5/2018 New Approach for Online Arabic Manuscript Recongnition by Deep Belief Network

V

W

H

Figure 9. RBM: V is the visible layer, H is the
hidden layer and W is the connection weight.

3.4.1. Restricted Boltzmann machine (RBM)
and contrastive divergence
learning (CD)

RBM is an undirected graph with two layers. The
first layer is considered visible while the second is
hidden. Each node represents a neuron. The neuron
is active when its cost is equal to 1 and inactive if
it is 0. The visible layer is connected to the hidden
layer by weighted edges (Figure 9). The visible layer
represents the observed data and the hidden layer
represents the unknown elements associated with the
data. Its size (fixed arbitrarily) allows the RBM to
model more or less complex distributions.
Let us suppose that a bias unit, always active, is

presented in the visible layer and the hidden layer.
W is the weights matrix, where Wij represents the
weight of the link between the units vj and hi The
RBM energy is given by:

Energy(v,h) = −hT Wv, (3)

Each unit corresponds to a hypothesis to which we as-
sign a binary value(1 or 0). The connections represent
constraints for these hypotheses: if Wij is negative,
i and j should not be activated at the same time
to reduce the RBM energy. But, if the weight Wij
is positive, then the simultaneous activation of both
units reduces the energy. To summarize, the energy
is a function of configuration, which is dependent on
the constraints related to the weights. This energy
function allows associating to an RBM, of weight W ,
a probability on the (v,h) configurations space:

Pw(v,h) =
e−Energy(v,h)

Z
, (4)

where Z is a partition function, it is the sum of all pos-
sible joined configurations. It is given by the following
formula:

Z =
∑
v,h

e−Energy(v,h). (5)

The activation probability of the visible or hidden
units are independent from each other. Let sgm(t) be
the sigmoid function:

sgm(t) =
1

1 + e−t
, (6)

The conditional probabilities for an RBM are given
by:

∀i ≤ r, PW (hi \v) = sgm
(∑

j

wijvj

)
, (7)

∀j ≤ q, PW (vj \h) = sgm
(∑

i

wijhi

)
. (8)

An RBM models the probability of an input v with
the joined PW (v,h). We use the Gibbs sampling
technique to make a random draw from the model to
generate the configurations (v,h). It serves as reliable
examples of inputs v.

This technique uses the Markov Chain Monte Carlo
method, which consists in a repeated random draw
from PW (h\v) and PW (v\h), such that the Markov
Chain is guaranteed to be converged to the PW distri-
bution. In summary, this distribution is modelled by
the RBM independently of the inputs, and according
to it we can make a random draw with the Gibbs
sampling method.
In the training part, we define PT R as the proba-

bility distribution corresponding to the random draw
of a sample v from a training database TR, then to
the random draw of a hidden representation h, which
is associated to v. PT R is the objective distribution
of training, it is fixed from a sample basis. Having
defined PW and PT R , the training step will mini-
mize the KullbackLeibler divergence between these
two distributions in such way that the probability to
draw a sample v from the examples approaches the
probability to generate it from the training RBM.
It turns out that the KullbackLeibler divergence

minimization has a very high treatment cost, thus we
prefer to minimize an approximated criterion: Con-
trastive Divergence (CD). This technique was devel-
oped by Geoffrey Hinton in 2001.

The pioneering paper, written in 2002 [24], demon-
strated an improvement of the handwritten digit recog-
nition results (dataset MNIST) by using a CD algo-
rithm to train RBM.

3.4.2. Deep Belief Network (DBN)
The restricted Boltzmann machines stacked in a gen-
erative way represents a particular type of a deep
neural network, the reason why we detailed the RBM
structure previously. The topology of such network is
presented in Figure 10. The training of the stacked
RBM is made layer by layer, in a greedy way. A first
RBM is trained on the dataset of samples to mini-
mize the Contrastive Divergence (CD). Then, each of
the following RBM makes its training on the hidden
representations of the previous RBM.
The training process includes two phases:

303


Benbakreti Samir, Aoued Boukelif Acta Polytechnica

v = input
h1 h2

RBM R2

RBM R1

stacked RBMs

Figure 10. A deep network topology: stacked RBM.

(1.) Unsupervised pre-training: the DBN distin-
guishes itself from the other neural networks by
the way of learning layer-by-layer, the idea is to
train every layer as an RBM. The training begins
from the lowest hidden layers (close to the visible
input vector) and progresses to the output vector,
so a DBN learns to probabilistically reconstruct its
inputs. The layers then act as feature detectors.
The DBN training algorithm for each RBM layer is
presented as follows:
Step 1: build a multilayer network from the re-

stricted Boltzmann machines (RBM), then train
the layers by the Greedy-wise algorithm. Natu-
rally, the input vector is the visible layer v.

Step 2: encode X as an RBM to produce a new
sample v′, by using the Gibbs sampling method.

Step 3: repeat the second step with v ←− v′ for
each two successive layers until the two top layers
of the network are reached.

(2.) Supervising training: the use of a supervised
algorithm like back-propagation algorithm in MLP
that helps to search the optimal parameters of the
DBN network. Nevertheless, we use the weight W
and the hidden bias b parameters to initialize this
MLP.

In [13], several experiments were performed on differ-
ent networks, trained for simple classification tasks
(handwritten digit, geometrical forms). The conclu-
sions of this reference paper are that training deep
architectures was unsuccessful until the recent advent
of algorithms based on an unsupervised pretraining.
The concept of deep learning does not mean that the
number of layers is high, it is rather the way of learn-
ing that is a different (layer by layer) method that

Database
Number
classes Writers

Sample of
training/

test
Class letters 28 20 1400/1400
Class word 48 20 2400/2400

Table 1. Description of the database.

has been applied in our work with the DBN network.
We deduce from these works that if the network

contains enough number of neurons, the classifica-
tion performances are better with an unsupervised
training. Even on classical networks (not deep), the
performances are improved. In our current work, we
suggest the use of all the neural architectures described
previously for the on-line Arabic handwriting recog-
nition without constraint (letters and words). We
also observe how the DBN manages to discriminate
between the data without having a prior information
and without making the feature extraction. In this
classifier, the complexity resides in the choice of the
hyper-parameters of the algorithm CD, this heuristic
choice requires many experiments to find the optimal
architecture, giving the best rates of classification.

4. Description of the Database
The objective of the architectures described previ-
ously is the classification and the recognition of Arabic
characters/words coming from our database NOUN-
DATABASE built from an on-line acquisition by
means of a tablet of acquisition. This choice is moti-
vated by the fact that most of the existing databases
are off-line i.e.,containing images. In this work, we
wanted to handle the on-line writing signal rather than
the images. The Arabic alphabet consists of 28 letters
of variable shapes depending on its position in the
word. Our database contains 2800 Arabic characters
written by 20 different writers, each writer inserts the
alphabet 5 times (once for every position in the word:
in its isolated form, at the beginning, middle, end,
and another time in the choice of the writer). The
constructed database includes 4800 words, inputted
by the same writers (20). The words represent the 48
Algerian provinces (Wilaya). Each writer performs
five insertions of the 48 Wilayas. The handwriting
can be influenced by several parameters, such as the
age, the sex and the state of the writer (each writer
possesses an appropriate writing style). To preserve
the variability of the writing signals, we took care to
make the data acquisition (the samples database) by
20 people with different sexes and age categories. The
database is summarized in Table 1.

5. Experiments and Results
In this section, we shall only consider the final results
of our system of the Arabic handwritten character
recognition, i.e. with all the parameter adjustments

304


vol. 58 no. 5/2018 New Approach for Online Arabic Manuscript Recongnition by Deep Belief Network

NN
Arabs letters

% correct % error

Time for
classification

(sec)
MLP 97.73 2.27 48.21
TDNN 49.46 50.54 2358.00
RBF 86.09 13.91 23.09
RBFe 75.39 24.61 8.68
GRNN 86.36 13.64 8.14
PNN 84.92 15.08 8.73

Table 2. Arabic characters recognition rates and
classification time according the neural network.

NN
Arabs words

% correct % error

Time for
classification

(sec)
MLP 85.31 14.69 27.01
TDNN 46.76 53.24 32017.00
RBF 81.85 18.15 37.57
RBFe 62.58 37.42 16.38
GRNN 78.44 21.56 16.74
PNN 78.44 21.56 17.77

Table 3. Arabic words recognition rates and classifi-
cation time according the neural network.

made in the different neuronal approaches. We also
proceeded to the division of the database in two parts:
50 % of samples formed the training (1400 characters
and 2400 words) and the remaining 50 % for the test
(1400 characters and 2400 words).

5.1. Neural Approach
(classical methods)

Experiment 1: In the first experiment, we have
used the classic architectures described previously
for the on-line Arabic handwritten character recog-
nition regardless of its position in the word, i.e.,
in its isolated form, at the beginning, middle and
end. The best configurations of our system gave
the results illustrated in the Table 2.
Discussion: We notice that the used classic ar-
chitectures give more or less satisfactory results,
having said that, the best classification results of
the Arabic characters were obtained by using the
multi-layer perceptron MLP with a 97.73 % success
rate.

Experiment 2: In the second experiment, we are
going to use the same previous architectures for the
online Arabic words recognition, represented by 48
Wilayas of Algeria. The best configurations of our
system gave the results illustrated in the Table 3.
Discussion: We notice that the word recognition
results have considerably fallen in comparison with
Table 2, this is due to the increase of the number of
classes (which rises from 28 to 48). That can also
be explained by the complexity of the data samples,

indeed a word can consist of several pseudo-words
which can contain one or several characters. How-
ever, the variability of the writing signal is richer
in the case of the word writing. In addition, the re-
duction of the rate recognition can be explained by
the main limitation of the classic architectures. If
the architecture is too deep, the optimization of the
parameters very often leads to a non-optimal local
minima. The increase of the number of layers did
not improve the results (some-times, it degraded
them). For this reason, we had previously high-
lighted that the results mentioned are those of the
ideal architecture (with the best parameters for each
neural network, including the number of layers). As
explained in [13, 14]), the surface of the error is
not convex and is more irregular as the network
is deep. Despite this, the MLP still gives the best
results, which is 85.31 % with an acceptable time of
classification of 27.01 seconds. Other observation
is that the TDNN takes a long time for the classifi-
cation, because of its complex architecture, which
adds an additional dimension to the network, i.e. a
temporal dimension in this particular case.

5.2. Deep Neural Approach
Experiment 3: To enhance that results, we will use
a deep neuronal architecture of the type Deep Be-
lief Network (DBN), which represents a stacking of
restricted Boltzmann machines. It should be noted
that the recognition architecture system is slightly
modified because we have eliminated the “Features
extraction” block (See Figure 2). This is due to the
abstraction ability of the deep learning network. In
fact, it allows automatically learning the different
representations of data by an abstraction level from
the raw data.
Phase 1: Pre-training (Un-supervising training
without backpropagation,): In this phase, the algo-
rithm utilizes a concatenation of the RBM layers to
learn the distribution of the inputs data X (hand-
writing signal) without taking into consideration
the Y labels. Each RBM layer (the two first layers
in this case) has a more abstract representation
than the precedent. Hinton et al. [12] proved that
a greedy learning method for unsupervised training
is effective. This method is also called contrastive
divergence (CD). This phase, called sometimes pre-
training, could be considered as an efficient initiali-
sation of the DBN.
Phase 2: Fine-tuning (Supervising phase with
back-propagation) The role of the second part of
the DBN is to convert the abstract representation
X into a Y labels that are used in the case of the
supervised learning (the last RBM layer with 2000
units). The back-propagation algorithm is used to
readjust the network parameters, and finally, the
global optimal network could be obtained. The out-
put layer has the number of classes (48 in this case).
The training phase is considered as accomplished

305


Benbakreti Samir, Aoued Boukelif Acta Polytechnica

Parameter Value
Number of hidden layer 2
Units of 1st hidden layer 100
Units of 2nd hidden layer 100
Units of 3rd hidden layer 2000
Batch size 100
Number of RBM epochs 50

Table 4. DBN Parameters.

NN

Arabs words
% error % error % correct
before after
BP BP

Time
for

classi-
fication
(sec)

DBN 33.75 2.92 97.08 252.21

Table 5. Arabic words recognition rates and classifi-
cation time according the DBN network.

once the error between the output and desired one
is stabilized. After that, we proceed to test another
samples from the database (see Table 1). This phase,
which is called fine-tuning, is more time consuming
than the previous one. The parameters of the used
DBN network for each layer (the pre-training and
fine tuning phases) are mentioned in Table 4.
The above described DBN network architecture
gave the results shown in Table 5.
Discussion: We can see that the results of the
word classification are significantly improved with
the use of the DBN. Indeed, in the previous table,
the best rate of classification was 85.31 % with the
MLP. It reaches a rate of 97.08 % with the DBN
network with the RBM staked, giving a gain of
11.77 %. Also, we noted that the second step, the
supervised part (after the back propagation), that
is called “fine-tunning” is very important in terms
of the error rate reduction (2.92 % of error rate)
and turned out to be much more efficient than the
pre-training step (33.75 % of error rate), because
we do not start from an initial random solution.
Figure 11 represents a summary of the classification
rates obtained for each used neural network.

6. Conclusion
In this paper, we have proposed a neuronal approach,
which aims to develop a solution based on neural net-
works for the on-line Arabic manuscript recognition
acquired dynamically. With the aim of performing
the on-line recognition, we built our own database,
containing 2800 characters and 4800 words. We di-
vided our work in two parts; the first presents a classic
neuronal approach using these different neural net-
works: MLP, TDNN, RBFNE, RBF, GRNN and PNN.
Our experiments on the character recognition gave
interesting results, such as 97.73 % success rate for

M
L
P

T
D
N
N

R
B
F

R
B
Fe

G
R
N
N

P
N
N

D
B
N

0
10
20
30
40
50
60
70
80
90

100

Classifier

C
la
ss
ifi
ca
ti
on

ra
te

(%
)

Figure 11. Comparison between the deep neuronal
approach (DBN) and the classic neuronal approach
for the Arabic word recognition.

the MLP and 86.36 % for the GRNN. Nevertheless,
the application of the same architectures on the word
recognition significantly deteriorates the results, which
did not exceed 85.31 % for the MLP and 81.85 % for
the RBF. This can be explained by the complexity
of the input signal and the local minima problem.
The second approach uses a deep learning, involving
the DBN, which represents a stacking of restricted
Boltzmann machines and which is characterized by
a layer-by-layer learning method. This network sig-
nificantly improves the performance for the words
database and gives a classification rate of 97.08 %.
In fact, the DBN has shown its ability to exerts an
excellent performance for the classification tasks and
dimension reduction. We conclude the superiority of
the deep learning network compared to the classic
neural networks. We also note that the various pa-
rameters used in all the previous neural architectures
have allowed us to obtain the best classification rates.

References
[1] N. Ben Amara, A. Belaïd. Modélisation pseudo
bidimensionnelle pour la reconnaissance de chaînes de
caractères arabes imprimés. In Colloque International
Francophone sur l’ECrit et le Documernt - CIFEd’98,
pp. 131–141. Québec, Canada, 1998. Colloque avec
actes et comité de lecture.

[2] R. Rabiner, L, H. Juang, B. An introduction to hidden
markov models. pp. 4–16. IEEE ASSP Magazine, 1986.

[3] L. Cheng-Lin, J. Stefan, N. Masaki. Online
recognition of chinese characters: The state-of-the-art.
IEEE Transactions on Pattern Analysis and Machine
Intelligence 26(2):198–213, 2004.
doi:10.1109/TPAMI.2004.1262182.

[4] K. Yousri, P. Thierry, B. H. AbdelMajid. A
multi-lingual recognition system for arabic and latin
handwriting. In Proceedings of the 2009, 10th
International Conference on Document Analysis and
Recognition, pp. 1196–1200. July 26-29, 2009.
doi:10.1109/ICDAR.2009.55.

306

http://dx.doi.org/10.1109/TPAMI.2004.1262182
http://dx.doi.org/10.1109/ICDAR.2009.55


vol. 58 no. 5/2018 New Approach for Online Arabic Manuscript Recongnition by Deep Belief Network

[5] M. Volker, E. A. Haikal. Databases and competitions:
strategies to improve arabic recognition systems. In
Proceedings of the 2006 conference on Arabic and
Chinese handwriting recognition, pp. 82–103. College
Park, MD, USA, September 27-28, 2006.
doi:10.1109/ICDAR.2009.55.

[6] E. Mohamed, T. Najiba, K. Monji. Optimization of
dbn using regularization methods applied for
recognizing arabic handwritten script. In International
Conference on Computational Science, ICCS. 12-14
June 2017. doi:10.1016/j.procs.2017.05.070.

[7] M. Neila, M. Amar, C. Mohamed. Bayes classification
of online arabic characters by gibbs modeling of class
conditional densities. IEEE Transactions on Pattern
Analysis and Machine Intelligence 30(7):1121–1131,
2008. doi:10.1109/TPAMI.2007.70753.

[8] M. Neila, C. Mohamed, M. Amar. Combination of
pruned kohonen maps for on-line arabic characters
recognition. In Proceedings of the Seventh International
Conference on Document Analysis and Recognition, p.
900. August 03-06, 2003.
doi:10.1109/ICDAR.2003.1227790.

[9] F. Biadsy, J. El-sana, N. Habash. Online arabic
handwriting recognition using hidden markov models,
2006.

[10] S. Benbakreti, A. Boukelif. Features Extraction and
On-line Recognition of Isolated Arabic Characters, pp.
481–500. 2018. doi:10.1007/978-3-319-67056-0_23.

[11] Y. Bengio, P. Lamblin, V. Popovici, H. Larochelle.
Greedy layer-wise training of deep networks. in
Advances in Neural Information Processing Systems
19,eds Schölkopf B, Platt J, Hoffman T, editors
(Vancouver: MIT Press) pp. 153–160, 2007.

[12] G. Hinton, S. Osindero, Yee-WhyeTeh. A fast
learning algorithm for deep belief nets. Neural
Computation 18(7):1527–1554, 2006.

[13] E. Dumitru, M. Pierre-Antoine, B. Yoshua, et al. The
difficulty of training deep architectures and the effect of
unsupervised pre-training. In International Conference
on artificial intelligence and statistics, pp. 153–160. 2009.

[14] I. Guyon, P. Albrecht, I. Le Cun, al. Design of a
neural network character recognizer for a touch
terminal. Pattern Recognition 24(2):105–119, 1991.
doi:10.1016/0031-3203(91)90081-F.

[15] P. Burrascano. Learning vector quantization for the
probabilistic neural network. IEEE Transactions on
Neural Networks 2(4):458–461, 1991.
doi:10.1109/72.88165.

[16] K. Kim, D, H. Kim, D, K. Chang, SE, K. Chang, SA.
Modified probabilistic neural network considering
heterogeneous probabilistic density functions in the
design of breakwater. KSCE Journal of Civil Engineering
11(2):65–71, 2007. doi:10.1007/BF02823849.

[17] J. D. Powell, M. Radial basis functions for
multivariable interpolation: A review. In Algorithms for
Approximation, pp. 143–167. Clarendon Press, Oxford,
1987.

[18] I. Park, W. Sandberg, I. Universal approximation
using radial basis function networks. Neural Computation
3(2):246–257, 1991. doi:10.1162/neco.1991.3.2.246.

[19] D. Speckt. A generalized regression neural network.
IEEE Transactions on Neural Networks 2(6):568–576,
1991. doi:10.1109/72.97934.

[20] F. Specht, D. Probabilistic neural networks. Neural
networks 3(1):109–118, 1990.
doi:10.1016/0893-6080(90)90049-Q.

[21] M. Berthold, J. Diamond. Constructive training of
probabilistic neural networks. Neurocomputing
19(1):167–183, 1998.
doi:10.1016/S0925-2312(97)00063-5.

[22] W. Tomasz, J. Jacek, M. Jacek. Probabilistic neural
network for direction estimation. In Proceedings of the
Third Conference Neural Networks and their
Applications and Summer School on Neural Networks
Applications to Signal Processing, Kule, 14 X-18 X 97,
pp. 173–178. Eds R. Tadeusiewicz, L. Rutkowski, J.
Chojcan. Czȩstochowa: Pol. Neural Netw. Soc. s., 1997.

[23] Y. Bengio, Y. LeCun. Scaling learning algorithms
towards AI. In Large-Scale Kernel Machines. MIT
Press, 2007.

[24] H. Geoffrey, E. Training products of experts by
minimizing contrastive divergence. Neural Computation
14(8):1771–1800, 2002.
doi:10.1162/089976602760128018.

307

http://dx.doi.org/10.1109/ICDAR.2009.55
http://dx.doi.org/10.1016/j.procs.2017.05.070
http://dx.doi.org/10.1109/TPAMI.2007.70753
http://dx.doi.org/10.1109/ICDAR.2003.1227790
http://dx.doi.org/10.1007/978-3-319-67056-0_23
http://dx.doi.org/10.1016/0031-3203(91)90081-F
http://dx.doi.org/10.1109/72.88165
http://dx.doi.org/10.1007/BF02823849
http://dx.doi.org/10.1162/neco.1991.3.2.246
http://dx.doi.org/10.1109/72.97934
http://dx.doi.org/10.1016/0893-6080(90)90049-Q
http://dx.doi.org/10.1016/S0925-2312(97)00063-5
http://dx.doi.org/10.1162/089976602760128018

	Acta Polytechnica 58(5):297–307, 2018
	1 Introduction
	1.1 Related work

	2 Processing and features extraction
	3 Neural Network Classification
	3.1 Multi-Layer Perceptron (MLP)
	3.2 Time Delay Neural Network (TDNN)
	3.3 Radial Basis Function network (RBF)
	3.3.1 Radial basis function networks exact (RBFNE)
	3.3.2 Radial basis function (RBF)
	3.3.3 Generalized regression neural networks (GRNN)
	3.3.4 Probabilistic neural networks (PNN)

	3.4 Deep Learning
	3.4.1 Restricted Boltzmann machine (RBM) and contrastive divergence learning (CD)
	3.4.2 Deep Belief Network (DBN)


	4 Description of the Database
	5 Experiments and Results
	5.1 Neural Approach (classical methods)
	5.2 Deep Neural Approach 

	6 Conclusion 
	References