AP08_1.vp


1 Introduction
Data mining can be defined as the non-trivial extraction

of implicit, previously unknown, yet potentially useful infor-
mation from data, and may be defined as the science of
extracting useful information from large data sets or data-
bases. With the help of data mining, derived knowledge,
relationships and conclusions are often represented as models
or patterns. For example, any data cluster, tree structure, or
set of rules, etc., can form a model or a pattern. The whole
process is sometimes referred to as knowledge discovery in
databases (KDD). Data mining only implies modeling or an
analytical method in this application sense, and is considered
to be a part of the KDD process [1].

One standard, named CRISP-DM (Cross-Industry Stan-
dard Process for Data Mining), describes this process step by
step. It develops each phase of the KDD process and, in addi-
tion, helps to avoid common mistakes. The particular phases
of the CRISP-DM methodology are [1]:
Business understanding – the first objective is to thoroughly un-
derstand what is really to be accomplished. We have to begin
by uncovering important factors that can influence the out-
come of the project. This task involves also more detailed
fact-finding about all of the resources, assumptions and other
factors that should be considered in determining the data
analysis goal.
Data understanding – orientation of data. This step usually in-
vestigates a variety of descriptive data characteristics (count of
entities in tables, frequency of attribute values, average values,
etc.).
Data preparation – this is the most difficult and most time-con-
suming element of all KDD processes. The goal of data
preparation is to choose (or create) relevant data from the
available data, and to represent it in a form which is suitable
for the analytical methods that are applied (the date often
needs to be in the same data table with all other values of the
object attributes). Data preparation includes activities like
data selection, filtering, transformation, creation, integration
and formatting.
Modeling – modeling is the use of analytical methods (algo-
rithms) sometimes referred to as self-data mining. There are
many different methods, and the most suitable one must be
chosen to solve a given task; efficient settings of the parame-
ters must also be found. This phase includes verifying the

quality of the model (e.g. testing in the independent data ma-
trix, cross validation, and so on).
Evaluation – interpretation and evaluation of the discovered
knowledge. The main aspects are novelty, interest, utility and
comprehensibility of the descriptive tasks. The derived know-
ledge is divided into the following categories:

� evident knowledge, which is comparable to “horse
sense”, or to the common knowledge of an expert.
Even if such knowledge does not offer anything
new, it can show us that the method works well and
that it is able to discover knowledge.

� interesting knowledge that yields a new point of
view. This is the main aim of KDD.

� knowledge that seems to be unclear or is at variance
with expert knowledge. This knowledge may have
been created by coincidence, and should be ruled
out. Nevertheless, it can expose a new point of view
that applies to all the problems, and this must be
taken into consideration.

Deployment – the acquired knowledge should be modified into
applicable forms, which can involve simply writing a final re-
port or specific actions.

2 Analytic methods
The core of all KDD processes is the use of analytic

methods. The input to the analytic procedures is the pre-
pared data, and the output is discovered knowledge. Analytic
methods include regression analysis, discriminatory analysis,
cluster analysis, decision trees and association rules, among
others. It is possible to use standard methods and also mod-
ern methods, like neural networks. Data analysis is based on
association rules, so they will be described in more detail here.

2.1 Association rules
This method is based on determining of connections

(associations) among attributes; in particular, it is used to
discover common combinations of attributes that occur most
frequently within a given data set. Association rules, which
originally came from the Czech GUHA method (General
Unary Hypotheses Automaton), can be divided into two
categories. The rules fall into the Ant ~ Suc form in the first
class, and the so-called contingent rules fall into the

24 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 48  No. 1/2008

Effective Data Mining for a
Transportation Information System
P. Haluzová

This paper describes the application of data mining methods in the database of the DORIS transportation information system, currently used
by the Prague Public Transit Company. The goal is to create knowledge about the behavior of objects within this information system. Data is
analyzed partly with the help of descriptive statistical methods, and partly with the help of association rules, which may discover common
combinations of attributes that occur most frequently within a given data set. Two types of quantifiers were used when creating the
association rules; namely “founded implication” and “above average”. The results of the analysis are presented in the form of graphs and
hypotheses.

Keywords: data mining, knowledge discovery in databases, association rule


Ant ~ Suc /Cond form in the second class, where Ant (ante-
cedent or precondition), Suc (succedent or conclusion,
consequent) and Cond (condition) are a logical conjunction of
literals; and symbol ~ means a generalized quantifier indicat-
ing a type of relationship between Ant and Suc. The literal is
defined as an attribute (positive literal); or its negation (a neg-
ative literal). A generalized quantifier can be in the form of an
implication, an equivalence or a statistical test, etc. Regarding
contingent association rules, only objects that satisfy a given
condition may be included in the hypothesis [1].

A contingent table for the Ant ~ Suc rule and for n in-
stances can be made [1]:

where a is the number of rows in the analyzed data matrix
that satisfy the concurrent precondition and conclusion, b is
the number of instances that satisfy the precondition and not
the conclusion, c is the number of instances that do not satisfy
the precondition but satisfy the conclusion; and d is the num-
ber of instances that satisfy neither the precondition nor the
conclusion. We can determine various rule characteristics
from these numbers, and with this we can quantitatively eval-
uate the knowledge that has been found.

The basis of all searching algorithms in association rules
is the generation of combinations (conjunctions) of attribute
values. They work only with categorical data, and there-
fore, it is necessary to separate the numerical attributes into
intervals.

Quantifiers
The quantifier (generalized quantifier) characterizes

the type of relationship between antecedent and succedent.
There are many types of quantifiers, e.g. lower critical im-
plication, upper critical implication, above average, below av-
erage, founded equivalence, Fisher quantifier, etc. Two types
of quantifiers were used to create the hypotheses with the
LISp-Miner program; namely “founded implication”, and
the relationship “above average”.

Founded implication
This quantifier has two parameters p and B; p is called

confidence and B is base. Always p � (0, 1� and B > 0. We say
that antecedent and succedent have a relation of founded im-
plication with a minimum value of parameters p and B in the
data matrix, if the equation that follows is true [1]:

a
a b

p a Base
�

� � �

(for a, b see the contingent table – Table 1).

This can be written formally as Ant �p, B Suc and can be
interpreted as: “At least p�100% of objects satisfying the ante-
cedent also satisfy the succedent, and at least B objects satisfy

both antecedent and succedent”. Or “antecedent implies suc-
cedent with probability p�100%”.

AA quantifier (above average)
This quantifier also has two parameters; they are con-

fidence p, and base B. Always p > 0 and B > 0. We say that
antecedent and succedent have a relation of above average,
with minimum value of parameters p and B in the data ma-
trix, if the equation that follows is true [1]:

a a b c d
a b a c

p a Base
( )
( )( )

� � �

� �
� � � �1

(for a, b, c, d see the contingent table – Table 1).

This can be written formally as Ant � p, B Suc and can be in-
terpreted as: “Among objects satisfying the antecedent, there
is a relative frequency of objects that satisfy the succedent at
least p � 100% higher than the relative frequency of objects sat-
isfying the succedent among all the objects in the whole data
matrix; and there are at least B objects that satisfy both ante-
cedent and succedent.” Another way to express this rule is: “If
we add one to p, we discover how many times the probability
of the succedent increases when the antecedent is satisfied, as
compared to a case where we don’t know whether the ante-
cedent is or is not satisfied.” [2]

The values of the confidence and base parameters are very
important for the resultant hypotheses, because the higher
the parameter, the more gravity will be assigned to these hy-
potheses. If we set the values of the parameters too high, only
the strongest hypotheses will be found. Their numbers are
mostly low, and can even be zero. Consequently, it is necessary
to repeatedly set parameters; and to try various values.

2.2 Software for knowledge discovery in
databases

Systems for KDD can be divided into the research sphere
and the commercial sphere. Most of them include compo-
nents for modeling, data preparation, visualization and inter-
pretation. Some examples of these commercial systems are
Clementine, Enterprise Miner, Intelligent Miner, Knowledge
Studio, and the Statistical Data Miner. LISp-Miner and Weka
are examples of non-commercial tools.

The LISp-Miner system was used to generate the associa-
tion rules, which have been under development for research
and education purposes since 1996 at the Faculty of Informa-
tics and Statistics of the University of Economics in Prague.
This system patterns itself on the GUHA method, and it is
possible to closely specify the character of each association
rule. In this way, only really interesting hypotheses are found.
This brevity of the selection process hastens the generation of
hypotheses. LISp-Miner works with the MS Access database,
which is appropriate, considering that data used only to be
available in this database system.

3 Data understanding and preparation

3.1 The DORIS system
The analyzed data created between April and December

2003 was taken from the DORIS tram dispatch control sys-
tem. This system observes tram-cars in the electric tramway

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 25

Acta Polytechnica Vol. 48  No. 1/2008

Suc ¬Suc �

Ant a b r

¬Ant c d s

� k l n

Table 1: Contingent table


net of the Prague Public Transit Company, and supports
tram-car control in real time. DORIS ensures basic functions
such as localization of all tram-cars logged into the system;
evaluation of the time divergence of tram-car as the pass the
tram stops; information about the departure of tram-cars
compared with the timetable; control of data and phonic traf-
fic; control of departure of tram-cars from the terminus; uni-
fied digital information about exact times, etc. [3].

The data is archived by months; three databases were
available for each month: the database of daytime traffic; the
database of night traffic; and the database containing emer-
gency records (traffic accidents). The database of daytime
traffic includes a table of daily records for each day of the
month. There are approximately 50 000 records daily, and
each record has 11 attributes (see Table 2).

3.2 Data preparation
It is necessary to integrate data from several tables to

create a single database that includes all required data. Fur-
thermore, duplicate records have to be deleted, because the
existence of such records has a negative influence on the sta-
tistics; the greater the duplication, the greater the distortion.
As well as duplicate records, extreme values can influence the
resultant statistics and the computation of new attributes (av-
erage values). Records with extreme values of delay/overtake
attributes were therefore deleted. Such values occur due to
so-called “reduced operating effectiveness” events (when a
tram is diverted from its route, towed to the depot because of
technical problems or an accident, or is being handled in the
depot). If a data table, which usually has 1.5 million records

26 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 48  No. 1/2008

array name data type meaning

Rec_id text primary key (integer)

Datum date and time date and time (in the general form dd.mm.yyyy; hh:mm:ss)

Kod text Identification of the type of record (log-in to the system, passage round the infra-pylon,
departure from route, etc.)

LnNo number line number

PorNo number line order

EvCislo text tram-car license number

Vozovna number number of the depot (1–8), from which the tram departed

A number ID of tram stop

B number number of the stop post

C number delay (overtake recorded as minus)

D number reserve column, without data

Table 2: Attributes of daytime traffic tables

name of attribute Meaning name of category values of attributes

weekday days in a week Monday to Sunday 1 to 7

hour hour interval

(e.g. 7 means an interval
from 7:00 to 7:59)

morning 5, 6

morning rush hour 7, 8

before noon 9, 10

noon 11, 12

afternoon 13, 14

afternoon rush hour 15, 16

evening 17, 18, 19, 20

late evening 21, 22, 23

average delay average delay [s] under 1 min �0, 59�

from 1 to 2 min �60, 119�

from 2 to 3 min �120, 179�

over 3 min > 179

Table 3: Meaning and categories of attributes


for each month, generates about 20 000 duplicate records,
this corresponds to 1.3 % of the total.

If a dispatcher diverts a tram from its route, there is no ap-
plicable delay/overtake attribute for this tram, because there is
no defined timetable on its “new” route, and therefore, it is
not possible to determine a departure. So a ‘NULL’ value
arises in these attributes; records with such values were also
deleted in the data preparation phase. Missing values can, in
general, be handled in various ways.

Because attributes like ‘day of the week’ or ‘hour intervals’
were used, it was necessary to create these from the ‘date’
attribute.

3.3 Creation of attributes for generating of
association rules

Association rules are generated and hypotheses are cre-
ated with groups of attributes, and must therefore be collected
in a simple data table, mostly with the help of the SQL
language aggregation function. Relevant categories must be
created for the values of each attribute.

An authentic group of attributes is offered here as an
example. The attribute categories: ‘weekday’, ‘hour’, ‘line’,
‘stop’ and ‘average delay’ are shown in the data table for the
generation of hypotheses. The meanings of these attributes
and their division into categories in Table 3.

The tram direction should be reflected in the survey of the
delays at the stops; whether or not it is inbound towards, or
outbound from the city center (this can be distinguished with
the help of the ‘stop post’ attribute in the database). The
attributes ‘line’ and ‘stop’ have one category for each value.
The resultant data table for generating the association rules
includes about 30 000 rows.

4 Results
We will now give some real examples and show how they

have been modeled.

4.1 Description
We can obtain information from the database, which al-

lows us to follow any leg of a selected line, either in the long
term, or in one day, or in hourly segments. The “long term”
pattern has a higher deposed value, and can be used for mod-
ifying timetables for adjusting the driving time between stops
or as a basis for drafting new timetables.

The statistics of accidents divulge much interesting statis-
tical information. About 800 accidents were recorded in the
database between July and December 2003. With the help of
simple a SQL query it was possible to determine that the most
common type of accident was between a tram and a car, with-
out resultant injury. Injury occurred in only 4.4 % of all such
accidents.

Fig. 1 shows that the greatest number of accidents oc-
curred on Monday and on Wednesday. The greatest number
of accidents during the day may be assumed to have occurred
during the rush hour. This is confirmed by the second graph
(Fig. 2).

The greatest number of accidents occurred between 3 and
4 p.m. On the other hand, the total number of accidents in
the evening, and at night, is below twenty per hour, apprecia-
bly fewer than in other hours of the day.

Another example that illustrates the use of data mining in
the DORIS database is the influence of one accident on in-
creasing the delays to other trams in the affected area. An ac-
cident has been chosen from the database that happened to
the second tram in the line order for line #18, between
Chotkovy sady and Malostranská tram stops. The tram struck
a pedestrian, and the tram was then taken to the depot. This
accident influenced the progress of trams not only on line 18,
but also on lines 22 and 23. Fig. 3 describes the trams operat-
ing on lines 22 and 23, and the increase in delays to them at
the time of the accident. Each tram is affixed with a line num-
ber and an ordinal number. At first sight, the ordinal numbers
of both lines appear to be numbered similarly in the graph,
but in reality they are slightly different.

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 27

Acta Polytechnica Vol. 48  No. 1/2008

Number of accidents in the period under consideration

162

123

156

137
147

46

25

Monday Tuesday Wednesday Thursday Friday Saturday Sunday

day of a week

n
u

m
b

e
r

o
f

a
c
c
id

e
n

ts

Fig. 1: Number of accidents in the study period


As shown in Fig. 3, the first trams on both lines ran with-
out any major delay. The accident occured between the
passage of trams 23/1 and 22/2. The delays increased rapidly
after this event. The trams most impacted were 22/2, 23/2,
22/3 and 23/3, and the delay reached as much as 700 s. After-
wards, the flow of traffic slowly returned to the normal state,
but several following trams were still delayed by about 200 s.

4.2 Searching for association rules
The attributes ‘weekday’, ‘hour’ and ‘number of accidents’

were placed as a precondition to an association rule; and the

attributes ‘percentage of delayed trams’, ‘average delay’, ‘av-
erage overtake’ were assigned to a succedent. The founded
implication with the parameters B � 15 and p � 0,700 was
used as a quantifier. These hypotheses were then searched:
among those that have the relationship of the founded impli-
cation between Ant and Suc, at least 15 objects satisfy both Ant
and Suc and confidence of the implication is at least 0,700.
With this setting of parameters, for example, a strong hypoth-
esis appeared as follows:

Number of accidents (high) �0.89, 23 percentage of de-
layed trams (high), where the confidence is 89 % and the

28 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 48  No. 1/2008

Number of accidents during the day

0

10

20

30

40

50

60

70

80

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

hour

n
u

m
b

e
r

o
f

a
c
c
id

e
n

ts

Fig. 2: The number of accidents during the day in the study period

M
a

lo
v
a

n
k
a

B
ru

s
n

ic
e

K
rá

l.
le

to
h

rá
d

e
k

M
a

lo
s
tr

a
n

s
k
á

Ú
je

z
d

N
á

ro
d

n
í
d

iv
a

d
lo 22/1

23/1
22/2

23/2
22/3

23/3
22/4

23/4
22/5

23/5
22/6

23/6
22/7

23/7

0

100

200

300

400

500

600

700

800

delay

[s]

stop

line and order

Influence of an accident on the delay increase of other trams

Fig. 3: The influence of an accident on increasing the delays on of lines 22 and 23


support a � 23 (number of objects which satisfy the concur-
rent precondition and conclusion). The hypothesis can be
interpreted as: if a high number of accidents occurs in a given
day and hour then a high percentage of trams delayed more
than 180 seconds will appear with a probability rate of 89 %
in the same day and hour. This is confirmed by the study of
23 entities in the data matrix. Examples of other hypotheses:

� Number of accidents (zero) �0.79, 15 percentage of delayed
trams (low) & average delay (low)

� Number of accidents (low) �0.82, 44 average delay (low)

From the hypotheses mentioned above it can be unambig-
uously deduced that the number of accidents influences the
delay rate. Such a conclusion can be intuitively assumed, and
we can draw this conclusion without the use of association
rules. No new elements appear, but we know that the method
used for generating association rules works well.

Other searches for contingent rules followed. We used the
attributes ‘line’, ‘weekday,’ ‘hour’ in the antecedent, and ‘aver-
age delay’ in the succedent. The ‘stop’ attribute (the direction
towards city center) was used as a condition. There was an
“above average” quantifier with the parameters p � 2 and
B � 20. Thirty hypotheses satisfied the given conditions. The
attribute ‘stop’ constrains the validity of founded rules at the
individual stops. These founded hypotheses reveal:

Line (25) �3, 20 average delay (over 3 min) /stop (Vltavská)

This hypothesis holds true for Vltavská tram stop, in the
direction towards the city center, for 20 instances in the data
matrix, and p is approximately 3. This is interpreted as: there
is a delay of over 3 minutes on line 25, four times more often
at Vltavská tram stop than for trams of all other numbered
lines running through this tram stop. The rules can be inter-
preted in various ways. While intelligibility and simplicity
should be the most important elements in the explication, it is
sometimes difficult to explain a hypothesis simply and realis-
tically in the same instance. A given hypothesis can be simply
interpreted as: trams of line 25 are those that are most often
delayed at Vltavská tram stop.

As in the case of generating association rules from the pre-
vious attributes group, evident or already known conclusions
appeared. As an example:

Hour (late evening) �2, 64 average delay (under 1 min) /
stop (Štěpánská). Simply stated: In comparison to any other
hour of the day at Stěpánská tram stop, the smallest number
of delayed trams occur between 21:00 and 24:00.

5 Conclusions
The results of data analysis can serve for modifying the

running times between stops, or as a basis for drafting of new
timetables. They also draw attention to the existence of black
spots with a high rate of accidents or a high percentage of
delayed trams (or tram overtake incidence). The conclusions
drawn from such analyses provide an exact tool for recog-
nizing black spots, and suitable measures can be taken, e.g.,

decisions about tram preference at traffic lights or about con-
structing some suitable passive elements.

As a result of applying the association rules, hypotheses
appeared which in most cases could have been intuitively as-
sumed. Basically, it can be inferred that the method used here
for searching for association rules in the LISp-Miner program
works well. No surprising or new hypotheses were found,
above all due to the interdependence among the attributes
that were used (for example, there is a correlation coeffi-
cient of 0.89 for the pair of attributes ‘number of accidents’
and ‘average delay’). It would be necessary to variegate the
data base with other independent attributes in order to find
more interesting hypotheses. For example, vehicle occupancy
rate could be used as another new attribute to record at the
stops, if the required equipment for reading the weight on an
axle were available. If the vehicle occupancy were known, it
would be possible to optimize the scheduled intervals between
trams, to reduce or increase the frequency of connections.

The most interesting hypotheses are those predicting the
probability of a delay on a line at a selected stop. Not such
statistics had been created previously at the Prague Public
Transit Company. When considering transport options, reli-
ability of transfer connections and information about the
probability of a delay are decisive elements when choosing
a mode of transport.

References
[1] Berka, P.: Dobývání znalostí z databází. Praha, Academia,

2003.
[2] Kejkula, M.: 4ft-Miner pro začátečníky. Získávání znalostí

z databází. Praha, VŠE, 2004.
[3] Internal materials of the Prague Public Transit Com-

pany: Dispečerský řídicí systém DORIS.
[4] The official site of the LISp-Miner project,

http://lispminer.vse.cz
[5] Hájek, P., Havránek, T.: Metoda GUHA. Automatická

tvorba hypotéz. Praha, Academia, 1983.
[6] Ducháček, M.: Nástroj pro správu databází s využitím pro

multi-relační data mining. Diploma thesis MFF UK, Praha,
2005.

[7] Burian, J: Datamining a AA (Above Average) kvanti-
fikátor. Sborník 2. ročníku konference Znalosti. Ostrava,
2003. ISBN 80-248-0229-5.

[8] Rauch, J., Šimůnek, M.: Systém LISp-Miner. Sborník
2. ročníku konference Znalosti. Ostrava, 2003.
ISBN 80-248-0229-5.

Ing. Petra Haluzová
email: p.haluzova@seznam.cz

Department of Informatics and Telecommunications

Czech Technical University in Prague
Faculty of Transportation Sciences
Konviktská 20
110 00 Prague 1, Czech Republic

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 29

Acta Polytechnica Vol. 48  No. 1/2008