AP08_6.vp


1 Introduction
In some cases, usual methods of supervised learning are

not able to provide satisfactory results. This may occur in data
with asymmetric distributed error, which is typical in insur-
ance. In order to manage the asymmetry in the sense of the
law of large numbers, this paper offers a new algorithm,
which constructs a predictor not for points, but for sets. We
will show an algorithm for finding sets of units with above-
-average outputs.

Let X be a set, and let �, � be measures over it. The Pareto
principle arises if there is a set P X� where

p P X
P
P

X
X

( , )
( )
( )

( )
( )

� � ��
�

�

�

�
1 (1)

and r P P X( ) ( ) ( )� ��� � 0. Let � stand for volume, � for pro-
duction, p for productivity, r for proportion. Typically, the
Pareto principle is considered as a rule that 20 % of elements
“produces” 80 % or more of the output. (This principle was
discovered by Vilfredo Pareto while assessing the welfare
distribution in the UK at the end of the 19th century. His
ideas were systematically described, applied and extended
by Max Lorenz[7].) In this case, r P( ) .� 0 2 and p P( ) � 4. Man-
agerial science often works with the Pareto diagram [1]. X is
discrete � 	x x xn1 2, , ,� . The elements are ordered by their
production �( )xi , and the production is drawn in a chart. In
stastitics,the Pareto principle is represented by the continu-
ous Pareto distribution:

Pr( )X x
x

xm

k

� �


�
�
�


�
�
�

�

(2)

for all x xm� , where xm is the (necessarily positive) minimum
possible value of X, and k is a positive parameter. Pareto dis-
tribution has positive skewness

2 1
3

2( )�
�

�k
k

k
k

,

which means that the below-average subset is bigger than
the above-average complement. This occurs in many real life
situations: the median citizen has a below-average salary, the
median driver causes below-average claims, etc.

Let � : X n� � be attributes of X. (Attributes can be con-
sidered as columns in a data table, i.e. the n-tuple from the
i-th row. Mapping � can also involve preprocessing. If X is �k

for some k � �, the � mapping may also be an identity.) The
problem of prediction consists in constructing the mapping �y
so that � ( )y i i d

i X
� � �� �

�� 0.
However, the construction of mapping �y may be difficult

if the Pareto principle arises. A small subset of high pro-
ductivity (called outliers) corrupts usual assumptions. Usual
datamining techniques propose removing the set and work-
ing only with the rest. However, in case of the Pareto principle,
the small set is very interesting. It is not adequate to speak
about outliers, because such data is relevant and obvious.
Therefore, we are dealing with more humble result, i.e. with
finding the set P defined in (1).

The formulated problem, i.e. finding

arg max ( , )
P X

p X P
�

under condition r P r( ) � 0 (3)

is new, and has not been found in the current literature. How-
ever, many other topics are related to it. First, clustering
methods [12] can be employed. Creating clusters of above-av-
erage individuals, the set P will be defined as these clusters.It
is important to define the border of the clusters somehow. If
these clusters are well found, they can be employed for a more
precise approximation [6]. Another approach is to attempt to
find a prediction mapping �y where the set P is afterwards de-
fined at some level of this mapping. RBF neural networks [5]
provide an example where the approach of rough sets is
employed. Finally, effective P can also be detected also Data
Envelopment Analysis [2]. However, none of these methods –
as the research in the bibliography shows – has been applied
explicitly to the problem of above-average subsets.

2 The Fencing algorithm
The following algorithm is the first attempt to solve this

problem (3). It offers the construction of �P X� with above-
-average production, i.e. with high p. The space �n will be

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 55

Acta Polytechnica Vol. 48  No. 6/2008

The Pareto Principle in Datamining: an
Above-Average Fencing Algorithm

K. Macek

This paper formulates a new datamining problem: which subset of input space has the relatively highest output where the minimal size of this
subset is given. This can be useful where usual datamining methods fail because of error distribution asymmetry. The paper provides a novel
algorithm for this datamining problem, and compares it with clustering of above-average individuals.

Keywords: ART, Pareto principle, insurance risk.


considered as X. The set P is represented by union intervals
(from points to hyperboxes) represented by means of comple-
ment coding. (Complement coding is a concept applied in
ART and ARTMAP neural networks, e.g.[3]. However, the ob-
jective of the Fencing algorithm is different. While ART and
ARTMAP work iteratively, the Fencing algorithm must often
go through the entire training set.) This algorithm works only
with finite data sets D, namely with ( , )xi iy pairs. Therefore, for
all subsets of D, the volume � is defined as count
� �( ) ( )M M D� � and the production as the sum of the pro-
duction of particular items �

�
( )

( )
M yii M D

�
� �� 1 � .

Other ways to construct such intervals may be considered.
The following algorithm uses fencing. Fencing is a heuristic
approachwhich anticipates that areas of higher average pro-
duction are located between mutually close points with high
production. The algorithm attempts to build a rectangular
fence around the area of above-average production, as shown
in Fig. 1.

2.1 Measuring and data preprocessing
� mapping is necessary. This mapping involves measuring

and data preprocessing. The simplest way is to transform
binary attributes into real attributes by 0 – 1 coding. Categori-
cal attributes are transformed into more binary binary attrib-
utes. It is very useful to reduce the input vector dimension,
e.g. by Principal Components Analysis [11]. Let us define
xi i� �( ) and the vector x x x xmax ( , , , )max max max� 1 2 � n so
x xj j

i i Dmax � � � . The vector xmax is used for complemen-
tary coding x.

2.2 Data splitting
The data set D is divided randomly into three subsets: base

subset B, training subset T and validation subset V. Sets B and

T are used for constructing the predictor y, whereby their size
will be represented by B T�� , say10 � �B T . The size of V is
chosen with respect to cross validation [8].

2.3 Starting set of intervals

The starting set of intervals is defined as follows

R xi i BO O� �cc( ) , whereby B ni B y k
B
B

O i� � � �
�
�
�

�
�
�

�

�

( )
( )

, cc is

the complement coding cc : � �n n� 2 and k is a parameter.

The definition of BO ensures that p B B kO( , ) � .

2.4 Interval expansion

Two intervals r1, r2 can be expanded as follows
r r ri i i

new � min( , )1 2 , i n� 1 2 2, , ,� . The construction of �P con-
sists in iterative expansion of intervals. The process starts with
RO. Two intervals are expanded only if the new interval covers
p I T qt( ( ), )r

new � , where qt is a parameter that sinks linearly
during the process from p0 to p1. In order to expand the com-
pared intervals, a heuristic is used. Two intervals I a( )r , I b( )r
are suitable for expansion if p I a( ( ))r and p I b( ( ))r are high,
and if ra and rb are close. Let us define suitability as:

v
p I T p I T

a b

a b

a b,
( ( ), ) ( ( ), )

( , )
�

�r r

r r�
, (4)

where � is a metrics. The first version of the algorithm worked
with Hamming distance [6], but other metrics can be also ap-
plied. In each step, a pair of intervals is tested for expansion.
The pair is selected partly randomly as follows: the pair with
highest suitability (probability 0.8), the pair with lowest suit-
ability (0.1), or a random pair (0.1). If a pair is expanded, its
suitability is recalculated for all other intervals.

56 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 48  No. 6/2008

1 2 3 4

1

2

3

� � 4

� �10

� �10

x2

x2

Fig. 1: Fencing: Expanding the square, a new fence (dashed) is recommanded to the rectangle that is close and has high productivity �


2.5 Termination
The algorithm terminates after all pairs of intervals have

been tested and none can be expanded. Afterwards, unex-
panded intervals (i.e. points) are deleted. Because intervals
may overlap, their conjunction may have lower p than aver-
age p of all intervals. Therefore, only intervals with highest p
are considered as results so their conjunction has p high
enough (e.g. higher than a given threshold).

2.6 Validation
Finally, the results are validated with respect to the vali-

dation set r P V V( , )� , and p P V V( , )� are calculated. Such
values can be considered as the quality of the algorithm.

3 Results
The Fencing algorithm has been applied successfully on

data on 18 177 insurance claims related to traffic accidents in
the Czech Republic in 2003–2005. Categorical and numerical
attributes were transformed into binary attributes. There was
a total of 135 binary attributes. The considered attributes and

their transformation is summarized in Table 1. The Fencing
algorithm has been implemented in Matlab as a set of simple
scripts. It should be mentioned that this particular data set in-
spired the author to invent of the Fencing Algorithm, after
attempts to build some regression model failed. Generalized
Linear Models [4], which are typical in insurance mathemat-
ics, and multilayer perceptrons [8] did not provide sufficient
results, as shown in Table 2. The data set was split into 10 sub-
sets and one subset was always tested. The logarithm of total
costs was taken as an output variable. However, the mean ab-
solute error remains very high (the prediction and reality dif-
fer over twentyfold on average!).

First experiments showed that the 135 dimensional space
is too sparse and than there are many futher unexpandable
intervals. Therefore, the dimension was reduced by selecting
36 attributes describing the region of the claimant, road type,
and cause of the accident. After next unsuccessful experi-
ments with p1 4� and p1 2� , it was neccessary to set p1 15� . .
Then 9 intervals were found. However, the conjunction of
them had p � 119. only. Therefore only 3 best intervals were
selected, with p � 1 48. and r � 0 32. .

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 57

Acta Polytechnica Vol. 48  No. 6/2008

Observed values Data type Used as

Accident-related information

Hour numerical 24 input binary variables

Day numerical 7 input binary variables

Month numerical 12 input binary variables

Year numerical 6 input binary variables

District categorical not used

Municipality size numerical 6 input binary variables

Cause categorical 11 input binary variables

Road type categorical 10 input binary variables

Tariff group categorical 25 input binary variables

Car make categorical not used

Information about causing person

Age numerical 8 input binary variables

Sex categorical 3 input binary variables

District categorical not used

Municipality size numerical 6 input binary variables

Region categorical 15 input binary variables

Accident at place of abode binary 1 input binary variables

Claim costs 1 output numerical variable

Paid numerical

Additional expected numerical

Table 1: Transformation of observed values into input and output variables


3.1 Comparison
The problem (3) formulated here is novel and the Fencing

algorithm is the only solution so far. However, for a simple
comparison a clustering based method was involved that can
be described briefly as follows:

1. Building above-average clusters from training data:
Best 20 % records were extracted and clustered via the
k-means algorithm. For each cluster, the diameter was cal-
culated as the maximum of distance between the center
and the record belonging to it.

2. Finding above-average records in the testing data: For
each record, we test whether there is a cluster whose cen-
ter is closer to the record than the c multiplied diameter
of the cluster. Parameter c is set up so that the level of r
is satisfied. So P is defined and r ensured.

3. Calculation of p from provided data, p is calculated.

Table 3 shows the results achieved by this method, and
compares them with the Fencing Algorithm: the alternative
method based on known algorihm provides less narrow re-
sults. However, the goal of this paper was not test proposed
Fencing Algorithm, but to show that this algorithm is able to
solve problem formulated above (3). More experiments with
the k-means based approach might provide better results.

4 Discussion and further work
The Fencing algorithm can be modified so the suitability

va,b is calculated in another way. There should be an increase
in p in both intervals and a decrease in distance between r(a)
and r(b). The randomized selection rule can also be modified.

If a pair of intervals is tested, the whole training set T is
gone through. This is probably the Achiles tendon, because
the size of T is usually very large. Therefore more detailed

examination complexity and the design of more suitable data
structures are desirable.

The basic idea of constructing an above-average subset
can be evolved in many ways. The subset need not be a union
of intervals, but they may be simplexes. The set must not be
narrow, it may be fuzzy. Or the subset can be given in an alge-
braic form and detected by genetic programming or other
optimization methods, such as Ant Colony Optimization [10].
The Fencing algorithm will be compared with these other ap-
proaches in terms of complexity and effectiveness on more
data sets. Systematic examination of relevant preprocessing
methods is also desirable. Finally, the algorithm could be
modified not for data, but for an estimated probability func-
tion, e.g. in form of copulas [9] which are more appropriate
for assymetric distributions.

5 Conclusion
The Fencing algorithm is a novel heuristic method for

finding a subset of with above-average production. The main
idea of the algorithm is to join intervals with high production
and small mutual distance. The Fencing Algorithm has been
successfully applied to insurance data. Further work has been
discussed above.

References
[1] Akpolat, H.: Six Sigma in Transactional and Service Envi-

ronments. . Gower, Burlington, Vt.:,Hasan Akpolat 2004.
[2] Andersen, P., Petersen, N. C.: A Procedure for Ranking

Efficient Units in Data Envelopment Analysis. Manage.
Sci., Vol. 39 (1993), No. 10, p. 1261–1264.

[3] Dagher, I.: L-p Fuzzy Artmap Neural Network Architec-
ture. Soft Comput., Vol. 10 (2006), No. 8, p. 649–656.

[4] de Jong, P., Heller, G. Z.: Generalized Linear Models for In-
surance Data. Cambridge Press, 2008.

58 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 48  No. 6/2008

Validation sets # 1 2 3 4 5 6 7 8 9 10

GLM 3.43 3.37 3.27 3.33 3.30 3.22 3.35 3.27 3.43 3.41

MLP 3.35 3.29 3.24 3.30 3.23 3.10 3.27 3.28 3.44 3.34

Table 2: Mean absolute error of machine learning for different validation sets

Method r (fixed) p

means based approach (6 means) 0.32 0.96

means based approach (20 means) 0.32 0.94

Fencing algorithm (raw results) 0.48 1.19

Fencing algorithm (results after best interval selection) 0.32 1.48

Table 3: Comparison of the Fencing algorithm with the k-means based approach


[5] Jia, Z., Gong, L.: The Project Risk Assessment Based on
Rough Sets and Neural Network (rs-rbf ). In Networking
and Mobile Computing, 2008. WiCOM ’08. 4 t h International
Conference on Wireless Communications, 2008.

[6] Kreinovich, V., Yam, Y.: Why clustering in function approxi-
mation? Theoretical explanation, 2001.

[7] Lorenz, M. O.: Methods of Measuring Concentration
and Wealth. Journal of the American Statistical Association,
Vol. 9 (1905), p. 209–219.

[8] Mitchell, T. M.: Machine Learning. McGraw-Hill Science/
Engineering/Math, March 1997.

[9] Nelsen, R. B.: An Introduction to Copulas, volume 139 of
Lecture Notes in Statistics. New York: Springer-Verlag,
1999.

[10] Ramos, G. N., Hatakeyama, Y., Dong, F., Hirota, K.:
Hyperbox Clustering with Ant Colony Optimization

(haco) Method and its Application to Medical Risk pro-
file Recognition. Appl. Soft Comput., Vol. 9 (2009), No. 2,
p. 632–640.

[11] Theodoridis, S., Koutroumbas, K.: Pattern Recognition,
Third Edition. Academic Press, February 2006.

[12] Xu, R., Wunsch, D.: Clustering. IEEE Press Series on
Computational Intelligence. John Wiley & Sons, 2009.

Mgr. Karel Macek
e-mail: karel.macek@fjfi.cvut.cz

Department of Mathematics

Czech Technical University in Prague
Faculty of Nuclear Sciences and Physical Engineering
Trojanova 13
120 00 Praha2, Czech Republic

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 59

Acta Polytechnica Vol. 48  No. 6/2008


	Table of Contents
	An Intelligent System for Structural Analysis-Based Design Improvements 3
	M. Novak, B. Dolšak
	The Amount of Regenerated Heat Inside the Regenerator of a Stirling Engine 10
	J. Škorpík


	Prerequisites for Increasing the Axle Load on Railway Tracks in the Czech Republic 15
	 M. Lidmila, L. Horníèek, H. Krejèiøíková, P. Tyc
	Simple Numerical Model of Laminated Glass Beams 22
	A. Zemanová, J. Zeman, M. Šejnoha


	Influence of Simulated LOCA on the Properties of Zircaloy Oxide Layers 27
	H. Frank

	Evaluation of Methods used for Separation of Vibrations Produced by Gear Transmissions 31
	A. Doèekal, M. Kreidl, R. Šmíd
	Rock Burst Mechanics: Insight from Physical and Mathematical Modelling 38
	J. Vacek, J. Vacek, J. Chocholoušová


	Properties of Starch Based Foams Made by Thermal Pressure Forming 45
	J. Štancl, J. Skoèilas, J. Šesták, R. Žitný

	The Pareto Principle in Datamining: an Above-Average Fencing Algorithm 55
	K. Macek