CUBO A Mathematical Journal

Vol.11, N
o
¯ 02, (139–149). May 2009

A General Purpose Platform for Data Clustering Analysis

Haiyan Qiao and Brandon Edwards

Department of Computer Science and Engineering

California State University, San Bernardino

San Bernardino, CA 92407, USA

email: hqiao@csusb.edu

ABSTRACT

Grouping objects into meaningful sets - clustering - is an important procedure in many

fields of social sciences. Yet, clustering analysis is a difficult problem because many

factors come into play in devising a well tuned clustering technique for a given clus-

tering problem. Therefore, an easy-to-use clustering analysis tool is needed. In this

paper, we have designed a general purpose clustering analysis platform, which inte-

grates the different clustering algorithms and provides the clustering results in both

textual and visual display. This platform will assist the users without any comput-

ing or programming background in the selection of appropriate clustering algorithm,

improve the quality of clustering, and be extendable to the evaluation of clustering

results.

RESUMEN

Agrupación de objetos en conjuntos meaningful-clustering es un importante proced-

imiento en muchos campos de las ciencias sociales. Análisis de clustering es un prob-

lema dif́ıcil pues varios factores entran en juego en idear una bien afinada técnica de

clustering para un determinado problema de clustering. Por lo tanto, una análisis de


140 Haiyan Qiao and Brandon Edwards CUBO
11, 2 (2009)

clustering de fácil nanejo es necesário. En este art́ıculo, designamos una plataforma

para propositos generales de análisis de clustering, la cual integra los diferentes algo-

ritmos de clustering y prove los resultados de clustering mostrando a la vez textual y

visual. Esta plataforma asiste a los usuarios sin cualquer formación computacional o

de programación en la seleción de algoritmos de clustering apropriados, perfeccionando

la calidad del clustering y extendible para evaluación de resultados de clustering.

Key words and phrases: Data clustering, clustering validation, clustering platform.

Math. Subj. Class.: 62H30, 91C20.

1 Introduction

With the increased advancement of computers and technologies for data collection and data stor-

age, the task of finding patterns and discovering knowledge through data analysis techniques, e.g.,

data clustering, is getting exceedingly important. Data clustering is to group data into clusters so

that the objects within the same cluster are similar to each other, while the objects in different

clusters are dissimilar to one another. For example, social network analysis, i.e., the identification

and reorganization of community by certain types of interdependency among people, such as val-

ues, visions, ideas, financial exchange, friends, kinship, dislike, conflict, trade, web links, sexual

relations, disease transmission, or airline routes [1], plays an important role on the study of social

sciences. In the analysis of complex survey data, cluster analysis is usually the crucial step to dis-

cover patterns that will be used in the decision-makings such as the target audience for a product,

the development of new products etc. Data clustering analysis is also an essential technique for

behavior studies and social-economic studies.

Although data clustering is an important procedure in many fields of the social sciences, clus-

tering analysis is a difficult problem. Many factors such as effective similarity measures, algorithms

and initial parameters come into play in devising a well tuned clustering technique for a given clus-

tering problem. Moreover, it is well known that no clustering method can be universally applicable

to all sorts of data structures in terms of data distribution, size, density, dimensionality etc. A

large number of clustering algorithms have been developed for different contexts or purposes. The

diversity, on one hand, provides us with many choices. On the other hand, the profusion of op-

tions causes confusion [2, 3]. How to choose an effective clustering algorithm and how to assess

the quality of the clusters returned are critical for the success of data clustering. However, these

fundamental questions are not well addressed in the research of data clustering.

As the new developments in one discipline usually spread slowly in the relevant disciplines,

the development and use of data clustering as a formal analysis procedure by anthropologists,

psychologies, political scientists, and sociologists is not only promising but challenging as well.

Therefore, it is desirable to have a clustering tool that assists the researchers in different disci-

plines in clustering analysis. A few platforms for cluster analysis have been developed for some


CUBO
11, 2 (2009)

A General Purpose Platform for Data Clustering Analysis 141

special contexts or purposes; see [4, 5]. However, they are either limited in their scope or are far

from being user-friendly. One such platform, named Cluster 3.0 [4], allows a user to cluster gene

expression data for use with bioinformatics. However, cluster 3.0 only offers K-means algorithm

and hierarchical clustering method, and the results generated by the platform are not easy to inter-

pret. Another such platform, RapidMiner [5], offers a variety of clustering. RapidMiner, however,

requires extensive knowledge and skills in programming, which are lacked for the general users.

There is no platform so far that offers a simple and user-friendly environment for cluster analysis.

In this paper, we will describe a generalized cluster analysis platform with easy-to-use interface.

The platform provides the classic clustering algorithms with both textual and visual display of

clustering results for any given data set. The platform serves the researchers in social sciences

easy-to-use interface and the ability for future extension to include clustering evaluation criteria.

The platform will assist the users in comparing clustering results of different clustering algorithms

and making a wise decision of which cluster algorithm best serves their purpose.

The rest of the paper is organized as follows. Section 2 surveys the primary data clustering

algorithms and clustering metrics. The general purpose clustering analysis platform is described

and demonstrated in Section 3. Our conclusions and directions for future work are presented in

Section 4.

2 Methodology

To cluster a data set, one question we need to answer is which similarity measure is appropriate

in a given situation. The most common similarity measurements within a set of data are distance

measurements, such as Euclidean distance, a special case (p = 2) of Minkowski metric

dp(Xi, Xj ) = (

n
∑

k=1

|xi,k − xj,k|
p
)
1/p

= ‖Xi − Xj‖p,

where n is data space dimension. Instead of direct use of Minkowski distance, normalization of

the continuous features is usually adopted in order to avoid the tendency of the largest-scaled

feature to dominate the others [3]. There are other measures used for the clustering strings,

such as Cosine similarity, which is the most commonly used method for document clustering [2].

Another important question for clustering analysis is clustering metrics. Data clustering algorithms

are classified into different categories based on different clustering metrics. A few categories of

commonly used clustering algorithms are introduced below.

2.1 Clustering analysis

(a) K-means algorithm and its variants

K-means algorithm is the most well-known centroid algorithm which assigns every data point


142 Haiyan Qiao and Brandon Edwards CUBO
11, 2 (2009)

to whichever cluster’s center is nearest. Its basic procedure is to first select k points as initial

group centroids, and to repeat the following steps until there is no change in the position of any

cluster’s centroid.

(1) Assign each point in the data set to the cluster with the nearest centroid.

(2) Re-calculate the position of each cluster’s centroid.

The clustering metric for K-means is to minimize total intra-cluster variance:

V =

k
∑

i=1

∑

xj ∈Si

(xj − µi)
2
,

where Si, i = 1, 2, · · · k denote k different clusters and µi is the centroid or mean point of cluster

Si.

A number of variations of this algorithm have been developed [6, 7, 8], such as fuzzy c-means

[6], in which a data point, instead of belonging exclusively to a single cluster, belongs to all

the clusters with a degree of membership and the sum of coefficients of being in clusters is one.

Compared with other clustering algorithms, the class of K-means algorithm and its variants has

fast convergence rate and is relatively easy to implement. However, to apply the algorithm, the

user must provide pre-specified value of k, i.e., the number of clusters, which is usually difficult to

determine. In addition, this class of algorithms is susceptible to noise in the data set as each data

point must belong to a cluster, an outlier can distort the shapes of clusters [2, 3].

(b) Hierarchical methods

The hierarchical methods group data points into a tree of clusters. The data points are

clustered with either agglomerative (bottom up) or divisive (top-down) method and the clustering

procedure is presented with the dendrogram [2, 3, 9].

The clustering metric in hierarchical methods is inter-group distance, which is defined as

single-linkage, complete linkage, or average linkage. Suppose D(r, s) denotes the distance between

two clusters r and s. The single linkage is defined as the minimum distance of any pairwise points

between two clusters, i.e., D(r, s) = min{d(i, j) : where point i is in cluster r and point j is

in cluster s }. The complete linkage is defined as the maximum distance of any pairwise points

between two clusters, i.e., D(r, s) = max{d(i, j) : where point i is in cluster r and point j is in

cluster s }. The average linkage is defined as the average distance between all pairwise points

between two clusters, i.e., D(r, s) = T rs/(N r ∗ N s), where T rs is the sum of all pairwise distances

between cluster r and cluster s and N r and N s are respectively respectively the sizes of the clusters

r and s. At each stage of agglomerative methods, the clusters r and s with minimum clustering

metric are merged as a single cluster.

The advantages of hierarchical methods are that a complete hierarchy of clusters is computed

and visually illustrated. In addition, they do not need to specify the number of clusters in advance.


CUBO
11, 2 (2009)

A General Purpose Platform for Data Clustering Analysis 143

However, they cannot perform flexible adjustments once the splitting (divisive method) or merging

(agglomative method) decisions are made during the clustering process.

(c) Density-based clustering algorithms

In density-based clustering algorithms [10], clusters are interpreted as dense regions in the

data space and are separated by regions of low object density (noise). These regions may have an

arbitrary shape and the points inside a region may be arbitrarily distributed.

For these algorithms, the clustering metric is density of data points in a region. The key idea

is that for each data of a cluster the neighborhood of a given radius (Eps) has to contain at least

a minimum number of data points (MinPts). The performance of density-based clustering thus

relies on the parameters of Eps and MinPts.

The density-based clustering algorithms have some advantages over k-means and hierarchical

clustering methods. They are more robust in identifying clusters of arbitrary shapes and sizes and

can separate from surrounding noise [2,3]. Their disadvantage is that they might be sensitive to

input parameters Eps and MinPts, which are difficult to determine in advance.

(d) Expectation-Maximization (EM) algorithm and its variants

The EM algorithm [11] is used in statistics to classify each point into the most likely proba-

bilistic model and estimate the parameters of each model. EM regards the data set as incomplete

and divides each data point into two parts - the observable features and the missing data. Its basic

procedure is first to initialize the distribution parameters and then repeat the following steps until

the estimations of the distribution parameters are convergent.

(1) Expectation (E): computes an expectation of the likelihood by including the latent variables

as if they were observed.

(2) Maximization (M): re-estimates the parameters by maximizing the expected likelihood found

on the E step.

The major disadvantages of EM algorithms are the sensitivity to the selection of initial pa-

rameters, the possibility of convergence to a local optimum, and the slow convergence rate [11, 2,

3]. The variants of EM have addressed these problems.

3 Implementation of Clustering Platform

To our knowledge, there is no platform so far that offers a simple and user-friendly environment

for cluster analysis. We fill that gap by developing a general purpose cluster analysis platform,

which implements some commonly used clustering algorithms and has an easy-to-use interface.

The platform can be used for cluster analysis and validation of general data sets.


144 Haiyan Qiao and Brandon Edwards CUBO
11, 2 (2009)

3.1 Development of the clustering platform

To implement the general purpose platform, we choose Java programming language due to its

good support of graphical user interfaces (GUI) and platform independence. In the developing

process, NetBeans is used as the integrated development environment because of its visual tools

that generate skeleton code. In addition, the GUI Builder of NetBeans is used to support a

sophisticated yet simplified Swing Application Framework and Beans Binding.

In the platform, the computation is conducted by Matlab. Matlab is designed for convenient

numerical computations, especially matrix manipulation and includes many special functions de-

veloped for specific fields such as optimization, statistics etc. We choose Matlab because it is a

high performance language for computation, simulation, and visualization. In addition, there are

a variety of Matlab open source codes for clustering analysis. It saves us time not to write these

algorithms from scratch. The platform brings together a variety of resources for performing cluster

analysis using Matlab [12]. The only problem to develop clustering algorithms in Matlab is that

Matlab is not a free software. To make our platform accessible to users with no extra financial

cost, we installed one license of Matlab on our server [14] so users do not need to install Matlab

on their local machines. To configure Matlab server and call Matlab code from Java interface, we

first download the package from [13] and then do the following:

1. Copy the exitform.fig, exitform.m, RemoteMatlab.jar and StartMatlabServer.m to our work-

ing directory.

2. Copy xmlrpc libraries from apache and some commons libraries to our working directory.

3. To launch Matlab server, type “StartMatlabServer” in Matlab command window.

To be able to call Matlab procedures located on the server from Java program during the devel-

opment of platform, we follow the steps below:

1. Install NetBeans.

2. Import lib directory library into NetBeans.

3. Import RemoteMatlab.jar into NetBeans.

4. Create instance of Matlabcilent in Java application.

3.2 Running of the clustering platform

To run the platform, a user does not need to install any software and only needs to download

clusteringPlatform.jar file from our website [14] and run it on his machine with Internet access.

After launching the file downloaded, the user will see a window as shown in Figure 1. The top half

of the window shows the data set to be clustered, with the listing of number of data points and the


CUBO
11, 2 (2009)

A General Purpose Platform for Data Clustering Analysis 145

number of attributes in the data set. The lower half of the window allows the user to choose which

clustering algorithm to run on the current dataset and to decide the values of the parameters for

the algorithm. For example, when the tab “k-means” is selected as shown in Figure 1, the user

needs to input the number of clusters k and number of iterations, and select similarity measure

from the pull-down list. When the user clicks on the “run” button, K-means algorithm is called

and runs on Matlab server in the backend, and the results including output data file and visual

display are sent back to the user and displayed in the platform on the user’s machine.

To load the data set for clustering, the user can copy and paste the data set to the top window,

or click on the menu “file” and choose “open” to open the right data file. The format of data set

that the platform takes is similar to the relational database, i.e., each data entry is represented in

one row, and each data entry consists of multiple attributes represented in columns separated by

space or Tab key; see the data set in Figure 1.

Figure 1: Clustering platform GUI


146 Haiyan Qiao and Brandon Edwards CUBO
11, 2 (2009)

To illustrate the visual display of the clustering results, a few snapshots are taken. In Figure

2, the hierarchical clustering result is displayed for a randomly picked data set. This dendrogram

display gives the user the merge distance and the points contained within each cluster at the current

level that the user is viewing. The user may pick at which level he would the dendrogram to be

drawn. The default display for the dendrogram is the top level. In Figure 3, EM visual result is

displayed. The chart displays the likelihood that each particular point belongs to a cluster. The

darker the shade, the more likely it belongs to that cluster. Figure 4 shows K-means visual results

where each cluster center is depicted as a circle. Figure 5 shows DBSCAN (Density-Based Spatial

Clustering of Applications with Noise) visual results where the results display each cluster as a

separate color.

Figure 2: Hierarchical results


CUBO
11, 2 (2009)

A General Purpose Platform for Data Clustering Analysis 147

Figure 3: EM visual results

Figure 4: K-means visual results

4 Conclusion and Future Work

In this paper, a general purpose clustering analysis platform is introduced. The platform integrates

a number of commonly-used clustering algorithms with friendly user interface and both textual

and visual display of results. An advantage of the platform is that it is extendable, so it is easy

to integrate more clustering algorithms in the future. Another advantage is that it is easy to use,

so we expect that it can satisfy the need of researchers in social sciences to cluster data without

writing a single line of code. A drawback of the visual display of results in our platform is that it

is not capable of visualizing clustering results effectively in high-dimensional data space.

After a number of clustering algorithms have been implemented in this platform, it will be

important to validate clustering results of different algorithms with different parameters. When

assessing the output of a clustering algorithm, both the intraconnectivity and interconnectivity


148 Haiyan Qiao and Brandon Edwards CUBO
11, 2 (2009)

Figure 5: DBSCAN visual results

of clusters have to be taken into account to ensure that the clusters are compact and isolated.

It is desirable to have a high intraconnectivity and a low degree of interconnectivity. However,

there are no general standards for cluster validation in existing literature except in well-prescribed

subdomains. Our future work will address the problem of clustering evaluation and integrate the

evaluation criteria into the platform to assist users in making better choice of clustering algorithms

and improving clustering quality.

Received: June 4, 2008. Revised: July 8, 2008.

References

[1] Wikipedia, the Free Encyclopedia, “Social Network”, 25 Jun. 2008. Wikimedia Foundation,

Inc., http://en.wikipedia.org/wiki/Social network

[2] R. Xu, and D. Wunsch II, Survey of clustering algorithms, IEEE Transactions on Neural

Networks, 16(2005), No. 3, 645–678.

[3] A. K. Jain, M. N. Murty and P. J. Flyn, Data clustering: a review, ACM Computing Surveys,

31(1999), No. 3, 264–323.

[4] http://bonsai.ims.u-tokyo.ac.jp/∼mdehoon/software/cluster/software.htm

[5] http://www.rapid-i.com


CUBO
11, 2 (2009)

A General Purpose Platform for Data Clustering Analysis 149

[6] M. Sato, Y. Sato and L. C. Jain, Fuzzy Clustering Models and Applications, Physica-Verlag,

1997.

[7] Z. Huang, Extensions to the k-means algorithm for clustering large datasets with categorical

values, Data Mining and Knowledge Discovery, 2(1998), 283–304.

[8] G. Hamerly and C. Elkan, Learning the k in k-means, Technical Report CS2002-0716, Uni-

versity of California, San Diego, 2002.

[9] N. Jardine and R. Sibson, The construction of hierarchic and non-hierarchic classifications,

Computer Journal, 11(1968), 177–184.

[10] M. Ester, H. P. Kriegel, J. Sander, and X. Xu, A density-based algorithm for discovering

clusters in large spatial data sets with noise, Proc. 2nd Int. Conf. on Knowledge Discovery

and Data Mining, Portland, OR, 1996, 226–231.

[11] G. McLachlan and T. Krishnan, The EM Algorithm and Extensions, New York: Wiley, 1997.

[12] http://www.dcorney.com/ClusteringMatlab.html

[13] http://plasmapowered.com/wiki/index.php/Calling MatLab from Java

[14] http://caplatform.ias.csusb.edu


	N10-Haiyan_paper_v2