Mathematical Problems of Computer Science 40, 55--67, 2013. 55 Hierarchical Cluster Analysis for Partially Synthetic Data Generation Levon H. Aslanyan and Vardan H. Topchyan Institute for Informatics and Automation Problems of NAS RA e-mail: lasl@sci.am, vardan.topchyan@gmail.com Abstract Limiting the risk of information disclosure is now common for statistical agencies. One of the widespread approaches is to release the synthetic, public use of microdata sets. To put it another way, thanks to the multiple imputations the sensitive variables of original data are replaced by new/synthetic values. This paper introduces the method for partially synthetic data generation based on hierarchical cluster analysis. Keywords: Confidentiality, Multiple imputation, Synthetic data, Hierarchical clustering. 1. Introduction Submission of sociological and/or economic data to the public structures is an integral part of procedures in statistical organizations. However, this task should not assume the risk of disclosure of sensitive or personal information. Analysis of the published data in this area [1] indicates the presence of diverse approaches/methods for solving such problems, including variable recoding, swapping data, and adding noise values. Although these methods replace the original data, protecting the information in this way may lead to distortion of relationships between the different segments of the data set, which in turn can lead to erroneous conclusions/inferences on the stage of data analysis, such as methods of standard statistical processing. An alternative approach to solve this problem, which also tries to maintain functional relationships between the segments of the data set simultaneously, is the approach of fully synthetic data generation [2]. In this case, the statistical organization should: (i) randomly and independently record the general format and content of critical information units, as well as integrate them into the corresponding set of expected synthetic data; (ii) establish new/synthetic values in the information units by the selected strategy; (iii) provide a number of generated synthetic data sets to the public. There are various methods [3]-[5] for generating fully synthetic data providing the receipt of meaningful results using standard statistical methods. In spite of advantages of the fully synthetic data, the process of generating these data is quite time-consuming. In this regard, statistical organizations often use partially synthetic data Hierarchical Cluster Analysis for Partially Synthetic Data Generation56 which is a mix of original and synthesized data [6]. For example, the statistical agency may seek to protect the confidentiality of certain records, or may not allow the identification of several records. In connection with that, the synthetic values generated only for certain variables and the values of the other variables remain changed. As in the case of fully synthetic data, partially synthetic data are also providing the restriction of the disclosure risk, allowing to obtain meaningful results by using standard statistical analysis. Note that, due to its nature, the use of partially synthetic data provide more accurate statistical calculations. For the same reason, the risk of disclosure is higher than in case of the fully synthetic data. However, there are several algorithms [7]-[9] for generating partially synthetic data which are used by many statistical agencies (US Federal Reserve Board, US Bureau of the Census, etc.) that indicate the perspectives of this method. The above mentioned situation was considered as a basis for the analysis of non-parametric methods for generating partially synthetic data sets used for calculation of simple estimands (average, standard deviation, etc.) and for construction of data driven linear regression models [3]. Published literature [10] shows that one of the most known non-parametric approaches is the method of sequential imputations of variables [11]. [11] uses CART (Classification and Regression Trees) model [12] for this reason. In this case, in non-parametric method [11] the hierarchical clustering model can be assumed to substitute CART as an alternative approach. The paper is organized as follows. Section 2 reviews the description of partially synthetic data, the principle of their generation and analysis. Section 3 illustrates how the hierarchical clustering model can be used in a similar to CART way, for generating partially synthetic data. Section 4 concludes the discussion about using hierarchical clustering in partially synthetic data generation. 2. Partially Synthetic Data 2.1 Creation of Partially Synthetic Data For partially synthetic data definition, we use notation [13]. The process of generating partially synthetic data consists of two parts: (i) pretreatment/preprocessing of data; (ii) replacement of the corresponding/tagged values with the synthetic one. Formally, this process can be described as follows. Let U be the set of records/information units, },...,2,1{ NUUUU  , where each information unit )1( NiiU  is characterized by the p attributes/ variables, },...,2,1{ pYYYY  . During preprocessing information units and confidential variables (rows and columns of matrix U) are selected, and the threshold conditions for these variables are set. 1Y 2Y  pY U 1U 11y 12y  py1 2U 21y 22y  py2  NU 1Ny 2Ny  Npy L. Aslanyan ,V. Topchyan 57 Let )( Nnn  be the number of randomly selected information units that will be considered in the current observation, denote those units by },..., 2 , 1 { ni UiUiU . Similarly, )( pdd  is the number of confidential variables, },..., 2 , 1 { d jYjYjY . In addition, N-block ),...,2,1( NIIII  and p-block ),...,2,1( pJJJJ  of numbers are defined as follows: , 1, { , ,..., } 1 2 0, { , ,..., } 1 2 U U U Ur i i in Ir U U U Ur i i in       1 ,r N  , 1, { , ,..., } 1 2 0, { , ,..., } 1 2 Y Y Y Yj j jk d J k Y Y Y Yj j jk d       1 .k p  The process of determining the information units and confidential variables is presented in the following scheme. As a result, ),( nrepUrepUobsU  matrix is defined. Here repU is a ][ dn  matrix with the values of confidential variables },...,,{ 21 djjj YYY and nrepU )]([ dpn  is a matrix of other variables values (replaced vs. not replaced). 1J    pJ 1Y    pY U 1I 1U 11y py1         NI NU 1Ny Npy Hierarchical Cluster Analysis for Partially Synthetic Data Generation58 Next, the d jcjcjc ,..., 2 , 1 threshold conditions are set for the d jYjYjY ,..., 2 , 1 variables. Based on these conditions, the indicator matrix ][ dnZ  is defined. repU nrepU 1j Y  djY  pjY obsU 1i U 11 ji y  d jiy 1  p jiy 1 2i U 12 ji y  d jiy 2  p jiy 2  ri U 1jiry  dr jiy  pr jiy  ni U 1jiny  dn jiy  pn jiy 1j C  djC 1j Y  djY repU 1i U 11 jiy  d jiy 1 2i U 12 ji y  d jiy 2  ri U 1jir y  dr jiy  ni U 1jiny  dn jiy Z 11 j z  d jz1 12 j z  d jz2  1rj z  d rjz  1nj z  d njz L. Aslanyan ,V. Topchyan 59 This matrix describes the need to replace the corresponding values of the confidential variables. Thus, during preprocessing ),( nrepUrepUobs U  and Z matrices are defined. The resulting, observed data set are denoted as ),,( ZnrepUrepUD  . The second part of the partially synthetic data generation is the process of replacement. Namely, based on ),,( ZnrepUrepUD  and the selected method/algorithm, the corresponding values of nrepU matrix are substituted with the synthetic one. Replacements are made independently m times to generate m different partially synthetic data sets: ),( nrepU i synUiSD  , 1 ,i m  where i synU - a matrix of imputed (replaced) values of i-th synthetic data set. The values in U rep are the same in all synthetic data sets, iSD , mi 1 . Thus, the generated partially synthetic data sets },...,2,1{ mSDSDSDsynD  are the information that are provided to the corresponding organizations and the public. 2.2 Analysis of Synthetic Data Based on the released synthetic data sets },...,2,1{ mSDSDSDsynD  , the corresponding organization, in other words the analyst, makes inferences about some population quantity )(YQQ  ( for example, Q can be the average of interest or the coefficients in a regression model). In each synthetic data set iSD ),...,2,1( mi  , the analyst estimates Q , by some value iq , and estimates the variance of iq with some estimator iv . It is assumed that the analyst determines the iq and iv as if iSD was in fact collected data from a random sample of U . Such technique is usual in area of missing value statistics which needs to develop approaches of unbiased data generation [3]. The approach used in this article is to consider mivq ii ,...,2,1),,(  in similar to [3] as a sufficient characterization of the synthetic databases synD , and construct an approximate posterior distribution of Q given synD , )|Pr( synDQ , in analogy with the theory of multiple imputation for missing data [5]. In that case the analyst can obtain valid inferences for Q by combining the results of iq and iv ),...,2,1( mi  . The following quantities are needed for inferences (see [3]): 1 1 , m m i im q q      2 1 1 , 1 m m i i mm q qb     Hierarchical Cluster Analysis for Partially Synthetic Data Generation60 1 1 , m m i im v v    The analyst then can use q m to estimate the scalar Q and vbT mmm m  ) 1 1( to estimate the variance of q m . 3. Hierarchical Clustering Model for Generation of Partially Synthetic Data The 2 base approaches we touch in data approximation are as the probabilistic distribution approximation on one hand and the heuristic data extensions on the other. Even when the first one seems more fundamental the heuristic approaches are more interpretable and applicable especially when used by not professionals. Such methods are also fast and can replace successfully the theoretical counterparts in many cases. Having the example of using CART model in synthetic data generation we will try to understand the role and power of hierarchical cluster analysis in the same role. It is more important to understand the inferences of these 2 approaches that complement each other. CART when generated is pruned like the cauterization, and cauterization is evaluated where to stop by the use of additional quality estimates. 3.1 Algorithm for Imputations Our studies are based on hierarchical agglomerative clustering [14] for generating partially data sets. The principle of synthetic data generation is the same as in the algorithm that considered in [11]. The only difference is that this model is used as a tool for estimating a conditional distribution for sensitive variables in the space where data need to be joined/replaced. That means that the replacement of sensitive variables occurs sequentially (in descending order of number of replacement values). Moreover, for generating new/synthetic values the Bayesian bootstrapping [15] is used. Formally, this process can be described as follows. Assume that the variables },...,2,1{ pYYYY  are continuous. Without loss of generality, suppose that from the set of variables },...,2,1{ pYYYY  the first d is confidential. In addition, the matrix ),( nrepUrepUobsU  consists of the first n records of the information units set U . First, for each variable )1( dk k Y  , based on the indicator matrix Z , the number of its replacements,   n i z ik 1 is computed. As a result, d YYY ,...,2,1 variables are sorted in descending order of the computed values and denoted as: )( ,..., )2( , )1( d YYY . After that, sequentially confidential variable imputations are processed. Let, )(k Y be the current variable. Firstly, for )(k Y we estimate the conditional distribution in the space where data need to be replaced by using hierarchical agglomerative clustering. Obviously, clustering will be produced based on information units satisfying the following condition: L. Aslanyan ,V. Topchyan 61 niz ki ,...,1,1)(  . Consider that the quantity of the mentioned units equals to k n . At the beginning of clustering each information unit is considered as a single cluster: },...,2,1{ k nCCCinit C  . After that the sequential process of clusters merging begins: each time it brings together a pair of closest clusters, where the distance between them is taken as a distance of their centers, and as an integrated distance measure the Euclidean distance is used:    p j c rjc ijrCiCd 1 )( 2 ),( , where ),...,1( cipcici  and ),...,1( crpcrcr  are centers of clusters ri CC , , respectively. To limit the disclosure risk this process continues until there is a possibility of clusters merging in accordance with one of the homogeneity validity measures [16]: SPR (Semi-partial R- squared),RMSSTD (Root-mean-square standard deviation), etc. As a result we get the set of current/final clusters }',...,' 2 ,' 1 { t CCC fin C  . Further, in each cluster ' l C )1( tl  new values for )(k Y are generated by using Bayesian bootstrap procedure. Consider lY as a set of values of )(k Y in the corresponding cluster ' l C , },..., 2 , 1 { l l n YlYlYlY  . Bayesian bootstrap draws values based on some donor pool. In this algorithm lY is taken for ' l C as a donor pool. Bayesian bootstrap method proceeds as follows: 1. Draw )1(  l n uniform random numbers in the range )1,0( and sort these numbers in ascending order: 1, )1( ,..., 2 , 1 ,0 0    l n a l n aaaa . 0 = 0 1 = 12 3 −1 −2 −1 2. Draw l n uniform random numbers in range ]1,0( : l n u l n uuu , )1( ,..., 2 , 1  0 = 0 1 = 12 3 −1 −2 −1 1 2 3 4 Hierarchical Cluster Analysis for Partially Synthetic Data Generation62 3. For each i u )1( l ni  determined interval in which it is contained: ], 1 ( j a j a i u   and replace l i Y value by l j Y . Note that some information units involved in clustering, may include new / synthetic values of )1( ,..., )2( , )1( kYYY variables. In order to maintain imputation consistency for all possible combinations of )(k Y value and )1( ,..., )2( , )1( kYYY new values corresponding information units are searched, after what )(k Y imputations are done among them. Thus, as a result of sequential imputations of confidential variables values generated set of partially synthetic data. The whole process is repeated independently m times and generated synthetic databases synD are provided to the public. 3.2 Simulation Studies For the above mentioned method our simulation studies are based on public release data from 2011 R.A. Household’s Integrated Living Conditions Survey which consist of 7872N records. Of the entire set of attributes/variables that characterize these units, we are interested in only six of them. The interest variables descriptions are presented in Table 1. Table 1: Description of variables used in empirical studies Variable Type Range Notes Monitory income Numeric(18,10) 0 – 3512850 Food purchase Numeric(18,10) 0 – 555600 Food consume Numeric(18,10) 0 - 193191.8083296074 Nonfood purchase Numeric(18,10) 0 – 965100 Expenditures Numeric(18,10) 6256.3126710940 - 4672865.1342223603 13.643% of house holders have total income more than 225000 AMD Total income Numeric(18,10) 0 - 3529697.3182837632 13.795% of house holders have expenditures more than 175000 AMD Next, we assume that total income and expenditures are sensitive variables and set threshold conditions for total income and expenditures are as follows; total income 225000 and expenditures 175000 . In empirical studies each observed data set D consists of 315n randomly sampled households from the 7872 households. As a result of simulation 5m partially synthetic data sets mSDSDSD ,...,, 21 are constructed for each D . Each ),...,2,1( miSDi  is generated using the algorithm presented earlier. Clustering for each sensitive variable is performed on the basis of the units that satisfy the threshold conditions for that variable. In turn, as a validity measure we use minimal count of units in cluster and minimal count of distinct values of sensitive variable in cluster. In this simulation we require minimum ten units with at least three distinct values in each cluster. L. Aslanyan ,V. Topchyan 63 Table2: Simple estimands for sensitive variables. Table 3: Regression model. Dependent variable: total income. Independent variables: monitory income, food purchase, expenditures. Table 4: Regression model. Dependent variable: total income. Independent variables: monitory income, expenditures. Estimand Q Avg. q 5 Coefficients Constant 8868.7488 11893.24448 monitory income 0.9638 0.89952 expenditures 0.072 0.09412 R 0.9818 0.93268 Estimand Q Avg. q 5 % of H.H. with total income > 300000 5.206344 5.3860208 % of H.H with expenditures > 230000 6.031744 6.2291976 Average of total income 132357.290182 132611.3233328 Average of expenditures 109301.86175 109548.668466 Standard deviation of total income 94569.735448 95539.39169016 Standard deviation of expenditures 77458.3434 77376.82509436 Estimand Q Avg. q 5 Coefficients Constant 11279.3796 14181.77272 Monitory income 0.9662 0.90748 Food purchase -0.2676 -0.20408 Expenditures 0.1652 0.16512 R 0.9844 0.93436 Hierarchical Cluster Analysis for Partially Synthetic Data Generation64 Table 5: Regression model. Dependent variable: expenditures. Independent variables: total income, food purchased, food consumed, nonfood purchased. Table 6: Regression model. Dependent variable: expenditures. Independent variables: food purchased, food consumed, food nonpurchased. Estimand Q Avg. q 5 Coefficients Constant -168.794 3725.97836 food purchased 1.0334 1.05348 food consumed 1.0162 1.204032 food nonpurchased 1.1082 0.95308 R 0.9838 0.90272 Tables 2 – 6 summarize the results of simulation for a variety of estimands. Inferences are made using the methods presented in Section 2.2. For simple estimands (table 2) the averages of synthetic point estimates are close to their corresponding Q. The average of parameter R1 for each regression models for synthetic data sets (tables 3-6) are greater 0.900 which indicates that these models are worth considering. In addition, the averages of regression coefficients are close to original values. So, the analyst will make the same inferences as in the case of actual data. In case for disclosure risk of each sensitive variable jl Y (l=1, 2,…, d), we assume that the analyst would estimate jl Y value of ir ’s unit by averaging the replaced values , , , . r l r l m rep k k y ji ji y   1 R shows how much the independent variables explain the dependent variable Estimand Q Avg. q 5 Coefficients Constant -1664.4878 -825.07984 total income 0.026 0.06612 food purchased 1.017 1.00768 food consumed 0.9924 0.90364 food nonpurchased 1.0886 0.906 R 0.9842 0.90692 L. Aslanyan ,V. Topchyan 65 To assess that risk we calculate the root mean squared error (RMSE) and relative root mean squared error (RelRMSE) of this estimator for each information unit: , ,2 2( ) ( ) / (( 1) ), ,, ,, 1 m rep k RMSE y y y m myj ji ir rj jl lir irjl lir l k       ,Re /,, , lRMSE RMSE yjirj jli ir rl l  For any data set, the distributions of the jlir RMSE , and jlir RelRMSE , across all units with replaced values can be examined to ensure sufficient variability in imputations. Table 7 displays averages of these quantities across all simulations. Median of RelMSEs is typically around 13.5%, which indicates that imputations for most records have a wide range of uncertainty. In case when data owner requires larger errors in terms of decreasing sensitive variables disclosure risk, stricter validity measure criteria can be used in clustering. Table 7: Sensitive variables limitation in simulation studies. 4. Conclusion The simulation results show that the proposed clustering model can be used as an alternative approach in partially synthetic data generation in the similar to the CART way. The only limitation is that the attributes/variables characterizing the information units must be continuous. The foregoing can serve as a reasonable prerequisite for the development of clustering model in order to generate synthetic data sets for mixed information units with continuous and categorical attributes. Variable Min. 1stQuartile Median RMSE Total Income 5097.683 16711.692 35692.008 Expenditures 5994.552 22308.272 42092.148 RelRMSE Total Income 0.019 0.062 0.10 Expenditures 0.031 0.092 0.168 Hierarchical Cluster Analysis for Partially Synthetic Data Generation66 References [1] L. Willenborg and T. de Waal, Elements of Statistical Disclosure Control, New York: Springer-Verlag, 2001. [2] D.B. Rubin, “Discussion: statistical disclosure limitation”, Journal of Official Statistics, vol. 9, pp. 462–468, 1993. [3] T. E. Raghunathan, J. P. Reiter and D. B. Rubin, “Multiple imputation for statistical disclosure limitation”, Journal of Official Statistics, vol. 19, pp. 1–16, 2003. [4] J. P. Reiter, “Significance tests for multi-component estimands from multiply-imputed, synthetic microdata”, Journal of Statistical Planning and Inference, vol. 131, pp. 365 – 377, 2005. [5] D. B. Rubin, Multiple Imputation for Nonresponse in Surveys, New York: John Wiley and Sons, 1987. [6] R.J.A. Little, “Statistical analysis of masked data”, Journal of Official Statistics, vol. 9, pp. 407–426, 1993. [7] W. Alvey and B. Jamerson, (eds), Record Linkage Techniques, Washington, D.C.: National Academy Press., 1997. [8] J. M. Abowd and S. D. Woodcock, Disclosure Limitation in Longitudinal Linked Data. Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, Amsterdam: North-Holland, 2001 [9] F. Liu and R. J. A. Little, “Selective multiple imputation of keys for statistical disclosure control in microdata”, ASA Proceedings of the Joint Statistical Meetings, pp. 2133–2138, 2002. [10] J. Drechsler, Synthetic Datasets for Statistical Disclosure Control. Theory and Implementation, Springer, 2011. [11] J.P. Reiter, “Using CART to generate partially synthetic, public use microdata”, Journal of Official Statistics, vol. 21, pp. 441-462, 2005. [12] L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, Classification and Regression Trees, Belmont, CA: Wadsworh, Inc. , 1984. [13] J.P. Reiter, “Inference for partially synthetic, public use microdata sets”, Survey Methodology, vol. 29, pp. 181–189, 2003. [14] C.D. Manning, P. Raghavan and H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008. [15] D.B. Rubin, “The Bayesian bootstrap”, The Annals of Statistics, vol. 9, pp. 130–134, 1981. [16] M. Halkidi, Y. Batistakis and M. Vazirgiannis, “Clustering validity checking methods: Part II”, ACM New York, NY, USA, vol. 31, pp. 19-27, 2002. Submitted 05.09.2013, accepted 18.10.2013. L. Aslanyan ,V. Topchyan 67 Մասնակի սինթետիկ տվյալների գեներացիայի հիերարխիկ կլաստերային վերլուծություն Լ. Ասլանյան և Վ. Թոփչյան Ամփոփում Կոնֆիդենցիալ տեղեկությունների բացահայտման ռիսկի նվազեցումը այսօրվա դրությամբ հանդիսանում է վիճակագրական ընկերությունների հիմնական խնդիրներից մեկը։ Այդ խնդրի լուծման համար կիրառվող ամենահայտնի մեթոդներից մեկն է այսպես կոչված սինթետիկ տվյալների բազմությունների մշակման և տրամադրման մեթոդը։ Այլ կերպ ասած, կոնֆիդենցիալ փոփոխականների արժեքները փոխարինվում են նոր սինթետիկ արժեքներով։ Տվյալ հոդվածում ներկայացված է մասնակի սինթետիկ տվյալների ստեղծման/գեներացիայի մի մեթոդ, որի հիմքում ընկած է հիերարխիկ կլաստերային վերլուծությունը: Иерархический кластерный анализ для генерации частично синтетических данных Л. Асланян и В. Топчян Аннотация Ограничение риска раскрытия конфиденциальной информации на сегодняшний день является одной из основных задач статистических агентств. Одним из широко применяемых подходов является предоставление синтетических множеств данных. Иными словами, благодаря множественным замещениям значения конфиденциальных переменных исходных данных заменяются новыми синтетическими значениями. В данной статье представлен метод генерации частично синтетических данных на основе иерархического кластерного анализа.