Mathematical Problems of Computer Science 41, 81--92, 2014. 81 Pair Correlations Preserving Model in Synthetic Data Generation Vardan H. Topchyan Institute for Informatics and Automation Problems of NAS RA e-mail: vardan.topchyan@gmail.com Abstract The risk of disclosure of confidential information increases by the statistical organizations, due to the large volume of data released to the public. The most common methods of limiting the risk of dicloure are synthetic data genaretion methods. Unfortunately, these methods have a heuristic nature, because they do not have a clear theoretical basis. In this work presented a formal model of synthetic data generation for pair correlation preservation Keywords: Synthetic data, Confidentiality, Disclosure limitation. 1. Introduction While providing state, economic or social data to public organizations it may be necessary to hide/protect the confidential component of the provided information. For this purpose, publishing organizations often resort to modification of initial data or their replacement by other synthetic data. New synthetic data [1] generated based on different models and algorithms; they should provide the analyzing organizations with the achievement of adequate conclusions. Synthetic data clearly distort the true picture of the data, but in addition, which is very important, they may cause distortion of links between different segments of the data sets, which, in turn, can lead to rougher erroneous conclusions at the data analysis stage. Therefore, in such cases, it is necessary together with the protection of personal information, also to ensure the safety of the same functional relations between the corresponding segments of data sets. The specified field intensively studied in the literature [2], [3] and [4]; there exist typical approaches and solutions. It often characterized as a problem of statistical disclosure limitation. This is because the data provided are mainly the object of statistical data analysis. Theoretically, as it is observed by some authors [1], the created tasks and methods of their solutions are similar to probabilistic problems with recovery of missing values [5], [6]. Analysis of the available methods for generating the synthetic data [7], [8], [9] indicates their heuristic nature. And this, in turn, means that the consistency of these methods substantiated by simulation methods and there is no theoretical validity of the use of a particular approach. In this paper we will make an attempt to formulate a formal model of the problem under consideration to identify and examine its very essence, in the case of saving of paired correlations, i.e. the study of structures of the input data Pair Correlations Preserving Model in Synthetic Data Generation82 themselves, natural limitations imposed on them by the methods of data processing as well as by various types of privacy requirements. 2. Model Description Consider themainmodel of the problem. Let U= , ,…, be a set of certain elements from which the considered input data tasks (information units set)composed.Let a certain data element be characterized by a set of attributesA= , ,…, := , ,…, , 1 ≤ ≤ . Without loss of generality, we assume that confidential information contains in first columns of data table. Given that the order of attributes not fixed within the meaning of our problem, by rearranging them we can ensure that the confidential information were only in the first columns, = , ,…, , ⊆A. In principle, as the so-called “categorical” as well as continuous attributes are considered, but in our study as confidential attributes we will restrict ourselves only to the consideration of continuous attributes. Let the intervals , ,…, of the real axis be the range of values of the attributes , ,…, , respectively. For these attributes introduce a set of threshold conditions which determines the degree of their confidentiality, C = , ,…, . The condition 1 ≤ ≤ defines the critical аrеа of attribute values . For the consideration simplicity assume that the conditions specify numerical intervals in the range of definition of the corresponding attribute , although the consideration of other restriction structures may be quite natural and useful. Let define the interval ( ̅ , ̿ ) of critical values in the range ( , ) of attribute definition, ≤ ̅ ≤ ̿ ≤ . Further, we assume that as additional information the setRis given, the elements of which represent (announce, declare) correlatedness (the quality of being correlated) between certain pairs of attributes of the set A, R= , ,…, .1 ≤ ≤ is a subset of A, ⊂A, which indicates the existence of correlation (or puts forward a demand of maintenance of the correlation form and degree) between the elements of this set of attributes. V. Topchyan 83 Fig. 1. The structure of the original data, critical intervals of values and system of correlativeness of the attributes. Thus, our task is protection / concealment of confidential attributes critical values, that are determined based on the threshold conditions set C = , ,…, , with the condition of maintaining pairing correlations of attributes A based on system R= , ,…, . 3. Analysis of Critical Areas of Confidential Attributes in Data Table When analyzing the critical areas of confidential attributes in the considered data table it is important to evaluate the predictability of values in these areas. So long as for some attribute∈ 1≤ ≤ the number (volume) of different from each other critical values is relatively small, then during their imputations the reduction of disclosure risk of confidential information contained in this attribute will be insignificant. In this regard, it is expedient to evaluate the information entropy [10] of critical values for each set of attribute . The results of simulations indicate that for each confidential attribute with a value of not less than 7.5 - 8 entropy the generation of synthetic data categorical in terms of limiting the risk of disclosure of confidential information is possible. For further analysis of critical areas of the attributes introduce necessary definitions. Definition 1: The attribute 1 ≤ ≤ is called single (isolated) if it is not correlated withone of the other attributes of the set A by the system of constraints R. The set of single attributes is denoted by , ⊆A. Definition 2: The attribute 1 ≤ ≤ is called linked if it is correlated with at least one of the otherattributes of the setA. The set of linked attributes is denoted by , ⊆A. It is easy to notice that the set of confidential attributes can be represented as a union of the corresponding subsets of single and linked attributes:= ′ ∪ ′ , U= … …… … C = , ,…, R = , ,…, … … ⋮ … … Pair Correlations Preserving Model in Synthetic Data Generation84 where ′ is the set of confidential single attributes, ′ ⊆ , and ′ − the set of confidential linked attributes ′ ⊆ . To analyze the permissible areas of modification/imputation of critical values of the attributes from , consider separately the sets ′ and ′ . Without loss of generality, assume that the first attributes of the set are single, ′ = , ,…, , and the rest ( − )- are linked ′ = , ,… . Consider the set ′ = , ,…, . Changes in the critical values of the elements of the set ′ in the construction of synthetic data may be carried out independently from each other, as they do not correlate with any of the attributes of the setA. Let 1 ≤ ≤ beacurrent attribute under consideration. We can assume that the critical values of this attribute are located in the upper part of the corresponding column; otherwise this representation can be obtained by consecutive relocations of certain rows of the table (Fig. 2). The above presented grouping may serve as a basis for consideration of changes in the critical values of the set in two separate areas: (1) change of values in critical area of the column, (2) change of values in the whole column. Fig. 2. Location scheme of single attribute values and areas of their confidential values. Specific change in the attribute values will depend on the supposed procedures of data analysis. Consider a simple example of calculating the mean value of the attribute under consideration. If there are not other limitations, then relocationscan be considered in the whole area (2). Relocations do not change the objective values, and they change their distribution by individuals - rows. In our simpleexample, there isgreater scope forchange. One can just take another arbitrary column in (2) with the same mean value, or, - in the area (1) if there is a condition of preserving noncritical values. Saving the mean value is a weak condition and it does not appear separately in practice, so that real changes will maintain the character of the receivable values of the attribute under consideration. Thus, the concealment of critical values of the attributes ′ = , ,… can be carried out independently of the rest of attributes of the set A, within the relevant field. To analyze the attributes ′ = , ,… , consider the set R= , ,…, . Further analysis will be based on the assumption that all attributes of the set presented in the system R and each of its elements contains at least one confidential attribute. Since, otherwise, if some element 1 ≤ ≤ does not contain any confidential attribute, then its V. Topchyan 85 consideration does not make sense, for the changes in the critical attribute values will nowise affect the connection represented by this element. Let ( − ) > 2. Assume also that the attributes taking part in the definition of the pair correlation for , ,…, , contain intersection. Consider the subsets , ∈R represent the correlation between the attributes , и , < ≠ ≠ ≤ , respectively, = , , = , . Assume also that the correlation between the attributes and is not additionally defined. In order to preserve communication over , it is necessary that the change of the values and were agreed. Namely, the values of should be changed in view of the appropriate attribute values , and vice versa. Similar judgments hold for and the attributes , . Obviously, the attribute depends as on , as well as on . Therefore, to preserve the correlation by , the attribute valuesandA should also be changed in agreement with each other. As a result, between the attributes and an interrelation arises subject to consideration of the attribute . The foregoing data allow us to introduce the following natural definition. Definition 3: Let's say that the attributes and are conditionally correlated, subject to consideration of the attributes , ,…, , if there exists a set of paired correlations,…, so that = , , = , ,…, = , . The conditional correlation of attributes , we denote by , ,…, = , . The next stage of analysis of the set ′ = , ,… was the study of a binary relation between those attributes for which the communication preserved by the system R=, ,…, . Consider the set of these attributes, denoted by (correlated). Definition 4: We say that the attribute enters into the binary relation with the attribute , , if they meet one of the following conditions:  Attributes and coincide: = ⇒ ,  Attributes , are correlated: ∃ ∈ ∈R, = , ⇒ ,  Attributes , are conditionally correlated: ∃ ,…, ∈ , ,…, = , ⇒ . It is obvious that satisfies the properties of reflexivity and symmetry:∀ ∈ ⇒ ,∀ , ∈ , ⇒ . Let us show that this relation also satisfies the transitivity property, namely:∀ , , ∈ , , ⇒ . Since , , then from the definition of the relation it follows that between the attributes , and , there exists either a direct or a conditional correlation. Then by virtue of Definition 3 the attributes and will be conditionally correlated. And this, in turn, means that enters into the relation α with the attribute : . Pair Correlations Preserving Model in Synthetic Data Generation86 Thus, the relation satisfies the properties of reflexivity, symmetry and transitivity, hence it is an equivalence relation. In this case divides the set into disjoint equivalence classes:= ∪ ∪…∪ ,∩ = ∅, 1 ≤ ≠ ≤ . Moreover, any two attributes of one and the same class are interconnected to each other, and between the attributes of various classes the correlation is missing. Thus, in order to conceal the confidential information contained in the attributes of the set ′ = , ,… , on the assumption of preservation of the pair relations in the system R= , ,…, , changes in critical values of the attributes , ,… may be carried out in the obtained equivalence classes , ,… , separately. For convenience, consider a particular case when an equivalence class= , , ,…, 1 ≤ ≤ contains two confidential attributes , ∈ . Since does not have common attributes with other classes of equivalence, then for data analysis it is worthwhile to group the values of its attributes as shown in Figure 4. ⋯ Fig. 4. Location scheme of linked attribute values and areas of their confidential values. As shown in Figure 4, the critical values of the attribute are situated in the interval of rows1, and at their concealment the appropriate values should be taken into account . Moreover, the rows of the given area apart from the other values of the attribute may contain also critical ones. Therefore, to ensure consistency during the imputations of the attribute values , , it is necessary that the rows in the interval 1, should be considered separately from the rest of the rows of the interval 1, . In order to generate categorical synthetic data, the imputations of critical values in the interval 1, should be carried out between the rows close / homogeneous by the attribute values and , on condition that the correlations between the elements of the class are preserved. Since, in our studies we restrict ourselves to the 1i . 1 2i . 3i . n . V. Topchyan 87 consideration of only continuous attributes as confidential ones, then to group the rows of this range the technique of decomposing hierarchical cluster analysis can be applied [11]. In this case, the data elements are considered as r-dimensional vectors consisting of values of the class attributes , which enables partitioning based on correlations between the attributes of this class. As a measure of distance between the data elements the Euclidean distance is considered, and as a measure of homogeneity of the obtained subsets - the measure of RMSSTD (Root- Mean-Square Standard Deviation) [12], which is equal to the mean square deviation of critical values of confidential attributes. In addition, it is more appropriate to carry out imputations by collection of these attributes instead of successive imputations for each of them. Due to this, the correlation between the attributes , will be preserved in the best possible way. Then, as for the other critical values of the attribute , they may be considered together with the values from the interval + 1, or without them, depending on the restrictions imposed on noncritical values of confidential attributes. In any case, to preserve the correlations between the elements of the class the imputations of these values should be made taking into account the conditional distribution of for the rest of the class attributes . According to the literature, [4], one of the most appropriate methods for determining the conditional distribution in the generation of synthetic data are CART trees (Classification and Regression Trees) [13]. CART trees are used to predict the values of the dependent variable based on a set of predictors. In this case, as a dependent variable is considered the attribute , and as predictors - the rest − 1 attributes of the class . The principle of constructing the CART trees consists in a recursive partition of the set of data elements under consideration into subsets that are homogeneous with respect to the dependent variable. Namely, at each step the best condition is determined for some predictor and a partition of the current set is produced (by growing it). As a result, in the leaves of the obtained tree the data elements will be contained with the same value of the dependent variable. Since the obtained tree may consist of unjustifiably large number of nodes and branches, then to reach an acceptable size of these trees their pruning is made on the basis of some optimality criterion. In essence, the leaves of the CART tree represent a conditional distribution of the dependent variable for the set of predictors under consideration. Subsequent imputations of the attribute values will consistently be carried out in each leaf. Similar statements hold also for the critical values of the attribute in the interval + 1, . Thus, from the above simple example one may conclude that if the equivalence class consists of attributes, = , , ,…, , the first of which are confidential, then the rows of the data table are divided into not more than 2 separate areas, each of which contains a specific combination of critical values of confidential attributes and is processed by one of the methods presented above. These model structures and analysis confirmed by computational experiments presented in the next section. 4. Simulation Studies The presented model has been approved based on the data of 2012 Household's Integrated Living Conditions Survey, provided by the "National Statistical Service" of the Republic of Armenia. From the entire set of attributes characterizing these data, we are interested only in the following six: FoodPurchased, FoodConsumed, NonFoodPurchased, Expenditure, MonitoryIncome, TotalIncome (Table 1). Pair Correlations Preserving Model in Synthetic Data Generation88 Table 1: Description of attributes of interest Name Label Type Description FoodPurchased FP Numeric(18,10) Foodpurchased of household per month in AMD. FoodConsumed FC Numeric(18,10) Foodconsumed of household per month in AMD. NonfoodPurchased NFP Numeric(18,10) Nonfood purchased of household per month in AMD. Expenditure E Numeric(18,10) Expenditures of household permonth in AMD. MonitoryIncome M Numeric(18,10) Monetary income of household per month in AMD. TotalIncome I Numeric(18,10) Total income of household per month in AMD. In our experiments we assume that the confidential information is contained in the attributes Expenditure and TotalIncome, = , , and as threshold conditions are considered >200000 and > 250000, respectively. As for pair correlations, which should be saved, they are as follows: = , , = , , = , и = , , i.e. R= , , , . It is obvious that in this case, when generating the synthetic data, only one equivalence class= , , , , will be considered. In addition, the imputations of critical attribute values and are implemented due to the methods of relocation and reevaluation of values. First of all relocation of values is carried out using the method of Bayesian bootstrapping [14]. After that if some values remain unchanged, then their reevaluation is made: (i) probabilistic density of these values is determined using the Gaussian kernel density estimator; (ii) new values are set using the inverse-cdf method. Finally, in the result of experiment, = 5 sets of synthetic data are generated and as resulting values of statistical quantities, their average values are taken calculated on these sets. Table 2: Mean and standard deviation of confidential attributes Mean Standard deviation I E I E Estimated value on original data set 132862.38 109818.56 105518.35 95060.952 Average of estimated values on synthetic data sets 132523.88 108960.52 100335.54 78792.5834 V. Topchyan 89 Table 3: Correlation coefficients Correlations coefficient , , , , Estimated value on original data 0.974 0.414 0.708 0.579 Avg. of estimated values on synthetic data 0.880 0.511 0.845 0.673 Table 4: Coefficient in regression of total income on monitory income, food purchase, expenditures Table 5: Coefficient in regression of total income on monitory income and expenditures Table 6: Coefficient in regression of expenditure on total income, food purchased, food consumed, nonfood purchased. Value on original data set Avg. of values on synthetic data sets Coefficients Constant 11279.38 17607.02 M 0.966 0.800 FP -0.268 -0.297 E 0.165 0.303 R 0.984 0.900 Value on original data set Avg. of values on synthetic data sets Coefficients Constant 8868.75 13756.59 M 0.964 0.800 E 0.072 0.214 R 0.980 0.894 Value on original data set Avg. of values on synthetic data sets Coefficients Constant -1664.49 1813.41 I 0.026 0.041 FP 1.017 0.919 FC 0.992 0.9262 NFP 1.089 1.110 R 0.984 0.957 Pair Correlations Preserving Model in Synthetic Data Generation90 Table 7: Coefficient in regression of expenditure on food purchased, food consumed, nonfood purchased The above presented tables contain the results of experiments conducted. As shown in Tables 2 and 3, the mean values of simple statistical quantities (mathematical expectation, mean square deviation) of confidential attributes and the coefficients of the corresponding pair correlations calculated on the sets of synthetic data, are close to the original ones. And this, in turn, indicates that the numerical characteristics of the attributes Totalincomeand Expenditure, as well as the primary relations are saved also in the synthetic data. Then, in Tables 4 - 7 the coefficients of linear regressions constructed on the original and synthetic data sets are shown. On the sets of synthetic data the values of the parameter R (indicating the degree of correctness of interpretation of a dependent variable from the independent ones) are in the vicinity of 0.9 and more, indicating that they are correct. In addition, the corresponding values of the regression coefficients are close to the original. Consequently, conclusions drawn from the analysis of synthetic data will correctly reflect the results of analysis of the original data. 5. Conclusion Problem of maintaining confidentiality of state, economic and social data in distributed computing related to new theoretical and applied research. As of today, the existing algorithms of their analysis are of a heuristic nature. In the present work the structure of these data was studied and a model was presented limiting the risk of disclosure of confidential information with preservation of paired connections between various segments of data. References [1] D. B. Rubin, “Discussion: statistical disclosure limitation”, Journal of Official Statistics, vol. 9, pp. 462–468, 1993. [2] T. E. Raghunathan, J. P. Reiter and D. B. Rubin, “Multiple imputation for statistical disclosure limitation”, Journal of Official Statistics, vol. 19, pp. 1–16, 2003. [3] J. Drechsler, Synthetic Datasets for Statistical Disclosure Control. Theory and Implementation, Springer, 2011. [4] J. Drechsler and J. P. Reiter, “An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets”, Computational Statistics & Data Analysis, vol. 55, no. 12, pp. 3232--3243, 2011. Value on original data set Avg. of values on synthetic data sets Coefficients Constant -168.794 3692.55 FP 1.033 0.943 FC 1.016 0.962 NFP 1.108 1.140 R 0.983 0.957 V. Topchyan 91 [5] T. E. Raghunathan, J. M. Lepkowski, J. van Hoewyk and P. Solenberger, ”A multivariate technique for multiply imputing missing values using a series of regression models”, Survey Methodology, vol. 27, pp. 85--96, 2001. [6] D. B. Rubin, Multiple Imputation for Nonresponse in Surveys, New York: John Wiley and Sons, 1987. [7] J. P. Reiter, “Using CART to generate partially synthetic, public use microdata”, Journal of Official Statistics, vol. 21, pp. 441-462, 2005. [8] G. Caiola and J. P. Reiter, “Random forests for generating partially synthetic, categorical data”, Transactions on Data Privacy, vol. 3, pp. 27--42, 2010. [9] J. Domingo-Ferrer, J. Magkos, (eds.), Privacy in Statistical Databases, New York: Springer, pp. 148--161, 2010. [10] В. Лидовский, Теория Информации, Москва, Спутник+, 2004 [11] L. Aslanyan and V. Topchyan, “Hierarchical cluster analysis for partially synthetic data generation”, Transactions of IIAP NAS RA, Mathematical Problems of Computer Science, vol. 40, pp. 55--67, 2013. [12] M. Halkidi, Y. Baristakis and M. Vazirgiannis, “On clustering validation techniques”, Journal of Intelligent Information Systems, vol. 17, no. 2-3, pp. 107--145, 2001. [13] L. Breiman, J. H. Friedman, R. A. Olshen and C.J. Stone, Classification and Regression Trees, Belmont, CA: Wadsworh, Inc., 1984. [14] D. B. Rubin, “The Bayesian bootstrap”, The Annals of Statistics, vol. 9, pp. 130–134, 1981. Submitted 20.12.2013, accepted 05.03.2014. Զույգ կորելյացիաների պահպանման մոդելը սինթետիկ տվյալների գեներացման միջոցով Վ. Թոփչյան Ամփոփում Կոնֆիդենցիալ տեղեկատվության բացահայտման ռիսկը մեծանում է ի շնորհիվ վիճակագրական կազմակերպությունների կողմից հանրությանը տրամադրվող մեծ քանակությամբ տվյալների: Այս խնդիրի լուծման ամենատարածված մեթոդներից են սինթետիկ տվյալների գեներացումը: Ցավոք, այդ մեթոդներն ունեն էվրիստիկ բնույթ, քանի որ նրանք չունեն հստակ տեսական հիմնավորում: Այս աշխատանքում ներկայացված է սինթետիկ տվյալների գեներացման ֆորմալ մոդելը, որն ապահովում է զույգ կորելյացիաների պահպանությունը: Pair Correlations Preserving Model in Synthetic Data Generation92 Модель сохранения парных корреляций при генерации синтетических данных В. Топчян Аннотация Риск раскрытия конфиденциальной информации увеличивается в связи с большим объемом данных, предоставляющимися статистическими организациям и общественности. Наиболее распространенными методами для решения данной проблемы являются методы генерации синтетических данных. К сожалению эти методы имеют эвристический характер, потому что они не имеют четкой теоретической основы. В этой работе представлена формальную модель генерации синтетических данных, обеспечивающих сохранение парных корреляций.