Microsoft Word - cet-01.docx CHEMICAL ENGINEERING TRANSACTIONS VOL. 46, 2015 A publication of The Italian Association of Chemical Engineering Online at www.aidic.it/cet Guest Editors: Peiyu Ren, Yancang Li, Huiping Song Copyright © 2015, AIDIC Servizi S.r.l., ISBN 978-88-95608-37-2; ISSN 2283-9216 The K-Anonymization Method Satisfying Personalized Privacy Preservation Jinling Song*, Liming Huang, Gang W ang, Yan Kang, Haibin Liu Hebei Normal University of Science & Technology, Qinhuangdao 066004, China. songjinling99@126.com Even if k-anonymity model can prevent publishing data from disclosing privacy effectively and efficiently, due to the uneven distribution of the sensitive data, ordinary k-anonymization method cannot guarantee each tuple satisfying the personalized privacy requirement of it’s data owner although the publishing table has been satisfied k-anonymity constraint. The reason which k-anonymity table fails to satisfy personalized privacy requirement is analyzed firstly, then Correlate degree of Sensitive Values, Leakage Collection, privacy disclosure metric and data quality metric are presented. At last an anonymization method satisfying personalized privacy requirements is presented, in which a utility-driven adaptive clustering method is proposed to partition tuples with similar best data quality. 1. Introduction The general process of privacy preserving data publishing is: firstly, the data owners offer the data publisher their individual data containing sensitive information, and then, the data publisher treats the collected data uniformly to meet certain privacy model, finally, the data publisher publishes the data satisfying privacy requirement to the data receiver for statistical analysis. In the data release process, although the data owner has authorized the data publisher to deal with his individual data for protecting privacy, the uniform treatment of data may not meet the privacy requirement of each individual. So, this leads to the problem of personalized privacy preservation, that is, each data owner will specify the privacy preservation level for his data independently and the publishing data set should satisfy the privacy requirement of each individual. K-anonymity is a typical privacy model, which reduces privacy leakage based on generalizing the values on quasi-identifier attributes (called k-anonymizaiton). Due to the sensitive value do not distribute symmetrically, ordinary k-anonymization method may lead to the same sensitive value appears frequently in a QI-group. Even the generalizing data set meet the k-anonymity constraint, some tuples may violate the privacy requirements of their owner, which implicates that the disclosing risk is still exist. Table 1 is the diagnosis records published by a hospital, in which the sensitive attribute is “disease”. To resist the linking attack, we can k-anonymize Table 1 before publishing. Assume the privacy request of each data owner is 50%, Table 1 should be 2-anonymous from the principle of the k-anonymity model. Table 3 is 2-anonymous table of Table 1 on {Age, Postcode}. But a closer look reveals that the risk of privacy leakage still exists in it: attacker can deduce that Zhao is related to the fourth tuple or fifth tuple in Table 3 according to his Age (12) and Postcode (“22000”), so he can infer that Zhao suffers from “pneumonia” because the disease is same in the two tuples. What we can deduce from the example is that the uniform k-anonymization of publishing data set can’t preserve the privacy of each individual even the privacy requirement of each one is consistent. The reason is the personal privacy requirements are neglected in the k-anonymization process and result in the slope of sensitive value. So, the k-anonymization method need to consider the privacy requirement of each individual adequately, we present a k-anonymization method which aims at personalized privacy preservation. DOI: 10.3303/CET1546031 Please cite this article as: Song J.L., Huang L.M., Wang G., Kang Y., Liu H.B., 2015, The k-anonymization method satisfying personalized privacy preservation, Chemical Engineering Transactions, 46, 181-186 DOI:10.3303/CET1546031 181 Table 1: Medicine diagnosis records Age Zipcode Diease 5 12000 gastric ulcer 9 14000 dyspepsia 8 19000 bronchitis 12 22000 pneumonia 19 24000 pneumonia Table 2: Voter information list Name Age Zipcode Li 5 12000 Wang 9 14000 Wang 6 18000 Liu 8 19000 Chen 7 17000 Zhao 12 22000 Table 3: 2-anonymous table Age Zipcode Diease [5-10] [10001-20000] gastric ulcer [5-10] [10001-20000] dyspepsia [5-10] [10001-20000] bronchitis [11-20] [20001-25000] pneumonia [11-20] [20001-25000] pneumonia 2. Related work K-anonymity privacy protection model (L. Sweeney (2002)) got the wide attention of experts and scholars when it was presented by L. Sweeney. Previous studies mostly focus on k-anonymization algorithm under different scenarios. Datafly algorithm was adopted by L. Sweeney (1997), which have promoted the generation of k-anonymity model. To improve the data precision of the generated table, Mingen algorithm was adopted in (L. Sweeney (2002)). The global Incognito algorithm was proposed by K. Lefvre et al (2005), which generalize all the domain values of attributes. Multi-dimensional algorithm was proposed by Kristen LeFevre et al (2006), which generalize multi-attributes at the same time. Fung B C M et al presented the TDS (top-down specialization) algorithm which achieves the k-anonymity by gradual specialization from the most generalization state (attribute values are represented by the root nodes in classification tree). To preserve the clustering information of anonymous data, Fung B C M et al (2009) extended TDS algorithm. Bo WANG and Jing YANG (2012) proposed a local coding anonymous algorithm was proposed based on the attribute hierarchy. Although the above algorithms are all excellent in their considering scenarios, no one consider the personalized privacy requirement. Zakerzadeh H and Osborn SL (2011), Cao JM et al. (2011) presented the anonymization method for numerical streaming data. HYUNJI L and JAE-W OO C (2013) present Density- based k-anonymization scheme in location-based services. Yuan MX et al (2011), M.E.S karkala et al. (2011) utilized the k-anonymity model in social network. Jinling SONG et al (2014) presented a multi-objective optimization method for selecting the value of k in the k-anonymity Model. 3. Basic definitions Definition 1 (Quasi-identifier, QI) If a dataset T (AQI, AS) can be connected with other dataset by attributes AQI={Ai,…,Aj} and re-identify the privacy of some individuals. Then the attribute set AQI is called Quasi- identifier. Definition 2 (Generalization Tree, Gtree) Let D is a finite domain of the attribute Ai, then the Generalization Tree of Ai is a tree whose leaves are attribute values in D and each non-leaf node is a value summarizing it’s children. Gtrees of the numeric attribute Age and Categorical attribute Disease are shown in Figure 1. 182 Definition 3 (Generalization) Let the Gtree of attribute Ai is GTAi, the generalization of Ai is the process of mapping a value v to its ancestor in in GTAi. The value has been generalized is called generalization value. Definition 4 (k-anonymization) For the dataset T (AQI, AS), if we generalize the values on AQI and get a dataset T* in which each tuple has at least k-1(K≥2) other tuples same with it on AQI, then T* is the k-anonymized dataset of T, the generalization process from T to T* is called k-anonymization. Definition 5 (QI Group) For a k-anonymous dataset T*(AQI, AS), the generalized tuples with the same value in T*[AQI] are called a QI group, i.e. QG. The QI groups in T* are denoted as QG (T*) = {QG1, QG2, QGm}, where |QGi |≥k,and |QG1|+|QG2|+…+|QGm|= |T*| . 4. The metric for privacy leakage Definition 6 (Correlate degree of sensitive values) Let t* and t*i are generalizing tuples in the k-anonymous table, t is the original tuple of t*, the correlate degree of the sensitive value of tuple t* and t*i is denoted as DCorrelate (t*, t*i). DCorrelate (t*, t*i)=1/Nleaf(t*i) when t[AS] is a leaf of the tree rooted as t*i [AS], where Nleaf(t*i) is the count of the leaves in the tree t*i [AS]. DCorrelate (t*, t*i)=1 if t[AS] is same as t*i [AS], else DCorrelate (t*, t*i)=0. Definition 7(Leakage Collection) Let t* is a tuple in the anonymous table, all the tuples relevant with the privacy leakage of t* in it’s QI group is called the Leakage collection of t*, denoted as LC(t*). LC(t*) expresses the tuple set composed by t*i making DCorrelate (t*, t*i)>0 in the QI group containing t*. Definition 8 (Privacy leakage probability of t*) Let the QI group including t* is QGt, then the privacy leakage probability of t* (denoted as Pleak (t*)) only associated with other tuples in the QGt. Let the cardinality of QI is b, then the probability that an attacker matches t* to any t*i is 1/b, so the privacy leakage probability that t* correlates to t*i is DCorrelate (t*,t*i)/b. Then b* * * Correlate i=1 ( ) ( , ) / i Pleak t D t t b  , where t * i represents the ith tuple in QGt. (a) Part of GTree about “Age” (b) Part of GTree about “Disease” Figure 1: Gtree about attribute Age and Disease 5. The Metric of Information Left Degree Definition 9 (Information Left Degree of value) After the attribute value v is generalized to v*, the size of information left in v* is called information left degree of v*, denoted by ILeft_value(v*), * * The number of leaf nodes in SubT( )-1 _ ( ) 1 The number of all leaf nodes in GTree v ILeft value v   , SubT(v*) is the child tree rooted by v* in GTree. Definition 10 (Information Left Degree of tuple) Let the generalized tuple of t is t*, and d is the sum of dimensions of AQI and AS in the tuple t, then the information left degree of tuple t* (denoted as ILeft_tuple(t*) ) can be expressed as ILeft_tuple(t*)= (ILeft_value(t*[AS])+ ILeft_value(t*[AQI]))/d. 183 Definition 11 (Information Left Degree of Anonymous Table) Let T* is the anonymous table of original table T, the information left degree of T* can be expressed as | |* * * 1 * _ ( ) _ ( )/ | | T i i ILeft table T ILeft tuple t T    . 6. The Anonymization Method Satisfying Personalized Privacy Preservation Considering the limitation and insufficiency in traditional k-anonymization methods, our method will not only satisfy the privacy requirements of each data owner but also persist as more as information. The algorithm contains the following steps: firstly, we k-anonymize the original table by a utility-driven adaptive clustering method which can partition tuples with best data quality similarly, Secondly, check each QI group and call SA_Generalization algorithm to ensure each tuple satisfy personalized privacy requirement. To partition tuples in a table with less information loss (that is with best data quality), the intuition is that the tuples in a partitioned group must be more similar and the tuples in two groups must be have more difference. If we sort the tuples on QI attributes, the adjacent tuples will be more similar than other tuples which are far away. So, we can partition the front k tuples to a group and the group will have least information loss naturally. To the (k+1)th tuple, it can belong to this group or the next group as the first tuple, we decide it by judging the information left degree change which can make the best data quality of (k+1)th tuple greeedly, so it can make the anonymity table with best data quality similarly. Theorem 1. Let the k-anonymous table of the table T is T*, and the QI group of T* that tuple t* locates in is QGt, and b is the cardinality of QGt (that is b=|QGt|), then the minimum privacy leakage probability of tuple t* is 1/b. By Theorem 1, if the privacy requirement of data owner is less than 1/b, then the tuple t* will not satisfy the privacy requirement whatever to do for it, that is the anonymous table fails to satisfy privacy requirement. Theorem 2. For any tuple t* in a QI group, if there is a tuple t*i (t*i≠t*) in the QI group satisfy DCorrelate(t*,t*i)>0, then generalizing the sensitive attribute value of the t*i will decrease Pleak(t*). Anonymity_Generalization (T, AQI, pleakj (1≤j≤|T|)) Input: publishing table T, quasi-identifier attributes AQI, the privacy requirements (maximum privacy leakage probability) of each data owner pleakj (1≤j≤|T|); Output: the anonymous table T* satisfying personalized privacy requirements 1. T*=; /*T* stores the anonymous table */ 2. flag=false; /* identify whether the anonymous table satisfy the personalized privacy requirements */ 3. T =Sort table T on AQI; 4. QGi =the front k tuples of Tto an anonymized group; 5. while (l<|T|) /* |T| is the number of the tuples in publishing table T*/ 6. {if (ILeft_table(QGitl)-ILeft_table (QGi) ILeft_tuple (tl { the k tuples of T from tl })) 7. QGi = QGi  tl; 8. else 9. {T = T- QGi; 10. QGi= generalizing QGi on AQI; 11. T* = T* QG*i; 12. QGi =the front k tuples of T; 13. }} 14. for each QGi T* 15. {for each (t*j in QGi) 16. if(pleakj<1/| QGi|) then 17. {flag=true; 18. return; } 19. if (not flag) 20. QGi =SA_Generalization(QGi, pleakl(1≤j≤|QGi|)); } 21. return T*; SA_Generalization algorithm is called to dispose the group QGi to satisfy the privacy requirements given by data owners. It examines each tuple in the QGi, if the privacy leakage probability of a tuple t* is greater than the privacy requirement, then the sensitive value that has the largest information left degree in the leakage collection LC (t*) will be generalized to make it achieve the requirement. SA_Generalization (QGi, pleakj (1≤j≤b)) Input: QGi (containing tuple t*1, t*2,...,t*b), the privacy requirements pleakj (1≤j≤b) of each data owner in QGi Output: QGi satisfying the privacy requirements of each data owner 1. For each tuple t*j∈QGi (1≤j≤b) 2. if (Pleak (t*j)>pleakj) then 3. {Compute the leakage collection LC (t*j) of tuple t*j; 4. do{ 5. search the tuple r in LC(t*j), which has the largest information left degree; 184 6. if (r[AS]= the root of the GTAi (AS)) 7. LC(t*j)= LC(t*j)-{r}; 8. else 9. r[AS]= the parent of r[AS]; 10. } while(Pleak(t*j)> pleakj); 11. } 7. Experiment Our experimental environment is: Intel Pentium IV CPU, Memory 2G, Microsoft Visual C++ 6.0 and SQL Server. The data we selected is a company employees database, where attributes are {country, age, sex, zipcode, profession}, {country, age, sex, zipcode} is quasi-identifier and {profession} is sensitive attribute. In the experiment we sets the privacy requirement of every data owners is pleak=0.25 (that is equivalent to k=4). We compare the variation of privacy preservation of Anonymity_Generalization and other traditional k- anonymization algorithm by changing tuples from 0 to 50000, Figure 2 shows the influence on the privacy leakage probability (here refers to the average disclosure probability). Figure 2: The experimental result As can be seen from Figure 2, when the tuples in the publishing table is less, both the Anonymity_Generalization algorithm and traditional k-anonymity algorithm has a lower privacy leakage probability, which satisfies the data owner’s requirements. But when the number of the tuples is larger enough, the privacy leakage probability of the anonymous table generated by traditional algorithm is obviously increased, while Anonymity_Generalization algorithm is almost invariable and less than the privacy requirement of data owners. It means that traditional k-anonymity algorithm cannot guarantee the privacy of the publishing data under the uneven distribution of the sensitive value, but Anonymity_Generalization algorithm can satisfy the privacy requirement of all data owners in deed. 8. Conclusions In this paper, we analyses the disclosure caused by k-anonymization even under uniform privacy requirements because the slope of sensitive values. We present a k-anonymization method which aims at personalized privacy preservation. The experiments show that our method can make the publishing dataset satisfying the privacy request of each data owner and prevent the slope attack effectively. Acknowledgements This work is supported by The Natural Science youth Foundation of Hebei Province (No. F2015407039), Hebei Province Department of science and technology Fund (NO.13227427), Hebei Province Department of Education Fund (NO. sz132075, sz141097, 2012-112), Doctoral Foundation of Hebei Normal University of Science & Technology (2013YB007, 2013YB001). References Cao J.M., Carminati B., Ferrari E., Tan K., 2011, CASTLE: Continuously anonymizing data streams. IEEE -352. [DOI: 10.1109/TDSC.2009.47] 185 Fung B.C.M., Wang K., Yu P.S., 2007, Anonymizing Classification Data for Privacy Preservation, IEEE Transactions on Knowledge and Data Engineering, 19(5): 711-725. [DOI: 10.1016/j.datak.2008.12.001] Fung B.C.M., Wang K., Wang L., 2009, Privacy-Preserving Data Publishing for Cluster Analysis, Data & Knowledge Engineering, 68(6): 552-575. [DOI: 10.1145/1857947.1857950] Hyunji L., Jae-Woo C., 2013, Density-based k-anonymization scheme for preserving users’ privacy in location- based services. Lecture Notes in Computer Science, 7861: 536-545. [DOI: 10.1007/978-3-642-38027- 3_57] Lefvre K., DeWitt D., Ramakrishnan R., 2005, Incognito: Efficient full-domain k-anonymity, In Proceedings of the International Conference on Management of Data, 49-60. [DOI: 10.1145/1066157.1066164] LeFevre K., DeWitt D.J., Ramakrishnan R., 2006, Mondrian Multidimensional K-Anonymity, In Proceedings of International Conference on Data Engineering, 25. [DOI: 10.1109/ICDE.2006.101] Li J.Y., Wong R.C.W., Fu A.W.C., 2008, Anonymisation by Local Recoding in Data with Attribute Hierarchical Taxonomies, IEEE Transactions on Knowledge and Data Engineering, 20(9): 1181-1194. [DOI: 10.1109/TKDE.2008.52] Song J.L., Huang L.M., Zhang C., Zhang G.B., 2014, Multi-objective Optimization of K in K-anonymity Model, Journal of Computational Information Systems, 10(22): 9759- 9770. [DOI: 10.12733/jcis13119] Skarkala M.E., Maragoudakis M., Gritzalis S., 2012, Privacy Preservation by k-Anonymization of Weighted Social Networks. In Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), IEEE Computer Society: 423-428. [DOI: 10.1109/ASONAM.2012.75] Sweeney L., 2002, K-Anonymity: a model for protecting privacy, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5): 557-570. [DOI: 10.1142/S0218488502001648] Sweeney L., 1997, Guaranteeing anonymity when sharing medical data: the Datafly system, In Proceedings of the 1997 AMIA Annual Fall Symposium, 4 (suppl): 51-55. [DOI: 10.1109/ITAB.1997.649428] Sweeney L., 2002, Achieving k-anonymity privacy protection using generalization and suppression, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5): 571-588. [DOI: 10.1142/S021848850200165X] Wang B., Yang J., 2012, Personalized (α, k)-Anonymity Algorithm Based on Entropy Classification, Journal of Computational Information Systems, 8(1): 259- 266. [DOI: 10.12733/jcis11309] Yuan M.X., Chen L., Yu P.S., Yu T., 2013, Protecting sensitive labels in social network data anonymization. IEEE Transactions on Knowledge and Data Engineering, 25(3): 633-647. [DOI: 10.1109/TKDE.2011.259] Zakerzadeh H., Osborn S.L., 2011, FAANST: Fast anonymizing algorithm for numerical streaming data. In Proceedings of the 5th International Workshop on Data Privacy Management and 3rd International Conference. on Autonomous Spontaneous Security, 36-50. [DOI: 10.1007/978-3-642-19348-4_4] 186