Original article Biomath 2 (2013), 1212127, 1–5 B f Volume ░, Number ░, 20░░ BIOMATH ISSN 1314-684X Editor–in–Chief: Roumen Anguelov B f BIOMATH h t t p : / / w w w . b i o m a t h f o r u m . o r g / b i o m a t h / i n d e x . p h p / b i o m a t h / Biomath Forum Rough Sets in Biomedical Informatics Antony Popov and Simeon Stoykov Faculty of Mathematics and Informatics, Sofia University “St. Kliment Ohridski” 5 James Bourchier Blvd., 1164 Sofia, Bulgaria E-mail:simeon@microdicom.com Received: 30 October 2012, accepted: 12 December 2012, published: 21 January 2013 Abstract—The intent of this paper is to face the essen- tials of granular computing and in its major component— the rough sets theory, introduced by Pawlak, since any rough set represents an information granule. As a part of modern soft computing paradigm, rough sets have been in- troduced as an interval-like extension of the usual sets with main applications in the intelligent systems. The proposed rough approach provides efficient algorithms for finding hidden patterns in data, finds minimal sets of data (data reduction), evaluates significance of data. Applications in medicine via DICOM standard are presented, as well as ideas for applications to microbiology. Keywords-Rough sets; mathematical morphology; molec- ular biology classifier; medical diagnosis; medical imaging I. INTRODUCTION In classical set theory a set is uniquely determined by its elements. In other words, it means that every element must be uniquely classified as belonging to the set or not. That is to say the notion of a set is a precise, or crisp one. For instance, the set of integer numbers is crisp because every number can be uniquely represented by its decimal digits. In mathematics traditionally crisp notions are mainly use to ensure precise reasoning. How- ever philosophers and natural scientists for many years were interested also in imprecise notions like feelings, moral categories, beauty, including also many biological features like the color of the skin or a flower. The rough set approach makes the vagueness of the data possible. It provides efficient algorithms for finding hidden patterns in data, finds minimal sets of data (data reduction), evaluates significance of data. Applications in medicine via DICOM standard are presented in this paper, as well as applications to microbiology and biometrics. Strictly speaking, any rough set represents an information gran- ule. As an example, in gray scale images boundaries between object regions are often ill defined because of grayness or spatial ambiguities. This uncertainty can be effectively handled by describing the different objects as rough sets with upper (or outer) and lower (or inner) approximations as follows: Let the universe U be an image consisting of a collec- tion of pixels. Then if we partition U into a collection of non-overlapping windows of size m × n, each window can be considered as a granule G. Given this granulation, object regions in the image can be approximated by rough sets. A rough image is a collection of pixels and the equivalence relation induced partition of an image into sets of pixels lying within each non-overlapping window over the image. II. INFORMATION SYSTEMS AND ROUGH SETS Let U be a non-empty, finite set called the universe and A is a non-empty, finite set of attributes, that is every a ∈ A is a mapping of the form a : U → Va, where Va is called a value set of a. The elements of U are called objects and interpreted as, e.g. cases, states, processes, patients, observations. Attributes are inter- preted as features, variables, characteristic conditions, etc. Every information system A = (U,A) and non- empty set B ⊆ A determine a B-information function defined by InfB(x) = {(a,a(x)) : a ∈ B}. The set {InfA(x) : x ∈ U} is called A-information set and it is denoted by INF(A). With every subset of attributes B ⊆ A, an equivalence relation, denoted by INDA(B) Citation: A Popov, S Stoykov, Rough Sets in Biomedical Informatics, Biomath 2 (2013), 1212127, http://dx.doi.org/10.11145/j.biomath.2012.12.127 Page 1 of 5 http://www.biomathforum.org/biomath/index.php/biomath http://dx.doi.org/10.11145/j.biomath.2012.12.127 A Popov at al., Rough Sets in Biomedical Informatics (or IND(B)) called a B-indiscernibility relation, is as- sociated and defined by IND(B) = {(s,s′) ∈ U2 : for every a ∈ B,a(s) = a(s′)}. Any minimal subset B ⊆ A such that IND(A) =IND(B) is called a reduct in the information system. In fact, microcalcification on X-ray mammogram is a signifi- cant mark for early detection of breast cancer. Texture analysis methods can be applied to detect clustered mi- crocalcification in digitized mammograms [2]. In order to improve the predictive accuracy of the classifier, the original number of feature set is reduced into smaller set using feature reduction techniques. In [5] have been introduced rough set based reduction algorithms based on the extracted features. The rough reduction algorithms are tested on mammograms from Mammography Image Analysis Society (MIAS) database [8]. III. ROUGH SET FORMAL DEFINITION AND MAIN PROPERTIES Rough set theory can be viewed as a specific imple- mentation of fuzzyness and vagueness, i.e., imprecision in this approach is expressed by a boundary region of a set, and not by a partial membership, like in fuzzy set theory. However a rough set can be expressed by a fuzzy membership function, as demonstrated below, but it many cases the textual and table representation of a rough set makes it easier and more efficient to practical implementations rather than the original fuzzy approach (3). Moreover, the attributes may be numeric, or in the most cases non-numeric (categorical) quantities, such as big, small, good, malignant, benign etc. As we said previously, we represent the rough. objects x in the universe by their information vector InfA(x). Thus we can define an equivalence relation R between two objects x and y if their information representation coincides, i.e. InfA(x) = InfA(y), so x and y belong to a same information granule. Then the lower approximation of a rough set X with respect to R is the set of all objects, which can be for certain classified as X with respect to this relation. The upper approximation of a rough set X with respect to R is the set of all objects which can be possibly classified as X with respect to R. The boundary region of a set X with respect to R is the set of all objects, which can be classified neither as X nor as not- X with respect to R. For any crisp set the boundary is empty. Therefore we should mainly work with rough sets for which the boundary region of X is nonempty. The equivalence class of R determined by element x will be denoted by R(x). Formal definitions of approximations and the boundary region follow below. • R-lower approximation of X: R∗(x) = ⋃ x∈U {R(x) : R(x) ⊆ X}. • R-upper approximation of X: R∗(x) = ⋃ x∈U {R(x) : R(x)∩X 6=}. • R-boundary region of X: RNR(X) = R ∗(X)−R∗(X). It is easy to see that R∗(x) ⊆ X ⊆ R∗(x). Thus we can define a fuzzy membership function, i.e. a fuzzy representation of the rough set X in the universe X with respect to the relation R, see [4]: µR(X) = #(R(x)∩X) #(X) . Note, that in the definitions above, X is a normal precise subset of the universe U, but we have constructed its rough representation with respect to the relation R - the pair (R∗(X),R∗(X)) and its fuzzy analog µR(X). Here the sign # means the cardinality (the number of the elements) of a set. The lower approximation is sometimes referred to as positive region, while the space of the universe outside the upper approximation is called also negative region. It has been mentioned by Bloch [7] that there is an analogy between rough sets and mathematical morphol- ogy. Namely, the R-upper approximation is an analog of morphological dilation, while the R-lower approxima- tion is an analog of morphological erosion. This fact is not surprising, since the relations between fuzzy sets and operations on them and morphology are well studied [1], and a relation between classical interval operations and morphological ones have been established. Moreover, it is evident that the rough approximation of a set is similar to an interval approximation of a real number. On the other hand, interval operations and mathemat- ical morphology have demonstrated their capabilities in solving problems in biomedicine. As an example, due to a complex nature of biomedical images, it is practically impossible to select or develop automatic segmentation methods of generic nature, that could be applied for any type of images, namely for either micro- and macroscopic images, cytological and histological Biomath 2 (2013), 1212127, http://dx.doi.org/10.11145/j.biomath.2012.12.127 Page 2 of 5 http://dx.doi.org/10.11145/j.biomath.2012.12.127 A Popov at al., Rough Sets in Biomedical Informatics ones, MRI and X-ray, and so on. Medical image segmen- tation is an indispensable process in the visualization of human tissues. However, medical images always contain a large amount of noise caused by operator performance, equipment and environment. This leads to inaccuracy with segmentation. So, a robust segmentation technique is required. The basic idea behind introducing rough sets is that while some cases may be clearly distinguished as being in a set X (positive region in rough sets theory), and some cases may be clearly labeled as not being in set X (negative region). Since we can obtain limited information we are not able to label all possible cases clearly. The remaining cases cannot be distinguished and lie in the boundary region. IV. ROUGH SET SPECIFICATION BY DECISION RULES For rough separation of the universe U one can use efficiently fuzzy C-means clustering [4]. If we want to separate m data elements into n clusters, by this algorithm we obtain n cluster centers and m n numbers between 0 and 1 showing the degree of membership of ith data element to the jth cluster. Thus if this number is not less than 0.75 then the element belongs to the positive region of the cluster, if it is less than 0.25 it belongs to negative region, otherwise it belongs to the border. Then by IF THEN ELSE rules we may specify the regions [6]. The rules can be applied to a set of unseen cases in order to estimate their classification power. Several application schemes can be envisioned. Let us consider one of the simplest which has shown to be useful in practice: 1. When a rough set classifier is presented with a new case, the rule set is scanned to find applicable rules, i.e. rules whose predecessors match the case. 2. If no rule is found (i.e. no rule is fired), the most frequent outcome in the training data is chosen. 3. If more than one rule fires, these may in turn indicate more than one possible outcome A voting process is then performed among the rules that fire in order to resolve conflicts and to rank the predicted outcomes. Here are some rough rules which in fact form a decision table: - IF Gene A is up-regulated AND Gene D is down- regulated THEN Tissue is healthy; - IF Transcription factor F binds AND Transcription factor V binds THEN Gene is co-regulated with Gene H; - IF Protein contains motif J THEN Function is magnesium ion binding OR copper ion binding; - IF Protein contains motif D AND Ligand water- octanol coeff. > c1 THEN Binding affinity is high; - IF change in frequency of alpha-helix at position X > c2 THEN Resistant to drug W. V. DICOM FORMAT AND ITS REALIZATION ACR (the American College of Radiology) and NEMA (the National Electrical Manufacturers Associ- ation) formed a joint committee to develop a Standard for Digital Imaging and Communications in Medicine [9] . This Standard is developed in liaison with other Standardization Organizations including CEN TC251 in Europe and JIRA in Japan, with review also by other organizations including IEEE, HL7 and ANSI in the USA. This Standard is now designated for almost CT, PET, MRI, Ultrasound devices used in practice. It is applicable to a networked environment. The previous versions were applicable in a point-to-point environment only; for operation in a networked environment a Net- work Interface Unit (NIU) was required. DICOM Version 3.0 supports operation in a net- worked environment using industry standard networking protocols such as OSI and TCP/IP. It specifies how devices claiming conformance to the Standard react to commands and data being exchanged. Previous versions were confined to the transfer of data, but DICOM Version 3.0 specifies, through the concept of Service Classes, the semantics of commands and associated data. DICOM Version 3.0 explicitly describes how an implementor must structure a Conformance Statement to select spe- cific options. It is structured as a multi-part document. This facilitates evolution of the Standard in a rapidly evolving environment by simplifying the addition of new features. ISO directives which define how to structure multi-part documents have been followed in the con- struction of the DICOM Standard. A single DICOM file contains both a header (which stores information about the patient’s name, the type of scan, image dimensions, etc), as well as all of the image data (which can contain information in three dimensions). The DICOM header The size of this header varies depending on how much header information is stored. Header represents an instance of a real world, referred to as Information Object. Header is constructed of Data Elements. Data Elements contain the encoded Values of Attributes of that object. The specific content and semantics of these Biomath 2 (2013), 1212127, http://dx.doi.org/10.11145/j.biomath.2012.12.127 Page 3 of 5 http://dx.doi.org/10.11145/j.biomath.2012.12.127 A Popov at al., Rough Sets in Biomedical Informatics Attributes are specified in Information Object Defini- tions (see PS 3.3 of the DICOM Standard [9]). The construction, characteristics, and encoding of a Data Set and its Data Elements are discussed in PS 3.5 of the DICOM Standard. Pixel Data, Overlays, and Curves are Data Elements whose interpretation depends on other related elements. As seen below, the data elements can be interpreted as rough set attributes. The main part of a DICOM file is the image collection reffered by the textual part described above. Data Elements A Data Element is uniquely identified by a Data Element Tag. The Data Elements in header shall be ordered by increasing Data Element Tag Number and shall occur at most once in a Data Set. A DICOM attribute or data element is composed of: • A tag, in the format of group, element (XXXX,XXXX) that identifies the element. • A Value Representation (VR) that describes the data type and format of the attribute’s value. • A value length that defines the length of the at- tribute’s value. • A value field containing the attribute’s data. The basic attribute structure is shown below. Tag VR Value Length Value Field A simple example for a single tag for a CT image is: (0028, 0004), Photometric Interpretation: MONOCHROME2 Here you can see also a genetic data representation: Tag Term Frequency Gene(s) Gene(s) (6950,0001) response to stress 16 of 106 15.1% PRX1, HSP26, PHO5, HSP30, ... (6810,0015) transport 15 of 106 14.2% GLK1, HXT7, HXT6, PIC2, STF2, ... Each data element is described by a pair of numbers (group number, data element number). Even numbered groups are elements defined by the DICOM standard and are referred to as public tags. Odd numbered groups can be defined by users of the file format, but must conform to the same structure as standard elements. These are referred to as private tags. The ACR-NEMA Version 1 and 2 standards did not use object-oriented analysis or design. Instead, attributes (or elements, as they were called) were grouped according to use. For example, there were groups of elements that carried identifying information about the patient and others consisting of elements that described the methods of image acqui- sition. Because they were developed without an entity relationship (E-R) model, these groups do not conform to conventional object-oriented definitions. Note, that the E-R data model views the real world as a set of basic objects (entities) and relationships among these objects. For example, a collection of elements used in the ACR- NEMA Version 2 standard to identify and describe a computer tomographic (CT) image would also contain the patient name. In an entity relationship (E-R) model, however, the patient name is an attribute of the patient object, not of the image object. In other words, the patient name is not needed to describe the CT image, even though it would be needed to identify the image. One might also view these complex objects as consisting of parts of more than one entity in an E-R model. A novel free DICOM viewer called MICRODICOM has been created by the second author (10). It gives good opportu- nities for finding pathological objects. After clustering by 5 features (pixel intensity, mean and standard deviation in a 7 × 7 window, the two x and y Sobel operations) four tissue clusters are specified. Then by adding rules for finding the connected components associated with the pathology cluster based on contour tracing techniques, the tumor is located, see Figure 1. VI. CONCLUSION We tried to explain the power of rough modeling in biomedicine. The MICRODICOM project is under development and further intelligent capabilities based on soft computing, and especially on rough sets theory will be included. ACKNOWLEDGMENTS This work was supported by Ministry of Educa- tion, Youth and Technologies under contract DO 02- 359/2008 “Computer Simulation and Innovative Model- Based Study of Bioprocesses”. REFERENCES [1] Popov A. T. (2007) General Definition of Fuzzy Mathematical Morphology Operations. Image and Non-image Applications, In: Nachtegael, M., Van der Weken, D., Kerre, E.E., Philips, W. (Eds.), Soft Computing in Image Processing - Recent Advances, Series: Studies in Fuzziness and Soft Computing, Vol. 210, Springer, 355–384. Biomath 2 (2013), 1212127, http://dx.doi.org/10.11145/j.biomath.2012.12.127 Page 4 of 5 http://dx.doi.org/10.11145/j.biomath.2012.12.127 A Popov at al., Rough Sets in Biomedical Informatics Figure 1. MRI slice of a human brain with a tumor detected by MICRODICOM software [2] Baeg S., S. Batman, E. R. Dougherty, V. G. Kamat, N. Kehtar- navaz, Seunghan Kim, A. Popov, K.Sivakumar, R. Shah, (1999) Unsupervised morphological granulometric texture segmenta- tion of digital mammograms, Journal of Electronic Imaging 8(1), 65–75. http://dx.doi.org/10.1117/1.482685 [3] Pawlak, Z. (1991) Rough Sets: Theoretical Aspects of Rea- soning about Data. Volume 9 of Series D: System Theory, Knowledge Engineering and Problem Solving, Kluwer. [4] Nguyen H. T., E. A.Walker (2000) A first course in fuzzy logic (2nd edition), CRC Press. [5] Thangavel K., Karnan M., and Pethalakshmi A. (2005) Per- formance Analysis of Rough Reduct Algorithms in image Mammogram, ICGST International Journal on Graphics, Vision and Image Processing 8, 13–21 [6] Hvidsten, T.R., Lgreid, A., Komorowski, J. (2003) Learning rule-based models of biological process from gene expression time profiles using gene ontology. Bioinformatics 19, 1116– 1123 http://dx.doi.org/10.1093/bioinformatics/btg047 [7] Bloch, I (2000) On links between mathematical morphology and rough sets, Pattern Recognition 33, 1487–1496 http://dx.doi.org/10.1016/S0031-3203(99)00129-6 [8] http://http://www.mammoimage.org/databases/; http://peipa.essex.ac.uk/info/mias.html [9] http://www.dclunie.com/dicom-status/status.html [10] http://www.microdicom.com Biomath 2 (2013), 1212127, http://dx.doi.org/10.11145/j.biomath.2012.12.127 Page 5 of 5 http://dx.doi.org/10.1117/1.482685 http://dx.doi.org/10.1093/bioinformatics/btg047 http://dx.doi.org/10.1016/S0031-3203(99)00129-6 http://http://www.mammoimage.org/databases/ http://peipa.essex.ac.uk/info/mias.html http://www.dclunie.com/dicom-status/status.html http://www.microdicom.com http://dx.doi.org/10.11145/j.biomath.2012.12.127 Introduction Information systems and rough sets Rough set formal definition and main properties Rough set specification by decision rules DICOM format and its realization Conclusion References