Int. J. of Computers, Communications & Control, ISSN 1841-9836, E-ISSN 1841-9844 Vol. V (2010), No. 5, pp. 684-692 Advanced Information Technology - Support of Improved Personalized Therapy of Speech Disorders M. Danubianu, S.G. Pentiuc, I. Tobolcea, O.A. Schipor Mirela Danubianu, Stefan Gheorghe Pentiuc, Ovidiu Andrei Schipor “Ştefan cel Mare” University of Suceava Romania, 720229 Suceava, 13 Universităţii E-mail: {mdanub, pentiuc, schipor}@eed.usv.ro Iolanda Tobolcea “Alexandru Ioan Cuza” University of Iaşi Romania, 700506 Iasi, 11 Bulevardul Carol I E-mail: itobolcea@yahoo.com Abstract: One of the key challenges of the Sustainable Development Strat- egy adopted by the European Council in 2006 is related to public health whose general objective envisages a good level of public health. One of the specific targets includes better treatments of diseases. It is true that there are affections which by their nature do not endanger the life of a person, however they may have a negative impact on her/his life standard. Various language or speech disorders are part of this category, but if they are discovered and treated in due time, they can be often corrected. The difficulty for researchers and ther- apists is to identify those children who have disorders that show a wide range of issues that cannot be solved spontaneously or which may lead to further significant deficiencies. Information technology in the latest years was used by specialists in order to assist and supervise speech disorder therapy. Conse- quently they have collected a considerable volume of data about the personal or familial anamnesis, regarding various disorders or regarding the process of personalized therapies. These data can be used in data mining processes that aim to discover interesting patterns which can help the design and adaptation of different therapies in order to obtain the best results in conditions of maxi- mum efficiency. The aim of this paper is to present the Logo-DM system. This is a data mining system that can be associated with TERAPERS system in order to use the data from its database as a source for analysis and to provide new information based on an improved system of therapy. Through the use of appropriate techniques of data mining Logo-DM realizes predictions on the evolution and the final status of patients undergoing therapy and enriches the knowledge data of expert system embedded in TERAPERS. Keywords: personalized therapy, data mining, classification, clustering, as- sociations rules. 1 Introduction Various forms of speech disorders affect an important percent of people. There are affections which, by their nature, do not endanger the life of a person, however may have a negative impact on her/his life standard. Discovered and treated in due time, they can be corrected, most often during childhood. The use of information technology in order to assist and supervise speech disorder ther-apy allows specialists to collect a considerable volume of data about the personal or familial anam-nesis, regarding various disorders or regarding the process of personalized therapy. Copyright c⃝ 2006-2010 by CCC Publications Advanced Information Technology - Support of Improved Personalized Therapy of Speech Disorders 685 Even if these data can provide plenty of statistical information little useful knowledge can be obtained from it. In order to get such useful knowledge it is necessary to discover patterns in the data regarding the common characteristics of children with different types of diagnosis, about the connection between antecedents, personal and family behaviour and evolution of the child, or on the connection between the anamnesis and the response to different types of treatments or to different phases of the therapeutic process. These patterns are used to establish such a future strategy so as to maximize the benefits of the therapy and to minimize the costs. What are the speech disorders? A speech disorder is a problem with fluency, voice, and/or how a person utters speech sounds. Classifying speech into normal and disorder is complex because the statistics points out that only 5% to 10% of the population has a completely normal manner of speaking, all others suffer from one disorder or another. The most common speech disorders are: stuttering, cluttering, voice disorders, dysartria and speech sound disorders. The speech disorder therapy should begin as soon as possible. Children enrolled in therapy early in their development (younger than 5 years) tend to have better outcomes than those who begin therapy later. During the therapy, speech therapists use a variety of strategies including: oral motor or feeding therapy, articu-lation therapy and language intervention activities [2]. During the language intervention activities the therapist will interact with a child by playing and talking. He may use pictures, books, objects, or ongoing events to stimulate language development. The therapist may also model correct pro-nunciation and use repetition exercises to build speech and language skills. In the area of speech disorders there are some European projects developed as part of the EU Quality of Life and Management of Living Resources program, like: OLP (Ortho-Logo-Paedia) pro-ject [8], STAR - Speech Training, Assessment, and Remediation [12] [19], Speechviewer III developed by IBM [11] or ARTUR (Articulation Tutor) [17] [18]. Currently, the priorities at the international level focus on the development of information systems that can provide a person- alised therapy. At the national level, little research has been conducted on the therapy of speech impairments [13]. TERAPERS project [1] [2], developed with the financial support granted by the National Agency for Scientific Research, contract ref. no. 56-CEEX-II03/27.07.2006 by the Research Center for Computer Science in the University "Stefan cel Mare" of Suceava, aims to assist and support the speech disorder therapists in their efforts to develop personalized programs for the therapy of dyslalia. 2 Data mining and its application in logopaedic area Data mining is defined as the process of discovering non-obvious and potentially useful pat- terns in large data volumes. As exploration and analysis technique of large amounts of data in order to de-tect patterns or rules with a specific meaning, data mining may facilitate the discovery from appar-ently unrelated data, relationships that can anticipate future problems or might solve the studied problems. Data mining represents one phase in the complex process of knowledge discovery in databases (KDD) [5]. According to CRISP-DM [15], the reference model for this process, KDD consists of a sequence of steps. These steps are presented in Figure 1. Using appropriate methods, data mining can solve two broad categories of problems: pre- diction and description [10] [14]. The most used methods for prediction are classifications and regressions, and for description, clustering, deviation detection or association rules. The specific logopaedic tasks performed by data mining fall into the following categories [3]: • classification which places the people with different speech impairments in predefinited classes. Thus it is possible to track the size and structure of various groups. We can use 686 M. Danubianu, S.G. Pentiuc, I. Tobolcea, O.A. Schipor Figure 1: Crisp_DM process of Knowledge Discovery in Databases classi-fication which is based on the information contained in many predictor variables, such as per-sonal or familial anamnesis data or related to lifestyle, to join the patients with different seg-ments. • clustering which groups people with speech disorders on the basis of similarity of different features. It is an important task because it helps therapists understand their patients. Cluster-ing aims to finding subsets of a predetermined segment, with homogeneous behavior to-wards various methods of therapy that can be effectively targeted by a specific therapy but it is not based on the previous definition of groups. • association rules aim to find out associations between different data which seem to have no semantic dependence. It may be a way to determine why a specific therapy program has been successful on a segment of patients with speech disorders and on the other was ineffective. To conclude with we state that data mining can be a useful tool. Still, there is a limitation we have to consider. Data mining applications generate information by analyzing patterns of data ob-tained from the systems which assist and supervise the speech therapy. Such patterns can help pre-dict the evolution of the individuals that are currently in the process of therapy, or design a scheme of an appropriate therapy for them. However data mining technology can not provide information about impairments, people or behaviors that are not found in the databases that provide data for analysis. 3 Logo-DM System 3.1 Objectives The idea of trying to improve the quality of logopaedic therapy by applying some data mining tech-niques started from TERAPERS project developed within the Research Center for Computer Sci-ence in the University "Stefan cel Mare" of Suceava. This project has proposed to develop a system which is able to assist speech therapists in their speech therapy of dislalya and to asses how the pa-tients respond to various personalized therapy programs. Starting in March 2008 the system is cur-rently used by the therapists from Regional Speech Therapy Center of Suceava. At present, because of the limited time and the economical aspects involved, information regard-ing the therapy for each particular case is of interest [4]: what is the predicted final state for a child or what will be his/her state at the end of various stages of therapy, which the best Advanced Information Technology - Support of Improved Personalized Therapy of Speech Disorders 687 exercises are for each case and how they can focus their effort to effectively solve these exercises or how the family receptivity - which is an important factor in the success of the therapy - is associated with other as-pects of family and personal anamnesis. All this may be the subject of predictions obtained by ap-plying data mining techniques on data collected by using a computer based therapy system. It is also interesting, as part of the knowledge discovered by data mining algorithms, to be used to enrich the knowledge base of expert system embedded. To achieve this goal we propose the development of Logo-DM system. Consequently its objectives are: • analysis of data collected and their preprocessing in order to assure a proper quality for data mining algorithms • feature selection for the elimination of those irrelevant or redundant • the use of corresponding data mining methods and algorithms that can be applied in order to find models which can answer to problems raised in speech disorders therapy • models evaluation and their validation on new cases • to find new rules which can enrich the knowledge base of the expert system embedded in TERAPERS 3.2 System Architecture Data mining aims at deriving knowledge from data. The architecture of a data mining system plays an important role in the efficiency with which data is mined. Considering the characteristic of the domain we have proposed for the system a two tier client server architecture. This architecture is presented in Figure 2. Figure 2: Logo-DM Architecture On the client side there is the user interface (GUI) which allows the user to communicate with the system in order to select the task to perform, to select and submit the datasets on which data mining needs to be applied. Pattern evaluation and the post-processing step consisting in pattern visualiza-tion are performed also on the client. The knowledge base is the module where the background knowledge is stored. 688 M. Danubianu, S.G. Pentiuc, I. Tobolcea, O.A. Schipor The more difficult computational tasks of data mining operations are carried out on the server. Here, the data mining kernel contains modules able to perform classifications and association rule detection. Supplementary the pre-processing data module allows data to become suitable for apply-ing data mining algorithms. 3.3 Some aspects regarding the system implementation It is well known that the best results of data mining algorithms are obtained by applying on data in data warehouses. But in this case the development of a data warehouse is not appropriate, so, it is used, as the primary source of data, a database that contains data collected from the different speech therapists’ offices. In order to choose the right solution for the implementation of the system we have made an analysis of available data both its structure and content. We have started from a scheme with over 60 tables and after deleting tables with irrelevant content for the intended purpose we have obtained, as underlying tables for the final data set, 27 tables as presented in Figure 3. Content analysis can reveal interesting issues related to data quality or the need for transforma- tion. We have made a first assessment of data quality through the following measures: complete- ness, conformity, accuracy, consistency and redundancy. The mechanisms provided by the used da-tabase management system have imposed a minimum, controlled redundancy and have as- sured data consistency. Values stored in fields correspond to reality, but unfortunately in some records useful data for analysis are missing. Therefore it is necessary to supplement data gaps, and where not pos-sible, the removal of the record for accurate results is suggested. Figure 3: The useful part of database schema Proper data for the analysis are subjected to the following types of transformation: transforma- tions of the structure, and changes aimed value. Structural transformations are dictated by the fact that there are fields in the database contain-ing data related to a complex of features to be addressed individually in the analysis. Values of transformations refer to the replacement of coded data by the rules, enabling, for exam- ple, the effective storage with descriptive values of characteristics allowing rapid interpretation of results. An example of these transformations is the following. An issue addressed in the anamnesis form is related to the skills of the child. In Figure 4 we can see that there is a complex of skills of interest (verbal, perceptual, numeric, psycho-motor or special skills). In the database, all these skills are in two distinct fields: one for general skills, which groups data regarding verbal, perceptual, numeric, psycho-motor and intelligence skills and one for Advanced Information Technology - Support of Improved Personalized Therapy of Speech Disorders 689 Figure 4: Sample of anamnesis data special skills (Figure 5). The field called ’aptitudini’ is numeric and is represented in the table by a string of five bits, as shown in Figure 5. These bits, positioned from left to right, have the following meaning: • the first bit - verbal skills (1- present, 0- absent) • the second bit - perceptual skills (1- present, 0- absent) • the third bit - numeric skills (1- present, 0- absent) • the fourth bit - psycho-motor skills (1- present, 0- absent) • the fifth bit - intelligence (1- normal intelligence, 0 - mental deficiency) Figure 5: Data to be transformed Since all these attributes may affect the analysis it is desirable that they can be addressed indi-vidually and explicitly in the final data set. For this purpose the original table structure is changed and values are converted to descriptive values as in Figure 6. These changes have conducted to a modified form of the relational database used by Terapers. In the first phase, construction of target data sets for each of the methods to be applied in the system is through the application of relational expressions like those presented in (1).∏ Ii (T1 ◃ ▹T2 ◃ ▹... ◃ ▹Tk) (1) where: • Ii is a superset of the attributes regarding the useful characteristics for each method • T1 . . . Tk is the set of tables containing the attributes in the list of projection. 690 M. Danubianu, S.G. Pentiuc, I. Tobolcea, O.A. Schipor Figure 6: Transformed data Each of these expressions was implemented in SQL, and has generated intermediate tables. For example, the target data set necessary to establish the profile of children with speech disorders, can be obtained by joining tables which contain: general data about children, family and personal an-amnesis, data on complex evaluation and diagnosis associated. The statement that performs that is presented in (2). The result is a table that contains 129 features. create table caract_copii as select f.∗, l.diagn_final from fise f, logopat l where f.idc = l.idc; (2) Data mining techniques were not designed to process large amounts of irrelevant features. Conse-quently before their application, a selection of the relevant features is required [6] [7]. The most im-portant objectives of feature selection are: to avoid over fitting and improve model performance. A variant of the mRMR method [9] for categorical values has been used for feature selection. It is based on mutual information criteria, formally defined, for two discrete random variables X and Y, as: I (X; Y) = ∑ y∈Y ∑ x∈X p (x, y) log ( p (x, y) p1(x)p2(y) ) (3) where p(x,y) is joint probability distribution function of X and Y, and p1(x) and p2(y) are the marginal probability distribution functions of X and Y respectively. For discrete random variable, the joint probability mass function is: p(x, y) = p(X = x, Y = y) = p(Y = y|X = x) ∗ p(X = x) = p(X = x|Y = y) ∗ p(Y = y) (4) Since these are probabilities, we have∑ x ∑ y p(X = x, Y = y) = 1 (5) The marginal probability function, p (X = x) is: p(X = x) = ∑ y p(X = x, Y = y) = ∑ yp(X = x|Y = y)p(Y = y) (6) The criterion used is related to minimizing redundancy and maximizing relevance to the chosen characteristics. The result of tests performed on data prepared as described in the example mentioned above, revealed that, for classification, the minimum error is obtained if we deal with a number between 20 and 22 features selected. The target data set, obtained after these steps, is Advanced Information Technology - Support of Improved Personalized Therapy of Speech Disorders 691 subject to data mining algorithms. For an effective implementation of algorithms we have taken into account, and we tested, two possibilities: to use the Oracle Data Mining kernel (ODM) which offers the possibility to apply algorithms for classification, clustering and association rules and to use some open source implementations of relevant algorithms adapted and integrated into our own system. We took into account the types of data included in the set and we used implementations in Oracle of Adaptive Bayes Network, Seeker Model and decision trees build with CART [16] and ID3/C4.5 for classification, for clustering the Oracle implementation of A-Clustering algorithm and for association rules Apriori algorithm. It should be noted that for the moment, the volume of data on which work is relatively low, because the system which is the main source of these data is operational for only several months. 4 Conclusions and Future Works Considering the opportunity of data mining techniques application on data collected in the process of speech therapy, we have concluded that methods such as classification, clustering or as-sociation rules can provide useful information for a more efficient therapy. Consequently, we have designed and we are currently implementing a data mining system that aims to use data provided by TERAPERS system, developed by the Research Center for Computer Science in the University "Stefan cel Mare" of Suceava, in order to achieve an optimized personalized therapy of dyslalia. We have tested the modules for data pre-processing and on target data sets obtained from these modules we have applied more algorithms for detecting the most appropriate solutions for the data mining kernel. At present efforts are directed towards the implementation of evaluation patterns and visualization modules and towards building a user friendly interface. Bibliography [1] M. Danubianu, S.G. Pentiuc, O. Schipor, I. Ungureanu, M. Nestor, Distributed Intelligent System for Personalized Therapy of Speech Disorders, in Proc. of The Third International Multi-Conference on Computing in the Global Information Technology ICCGI, July 27- Au- gust 01, Athens, Greece, 2008. [2] M. Danubianu, S.G. Pentiuc, O. Schipor, M. Nestor, I. Ungurean, D.M. Schipor, TER- APERS - Intelligent Solution for Personalized Therapy of Speech Disorders, International Journal on Ad-vances in Life Science, p.26-35, 2009. [3] M. Danubianu, T. Socaciu, Does Data Mining Techniques Optimize the Personalized Ther- apy of Speech Disorders?, Journal of Applied Computer Science and Mathematics, p.15-19, 2009 [4] M. Danubianu, S.G. Pentiuc, T. Socaciu, Towards the Optimized Personalized Therapy of Speech Disorders by Data Mining Techniques, The Fourth International Multi Conference on Computing in the Global Information Technology ICCGI 2009, Vol: CD, 23-29 August, Cannes - La Bocca, France, 2009 [5] F.G. Filip, Decizie asistata de calculator, Ed. Tehnica, Bucuresti, 2005 [6] I. Guyon, A. Elisseeff, An introduction to variable and feature selection. J. Mach Learn Res., 3, p.1157-1182, 2003 692 M. Danubianu, S.G. Pentiuc, I. Tobolcea, O.A. Schipor [7] H. Liu, H. Motoda, Feature Selection for Knowledge Discovery and Data Mining. Kluwer Aca-demic Publishers, Norwell, MA, 1998 [8] OLP (Ortho-Logo-Paedia) - Project for Speech Therapy (http://www.xanthi.ilsp.gr/olp);W.-K. Chen, Linear Networks and Systems (Book style).Belmont, CA: Wadsworth, p. 123-135, 1993 [9] H. Peng, F. Long, C. Ding, Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, p. 1226-1238, 2005 [10] B. Reiz, L. Csató, Bayesian Network Classifier for Medical Data Analysis. International Journal of Computers Communications & Control Vol. 4, p: 65-72, 2009 [11] Speechviewer III - (http://www.synapseadaptive.com/edmark/prod/sv3) [12] STAR Speech Training, Assessment, and Remediation (http://www.asel.udel.edu/speech) [13] Tobolcea,I., Interventii logo-terapeutice pentru corectarea formelor dislalice la copilul normal, Editura Spanda, Iasi, 2002. [14] P. Wessa, Quality Control of Statistical Learning Environments and Prediction of Learning Outcomes through Reproducible Computing, International Journal of Computers Commu- nications & Control Vol. 4, p: 185-197, 2009 [15] R. Wirth, J. Hipp, CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, pages 29-39, Manchester, UK, 2000 [16] www.salford-systems.com/ last visited October 2009 [17] www.speech.kth.se/multimodal/ARTUR/index.html last visited August 2009 [18] O. Balter, O. Engwall, A.M. Oster, H. Kjellstrom, Wizard-of-Oz Test of ARTUR - a Computer-Based Speech Training System with Articulation Correction. Proceedings of the Seventh International ACM SIGACCESS Conference on Computers and Accessibility, Bal- timore, October, 2005, pp.36-43. [19] H.T. Bunnel, M.D. Yarrington, B.J. Polikoff, Articulation Training for Young Children, Proceedings of 6th International Conference on Spoken Language Processing (ICSLP 2000), Beijing, China, October 16-20, 2000, vol.4, pp. 85-88.