Transactions Template JOURNAL OF ENGINEERING RESEARCH AND TECHNOLOGY, VOLUME 2, ISSUE 2, JUNE 2015 152 Investigating the Efficiency of WordNet as Background Knowledge for Document Clustering Iyad AlAgha 1 , Rami Nafee 2 1 Faculty of Information Technology, The Islamic University of Gaza, ialagha@hotmail.com 2 Faculty of Information Technology, The Islamic University of Gaza, raminafe2002@hotmail.com Abstract—Traditional techniques of document clustering do not consider the semantic relationships between words when assigning documents to clusters. For instance, if two documents talk about the same topic but by using different words, these techniques may assign documents to different clusters. Many efforts have approached this problem by enriching the document’s representation with background knowledge from WordNet. These efforts, however, often showed conflicting results: While some researches claimed that WordNet had the potential to improve the clustering performance by its capability to capture and estimate similarities between words, other researches claimed that WordNet provided little or no enhancement to the obtained clusters. This work aims to experimentally resolve this contradiction between the two teams, and explain why WordNet could be useful in some cases while not in others, and what factors can influence the use of WordNet for document clustering. We conducted a set of experiments in which WordNet was used for document clustering with various settings including different datasets, different ways of incorporating semantics into the document’s representation and different similarity measures. Results showed that different experimental settings may yield different clusters: For example, the influence of WordNet’s semantic features varies according to the dataset being used. Results also revealed that WordNet-based similarity measures do not seem to improve clustering, and that there was no certain measure to ensure the best clustering results. Index Terms—Document Clustering, WordNet, Similarity Measure, Ontology I INTRODUCTION Document clustering is a technique that aims at grouping document collections into meaningful groups. In traditional techniques of document clustering, documents are represent- ed as bag of words, and are then assigned to clusters accord- ing to the similarity scores obtained from a document simi- larity measure. These techniques ignore the semantic rela- tionships between the document words, and thus cannot ac- curately group documents based on meaning similarity. For example, a document that only contains the word “plane” and another that only contains the word “jet” are assigned to different clusters as the two words will be considered differ- ent. Existing research has tried to overcome this limitation by proposing clustering techniques that are based on meaning similarities. The similarity in meaning can be measured by exploiting background knowledge in form of domain ontol- ogies or lexical resources such as WordNet. Similarity scores obtained from WordNet can be used to enhance the docu- ment’s representation by giving more weight to words that are semantically related [1]. With the enhanced document’s representation, the clustering algorithm can better assign documents to clusters based on their semantic similarities to each other. Several efforts have investigated different ap- proaches to incorporate the semantic features of ontologies in an attempt to improve document clustering, and have shown that information semantics have the potential to im- prove the quality of the obtained clusters [2-5]. WordNet [6] is one of the most popularly used semantic networks for determining semantic relations between words. WordNet has an ontology alike structure: words are repre- sented as having several meanings (each such meaning forming a synset, which is the atomic structure of WordNet), and relations between words (hyponymy, hyperonymy, an- tonymy, and other relations) are represented as links in a graph. Many similarity measures use the relations defined in WordNet to determine the semantic relatedness between words. Due to its wide coverage as compared to other re- stricted domain ontologies, many efforts used it as back- ground knowledge for document clustering [7-9]. Despite the significant amount of research on WordNet- based clustering, existing approaches often resulted in con- flicting results: while some approaches showed that Word- Net could enhance the document’s representation with se- mantic features, yielding to better clustering [9-11], other approaches claimed that WordNet resulted in little or no improvement, or might even degrade the clustering results due to the introduced noise [7, 12, 13]. Given this contradic- tion, the objective of this research is to try to resolve this issue by seeking answers to the following questions: Iyad AlAga and Rami Nafee / Investigating the Effeciency of WordNet as Background Knowledge for Document Clustering (2015 ) 153  What potential factors could make WordNet useful for document clustering in particular situations while not in others?  Do different experimental settings, i.e. different da- tasets, document’s representations and similarity measures affect the potential of WordNet to improve clustering?  What is the best similarity measure to use with Word- Net-based clustering? II RELATED WORK The idea of incorporating semantic features from the WordNet has been widely explored to improve document clustering techniques. However, there were major differ- ences among the findings of these efforts: while some efforts affirmed the value of WordNet in improving document clus- tering, other efforts indicated the opposite. For example, Hotho et al. [1] discussed different strategies for represent- ing text documents that take background knowledge from WordNet into account. Their performed evaluations indicat- ed improved clustering results. Gharib, Fouad, and Aref [14] matched the stemmed keywords to terms in WordNet for word sense disambiguation. Their approach outperformed the traditional clustering techniques; however, it seemed to over generalize the affected keywords [15]. Fodeh et al. [13] addressed the effect of incorporating the polysemous and synonymous nouns into document clustering, and showed that they play an important role in clustering. Chen et. al [16] proposed a document clustering approach that combines fuzzy association rule mining with WordNet for grouping documents. They also proposed a technique to filter out noise when adding hypernyms into documents. Wei et. al [17] presented a WordNet-based semantic similarity measure for word sense disambiguation whereas lexical chains are employed to extract core semantic features that express the topic of documents. In contrast, some studies indicated that the use of Word- Net as background knowledge does not necessarily lead to better clusters, and may even produce noise that degrades the clustering performance. For example, Dave et al. [18] used synsets as features for the document’s representation and subsequent clustering, and reported that WordNet synsets actually decreased or added no value to clustering performance. Amine et al. [19] found that the mapping of document words to concepts in WordNet might increase ambiguity and induce loss of information. Passos, A. and J. Wainer [20] showed that many similarity measures between words derived from WordNet are worse than the baseline for the purposes of text clustering, and indicated that WordNet does not provide good word similarity data. However, they worked on a single dataset, and did not examine other ap- proaches of incorporating WordNet’s features into the doc- ument’s representation. Sedding, J. and D. Kazakov [7] showed that synonyms and hypernyms disambiguated only by Part-of-Speech tags are not successful in improving clus- tering effectiveness. This could be attributed to the noise introduced by all incorrect senses that are retrieved from WordNet. The above discussion reveals inconsistent results regard- ing the ability of WordNet to operate as background knowledge for document clustering. This demands further investigation into the factors and circumstances causing this inconsistency. Approaches that exploit WordNet or any other ontology for clustering often rely on some types of semantic similarity measures to estimate the similarity between document words. These measures can be classified into four groups: path length based measures, information content based measures, feature based measures, and hybrid measures. An exhaustive overview of these approaches can be found in [21]. A previous study [22] compared the use of different similarity measures with medical ontologies for document clustering, and indicated that there was no a certain type of similarity measure that significantly outperforms the others. Our study also compares the use of similarity measures for clustering but with WordNet rather than domain-specific ontologies. We also examine the effect of WordNet’s seman- tics with different datasets and document’s representations. Amine et al. [19] compared three different clustering algo- rithms which were all based on the synsets of WordNet as terms for the representation of documents. While their study aimed to determine the best clustering algorithm to use with WordNet, this study aims to explain the opposite findings regarding the efficiency of WordNet for document cluster- ing. III USING WORDNET TO ENHANCE THE DOCUMENT’S REPRESENTATION Clustering of a document collection typically starts by representing each document as a bag of words. The simple bag of words representation may be enhanced by weighting the terms according to their information content by, for ex- ample, tf-idf. Subsequently, a similarity measure, such as the cosine similarity, is used to assign a score to each pair of documents, and similar documents are accordingly assigned to the same cluster. There are two approaches that are com- monly used to enhance the document’s representation with WordNet, which are explained in what follows: A Enhancing the Document’s Representation by Replacing Synonyms One limitation of using the traditional bag of words rep- resentation is that words are weighted separately without considering the similarity between them. For example, the terms are weighted separately de- spite of all being synonyms. This leads to information loss as the importance of a determinate concept is distributed among different components of the document’s representa- tion. Existing approaches [1, 13] addressed this issue by refer- ring to lexical databases such as WordNet to identify syno- nyms. Subsequently, the document’s bag of words is modi- fied by replacing all synonyms with a single descriptor. For Iyad AlAga and Rami Nafee / Investigating the Effeciency of WordNet as Background Knowledge for Document Clustering (2015 ) 154 example, the terms: may be re- placed with the term . Afterwards, the docu- ment is represented by using the tf-idf scheme. Therefore, the replacing term will have a cumulative weight that is equal to the sum of the tf-idf weights of replaced synonyms. Finally, a clustering algorithm, such as K-means, is applied. B Enhancing the Document’s Representation by Using Similarity Measures Having documents with different term sets does not nec- essarily mean that documents are unrelated. Document terms can be semantically related even though they are syntactical- ly different. For example, the terms: cow, meat, milk and farms are all related with some relations which cannot be captured without using background knowledge. As discussed earlier, the document’s representation can be enhanced by identifying and replacing synonyms of the same term. How- ever, this approach only considers synonyms, while terms that are not synonyms but are semantically related are ig- nored. For example, the words: and are related in meaning, and this relation will not be captured by simply identifying and replacing synonyms. To overcome this limitation, it is necessary to represent the document in a way that reflects the relatedness in meanings between the document terms. Similarity measures have been commonly used to meas- ure the semantic relatedness between documents words, and then relatedness scores are incorporated into the document’s representation. Similarity measures exploit knowledge re- trieved from a semantic network (i.e., WordNet) to estimate the similarity between term pairs according to the topology structure of WordNet. Similarity scores are then incorpo- rated into the document’s tf-idf representation so that terms are related will gain more weight. Reweighting terms ac- cording to their semantic relatedness may help discount the effects of class-independent general terms and aggravate the effects of class-specific “core” terms [22]. This can eventu- ally help to cluster documents based on their meanings. Em- ploying similarity measures on WordNet is an idea that have been explored by several efforts [3, 17, 23] for the purpose of improving document clustering. IV EXPERIMENTAL STUDY After presenting the approaches for enhancing the docu- ment’s representation with knowledge in WordNet, the fol- lowing subsections report on the experimental study we conducted with the following objectives in mind: 1) Com- pare between the approaches previously explained and ex- amine their influence on document clustering by testing with different datasets. 2) Examine the use of different ontology- based similarity measures in order to identify the best meas- ure(s) to use with WordNet. 3) Explain, in light of the results obtained from 1 and 2, the contradiction between existing works regarding the value of WordNet’s semantics for doc- ument clustering. A Datasets Two datasets were used for the study, which were: Reu- ters-21578, and OHSUMED. Details of each dataset are given below. In addition, the rationale behind using these particular datasets is illustrated. - Reuters-21578 [24] : The documents in the Reuters-21578 collection appeared on the Reuters newswire in 1987. The documents were assembled and indexed with categories by personnel from Reuters Ltd[24]. Reuters-21578 dataset has been widely used for evaluating document clustering algo- rithms. Its domain is not specific, therefore it can be under- stood by a non-expert [7]. - OHSUMED [25]: The OHSUMED test collection is a set of 348,566 references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987- 1991). The available fields are title, abstract, MeSH index- ing terms, author, source, and publication type. The National Library of Medicine has agreed to make the MEDLINE ref- erences in the test database available for experimentation [26]. The above datasets were chosen because they have dif- ferent characterises that might lead to different clustering performance: the Reuters-21578 is considered a heterogene- ous dataset with no specific domain, covering a wide verity of dissimilar topics from the newswire. In contrast, OHSUMED is a domain-specific dataset strictly covering the domain of medicine. The intention was to explore how the use of datasets of different homogeneity could yield dif- ferent clustering performance. B Experiments We conducted three experiments, each of which used a different approach of representing documents. K-means clustering algorithm was applied in the three experiments. These experiments were as follows: - Traditional clustering without background knowledge: This represented the baseline case. Documents were pre- processed by applying tokenization, stemming and stop- word removal. Documents were then represented in tf-idf prior to applying K-means for clustering. Note that the con- ceptual relations between document terms were ignored, and terms were weighted only according to their frequency of co-occurrence in the document collection. - Enhancing the document’s representation by identifying and replacing synonyms: Before finding synonyms in doc- uments, the following pre-processing steps were applied: First, all documents were broken down into sentences which undergone part of speech tagging. Part of speech tags were essential to correctly identify synonyms which should have the same POS tag. After tagging the document words, other pre-processing steps including tokenization, stemming and stop-word removal were applied. The following step was to search the document collection for terms that are synonyms with the help of WordNet. Synonyms of a particular concept were replaced by a unique term in the document’s bag of words. The modified bag of words of each document was Iyad AlAga and Rami Nafee / Investigating the Effeciency of WordNet as Background Knowledge for Document Clustering (2015 ) 155 then represented in tf-idf. - Enhancing the document’s representation by using ontolo- gy-based similarity measures: First, pre-processing steps consisting of tokenization, stop-word removal and stemming were applied on the document collection. Documents were then represented in tf-idf scheme. Ontology based similarity measures were used to measure the WordNet-based similari- ty between each pair of words in the document collection. Similarity scores were then used to augment the tf-idf weights so that terms gained more weight according to their similarity to each other. This process is formally represented as the following: Let d = *w ,w ,w , . . ,w + be the docu- ment’s representation where w is the weight of term t in document d, and is computed using the tf-idf scheme. The similarity between each pair of terms in the document was calculated by using each similarity measure shown in Table 1. Afterwards, if-idf weights were reweighted using the fol- lowing equation [27]: w′ = w + ∑ w ∗ sim(i, j) ( ) , where: w′ stands for the augmented tf-idf weight of term i, w is the tf-idf weight of term j of the same document, and sim(i, j) is the semantic similarity score between terms t , t rated from 0 to 1, where 1 represents the highest similarity. This equation assigns more weight to terms that are semanti- cally related. Weights of terms that are not related to any other terms or that are not included in WordNet remain un- changed. After augmenting the tf-idf document’s representation with similarity scores, K-means algorithm was applied. Since it was of our objectives to assess different similarity measures, the above process was repeated for every similari- ty measure shown in Table 1. These measures have been widely used for semantically-enhanced text clustering. Short descriptions of these measures are also given in Table 1. TABLE 1 Similarity measures used in the study. Description ID Leacock and Chodorow [28]: This measure relies on the length of the shortest path between two terms. It is lim- ited to IS-A links, and the path length is scaled by the overall depth of the taxonomy. LCH Wu and Palmer [29]: This measure calculates similarity by considering the depths of the two terms in WordNet, along with the depth of the least common subsumer. WUP Jiang and Conrath [30]: This measure uses the notion of information content, but in the form of the conditional probability of encountering an instance of a child-synset given an instance of a parent synset. JCN Lin[31]: Math equation is modified a little bit from Jiang and Conrath: 2 * IC(lcs) / (IC(synset1) + IC(synset2). Where IC(x) is the information content of x. LIN Resnik [32]: This measure defined the similarity between two terms to be the information content of the most spe- RES cific common subsume. Banerjee and Pedersen [33]: The relatedness of two terms is proportional to the extent of overlaps of their dictionary definitions. LESK Hirst and St-Onge [34]: Two terms are semantically re- lated if their WordNet synsets are connected by a path that is not too long and that does not change direction too often. HSO V RESULTS AND DISCUSSION The clustering performance was evaluated by using F- measure [35] and purity[36]. Table 2 summarizes the results whereas rows indicate the three experiments and columns indicate the two datasets used. When using similarity measures with WordNet, the clustering process was repeated several times while varying the similarity measure and the best result was considered for comparison. The ID of the similarity measure giving the best result is shown alongside the result between brackets (IDs of similarity measures are shown in Table 1). TABLE 2 Clustering results of the three experiments. Experiment Reuters-21578 OHSUMED Purity F- measure Purity F- measure Without Back- ground Knowledge 0.57 0.64 0.36 0.47 With Replacing Synonyms 0.64 0.77 0.49 0.65 With WordNet- based Similarity Measures 0.59 (LCH) 0.70 (LCH) 0.43 (RES) 0.60 (RES) Table 3 lists the different similarity measures we used for the third experiment, i.e. clustering with WordNet-based similarity measures, and the clustering performance per each measure. Results are discussed in the following subsections, and related efforts are revisited, where appropriate, in light of our results. TABLE 3 Clustering results for each similarity measure. Similarity Measures Reuters-21578 OHSUMED Purity F-measure Purity F-measure LCH 0.59 0.70 0.39 0.49 WUP 0.56 0.64 0.41 0.55 JCN 0.40 0.48 0.30 0.39 LIN 0.48 0.55 0.41 0.55 RES 0.48 0.61 0.43 0.65 LESK 0.54 0.67 0.42 0.57 HSO 0.46 0.58 0.42 0.62 A Comparing the Document’s Representation Tech- niques Iyad AlAga and Rami Nafee / Investigating the Effeciency of WordNet as Background Knowledge for Document Clustering (2015 ) 156 In this subsection, we compare results across the three experiments, i.e. clustering without background knowledge (baseline), clustering with replacing synonyms and cluster- ing with similarity measures. In the case of Reuters dataset, clustering with replacing synonyms outperformed other approaches (F-measure =0.77 and purity =0.64), followed by clustering with similarity measures (F-measure=0.70, Purity=0.59).When using the OHSUMED dataset, the best results were also achieved by replacing synonyms (F-measure=0.65, Purity=0.49), fol- lowed by clustering with similarity measures (F- measure=0.60, Purity=0.43). In general, this result shows that potential of WordNet to improve the clustering results, either by replacing synonyms or by using similarity scores, as compared to clustering without background knowledge. This result conforms to other studies which indicated the value of WordNet semantics for document clustering [1, 7, 11, 17]. However, the use of similarity measures with WordNet has unexpectedly produced results worse than those pro- duced by replacing synonyms, but slightly better than the baseline case, i.e. clustering without background knowledge. It should be noticed that Table 2 shows the top result ob- tained from all the seven similarity measures. This result concurs with some studies which indicated that the use of similarity measures with WordNet had little impact on text clustering and may produce worse results [20, 23]. The above result suggests that WordNet-based similarity measures do not seem to improve the clustering results. We think that this can be attributed to the structure of WordNet taxonomy which is mainly designed to represent specific relations (e.g. hyponymy, hyperonymy) but is not designed to capture similarity between words. For example, when measuring the similarity between the words: “camel” and “desert”, or between the verb “sit” and the noun “chair”, the similarity scores were close to 0. B Comparing the Influence of Datasets Comparing results across the two datasets, we can see that the improvement resulted from semantic-based ap- proaches (synonyms and similarity measures) was more obvious in the case of the OHSUMED dataset than in the Reuters dataset. We think this difference can be explained by the nature of the dataset in terms of the disparity of its con- tent: For example, the Reuters dataset is heterogeneous in the sense that it covers news from unrelated domains, a thing that makes it difficult to identify semantic relations between the document words. It was noticed experimentally that the scores obtained by applying the similarity measures on the Reuters dataset were often low. Considering the fact that most similarity measures rely, mainly or partially, on the taxonomical distances within WordNet, the similarity scores will decrease as the differences between terms increases. In contrast, OHSUMED is a domain-specific dataset with dif- ferent classes of documents belonging to the medical do- main. This makes it easy to identify terms that belong to a specific domain and measure similarities between them. This explains the better results obtained from the OHSUMED dataset as compared to the results obtained from the Reuters dataset. The above discussion reveals that the use of different da- tasets can result in different results: The more homogeneous and domain-specific the dataset is, the easier it becomes to capture similarities between terms included in the dataset, hence the more influence the WordNet has on the clustering results. We should also bear in mind that the WordNet is a general-purpose lexical database of English terms but it does not provide a thorough coverage of every domain of knowledge. Although its use has improved the clustering performance in our experiment, WordNet is not meant to be used with domain specific applications. It is always recom- mended to use domain-specific ontologies to cover domain- specific datasets. C Comparing the WordNet-based Similarity Measures Comparing the use of different similarity measures, re- sults vary as shown in Table3: in the case of Reuters dataset, the LCH measure achieved the best results followed by the PATH and WUP measures. In the case of OHSUMED, the RES measure gave the best results, followed by the HSO and LESK. However, the improvement on the results was not significant [t-test, p>0.05]. In addition, the clustering performance with some similarity measures was even lower than the performance of the baseline case where no back- ground knowledge was used (e.g. refer to JCN measure in Table 3). These results indicate that there was no certain measure to ensure the best clustering results. They also sup- port our conclusion about the inadequacy of WordNet to be used with ontology-based similarity measures. However, this result does not generalize to other types of ontologies as our study focused strictly on WordNet. VI CONCLUSIONS AND RECOMMENDATIONS The different outcomes of exsiting approaches regarding the influence of WordNet on document clustering have mo- tivated us to conduct this work. Multiple experiments on document clustering were conducted with multiple datasets, document’s representations and similarity measures. In summary, our study found that the characteristics of the da- tasets being clustered in terms of the disparity of its topics may reduce the ability to capture the semantic relations be- tween terms from the taxonomy of WordNet. Results also indicated that augmenting the document’s representation by replacing synonyms may achieve better results than those achieved by using similarity measures or the baseline case, i.e. clustering without background knowledge. Based on these findings, we draw some recommendations to be considered when using WordNet as background knowledge for text clustering: First, experimenters should consider the nature of the dataset in hand and the diversity of its topics before deciding to use WordNet for measuring similarities. Second, the WordNet structure does not seem to support the application of similarity measures. Alternatively, WordNet can be better exploited by capturing specific types of relations such as “IS-A”, “hyponymy” and hypernymy, Iyad AlAga and Rami Nafee / Investigating the Effeciency of WordNet as Background Knowledge for Document Clustering (2015 ) 157 and then use them to enhance the document’s representation. For example, capturing and replacing synonyms in the doc- ument collection outperformed other approaches in our ex- periments. REFERENCES [1] A Hotho, S Staab, andG Stumme (2003). Ontologies Improve Text Document Clustering, Data Mining, 2003. ICDM 2003. Third IEEE International Conference on: IEEE (pp. 541-544). [2] A Charola, andS Machchhar. (2013). Comparative Study on Ontology Based Text Documents Clustering Techniques. Data Mining and Knowledge Engineering, 5(12), 426. [3] H H Tar, andT T S Nyunt. (2011). Ontology-Based Concept Weighting for Text Documents. world Academy of Science, engineering and Technology, 81(249-253. [4] G Bharathi, andD Venkatesan. (2012). Study of Ontology or Thesaurus Based Document Clustering and Information Retrieval. Journal ofEngineerir1g and Applied Sciences, 7(4), 342-347. [5] Q Dang, J Zhang, Y Lu, andK Zhang (2013). Wordnet- Based Suffix Tree Clustering Algorithm, 2013 International Conference on Information Science and Computer Applications (ISCA 2013): Atlantis Press (pp. [6] G Miller, andC Fellbaum, "Wordnet: An Electronic Lexical Database", MIT Press Cambridge1998. [7] J Sedding, andD Kazakov (2004). Wordnet-Based Text Document Clustering, Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data: Association for Computational Linguistics (pp. 104-113). [8] H-T Zheng, B-Y Kang, andH-G Kim. (2009). Exploiting Noun Phrases and Semantic Relationships for Text Document Clustering. Information Sciences, 179(13), 2249-2262. [9] D R Recupero. (2007). A New Unsupervised Method for Document Clustering by Using Wordnet Lexical and Conceptual Relations. Information Retrieval, 10(6), 563-579. [10] A Hotho, Staab, S. and Stumme, G (2003). Wordnet Improves Text Document Clustering, in Proc. of the Semantic Web Workshop at 26th Annual International ACM SIGIR Conference, Toronto, Canada (pp. [11] Y Wang, andJ Hodges (2006). Document Clustering with Semantic Analysis, System Sciences, 2006. HICSS'06. Proceedings of the 39th Annual Hawaii International Conference on: IEEE (pp. 54c-54c). [12] L Jing, L Zhou, M K Ng, andJ Z Huang, "Ontology- Based Distance Measure for Text Clustering", Proceedings of the Text Mining Workshop, SIAM International Conference on Data Mining2006. [13] S Fodeh, B Punch, andP-N Tan. (2011). On Ontology- Driven Document Clustering Using Core Semantic Features. Knowledge and information systems, 28(2), 395-421. [14] T F Gharib, M M Fouad, andM M Aref, "Fuzzy Document Clustering Approach Using Wordnet Lexical Categories", Advanced Techniques in Computing Sciences and Software EngineeringSpringer. pp. 181-186. 2010. [15] C Bouras, andV Tsogkas. (2012). A Clustering Technique for News Articles Using Wordnet. Knowledge-Based Systems, 36(115-128. [16] C-L Chen, F S Tseng, andT Liang. (2011). An Integration of Fuzzy Association Rules and Wordnet for Document Clustering. Knowledge and information systems, 28(3), 687-708. [17] T Wei, Y Lu, H Chang, Q Zhou, andX Bao. (2015). A Semantic Approach for Text Clustering Using Wordnet and Lexical Chains. Expert Systems with Applications, 42(4), 2264-2275. [18] K Dave, S Lawrence, andD M Pennock (2003). Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews, Proceedings of the 12th international conference on World Wide Web: ACM (pp. 519-528). [19] A Amine, Z Elberrichi, andM Simonet. (2010). Evaluation of Text Clustering Methods Using Wordnet. Int. Arab J. Inf. Technol., 7(4), 349-357. [20] A Passos, andJ Wainer (2009). Wordnet-Based Metrics Do Not Seem to Help Document Clustering, Proc. of the of the II Workshop on Web and Text Intelligente, São Carlos, Brazil, (pp. [21] L Meng, R Huang, andJ Gu. (2013). A Review of Semantic Similarity Measures in Wordnet. International Journal of Hybrid Information Technology, 6(1), 1-12. [22] X Zhang, L Jing, X Hu, M Ng, andX Zhou, "A Comparative Study of Ontology Based Term Similarity Measures on Pubmed Document Clustering", Advances in Databases: Concepts, Systems and ApplicationsSpringer. pp. 115-126. 2007. [23] L Jing, L Zhou, M K Ng, andJ Z Huang (2006). Ontology-Based Distance Measure for Text Clustering, Proceedings of the Text Mining Workshop, SIAM International Conference on Data Mining, (pp. 537- 541). [24] D D Lewis. (1997). Reuters-21578 Text Categorization Test Collection, Distribution 1.0. http://www. research. att. com/~ lewis/reuters21578. html, [25] W Hersh, C Buckley, T Leone, andD Hickam (1994). Ohsumed: An Interactive Retrieval Evaluation and New Large Test Collection for Research, SIGIR’94: Springer (pp. 192-201). [26] X U Group, 07/15/2005 [cited 2014 1/12/2014]; Available from: http://davis.wpi.edu/xmdv/datasets/ohsumed.html. [27] G Varelas, E Voutsakis, P Raftopoulou, E G Petrakis, andE E Milios, "Semantic Similarity Methods in Wordnet and Their Application to Information http://www/ http://davis.wpi.edu/xmdv/datasets/ohsumed.html Iyad AlAga and Rami Nafee / Investigating the Effeciency of WordNet as Background Knowledge for Document Clustering (2015 ) 158 Retrieval on the Web", Proceedings of the 7th annual ACM international workshop on Web information and data managementACM: NewYork. pp. 10-16. 2005. [28] C Leacock, andM Chodorow. (1998). Combining Local Context and Wordnet Similarity for Word Sense Identification. WordNet: An electronic lexical database, 49(2), 265-283. [29] Z Wu, andM Palmer (1994). Verbs Semantics and Lexical Selection, Proceedings of the 32nd annual meeting on Association for Computational Linguistics: Association for Computational Linguistics (pp. 133- 138). [30] J J Jiang, andD W Conrath. (1997). Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. arXiv preprint cmp-lg/9709008, [31] D Lin (1998). An Information-Theoretic Definition of Similarity, ICML, (pp. 296-304). [32] P Resnik. (1995). Using Information Content to Evaluate Semantic Similarity in a Taxonomy. arXiv preprint cmp-lg/9511007, [33] S Banerjee, andT Pedersen, "An Adapted Lesk Algorithm for Word Sense Disambiguation Using Wordnet", Computational Linguistics and Intelligent Text ProcessingSpringer. pp. 136-145. 2002. [34] G Hirst, andD St-Onge. (1998). Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms. WordNet: An electronic lexical database, 305(305-332. [35] B Larsen, andC Aone, "Fast and Effective Text Mining Using Linear-Time Document Clustering", Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data miningACM. pp. 16-22. 1999. [36] Y Zhao, andG Karypis. (2001). Criterion Functions for Document Clustering: Experiments and Analysis. Machine Learning, Iyad M. AlAgha received his MSc and PhD in Computer Science from the University of Durham, the UK. He worked as a research associate in the center of technology enhanced learning at the University of Durham, investigating the use of Multi-touch devices for learning and teaching. He is cur- rently working as an assistant professor at the Faculty of Information technology at the Islamic University of Gaza, Palestine. His research interests are Semantic Web technolo- gy, Adaptive Hypermedia, Human-Computer Interaction and Technology Enhanced Learning. Rami H. Nafee received his BSc from Al-Azhar University- Gaza and his MSc degree in Information technology from the Islamic University of Gaza. He works as a Web programmer at Information Technology Unit at Al-Azhar University- Gaza. He is also a lecturer at the Faculty of Intermediate Studies at Al-Azhar University. His research interests in- clude Data Mining and Semantic Web.