


Kurdistan Journal of Applied Research (KJAR) | Print-ISSN: 2411-7684 – Electronic-ISSN: 2411-7706 |  kjar.spu.edu.iq 

Volume 2 | Issue 3 | August 2017  | DOI: 10.24017/science.2017.3.3 

 
Evaluation of Data Mining Features, Features 

Taxonomies and their Applications  

 
http://dx.doi.org/10.24017/science.2017.3.3 

Shirin Noekhah  

Faculty of Computing, Universiti Teknologi 
of Malaysia, UTM,  

81300, Johor, Malaysia 

nshirin2@live.utm.my 

 
Naomie binti Salim  

Faculty of Computing, Universiti Teknologi 
of Malaysia, UTM,  

81300, Johor, Malaysia 

naomie@utm.my 

 
Nor Hawaniah Zakaria  

Faculty of Computing, Universiti Teknologi 
of Malaysia, UTM,  

81300, Johor, Malaysia 

hawaniah@utm.my 

 
Abstract: The World Wide Web has brought an 

enormous improvement in the lives of people, during 

the last couple of decades. E-commerce is a new area 

arisen during this evolutionary period and has changed 

the traditional trading approaches for selling products 

and services. It uses different techniques to discover a 

market trend and analyze the competitor’s activities by 

exploiting reviews’ information. On the other hand, 

potential customers, also, use the online opinion to 

make their purchase decision. Opinion mining and 

sentiment analysis are the most critical and 

fundamental domains of data mining which can be 

useful for variety its sub-domains such as opinion 

summarization, recommendation system and opinion 

spam detection.  Opinion mining and all its sub-

branches can be performed efficiently when there is a 

comprehensive understanding of the most effective 

features applied in those domains. To achieve the best 

results, we need to use the most proper set of features 

for different case studies in order to classification or 

clustering. To the best of our knowledge, there is no 

extensive study and taxonomy of variety range of 

features and their applications in opinion mining. In 

this paper, we do comprehensive investigation on 

various types of features exploited in variety sub-

branches of opinion mining domain. We present the 

most frequent features’ sets including structural, 

linguistic and relation-based features as a complete 

reference for further opinion mining research. The 

results proved that using multiple types of features 

improve the accuracy of opinion mining applications. 

 
Keywords: Opinion mining, Feature selection, Opinion 

spam, Recommendation system, Meta-data and content-

based Features 

1. INTRODUCTION 

By rapidly growing of the E-commerce and social 

media, companies and businesses provide the facilities 

for their customers to express their experience and 

opinion toward their products or services. This positive 

or negative opinion has an effective influence on fame or 

defame of that business.  It can raise or drop the sale rate 

of the companies which effect on the reputation of them 

[1, 2]. As the opinion reflects the experiment, thought, 

motivation and emotion of people, we can extract the 

meaningful information from their reviews. In opinion 

mining unlike the other domains of data mining, the 

main focus is on opinion analysis and how the people 

express their opinion rather than the subject of opinion 

[3, 4]. The opinion can be explained in three different 

ways based on the requested format of the websites: Pros 

and Cons (e.g., Clnet.com), Pros, Cons and detailed 

review (e.g., Epinions.com), and free format (e.g., 

Amazon.com) [6]. Extracted opinion can be used by 

academic research, business and potential customers 

(before purchasing products) to benefit from this 

valuable information [5]. Businesses usually want to 

know about the market trend and its need, and, also, 

about their competitors. On the other hand, research 

studies need opinion analysis to develop their opinion 

mining and data analysis applications. In this case 

traditional methods such as human evaluators or manual 

analysis cannot be applied to extract the information of 

customers’ opinion or market feedbacks since they are 

time-consuming and less accurate ways. Instead, 

researchers benefit abundance of valuable information 

scattered through the World Wide Web by exploiting 

opinion mining techniques and applications, 

automatically to extract the desired information.   

The information is extracted from different set of 

features. These features are categorized as content-based 

features (e.g., word sentiment, POS tags, text similarity, 

etc.) or meta-data features (e.g., date/time, helpful 

feedback, number of reviews for product, etc.). Meta-

data features refer to those features which represent the 

behavioural information of entities (i.e. review, 

reviewers, group of review and product), while Content-

based features present the textual information about 

them. They express different characteristics of different 

entities. Features are critical part of any supervised and 

unsupervised techniques. While variety range of meta-

data and content-based features have been exploited in 

different mining applications, limited combination of 

them have been used in data mining. So, many 

techniques miss the opportunity of having more accurate 

results by selecting and applying the effective 

combination of features.   

In this study, we not only evaluate different set of 

features, but also, we present the most popular and 

useful features which can be applied in different 

domains. These detailed features’ investigation can be 

used as a reference for future research which works on 

different domains of data mining. We propose an 

iterative algorithm which combines the most important 

features in data mining techniques based on the graph-

based structure. In addition, we propose four new 

features which can be applied for different set on entities 


in data mining applications. 

2. DATA MINING AND SENTIMENT 
ANALYSIS 

People usually express their opinion through the 

opinionated websites. Reviews scattered through 

opinionated websites such as Amazon.com, epinion.com, 

tripadvisor.com, and sellerranking.com have changed the 

method of our purchasing and made it to be more 

effective. Opinion mining focuses on new variety types 

of application in different domains (e.g., politic, e-

commerce, health, etc.).  Opinion mining as a sub-

branch of data mining contains the techniques which can 

be applied to find the patterns or analysis of the data [53, 

54, 55]. There are 7 steps in terms of knowledge 

extraction from the corpus include determining the type 

of knowledge need to be extracted, defining the desired 

group of data, pre-processing step, cleaning the dataset, 

data mining, pattern recognition and extraction, and 

applying discovered knowledge. In opinion mining the 

main focus is on sentiment extraction and its analysis.  

Sentiment analysis can be used in different data 

mining application such as opinion mining, opinion 

summarization, opinion searching, recommendation 

system and opinion spam detection. Opinion mining [11, 

44] can be used for sentiment identification, and also, for 

feature-based opinion mining. It can be considered in 

document level, sentence level and feature level. To 

determine the polarity in these levels, researchers use 

corpus-based and dictionary-based approaches. In 

corpus-based approach the co-occurrence of the words 

identify the polarity, while in dictionary-based approach, 

the synonyms and antonyms of the words based on seed 

words and using some dictionaries such as WordNet 

determine the sentiment. The first research on the 

problem of opinion mining was conducted by Turney 

(2002). They proposed an unsupervised learning 

algorithm to classify reviews into thumbs up and thumbs 

down [52]. The main problem of his method was 

misclassifying of some terms which their orientation was 

changed based on the context.  

There are some problems in opinion mining analysis 

such as domain dependency and conflicting opinion 

words. It is a difficult task to know the orientation of an 

opinion word by only considering it and the features that 

it describes without considering the whole context 

because of domain dependency issue. Conflicting of 

opinion words in a context, also, causes an inaccurate 

opinion analysis. We can use conjunction and 

disjunction, automatically derived morphological 

relationship, manual syntactic dependency rule templates 

and WordNet synonyms, antonyms, IS-A relationship, 

negation modifiers and morphological relation to assign 

orientation to opinion word. 

The first systematic work for opinion mining and 

summarization has been done by Hu and Liu (2004). 

Unlike the traditional text summarization techniques 

which only summarize the sentences of the reviews, they 

proposed a feature-based summarization system (FBS) 

which summarizes the reviews of customers by 

considering their opinion and the features that they 

express their opinion in sentence level [43]. The results 

proved that using compactness and redundancy pruning 

on frequent extracted features and also considering 

infrequent features rather than only focus on frequent 

features improves the precision and recall of the 

proposed method. But the only considered adjective as 

opinion word and only focus on explicit opinion. There 

variety domain of studies in opinion mining. Some of 

them focus on opinion holder (the writer of opinion 

which can be an individual or an organization) analysis, 

some extract and summarize the features of reviews and 

other techniques analysis the sentiment of reviews and 

its strength. As an example, a review about a product has 

variety types of features related to that object (e.g., 

camera, printer, etc.), its components (e.g., battery, hard 

disk, etc.), and attributes (e.g., battery life, disk capacity, 

etc.) and the opinion which describe the sentiment of 

each part. Opinion can be expressed implicitly (only 

implies the opinion with no subject) or explicitly 

(directly mentions the polarity of the opinion) about 

different types of entities.  

Review, reviewer and the target are three main entities 

which form opinionated documents. Researchers prove 

that each pair of these entities has power low 

relationship with each other. Usually most of the 

reviewers write a small number of reviews for products 

and a few numbers of reviewers write many reviews. 

This fact is also applied for the relation between number 

of products and reviews from reviewers, which means 

that a small number of products can take a large number 

of reviews and vice versa. Similar concept can be found 

for the pair of number of feedbacks and reviews [42]. 

3. FEATURES OF OPINIONATED 
DOCUMENTS’ ENTITIES 

Features can be evaluated based on the entity which they 

describe. Based on this concept, we have review-, 

reviewer-, group of reviewer-, and target- centric 

features. Review centric features extracted from the 

information of the review. It can be considered either as 

review text features (e.g., sentiment, number of words, 

etc.) or review meta-data (e.g., rate, date/time, feedback, 

etc.). Some of review-centric features are domain 

dependent which reduce the generalization of the data 

mining techniques. Reviewer-centric features are those 

features which imply on behaviour and characteristic of 

reviewer along with holistic investigation on all reviews 

written by reviewer. The main problem is that some of 

reviewer-centric features are not available in some 

opinionated websites, so they cannot be applied 

generally in data mining of variety opinionated sources. 

Group of reviewer-centric features are those features 

which reveal the relationship of those reviewers who 

work together to express their opinion or change the 

sentiment trend of the opinionated document. These 


types of features are only available in a case which there 

is the motivation for reviewers to work in a group. 

Finally, the target-centric features are those features 

which describe different aspects of a targeted entity 

which reviewer describe it. Similar to reviewer-centric 

features, this type of features are not made available for 

further analysis by the companies. So, it causes reducing 

the generality of the methods which consider it. Some 

techniques prefer to focus on reviewer-centric features 

rather than review-centric features because they are 

easier to extract and trace. 

Due to the limitation of each category of features, the 

best way is exploiting the combination of the most 

effective ones, since much useful information can be 

collected from reviews, products, reviewer’s shared 

profile and activity patterns. The efficient data mining 

techniques are those which consider all entities with 

their relations and their associated features to produce 

more accurate results. In section 4.2, we describe our 

proposed algorithm and how it aggregate the features 

efficiently. 

In this study, we make an investigation about three 

groups of features including content-based features, 

meta-data features, relational-based features and their 

sub-categories. In opinionated websites different types of 

features exist that based on the need of different 

applications variety set of these features or the 

combination of them are exploited. In these websites, 

each product has its own profile along with the set of 

reviews written by different reviewers. Some websites 

even provides the profile for each reviewer which 

includes his reviews, location, helpful rate, etc. Each 

reviewer can post multiple reviews [10]. Each review 

has textual content features along with meta-data 

features. The features can be categorized as illustrated in 

Figure 1. 

 
Figure1 Data Mining Feature’s Taxonomy 

 
In this study, we make an investigation on variety 

range of features exploited in different areas of data 

mining including opinion spam detection, opinion 

summarization, sentiment analysis and recommendation 

system. Features have been categorized based on the 

concept(s) which they describe for different entities. In 

Table 1-5, we present different set of features along with 

their definition, their domains which they are applied 

and also the references of the techniques which use that 

specific feature. It should be noted that some of the 

features are self-explanatory, so they do not need the 

description. 

3.1. Content-based Features  

Review text contains a variety set of features which can 

reveal valuable information about the reviewer opinion 

about different subjects, features, and its strength. These 

features imply on either linguistic concept or semantic 

one. In order to extract the features’ values, we exploit 

different text mining algorithms and natural language 

processing (NLP) techniques according to the nature of 

those features. The content-based features’ values are 

collected from review’s body and used to evaluate the 

linguistic and semantic patterns of the review content 

[11, 12, 13, 14, 15, 16]. 

 Some set of techniques in data mining use a very 

shallow set of textual features extracted or calculated 

easily from the review content. This approach causes the 

accuracy of the proposed technique to be reduced. 

3.1.1. Sentiment-driven Features  

During the last decade, opinion mining becomes a 

very important concept in data mining domain as the 

governments, private sections and individuals usually 

need to know about the overall sentiment of viewpoint 

about desired phenomena. To achieve this goal, many 

powerful techniques have been proposed and many 

investigations have been performed in this area. The 

common target for all of them is to know about the 

sentiment of the words express by people about different 

features or topic. In this case, sentiment analysis and 

extraction of its related features have been become 

important tasks. 

Sentiment or polarity of reviews or user generated 

content is one of the main characteristics which people 

consider when they want to make decision. This concept 

refers to the feeling, experiments and idea of reviewer 

about a product or service. It can be express for detailed 

features or for whole product or service. In opinionated 

websites, the idea can be represented through semantic 

expression or by using stars or rates which imply on 

three types of polarity including positive, neutral and 

negative. As we can see in Table 1, this polarity can be 

extracted by exploiting different set of features.     

Sentiment analysis is a main part of any opinion 

mining applications. In recommendation systems, first, 

researchers should know about the customers’ 

preference (positive/negative opinion) about the products 

and then the system can make the best suggestion. On 

the other hand, in text and feature summarization, the 

method extracts the sentiment of each feature and then 

makes the summarization based on the polarity of 

opinionated words.  Sentiment classifiers performs their 

analysis in document, sentence of feature level which 

cause different set features from Table 1 can be used to 

satisfy the classifiers need.  In addition, as some of the 

opinionated websites do not provide rating or some 

reviewers (esp. Spammers) give mismatched rate 


compared to review content, some sentiment analysis 

and classification techniques can evaluate the sentiment 

of the review and assign the rate to each review based on 

its content [15]. There are many tools and algorithms 

(e.g., NTUSD (NTU Sentiment Dictionary) [56]) 

exploited to identify the polarity of opinion reviews. 

Table 1: Sentiment-driven features 

No 
Name of 

Feature 
Description 

Domain 

of 

Study 

Reference 

1 
Review 

sentiment 

Self-

explanatory 

OM, 

SD, RS, 

TS 

[10, 13, 

16, 17, 18, 

19] 

2 

Polarity of 

emotion 

words 

Positive/Ne

gative of  

adjective, 

adverb or 

verbs 

OM, 

SD, RS, 

TS 

[10, 15, 

16, 17, 19, 

20, 21, 22, 

23, 24, 

25] 

3 
Status of 

review 

Bad/good 

review is  

after a 

good/bad 

review 

SD, OM [10, 17] 

4 

Review 

Group 

agreement 

Whether the 

review has 

the same 

polarity 

with 

surrounding 

reviews 

SD, OM Authors 

5 
Polarity of 

features 

Self-

explanatory 

OM, 

RS, TS 
[26] 

6 

Number of 

reviews in  

Time 

window 

Number of 

Positive/Ne

gative 

reviews in 

TW 

OM, 

SD, RS, 

TS 

[27] 

7 
Opinion 

strength 

Opinion 

severity for 

its polarity 

OM, 

SD, RS, 

TS 

[28] 

8 

Sentiment-

Rate 

difference 

Difference 

of sentence  

sentiment 

and rate 

SD Authors 

*RS= Recommendation System, OM= Opinion Mining, TS= Text 
Summarization, SD= Spam Detection. 

3.1.2. Syntactic and Semantic-driven Features  

Semantic of the word or review presents the meaning or 

concept which describes it. This set of features has been 

exploited by researchers to generate a semantic language 

model in terms of similarity evaluation. They found that 

not only duplicated reviews can be similar with each 

other, but also those reviews which semantically are 

similar through synonym words also can be considered 

as duplicated reviews. On the other hand, syntactic of the 

word refers to the grammatical role of that word within a 

sentence of review. The first work of using semantic 

classification of reviews in opinion mining has been 

performed by Dave et al. (2003). They applied 

information retrieval techniques along with feature 

scoring method in order to classifying the opinion of 

features and documents. They used machine learning 

approach and Rainbow text classification tool [46], 

SVMlight package and Naïve Bayes classifier along with 

Laplace smoothing [44]. 

Most of the techniques applied in document 

representation rely on Bag-of-Word Model (BOW) 

which is commonly known as a Vector Space Model 

(VSM). Documents are represented as a linear vector 

which describes the co-occurrence of words in textual 

corpus. In VSM, many semantic relations among 

concepts and their significant information will be lost 

which cause reducing the accuracy of technique. The 

other problem with VSM is that if the document is long, 

it is very difficult to represent it as a vector model due to 

its large size. The details of syntactic and semantic-

driven features and the clues which can be extracted 

from the review content are explained in Table 2. 

Table 2: Syntactic and Semantic-driven features 

No 
Name of 

Feature 
Description 

Domain 

of 

Study 

Reference 

1 

Number of 

words, 

sentences 

(length of 

review) 

Self-

explanatory 

OM, 

SD, TS 

[10, 15, 

16, 17, 19, 

20, 21, 23, 

25, 26, 29, 

30, 31, 32, 

33, 34] 

2 

Number of 

noun, 

adjective, 

etc. 

Self-

explanatory 

OM, 

SD, RS, 

TS 

[26, 32, 

35] 

3 

Rate of 

brand/ 

product 

name 

Percentage 

or rate of 

repetition of 

brand/produ

ct name 

OM, 

SD, RS, 

TS 

[10, 17, 

20, 21] 

4 

Review 

content 

similarity 

content 

similarity of 

current 

review with 

other 

reviews 

SD 

[10, 11, 

15, 17, 20, 

23, 24, 25, 

27, 30, 31, 

32, 34, 36,  

37, 38, 39, 

40] 

5 
Text 

generality 

Whether the 

review is 

general or 

not 

OM, SD [38] 

6 
N-gram 

feature 

N-gram 

noun 

phrases 

(unigram/bi

gram) the 

combination 

order of 

terms 

OM, 

SD, RS, 

TS 

[10, 15, 

16, 17, 18, 

20, 26, 27, 

29 , 31, 

33, 34, 35, 

36, 38, 39, 

40, 41] 

7 

Percentage 

of capital  

word 

Self-

explanatory 
OM, SD 

[10, 15, 

17, 20, 

42] 


8 

Percentage 

of 

numerals 

word 

Self-

explanatory 
OM, SD 

[10, 16, 

17, 19] 

9 
Distributio

n of POS 

Self-

explanatory 

OM, 

SD, RS, 

TS 

[18, 29, 

31, 32, 35, 

43, 44] 

10 

Term 

frequency-

inverse 

document 

frequency 

(TF-IDF) 

and Bag-

of-Words 

Numerical 

concept 

which refers 

to how 

much a 

word is 

frequent 

within the 

document 

OM, 

SD, RS, 

TS 

[15, 16, 

19, 30, 33, 

35, 36, 

45] 

11 

Subjectivit

y/Objectivi

ty of 

review 

Whether the 

review is 

objective or 

subjective 

OM, 

SD, TS 
[15, 20] 

12 Pronoun 

First/second

/ 

third person 

SD, OM 

[15, 16, 

18, 19, 20, 

21, 22, 23, 

35] 

13 

Ratio of 

grammatic

al words 

Ratio of 

question, 

exclamation

, 

punctuation 

and html 

tags 

OM, 

SD, TS 

[15, 16, 

19, 20, 

26] 

 
For future extraction, we need to have a mechanism 

which can pars the document according to the positions 

and roles of its phrases. POS tagging is that mechanism 

which identifies the syntactic or morphological role 

(noun, adjective, pronoun, verb, adverb, preposition, 

conjunction and interjection, etc.) of the specific phrase 

and its linguistic construction in the sentence. POS 

tagging is one of the most important pre-processing steps 

during data analysis, as it can help us to determine the 

grammatical structure of the document. Evaluating of 

adjectives, adverbs and pronouns, by using POS tagging, 

can reveal variety sets of emotion and opinion hidden in 

the review’s sentences. It can help the researchers to 

identify the implicit and explicit opinion expressed 

through the reviews. 

POS tagging has a wide range of usage in data mining. 

A simple pre-processing task involves dividing text into 

meaningful segment according to boundary detection. In 

most cases, a period (.), an exclamation mark (!) or a 

question mark (?) are the usual signals that indicate a 

sentence boundary. This step is applied in text 

processing approaches such as information extraction, 

text summarization, semantic role labelling, machine 

translation, syntactic parsing and plagiarism detection. 

NLProcessor can be used to produce POS tags and 

syntactic chunking. The output of NLProcessor is a 

XML file which shows the reviews along with their POS 

tags [43]. 

Punctuation and writing style are the other indicators 

applied in many data mining applications especially for 

sentiment analysis. Alongside of writing style, the 

syntactic expressions used by the reviewer can be 

analyzed to determine the writing logic of that review.  

1st, 2nd and 3rd person pronouns are three indicators 

which some of the researchers use in different opinion 

mining applications. These pronouns widely used in 

opinion mining to evaluate the opinion of the reviewer or 

other people who deal with that product or service. On 

the other hand, in opinion spam detection, this feature 

can be used to distinguish between spammer and non-

spammer. For example, some researchers like [22, 35, 

39] believe that spammers try to use 2nd and 3rd person 

pronouns to remove the responsibility of telling lie from 

themselves or due to insufficient personal experience 

about that case, which both cases cause the 

psychological distancing. While, other researchers such 

as [18, 29] believe that 1st person pronoun is more 

prevalence in spam reviews as the spammers try to 

increase the credibility of their reviews and show that 

they had such experience.  Unlike the psychological 

deception and lying which researchers [22, 34] believe 

that the liars do not use first person pronoun to avoid 

having the ownership of that lie, in opinion spam the 

spammer tries to make the review more convincing and 

put more impression by using first person pronoun. In 

this case, spam reviews nature is different with normal 

lying [34]. 

Domain dependency is the main weakness of those 

data mining techniques which consider the set of 

semantic and synthetic features. These techniques can be 

exploited only for specific domain of study. In this case, 

more robust techniques are needed which can be used in 

cross-domain data mining. While content-based features 

cannot provide all required information for data mining 

activities, but they present strong clues, which can help 

us for to develop variety set of applications in different 

sub-domains of data mining. 

3.2. Meta-data Features  

Apart from content-based features, meta-data features 

are those features which describe the additional 

information about review, reviewer and his/her 

behaviours which cannot be extracted from text of 

review. The main sub-categories of meta-data features 

are rating, date/time, helpfulness and position. 

3.2.1. Rating-driven Features  

Rating is one of the most popular features widely used 

by variety sets of data mining techniques. This feature 

can influence on the popularity trend of a product. The 

review content should be match with its corresponded 

rating, so, some techniques are developed to evaluate 

this type of matching. As this concept can reduce the 

accuracy of opinion mining techniques, detecting and 


filtering these irrelevant reviews (e.g., advertisement 

reviews or non-opinion reviews) can improve the results 

of those techniques. These types of reviews play a 

critical role especially in spamming activities, when the 

spammers try to change the trend of product average 

rating without spending their time to write the detailed 

reviews. In opinionated websites rating is presented in 

different formats including star (from 1 star to 5 star), 

number (from 1 to 5)or binary value (thumbs up or 

thumbs down). These different forms of rating usually 

are normalized in the range of [0,1]. The details of 

variety sets of rating-driven features are presented in 

Table 3. 

Table 3: Rating-driven features 

No 
Name of 

Feature 
Description 

Domain 

of 

Study 

Reference 

1 
Review 

rating 

Rating of 

the review 

OM, 

SD, RS 

[10, 11, 

12, 13, 14, 

15, 16, 17, 

19, 20, 23, 

26, 27, 46, 

47, 48, 

49] 

2 
Rate 

deviation 

Deviation 

from 

average 

rating 

SD 

[10, 11, 

15, 17, 20, 

23, 25, 26, 

27, 31, 34, 

36, 37, 50, 

51] 

3 

Similar 

rating 

reviews 

Number of 

similar rates 

which a 

reviewer 

gives to a 

product(s) 

SD Authors 

4 
Extremity 

of rating 

Extremity of 

rating 

SD, RS, 

TS, OM 

[15, 34, 

37, 38] 

5 
Burst 

review rate 

Rate of 

reviews 

posted in 

burstiness 

intervals 

SD, RS, 

OM 
[27, 38] 

6 
Feedback 

number 

Number of 

feedbacks 

which are 

assigned to 

the specific 

review 

SD, OM [10,17] 

7 

Helpful 

Feedback 

number 

Self-

explanatory 
SD, OM 

[10, 14, 

17, 23, 27, 

32] 

 
Rating feature is important for review, reviewer and 

product.  By evaluating the product average rating, we 

know about product popularity. Rating deviation of a 

review can give important signal about the truthfulness 

of that review. Finally, analysing the rating patterns 

(number of positive/negative rating, similarity in rating 

and rating for specific product’s category) of a reviewer 

reveals the important characteristic about the reviewer’s 

behaviors.  

Review feedback is feature which some opinionated 

websites provide for their users to give their opinion 

about the usefulness of the reviews content. Feedback 

can be present by assign a binary value into the review to 

show that whether the review is helpful or not.  This 

factor shows the level of satisfactory of the readers who 

find that review is useful, informative and effective. 

Helpfulness can be considered as a factor which 

increases the credibility of review. On the other hand, if 

the reviews of reviewer gain more helpful rates, that 

reviewer with be more reliable. This point should be 

considered that similar to rate spamming, helpfulness 

can be effected by spammers. 

3.2.2. Time-driven Features  

The time-related features are based on the posting date 

and time of the review. In data mining applications, we 

can use different fixed time units such as hour, day, 

week, month and year, or the customized time unit (e.g., 

three-weeks time interval). Table 4 explains different 

types of time-driven features along with their application 

in different domains of data mining. 

Table 4: Time-driven features 

No 
Name of 

Feature 
Description 

Domain 

of 

Study 

Reference 

1 
Date/Time 

of review 

Self-

explanatory 

SD, RS, 

TS, OM 

[11, 14, 

15, 16, 19, 

20, 31, 34, 

41, 42, 

51] 

2 
Time 

window 

Time 

interval 

SD, RS, 

TS, OM 

[11, 23, 

27, 39, 40, 

46, 49] 

3 

Time 

window 

review 

The number 

of reviews 

in time 

interval 

SD, RS, 

OM 
[34, 38] 

4 

Max. 

number of 

reviews 

per day 

Self-

explanatory 
OM, SD 

[15, 25, 

37] 

5 
Burstiness 

window 

Number of 

days 

between 

first and last 

review in 

density time 

intervals 

OM, 

SD, RS 

[15, 25, 

27, 36] 


6 
Early 

deviation 

First 

Review 

Rate 

deviation 

from 

average 

rating 

SD, OM 
[27, 31, 

49] 

7 
Burst 

review rate 

Number of 

reviews 

which 

appears in 

products 

burstiness 

SD, RS, 

TS, OM 
[36, 38] 

8 
Early time 

frame 

Spammer’s 

review early 

to increase 

the impact 

SD 
[15, 34, 

37] 

9 

Arriving-

Writing 

time 

Time 

between 

registration 

and writing 

of reviewer 

SD, RS Authors 

10 

Review 

Rank 

(position) 

order among 

all reviews 
SD 

[10, 15, 

16, 19, 17, 

42] 

11 

First 

position 

review 

whether 

review is in 

first position 

or not 

SD 

[10, 16, 

17, 19, 20, 

23] 

 
Review distribution can be analyzed in different scales 

of time windows. Time window is the time interval 

between any two consecutive temporal points. Variety 

types of information required of different data mining 

applications (e.g., product’s popularity trend analysis, 

recommendation system and popular features’ 

summarization) can be extracted through review 

prevalence evaluation. In some intervals the number of 

reviews for specific products increases dramatically. 

This burstiness (large amount of reviews within a short 

time interval) happens due to different reasons including 

releasing of product, promotion time or posting fake 

reviews by product’s owner to change the popularity 

trend of his product or competitor’s product. If the 

number of reviews in a specific time interval becomes 

greater than the threshold, this time interval can be 

considered as burstiness time interval. To investigate the 

burstiness, we need to standardize the format of the 

review posting date.  

Some websites such as Amazon.com, copy all the 

reviews of one version product for all the version of 

same product (i.e., the main difference between them is 

only the color). Identifying this duplication can improve 

the opinion analysis results. We can perform this task by 

assessing the posting date of the reviews. Duplicated 

reviews, also, can be happened due to spamming 

activities which spammers try to post a huge number of 

reviews in same day to change the rating trend of 

product. So, it is a complicated task to distinguish 

whether this duplication is due to website’s policy or 

because of spamming activities.  

Time interval between two reviews of a reviewer or 

product can be an important signal for those techniques 

which track the reviewer’s behavior (e.g., the activeness 

of reviewer). Early time frame refers to what extent the 

reviewer write review early. This feature is importance 

since the top positions of reviews (i.e., early posted 

reviews after product is lunched) can influence on 

product’s popularity. If the launched date of product was 

not specified for the website, we consider the date of 

first review as launched date of that product. Reviewing 

activity of reviewers refers to the time period between 

first and last reviews of reviewers. Reviewers who write 

reviews after a reasonable time are less like to be 

spammers than those who create an account and post 

some reviews and after that never use that account. So, 

time-driven features are very important features in data 

mining domains and more specifically, in spam detection 

field. 

3.3. Relational-based Features  

There are some groups of features which are mostly 

significant for opinion spam detection or opinion 

mining. As mentioned before, the entities have 

relationships with each other. Considering these types of 

relationships can improve the accuracy of data mining 

techniques. For example, the number of reviews which a 

reviewer writes for a group of products reveals the 

relationship between reviewer, reviews and product. The 

relational-based features are illustrated in Table 5. 

Table 5: Relational-based features 

No 
Name of 

Feature 
Description 

Domain 

of 

Study 

Reference 

1 
Singleton 

review 

Whether the 

review is 

reviewer’s 

sole review 

or not 

OM, SD [15] 

2 

Ratio of 

Singleton 

reviews 

Number of 

singleton 

reviews 

among all 

product’s 

reviews 

OM, SD 
[38, 46, 

48] 

3 

Proportion 

of positive 

singleton 

Number of 

positive 

singleton 

reviews 

among all 

the reviews 

OM, 

SD, TS 
[48] 

4 
Only 

review 

Whether 

this review 

is the only 

product’s 

review 

OM, SD [10, 17] 


5 

Group 

rating 

deviation 

Reviews’ 

rating 

deviation of 

a group of 

reviewers 

OM, SD [31, 49] 

6 

Group 

content 

similarity 

Reviews’ 

content 

similarity of 

group of 

reviewers. 

OM, SD [31, 49] 

7 

Group 

early time 

frame 

First group 

of reviewers 
 [31, 49] 

 
In spamming activities, reviewers are either singleton 

reviewer or multi-reviews reviewers. If a reviewer writes 

only one review, we call that reviewer as singleton 

reviewer and that review as singleton review. Multiple-

reviews reviewers can change the market trend for a 

specific product or group of products, so, this concept 

should be considered in data mining applications. These 

two types of reviewers have different behaviors which 

cause simple methods cannot detect their activities 

accurately.  

Those reviewers who write multiple reviews for a 

single product with more likelihood will be review 

spammers. Proportion of positive singleton reviews is a 

good indicator to investigate this probability. Spammers 

usually try to post the reviews as a singleton review by 

posting different reviews under different userID. In this 

case, the methods which develop to detect multi-reviews 

reviewer cannot catch them. Usually spammers hired by 

companies try to write bulk of reviews in short period of 

time by using different user id due to prevent to be 

detected by existing detection methods. As we can see in 

Table 5, singleton reviewers have different 

characteristics which make them difficult to be detected. 

On the other hand, sometimes reviewers work within a 

group to increase the influence of their reviewing. This 

group activity reveals a set of significant features which 

can be exploited in variety domains of data mining, 

especially in opinion spam detection and group of 

spammer detection. In group of spammer, the rating 

behavour, review content and review posting time are 

similar. Mostly, the group average rating is deviated 

from targeted product’s average rating. All these signals 

help us to improve our prediction results.  

The evaluation of existing techniques and the set of 

features they have exploited prove that considering those 

features which can be extracted from the relationships 

among entities can improve the accuracy and generality 

of any proposed techniques. But, we should know which 

combination of features is the most useful and 

informative sets, and can be applicable in our research. 

In next section, we present our proposed graph-based 

model which considered these relationships among 

entities. 

4. RESULT AND DISCUSSION 

Different entities in opinion mining have their own 

characteristic explained through variety set of features. 

However, considering these individual sets of features 

cannot reveal all characteristics and hidden relationships 

existed among entities. Opinion mining entities can have 

influence on each other through reviewing activities (as 

mentioned in section 3.3). 

4.1. Feature Prevalence in Data Mining 
Applications   

In this section, we make an investigation on the existing 

sub-categories of data mining techniques. As it can be 

analyzed from Table 1-5, different domains of data 

mining use variety sets of features. This evaluation can 

help the researchers to know which set of features are 

popular to be applied in desired domain and to what 

extent they are important. The results of this analysis are 

illustrated in Figure 2. 

 
Figure2 Prevalence of Data Mining Features in different 

Domains 

 
As illustrated in Figure 2, opinion mining techniques 

mostly use syntactic and semantic features related to the 

review content. Recommendation systems find that 

syntactic features along with sentiment analysis features 

give the best result for their methods. The usage of all 

four categories in text summarization is near to equal, 

but they use few numbers of features from each 

category. An interesting result, which can be extracted 

from the above diagram, is that opinion spam detection 

techniques use all four categories along with high 

number of features from each category. The most 

common category in opinion spam detection domain is 

syntactic features, as the professional reviewers write the 

reviews in such a way that cannot be detected easily. In 

this case, opinion spam detection techniques need more 

complicated features to capture the spamming clues. 

Another result from Figure 2 is that relational-based 

features mostly used by opinion mining and opinion 

spam detection techniques as they present the more 

useful and important information about entities, which 

work together to generate the opinionated document, for 

those two types of techniques. 


4.2. Multi-iterative Graph-based Structure for 
Data Feature Extraction   

The graph structure is represented as a tripartite network. 

In this structure, review and product are connected 

through the “belonged relationship” link. Review and 

reviewer are connected through the “posted relationship” 

link. Reviewer and product are connected through the 

“reviewed relationship” link.  

The proposed model combines the multi-iterative 

algorithm and graph entities representation structure to 

perform feature extraction for data mining. It focuses on 

finding inter-relation and intra-relations among entities, 

their joint and disjoint features, and how they can 

connect with each other in terms of having effective 

feature selection process. The main advantage of this 

structure is that it monitors behaviors of the entities and 

produce more accurate feature values for different data 

mining application. All entities will be evaluated 

simultaneously and produce new set of relational-based 

features iteratively. The graph-based model is flexible 

and scalable linearly so it can be generalized in other 

domains of data mining. The proposed model is 

presented in Figure 3. 

 
Figure3 Multi-Iterative Graph-based Model for Opinion 

Feature Extraction 

 
In Figure 3, the red curve represents the concept of 

iteration. Multi-iterative feature extraction algorithm is 

introduced to capture the relations of entities which 

reveal during iteration phase. For example, when we 

evaluate the features of reviewer individually, we do not 

have any knowledge about the features of reviews which 

he has posted. After we evaluate whole graph structure, 

we will be informed about their relations. Multi-iterative 

algorithm adjusts the value of relational-based features 

after some iterations (in our study, we consider after the 

changing rate of the features’ value becomes less than 

the threshold < 0.01).   

Variety types of features, relations and their possible 

assigned values cause the graph size become 

exponential, and hardly can be controlled. In this 

situation, previous study found that general MRF model 

becomes useless for such a large network. So, we 

propose an iterative algorithm. We integrate the pieces 

of information extracted from the proposed graph 

structure. Then, we have an iterative algorithm which 

updates entities’ features value iteratively based on their 

neighbours entities’ features, the results from last 

iteration and using the inter- and intra-relationships 

among them.  

Each entity has a set of features which in our method 

their values become normalize and the score is assigned 

to them. These features of each entity are integrated with 

each other through a linear combination. In our proposed 

algorithm, we have two main steps include initializing 

and iterative computation. In initialize step of iterative 

algorithm, the entities will be initialized by using the 

value of extracted features. In iterative step, we use the 

previous calculated score to update the current feature 

score of the entity. Finally, we utilize final value, after 

convergence of algorithm, to determine the final value of 

entities’ features. 

As this model can reveal the most useful features of 

entities it can be a framework for any data mining 

techniques. It can be the main part of feature extraction 

phase, and provide desired information for further 

analysis. 

5. CONCLUSION 

In this paper, we performed a comprehensive study on 

different types of features exploited in different domains 

of data mining and information retrieval. The features 

have been categorized into content-based features, meta-

data features and relational-based features. Obviously, 

each feature has its own characteristic and can be 

effective differently in variety applications. So, the main 

goal of this study was providing a reference for 

researchers to select the most effective set of features 

and combine them based on the scope and application of 

their research. We proposed four new features which can 

improve the data mining application techniques. Finally, 

we proposed a graph-based model for feature extraction 

which can reveal the entire relationships among different 

entities. 

  
6. REFERENCE 

[1] NN. Ho-Dac, SJ. Carson, and WL. Moore, The 
effects of positive and negative online customer 

reviews: do brand strength and category maturity 

matter?, Journal of Marketing, pp.37-53, 2013.   

[2] F. Zhu and X. Zhang, Impact of online consumer 
reviews on sales: The moderating role of product 

and consumer characteristics, Journal of 

marketing, pp.133-148, 2010. 

[3] JW. Pennebaker and King LA, Linguistic styles: 
language use as an individual difference, Journal 

of personality and social psychology, 1999. 

[4] D.Shapiro, Psychotherapy of neurotic character, 
Basic Books, 1999. 


[5] J. Evelyn, Online shopping-Unabridged Guide, 
Emereo Publishing, 2012. 

[6] SP. Algur, AP. Patil, PS. Hiremath and S. 
Shivashankar, Conceptual level similarity measure 

based review spam detection, In Signal and Image 

Processing (ICSIP), International Conference, pp. 

416-423, 2010. 

[7] A. McCallum and K. Bow, A toolkit for statistical 
language modeling, text retrieval, classification 

and clustering, 1998.  

[8] M.F. Porter, An algorithm for suffix stripping, In 
Program, volume 14, pp. 130–137, 1980. 

[9] K. Dave, S. Lawrence and DM. Pennock, Mining 
the peanut gallery: Opinion extraction and 

semantic classification of product reviews, In 

Proceedings of the 12th international conference 

on World Wide Web, ACM, pp. 519-528, 2003. 

[10] N. Jindal and B. Liu, Analyzing and detecting 
review spam, In Data Mining, ICDM, Seventh 

IEEE International Conference, pp. 547-552, 

2007. 

[11] G. Wang, S. Xie, B. Liu and SY, Philip, Review 
graph based online store review spammer 

detection, In Data mining (icdm), IEEE 11th 

international conference, pp. 1242-1247, 2011. 

[12] A. Ghose, PG. Ipeirotis and A. Sundararajan, 
Opinion mining using econometrics: A case study 

on reputation systems, In annual meeting-

association for computational linguistics, p. 416, 

2007. 

[13] L. Akoglu, R. Chandy and C. Faloutsos, Opinion 
Fraud Detection in Online Reviews by Network 

Effects, ICWSM, 2013. 

[14]  A. A. Hammad and A. El-Halees, An approach 
for detecting spam in arabic opinion reviews, 

International Arab Journal of Information 

Technology, vol. 12, no. 1, pp. 10–16, 2015.  

[15] J. D’onfro and A Whopping, 20% Of Yelp 
Reviews Are Fake, http://read.bi/1M03jxl, 2013. 

[16] YR. Chen and HH. Chen, Opinion spam detection 
in web forum: a real case study, In Proceedings of 

the 24th International Conference on World Wide 

Web, pp. 173-183, 2015.  

[17] N. Jindal and B. Liu, Review spam detection, In 
Proceedings of the 16th international conference 

on World Wide Web, pp. 1189-1190, 2007. 

[18] J. Li, M. Ott, C. Cardie and EH. Hovy, Towards a 
General Rule for Identifying Deceptive Opinion 

Spam, pp. 1566-1576, 2014. 

[19] YR. Chen and HH. Chen, Opinion spammer 
detection in web forum, In Proceedings of the 

38th International ACM SIGIR Conference on 

Research and Development in Information 

Retrieval, pp. 759-762, 2015. 

[20] F. Li, M. Huang, Y. Yang and X. Zhu, Learning to 
identify review spam, In IJCAI Proceedings-

International Joint Conference on Artificial 

Intelligence, p. 2488, 2011. 

[21] KH. Yoo and U. Gretzel, Comparison of 
deceptive and truthful travel reviews, Information 

and communication technologies in tourism, pp. 

37-47, 2009. 

[22] ML. Newman, JW. Pennebaker, DS. Berry and 
JM. Richards. Lying words: Predicting deception 

from linguistic styles. Personality and social 

psychology bulletin, pp. 665-75, 2003. 

[23] Y. Lu, L. Zhang, Y. Xiao and Y. Li, 
Simultaneously detecting fake reviews and review 

spammers using factor graph model, In 

Proceedings of the 5th annual ACM web science 

conference, pp. 225-233, 2013. 

[24] JG. Thanikkal, M. Danish, JG. Thanikkal and M. 
Danish, A novel approach to improve spam 

detection using SDS algorithm, International 

Journal, 2015. 

[25] A. Mukherjee, V. Venkataraman, B. Liu and NS. 
Glance, What yelp fake review filter might be 

doing?, In ICWSM, 2013. 

[26] S.-M. Kim, P. Pantel, T. Chklovski and M. 
Pennacchiotti, Automatically assessing review 

helpfulness, In EMNLP, 2006. 

[27] EP. Lim, VA. Nguyen, N. Jindal, B. Liu and HW. 
Lauw, Detecting product review spammers using 

rating behaviors, In Proceedings of the 19th ACM 

international conference on Information and 

knowledge management, pp. 939-948, 2010. 

[28] A-M. Popescu and O. Etzioni, Extracting Product 
Features and Opinions from Reviews. EMNLP-

05, 2005. 

[29] M. Ott, Y. Choi, C. Cardie and JT. Hancock, 
Finding deceptive opinion spam by any stretch of 

the imagination, In Proceedings of the 49th 

Annual Meeting of the Association for 

Computational Linguistics: Human Language 

Technologies, pp. 309-319, 2011. 

[30] H. Sun, A. Morales and X. Yan, Synthetic review 
spamming and defense, In Proceedings of the 19th 

ACM SIGKDD international conference on 

Knowledge discovery and data mining, pp. 1088-

1096, 2013. 

[31] A. Mukherjee, B. Liu and N. Glance, Spotting 
fake reviewer groups in consumer reviews, In 

Proceedings of the 21st international conference 

on World Wide Web, pp. 191-200, 2012. 

[32] Z. Zhang and B. Varadarajan, Utility scoring of 
product reviews, In Proceedings of the 15th ACM 

international conference on Information and 

knowledge management, pp. 51-57, 2006. 

[33] H. Li, Z. Chen, B. Liu, X. Wei and J. Shao, 
Spotting fake reviews via collective PU learning. 

In ICDM, 2014. 

[34] A. Mukherjee and V. Venkataraman, Opinion 
spam detection: An unsupervised approach using 

generative models, Technical Report. UH, 2014. 

[35] T. Wang and H. Zhu, Voting for Deceptive 
Opinion Spam Detection, arXiv preprint 

arXiv:1409.4504, 2014. 

[36] G. Fei, A. Mukherjee, B. Liu M. Hsu, M. 
Castellanos and R. Ghosh, Exploiting Burstiness 

in Reviews for Review Spammer Detection, 

ICWSM, 2013. 


[37] A. Mukherjee, A. Kumar, B. Liu, J. Wang, M. 
Hsu, M. Castellanos and R. Ghosh, Spotting 

opinion spammers using behavioral footprints. In 

Proceedings of the 19th ACM SIGKDD 

international conference on Knowledge discovery 

and data mining, pp. 632-640, 2013. 

[38] Y. Xu, B. Shi, W. Tian and W. Lam, A unified 
model for unsupervised opinion spamming 

detection incorporating text generality, In Twenty-

Fourth International Joint Conference on Artificial 

Intelligence, 2015. 

[39] Y. Lin, T. Zhu, X. Wang, J. Zhang and A. Zhou, 
Towards online review spam detection, In 

Proceedings of the 23rd International Conference 

on World Wide Web, pp. 341-342, 2014. 

[40] Y. Lin, T. Zhu, H. Wu, J. Zhang, X. Wang and A. 
Zhou, Towards online anti-opinion spam: Spotting 

fake reviews from the review sequence, In 

Advances in Social Networks Analysis and 

Mining (ASONAM), IEEE/ACM International 

Conference, pp. 261-264, 2014. 

[41] M. Ott, C. Cardie and J. Hancock, Estimating the 
prevalence of deception in online review 

communities, In Proceedings of the 21st 

international conference on World Wide Web, pp. 

201-210, 2012. 

[42] N. Jindal and B. Liu, Opinion spam and analysis, 
In Proceedings of the 2008 International 

Conference on Web Search and Data Mining, pp. 

219-230, 2008. 

[43] M. Hu and B. Liu, Mining and summarizing 
customer reviews, KDD’2004. 

[44] X. Ding, B. Liu and PS. Yu, A holistic lexicon-
based approach to opinion mining, In Proceedings 

of the 2008 international conference on web 

search and data mining, pp. 231-240, 2008. 

[45] R. Patel and P. Thakkar, Opinion spam detection 
using feature selection, In Computational 

Intelligence and Communication Networks 

(CICN), International Conference, pp. 560-564, 

2014. 

[46] S. Xie, G. Wang, S. Lin and PS. Yu, Review spam 
detection via temporal pattern discovery, In 

Proceedings of the 18th ACM SIGKDD 

international conference on Knowledge discovery 

and data mining, pp. 823-83, 2012. 

[47] C. Dellarocas, Immunizing online reputation 
reporting systems against unfair ratings and 

discriminatory behavior, In ACM EC, 2000. 

[48] G. Wu, D. Greene, B. Smyth and P. Cunningham, 
Distortion as a validation criterion in the 

identification of suspicious reviews, Technical 

Report UCD-CSI-2010-04, University College 

Dublin, 2010. 

[49] A. Mukherjee, B. Liu, J. Wang, N. Glance and N. 
Jindal, Detecting group review spam, In 

Proceedings of the 20th international conference 

companion on World Wide Web, pp. 93-94, 2011. 

[50] M. Sahami, S. Dumais, D. Heckerman and E. 
Horvitz, A Bayesian Approach to Filtering Junk 

{E}-Mail, AAAI Technical Report WS-98-05, 

1998. 

[51] H. Li, Z. Chen, A. Mukherjee, B. Liu and J. Shao, 
Analyzing and Detecting Opinion Spam on a 

Large-scale Dataset via Temporal and Spatial 

Patterns, In ICWSM, pp. 634-637, 2015. 

[52] PD. Turney, Thumbs up or thumbs down?: 
semantic orientation applied to unsupervised 

classification of reviews, In Proceedings of the 

40th annual meeting on association for 

computational linguistics, pp. 417-424, 2002 . 

[53] K. Costa, P. Ribeiro, A. Camargo, V. Rossi, H. 
Martins, M. Neves and JP. Papa, Comparison of 

the Intelligent Techniques for Data Mining in 

Spam Detection to Computer Networks, 2014. 

[54] G. Piatetsky-Shapiro, Advances in knowledge 
discovery and data mining, AAAI press, 1996. 

[55] CR. Narendran, Data Mining-Classification 
Algorithm–Evaluation, 2009. 

[56] LW. Ku, HW. Ho, HH. Chen, Opinion mining and 
relationship discovery using CopeOpi opinion 

analysis system, Journal of the Association for 

Information Science and Technology, 2009. 

 
ACKNOWLEDGMENTS 
This work is supported by Ministry of Higher Education 

(MOHE) and Research Management Centre (RMC) at 

the Universiti Teknologi Malaysia (UTM) under 

Research University Grant Category 

(R.J130000.7828.4F719).