International Journal of Interactive Mobile Technologies (iJIM) – eISSN: 1865-7923 – Vol. 14, No. 10, 2020


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

ECharacterize: A Novel Feature Selection-Based 
Framework for Characterizing Entrepreneurial 

Influencers in Arabic Twitter 
https://doi.org/10.3991/ijim.v14i10.14807 

Bodor Moheel Almotairy(*),	Manal Abdullah  
King Abdulaziz University, Jeddah, Saudi Arabia 

balmetere0002@stu.kau.edu.sa 

Rabeeh Abbasi  
Quaid-i-Azam University, Islamabad, Pakistan  

Abstract—Social media are widely used as communication platforms in the 
world of business. Twitter, in particular, offers valuable opportunities for collab-
oration due to its open nature. For that, many entrepreneurs employ Twitter for 
different reasons, such as mobilizing financial resources, get funding, and in-
crease their innovation capabilities. Therefore, they keep looking for local entre-
preneurial accounts to help them. Messages from entrepreneurial influencers -
opinion leader- increase the information diffusion to entrepreneurs, helping them 
to find more opportunities. Discovering the characteristics of entrepreneurial in-
fluencers in Twitter networks becomes extremely important since it reflects the 
way to reach entrepreneurs. In the present paper, we propose a novel framework 
called ECharacterize based on feature selections techniques to discover the char-
acteristics of the entrepreneurial influencer in the Saudi context in a robust man-
ner. The framework extracts abundant influencers’ features and then employs 
seven state-of-the-art ranking methods to determine the characteristics of the 
most relevant influencer. It robustly aggregates the lists to come out with the 
accurate final list using Robust Rank Aggregation. The framework examined on 
233,018 real-life Arabic tweets. The results show the ability of the proposed 
method to distinguish between the influencers by their popularity, reliability and 
activity level.    

Keywords—Twitter, characteristics of influencers, entrepreneurial influencers, 
robust ranked list.	

1 Introduction 

In the last decade, a variety of social media platforms have brought the new world 
of information. Starting by Myspace, which disappears with Facebook and Twitter. 
Then, a life-sharing social network such as Instagram, Snapchat [1] and others give us 
the capability to share our voice, make some new friends, become more social posi-
tively, and giving a chance to share some cultural information and build an extensive 

74 http://www.i-jim.org


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

knowledge the unknown places and cultures. Overall, these social networks allow peo-
ple to be connected whenever and everywhere.  

Nowadays, many entrepreneurs employ social media – Twitter in particular – for 
different reasons, such as mobilizing financial resources [2], connecting with potential 
investors in an attempt to get funding [3], connecting with other startups [3]. Another 
use of social media lies in consulting with advisors for knowledge creation [4], the 
process of innovation [5] and innovation capabilities [6], which allows them to find 
more opportunities. Entrepreneurs in Twitter look for information from various sources, 
especially the local accounts, helping them to find and interact with other stakeholders 
in the entrepreneurial ecosystem. Entrepreneurs in their early-stage look for infor-
mation from various sources, especially the local accounts. Since they need the advice 
and consulting provided by entrepreneurial ecosystem stakeholders [5], therefore, there 
is a need to enhance the information follow for them [5].  

Messages from key persons in the network [7], such as leaders and managers, are 
more likely to be followed and shared by followers, and would thus reach the whole 
community via small world [8] and word-of-mouth [9] effects. There are many studies 
that focus on ranking Twitter influencers [10]. However, how can we influence these 
networks? With the help of some particular accounts known by influencers particular 
which allow Twitter, for example, to have the chance to interact and increase infor-
mation diffusion with accounts followers (audiences) become efficient.  

All these reasons drive us to think about an efficient manner to detect influencers. 
Thus, it is difficult to determine the appropriate features for a given study case. For 
example, some features are used to detect academic influencers not be relevant if it is 
used to rank the political influencers. [10]. For this reason, there is a need to determine 
the appropriate features of an entrepreneurial influencer.  

In the literature, many methods have been proposed and developed to determine the 
most pertinent attributes for particular applications. Ranking methods is a technique for 
attribute selection used to emphasize the most relevant attributes.  Also, it represents a 
critical task for information retrievals, such as search engines, advertisement systems, 
and recommender systems [11]. Rank aggregation (RA) can be defined as a process 
that combines multiple ranked lists and gives as output one accurate ranking list [12]. 
Because of the RA, it becomes easy to integrate information from individual genomic 
studies [12].  

This article is motivated by the lack of literature to identify the characteristics of 
Twitter’s entrepreneurial influencers, particularly for users of Saudi Arabic. It reviews 
the Twitter influencers’ characteristics in detail to establish the link between these char-
acteristics and the Saudi entrepreneurial influencers. Then it proposes a novel frame-
work called ECharacterize to determine the essential characteristics, robustly. The EC-
haracterize framework extracts abundant influencers feature a range of characteristics 
against different research fields such as natural language processing, retrieval infor-
mation and social network understanding. It is built on eight state-of-the-art ranking 
and aggregation methods to ensure its efficiency. The ECharacterize was examined on 
a real-life data set, reaching a total of 233,018 Arabic tweets from 656 Saudi entrepre-
neurial ecosystem stakeholders. Finally, the results were evaluated by three different 

iJIM ‒ Vol. 14, No. 10, 2020 75


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

state-of-the-art algorithms for machine learning, supervised prediction models to prove 
its efficiency and correctly. 

The rest of this paper is organized as follows: the theoretical literature review is 
discussed in section 2. In section 3, we explain the framework phases of its evaluation. 
Section 4 discusses the obtained results and interpret the phenomena. Finally, a conclu-
sion and perspective work are presented in section 5. 

2 Literature Review 

This section reviews many interesting features could be used to characterize Twitter 
users in subsection 2.1. All the discussed features are extracted to characterize entre-
preneurial influencers in the ECharacterize framework. Then, the ranking methods 
which embedded in the ECharacterize framework are explained in section 2.2, followed 
by explanation of the aggregation method in 2.3. 

2.1 Features 

The features are grouped into five categories. In Fact, the categorization of features 
does not follow any standard. So, authors usually tend to categorize them thematically. 
The next subsections describe these features in detailed based in their group. 

User profile: The first group gathers features related to user profiles. Feature 1(Ver-
ified) indicates if the users’ account verified by Twitter [13]. Feature 2 (Description 
length) is the number of characters written by the user to describe himself. In fact, this 
feature is considered an excellent feature to indicate the user presence on Twitter and 
his online presence. Generally, corporate accounts and professional bloggers tend to fill 
their profile[14]. Feature 3,4, and 5 (URLs, usernames (mentions), and hashtags) are 
appearing in the textual profile description. Previous studies [14] [15] show that some 
users use these features to indicate their professional, distinguished roles to gain visi-
bility in a specific area. Feature 6 (Profile age) could be related to the user’s visibility 
on Twitter since it needs some time to have an influential position [14].  

Activities and publications: Publishing activity category focuses on the ways the 
influencers behaves regarding publishing the tweets. Feature 7 (Tweet count) represents 
the total number of tweets he posted in general, while Feature 8 (Topic Tweet) corre-
sponds to the number of tweets he posted related to the entrepreneurial issues. Tweet 
count and topic tweet represent the user activity on Twitter [14] [15].    

Interaction and responsiveness: This category focus feature describes how the user 
interacts with people. Feature 9, 10, and 11 are related to the reactions caused by the 
user's tweets. These features can be used as indicators to the tweet quality, and the high-
quality tweet may cause a tremendous other reaction. Feature 9 (Retweet) represents 
the total number of retweets of the user's tweets [15].  Feature 10 (Favorite) is the num-
ber that the user’s tweet marked as a favorites. [15]. Feature 11 (Reply) represents how 
many times the user’s tweet replies by others [15]. Feature 12 (User Favorites Count) 
represents another type of interactions, and it considers the total number of favorites 
chosen by the user[15]. 

76 http://www.i-jim.org


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

User relationship: Relationship category describes how the user is popular and fa-
mous on Twitter and connect to the rest of the Twitter users. Feature 13 and 14 clarify 
how much others prefer the user’ tweets on Twitter. Feature 13 (Follower) is the num-
ber of user’s followers[16], while Feature 14 (List) is the count of lists include the user’s 
account [16]. On the other hand, feature 15 (Friends) correspond to how much the user 
seeks information from others [16].  

Lexical Aspects: The features of this category can be investigated in order to figure 
out the lexical aspects. These features are beneficial to distinguish users based on the 
ways users describe themselves on Twitter. For instance, if users belong to the same 
class used to describe themselves in the same way, the selected features will be useful 
and allow their identification. Features 16 focuses on the Parts of Speech (POS), while 
feature 17 focus on Named Entities recognition NER.  POS and NER are Natural Lan-
guage Process (NLP) techniques [17]. NLP is a field of linguistics in computer science 
with artificial intelligence that concerns with the interactions and defines the languages 
used by human in a comprehensive way to the computers[17]. The Part-of-Speech 
(POS) tagger is a process of tagging a sentence to a list of words. In general, eight main 
parts define the  in which: adjectives, interjections, prepositions, nouns, adverbs, verbs, 
conjunctions and pronouns as cited in [17]. The profile may include more than one part. 
The output of this stage is tagged profiles (T.P) as shown by equation 1. 
 

                             T.P = {V1...n, N1...n, Adv1...n,..., Adj1...n}                             (1) 
 
The Named Entity Recognition (NER) aims to classify named entities mentioned in 

a specific text into some predefined categories for example "cities", "companies", "or-
ganization", "individuals", "product " and others. The NER gives a wealth knowledge 
and meaning to the given text to be understandable. Thus, these feasters can be used to 
discover the relation between the named entities mentioned in the profiles and the users' 
influence. The output of this NER is Named Entities in profiles (N.E.P) shown by equa-
tion 2. 

 
                              N.E.P = {P1...n, O1...n, L1...n}                                                (2) 
 

2.2 Ranking Methods  

The ranking is one of the significant problems in the field of information retrieval, 
which aims to assign a score to a set of objects (for example documents), this rank will 
be used to sort these objects. For the feature, ranking is used to give a score to each 
feature in order to figure out the most relevant one for a specific study. Depending on 
its application, the ranking may give an idea about the relevance, importance of the 
studied case[11]. In the literature, several methods for features ranking have been pro-
posed [11]. Based on state of the art, SVM-RFE, Correlation, Information gain, Chi-
squared, Gain ratio, and Random forest are chosen in order to rank the entrepreneurial 
features. We describe in the next subsections these methods briefly. 

iJIM ‒ Vol. 14, No. 10, 2020 77


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

Random Forest: In data science, Random Forests RF are considered accessible, 
accurate, robust, and easy to use machine learning methods. RF proves its effectiveness 
in assess features importance. RF uses decision tree strategies which rank features ac-
cording to its contribution in improving the node purity, decreasing impurity over all 
trees. Also, they provide a helpful feature called feature importance.   The feature im-
portance finds the most effective variable in the dataset [18]. 

In the decision trees, every node is considered a feature condition to divide the da-
taset into two sets: training and test. So, during training a tree, they compute how much 
each feature decrease the impurity.  This could help us in the classification stage be-
cause it is based on both information gain/entropy. For regression trees, it is known by 
variance. Finally, the feature list is ordered based on this measure[18].  

SVM-Recursive feature elimination: Support Vector Machines Recursive Feature 
Elimination (SVM-RFE) is a well-known approach for ranking. As mentioned in the 
study of Guyon et al. [19], this approach has shown superior classification results com-
pared with other methods. Generally, this method is used to evaluate the importance of 
each variable. SVM-RFE can also find the best combination possible for the feature in 
order to have the best classification performance [20]. Moreover, this method uses a 
recursive way to classify some samples from the dataset with SVM then selects the best 
fit and ensure the tradeoff between accuracy and feature number. [19]. 

Information Gain: Information Gain, on the other hand, is one of the ranking meth-
ods that give a weight for the feature by measuring the gain vis-a-vis the class. It per-
forms the feature selection based on Claude Shannon theory[21], based on the infor-
mation value for the analyzed message. The formula can be expressed as follow:   

 
                                             𝐼𝐺 = 𝐻(𝑌) − 𝐻 )!

"
*		                                                 (3) 

 
Where H (Y|X) is the uncertainty about Y for a given X and H(Y) is the entropy of 

Y. IG is a symmetrical measure, where the information gained with Y to X is the same 
as with X to Y. IG biases to high branching features even if it is not valuable for the 
study. Because of this bias, it is recommended to select a large number value for the 
attributes before performing the IG method. 

Gain Ratio: The gain ratio is an extension of IG with less bias since it take into 
consideration the size and number of branches when choosing a feature[22]. This is 
done by normalizing the IG by “intrinsic information” of a split. intrinsic information 
is a positional information created by splitting the dataset into n portions.  Gain Ratio 
is given by equation 4 

                                  𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜(𝐴) = #$%&(()
*+,%-%&./(()

                                               (4)         

Where 𝑆𝑝𝑙𝑖𝑡𝑖𝑛𝑓𝑜(𝐴) is intrinsic information. GR biases to unbalanced splits in 
which one partition is smaller than the other. 

Symmetrical Uncertainty: The symmetrical uncertainty SU criterion, giving by 
equation 5,  is explained in order to compensates the inherent bias of IG [22].  

 
78 http://www.i-jim.org


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

                                     𝑆𝑈 = 2) 0#
1(!)21(")

*                                                           (5) 

The values of SU are selected and normalized to [0,1] range. If SU value is 1, that 
means this feature can be predicted successfully, else its value is 0, there is no correla-
tion between X and Y. This method is pretty similar to GR in the bias because its se-
lection is based on the features with lower values. 

Correlation: The selection of characteristics based on correlation is the basis of 
symmetric uncertainty (SU)  [23]. It is a symmetrical measure that can be used to meas-
ure the correlation between characteristics and characteristics. The value of symmet-
rical uncertainty ranges [0 to 1]. Thus, one indicates that one variable (either X or Y) 
ultimately predicts the other variable. The value of 0 indicates that both variables are 
entirely independent. The Pearson correlation coefficient is defined as the following 
equation 6 to predict Y. 

 
                                   𝑅(𝑖) = 3/4	(	"!,!)

74$8	("!),4$8(!)
                                                (6) 

where cov and var designate, respectively, the covariance and the variance. 
 
Chi-squared: Chi-square is one of the standard methods which is used to select 

feature [24]. As described in formula 7, this method evaluates feature values by calcu-
lating its statistic chi-squared. Starting by a hypothesis H0 which assume that there is 
no relation between a set of features (two or more) and perform the test by the following 
formula: 

                                       𝑋9 	= ∑ ∑
(:!";	<!")

#

<!"
3
=>?

8
%>?                                                 (7) 

Where Oij is the observed frequency and Eij is the expected (theoretical) frequency, 
asserted by the null hypothesis. The higher the value of χ2, the greater the evidence 
against the hypothesis H0 is. 

2.3 Ranking aggregation 

Ranking aggregation is the process of aggregating many ranked lists generated by 
individual rankers to one ranked list. This gives a better rank and resort the list based 
on the new rank[12]. In general, Rank Aggregation (RA) is an ensemble-based method 
for feature selection. Using this technique gives more accurate results with different 
kind of data as reported in [12]. Furthermore, the RA method can perform in both su-
pervised and supervised methods, but overall, the unsupervised RA methods are mainly 
used in the literature [12]. Overall in this field of study, there are many studies to rank 
features using aggregation, we cite, for example, median, highest ranked, sum,  mean, 
and lowest rank aggregation [25] 

Robust Rank Aggregation (RRA) is an aggregation method proposed by Dittman at 
el. 2013 [25] to aggregate results of many ranking methods in an unbiased manner.  
RRA is considered one of the statistically stable and computationally efficient 

iJIM ‒ Vol. 14, No. 10, 2020 79


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

algorithms. Authors proposed RRA to prioritize genes lists in genomic data analysis 
applications. RRA assigns an importance score for each gene, providing a robust way 
to retain only the relevant genes in the final list. RRA looks at how the feature is posi-
tioned in the ranked lists and compares it to the baseline case where all ranked lists are 
shuffled randomly. Then, RRA assigns a P-value for all features to decide their signif-
icance and for re-ranking the feature.   

3 ECharacterize Framework 

This research proposes ECharacterize framework in order to discover the traits 
which make certain users more influential in entrepreneurial ecosystem on Twitter. The 
ECharacterize assigns importance scores to each influencers feature; then, the features 
are evaluated by prediction validation. The feature scores have generated by aggregat-
ing the ranked lists created by seven state-of-the-art feature ranking methods. Figure 1 
shows the ECharacterize framework components. Next subsections explain the compo-
nents in detail. 

3.1 Data Collection  

A real dataset was collected from Twitter. The Twitter Search API1  was used to 
crawl the data from Saudi entrepreneurial hashtag “startups_saudi_forum 
/ ةئشانلا_تاكرشلل_يدوعسلا_ىقتلملا ” during Jan 2, 2018, to Des 31, 2018.  Based on the col-
lected tweets, Twitter REST API2  was used to get data of the users’ profiles. As a 
result, we ended up with a total of 233,018 tweets from 656 users. 

3.2 Features Extraction  

All the seventeen discussed features in section 2.1 were extracted. Stakeholder, of-
ficial, and contact channels are new features that added for this paper purpose. Those 
features are not discussed before in the literature. Stakeholder feature represents entre-
preneurial stakeholder category which Twitter account belongs to. The stakeholders are 
categorized into six categories based on Andonova et al. 2019 [26]. They include gov-
ernment sector, universities, startups, entrepreneurs, accelerators and incubators, and 
unofficial accounts like news and initiatives. The official feature represents if the ac-
count is official or not. The entrepreneurial influencers must be in a place of trust, be-
cause of their tweets about crucial issues such as funding, government regulations and 
others.  Therefore, this paper assumes that the users in the entrepreneurial ecosystem 
will be influenced by official accounts. Contact Channels represents the availability of 
contact channel in the profile, increasing the profile reliability. We categorized the two 
new features “official” and contact channels in the profile features. Stakeholders are 
considered a separated feature.  

 
1	https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets	-	Last accessed, No-
vember 2, 2019	
2 https://dev.twitter.com/rest/public   - Last accessed, November 2, 2019 

80 http://www.i-jim.org


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

 
Fig. 1. ECharacterize Framework 

MADAMIRA was used to apply NER and POS [27]. MADAMIRA is one of the 
state-of-the-art Arabic, accurate, and fast text processing morphological analysis for 
Arabic text. MADAMIRA can find the named entities in three categories they are Per-
son (PER), Organization (ORG), and Location (LOC), we consider each category a 
separated feature. Regarding POS, MADAMIRA can find all the eight parts of speech, 
representing eight new features. The number of users tagged, and the number of 
hashtags were calculated by counting the @ and # symbols in the tweets. Profile de-
scription length extraction step, the words and spaces were kept, everything else was 

iJIM ‒ Vol. 14, No. 10, 2020 81


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

removed. Then, the number of letters were counted. This step must be done after ex-
tracting the features like hashtags; user tagged because this step will also remove ‘#’ 
and ‘@’. Finally, we result in 35 features; they are listed in table 1. 

Table 1.  The extracted entrepreneurial influencers features 

Category The Features 

User’s profile Verified, Official, Profile age, Contact Channels, Description length, and URLs, usernames (mentions), and hashtags appearing in the textual profile description 

Publishing activity 
Tweet count (all tweet the user-posted) and Topic Tweet (the number of entrepre-
neurial tweets the user-posted) 

User’s interaction Retweet, favourite, reply (the count of other users’ reactions on the user’s post), and User Favorites Count (considers the total number of favourites selected by the user) 
User’ relationship Followers Count, List Count, and Friend Count  

Lexical Aspects 
Named Entities features (PER, ORG, and LOC), and Part of Speech features (nouns, 
pronouns, proper nouns, adjectives, verbs, adverbs, prepositions, conjunc-
tions and interjections) 

Stakeholder Each stakeholder category represents a separated feature (government sector, univer-sities, startups, entrepreneurs, accelerators and incubators, and unofficial accounts) 

3.3 User’s Annotation  

To ensure reliability, three expert coders were hired to annotate the top 200 users. 
Top 200 users were chosen according to the number of retweets they have gained. The 
first two coders independently annotated the users as entrepreneurial-influencers or 
non-influencers. Cohen’s kappa was used to measure their agreement [28].  Cohen’s 
Kappa showed a ‘good’ agreement with a kappa value of 0.633, reflecting 85% agree-
ment between the two annotators. The third expert annotated the users independently, 
where the first two coders had disagreed.  Based on the three coders’ judgment, the 
dataset contained 28 influencers. 

3.4 Data preprocessing  

Data preprocessing transforms the raw data for further processing [29]. Based on the 
collected dataset, there are no missing data, and we did not remove outliers since they 
reflect some influencers’ characteristics. This research used the encoding and normali-
zation for data preprocessing.  

• Normalization: It is the process of transforming the data of different ranges into a 
uniform scale so that they can be compared [30]. Z-score was used to scale the fea-
tures due to its ability to handle the outliers. 

• Encoding is the process of converting categorical variables into numerical.  Binary 
encoding technique was used to encode the verified and official features, ‘0’ repre-
sents the account which is not verified or official, while ‘1’ represents verified and 
official the account. The stakeholder feature was encoded using one-hot technique. 
One-hot encoding is binary style of categorizing, each categorical variable has one 
element for each label with the class label is 1 and all other elements are 0. 

82 http://www.i-jim.org


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

3.5 Features ranking and aggregation  

In this paper, researchers consider seven commonly used features ranking methods 
based on learning algorithms, statistical and entropy-based with excellent performance 
in various domains. These are random forest, SVM-RFE, information gain, gain ration, 
symmetrical uncertainty, correlation, and chi-squared [11].  Robust Rank Aggregation 
(RRA) algorithm was used to aggregate the seven lists produced by the ranking meth-
ods. RRA returns the final aggregated list with associated P-value score of each feature. 
The P-value is used for deciding their significance and thus re-ranking the feature. Fig-
ure 2 shows the aggregated results and its P-value scores. P-value score becomes sig-
nificant (smaller than 0.05) as the features become more important.  Table 2 shows the 
results of all the ranking and aggregation methods. The numbers indicate to the position 
of the feature in the list, and the final column shows the RRA associated score (P-
value). 

Table 2.  The result of all ranked methods and RRA method. 

 RF SVM-RFE Correlation IG GR SU Chi RRA P-value 
FollowersCount 4 2 5 5 6 6 4 1 6.98E-06 

listedCount 9 3 6 6 2 2 5 2 0.000187 
All_Tweet 7 10 7 3 5 5 7 3 0.000427 
Favorite 8 1 4 2 4 3 1 7 0.009612 

UserFavoritesCount 11 15 15 8 9 8 12 5 0.009416 
Reply 3 20 2 4 1 1 2 6 0.009612 

Retweet 6 5 1 1 3 4 3 4 0.009612 
Tweet 1 11 3 7 7 7 6 8 0.014028 

Verified 30 4 8 11 8 11 8 9 0.017151 
ProfileAge 12 19 19 12 18 16 13 10 0.054688 
Desclength 2 18 23 10 11 10 10 11 0.143959 

ORG 18 9 16 20 26 24 9 12 0.545206 
Official 15 31 9 15 12 12 27 13 0.601293 

FriendsCounts 5 25 22 9 10 9 11 14 0.69346 
Verb 10 30 11 14 16 14 21 15 0.754631 

Adjective 13 28 12 17 25 21 22 16 0.934387 
Noun 14 26 30 13 22 17 20 17 1 

Preposition 16 24 29 18 21 18 23 18 1 
Mentions_in_profile 17 21 32 21 24 22 15 19 1 

Unofficial 19 35 14 30 20 28 31 20 1 
Startups 20 6 34 22 19 20 34 21 1 

Hashtag_in_profile 21 29 28 23 30 25 16 22 1 
LOC 22 14 31 26 31 30 18 23 1 

University 23 8 20 34 33 33 35 24 1 
Pronoun 24 33 17 19 13 13 25 25 1 

ContactChanel 25 13 18 28 29 27 14 26 1 
Accelerators 26 27 25 32 32 32 30 27 1 

iJIM ‒ Vol. 14, No. 10, 2020 83


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

Conjunction 27 23 33 24 23 23 24 28 1 
URL_in_Profile 28 17 13 29 28 29 26 29 1 

ProperNoune 29 16 35 16 15 15 19 30 1 
Entrepreneur 31 12 24 33 34 34 32 31 1 

PER 32 27 27 27 27 26 17 32 1 
Adverb 33 32 26 31 14 31 28 33 1 

Interjection 34 34 21 35 35 35 29 34 1 
Government 35 7 10 25 17 19 33 35 1 

 
Fig. 2. The aggregated list features and associated score (P-value) 

3.6 Evaluation  

To evaluate the final aggregated list, researchers used the concept of an incremental 
feature selection (IFS) [31]. In IFS, supervised machine learning algorithms are used to 
evaluate the features which sorted according to its importance. It works as follows: the 
algorithm is trained on only the first best attribute, then the top 2, then top 3 and con-
tinue until finishing all the features. In each iteration, the algorithm returns the accu-
racy. In this paper, we used precision as evaluation metrics [32]. As shown in equation 
8 precision is the number of true positives (the number of correctly predicted influenc-
ers) divided by the total number of elements classified as positive class (influencers) 
(the sum of correctly and incorrectly predicted influencers) [32]. We used it due to its 
ability to deal with imbalanced class distribution. In this research case, there are 28 
influencers out of 200 users.   

84 http://www.i-jim.org


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

                       Precision= True_Positive/ (True_Positive+ False_Positive)                (8) 
 
Three different types of state-of-the-art algorithms trained in a train-test fashion, 

they are Support Vector Machine (SVM), Naïve Bayes (NB), and Random Forest (RF).  
The algorithms were fed the aggregated list incrementally. Figure 3 shows the precision 
results of all iterations. Each number represent the number of features in the iteration. 
For example, ‘1’ means the best feature (highest significant P-value), while ‘2’ means 
the two best features. The significant features are the first nine features.  

As shown in the figure, the performance of NB starts with 0.86896 in the first itera-
tion and then it increased incrementally until it reaches its highest performance with 
0.948365 in the ninth iteration, then it becomes stable.  SVM provides better perfor-
mance reaching 0.95254 from the first and second iterations, but its performance de-
clined in the third iteration to reach 0.8711111, then its performance is stable to the 
final iteration.  Compared with SVM and NB, RF started with the lowest performance 
reaching 0.8292397, then the performance increased incrementally until it reaches its 
highest performance in the ninth iteration equal to 0.910169, then it declined and be-
came stable.  Table 3 shows the performance of the three-algorithms based on precision 
for the nine significant features. 

 
Fig. 3. The performance of the models based on precision 

 
iJIM ‒ Vol. 14, No. 10, 2020 85


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

Table 3.  Performance of the three-algorithm based on precision for the nine significant fea-
tures 

 1 2 3 4 5 6 7 8 9 
SVM 95.25423 95.25423 87.11111 87.11111 87.11111 87.11111 87.11111 87.11111 87.11111 
NB 86.896552 86.78362 89.25925 91.71608 91.71608 92.09876 92.09876 93.72549 94.83651 
RF 82.92397 91.01694 91.01694 91.01694 91.01694 91.01694 91.01694 91.01694 91.01694 

4 Discussion 

Only the first nine features with significant P-value are considered the essential fea-
tures of entrepreneurial influencers since 0.05 is used as the cutoff for significance. The 
‘number of followers’ is considered the most crucial characteristic of the entrepreneur-
ial influencers followed by the number of the list. These two features reflect the im-
portance of entrepreneurial influencer’s popularity. The user's popularity may be in-
creased by activity level. Therefore, we found the influencer’s activity ‘All Tweet’ is 
the third essential features. This result is in agreement with Asadi at el. 2018 [16] who 
found that most of the influencers conversations ranged across different topics as per-
sonal experiences, travel, or politics. 

Ranking ‘Favorite’ as the fourth entrepreneurial influencers reflects that many of 
influencers' audience is made up of followers who act as observers than participants in 
the conversation. The ‘User Favorite Account’ and ‘Reply’ are ranked as the fifth and 
sixth most essential features, reflecting the influence of the influences’ interaction level. 
This also agrees with Asadi at el. 2018 [16] who reported that the majority of influenc-
ers spent their time in interaction with their audience.  The quality of tweets ‘Retweet’ 
feature is the seventh feature distinguishes the entrepreneurial influencers. This is a 
logical result since the entrepreneurial users especially the beginner entrepreneurs, and 
the founder of Small and Medium Enterprises SMEs usually look for the information 
guide them. This result corresponds with the result of  Kuffo at el. 2018 [33] who found 
that entrepreneurs rely more on local sources for information. The actively level of in-
fluencers again proves its importance in term of ‘Tweet’ feature which ranked as the 
eighth most important feature distinguish the entrepreneurial influencers .it is the num-
ber of influencers tweet related to entrepreneurial issues.  This find agrees with As 
Kuffo at el. 2018 [33] who found in his research entrepreneurship-focused sources are 
more popular among entrepreneurs. Finally, the profile features, ‘verified’ is ranked on 
the ninth position on the ranking list, reflecting how much the influencers’ account must 
be reliable.  

5 Conclusions and future work 

In this paper, researchers focused on the problem of detecting valuable features of 
entrepreneurial influencers on Twitter, in particular, Saudi's influencers. At the first 
stage, a wide range of features are collected in order to be investigated for the performed 
research. These features are coming from several research domains such as social media 

86 http://www.i-jim.org


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

analysis, natural language processing, and retrieval information studies. It then pro-
posed a robust framework called ECharacterize to rank the most relevant features dis-
tinguish the Saudi entrepreneurial influencers.  Three state-of-the-art machine learning 
supervised algorithms are used to evaluate the final results to ensure the correctness and 
efficiency. Based on the experimental, we can highlight following main results. First, 
the entrepreneurial influence is based on the number of followers and the number of 
followers who have added those influencers to a list. Second, the level of activity dis-
tinguishes those account either on term of entrepreneurial tweets or general tweets. 
Third, their continue conversation are selected on the basis of evidence that they keep 
strong influence, passive members who participated by liking tweets are also consid-
ered. Finally, the influence also related to the reliability of the account.  

6 References  

[1] D. Kuss, M. Griffiths, D. J. Kuss, and M. D. Griffiths, “Social Networking Sites and Addic-
tion: Ten Lessons Learned,” Int. J. Environ. Res. Public Health, vol. 14, no. 3, p. 311, Mar. 
2017. https://doi.org/10.3390/ijerph14030311 

[2] S. Shane, “The Importance of Angel Investing in Financing the Growth of Entrepreneurial 
Ventures,” Q. J. Financ., vol. 02, no. 02, p. 1250009, Jun. 2012. https://doi.org/10.1142/ 
s2010139212500097 

[3] F. Jin, A. Wu, and L. Hitt, “Social Is the New Financial: How Startup Social Media Activity 
Influen Funding Outcomes,” Acad. Manag. Proceedings., p. 13329, 2017. https://doi. 
org/10.5465/ambpp.2017.13329abstract 

[4] A. Papa, G. Santoro, L. Tirabeni, and F. Monge, “Social media as tool for fa-cilitating 
knowledge creation and innovation in small and medium enterprises,” Balt. J. Manag., vol. 
13, no. 3, pp. 329–344, Jul. 2018. https://doi.org/10.1108/bjm-04-2017-0125 

[5] Y. Motoyama, S. Goetz, and Y. Han, “Where do entrepreneurs get information? An analysis 
of twitter-following patterns,” Small Bus. Entrep., vol. 30, no. 3, pp. 253–274, 2018. 
https://doi.org/10.1080/08276331.2018.1435187 

[6] C. Riverola and F. M. On, “Entrepreneurs’ Bricolage and Social Media,” in 2018 IEEE In-
ternational Conference , 2018 . 

[7] C. D. M. M. C. R, “Identifying influential and susceptible members of social networks,” 
Science (80-)., vol. 329, no. 0036–8075, pp. 1194–1197, 2012. 

[8] D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small-world’ net-works,” Nature, 
vol. 393, no. 6684, pp. 440–442, Jun. 1998. https://doi.org/10.1038/30918 

[9] B. J. Jansen, M. Zhang, K. Sobel, and A. Chowdury, “Twitter power: Tweets as electronic 
word of mouth,” J. Am. Soc. Inf. Sci. Technol., vol. 60, no. 11, pp. 2169–2188, Nov. 2009. 
https://doi.org/10.1002/asi.21149 

[10] J. V. Cossu, V. Labatut, and N. Dugué, “A review of features for the discrimination of twitter 
users: application to the prediction of offline influence,” Soc. Netw. Anal. Min., vol. 6, no. 
1, Dec. 2016. https://doi.org/10.1007/s13278-016-0329-x 

[11] I. Sangaiah, A. Vincent, A. Kumar, A. Balamurugan, and I.  Sangaiah, “An Empirical Study 
on Different Ranking Methods for Effective Data Classification,” J. Mod. Appl. Stat. Meth-
ods, vol. 14, no. 2, p. 7, 2015. https://doi.org/10.22237/jmasm/1446350760 

[12] X. Li, X. Wang, and G. Xiao, “A comparative study of rank aggregation methods for partial 
and top ranked lists in genomic applications,” Brief. Bioinform., vol. 20, no. 1, pp. 178–189, 
Jan. 2019. https://doi.org/10.1093/bib/bbx101 

iJIM ‒ Vol. 14, No. 10, 2020 87


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

[13] Z. Chu, S. Gianvecchio, H. Wang, and S. Jajodia, “Detecting automation of Twitter ac-
counts: Are you a human, bot, or cyborg?” IEEE Trans. Dependable Se-cur. Comput., vol. 
9, no. 6, pp. 811–824, 2012. https://doi.org/10.1109/tdsc.2012.75 

[14] K. Lee, P. Tamilarasan, and J. Caverlee, “Crowdturfers, Campaigns, and Social Media: 
Tracking and Revealing Crowdsourced Manipulation of Social Me-dia.” 

[15] G. de-la-Ramírez-Rosa, E. Villatoro-Tello, H. Jiménez-Salazar, and C. Sánchez-Sánchez, 
“Towards automatic detection of user influence in twitter by means of stylistic and behav-
ioral features,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. 
Notes Bioinformatics), vol. 8856, pp. 245–256, 2014. https://doi.org/10.1007/978-3-319-
13647-9_23 

[16] M. Asadi and A. Agah, “Characterizing User Influence Within Twitter,” in International 
Conference on P2P, Parallel, Grid, Cloud and Internet Computing, 2018, pp. 122–132. 
https://doi.org/10.1007/978-3-319-69835-9_11 

[17] J. Mustafi, “Natural Language Processing and Machine Learning for Big Da-ta,” in Tech-
niques and Environments for Big Data Analysis, Springer, Cham, 2016, pp. 53–74. 
https://doi.org/10.1007/978-3-319-27520-8_4 

[18] K. J. Archer and R. V. Kimes, “Empirical characterization of random forest variable im-
portance measures,” Comput. Stat. Data Anal., vol. 52, no. 4, pp. 2249–2260, Jan. 2008. 
https://doi.org/10.1016/j.csda.2007.08.015 

[19] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification 
using support vector machines,” Mach. Learn., vol. 46, no. 1–3, pp. 389–422, 2002. 
https://doi.org/10.1023/a:1012487302797 

[20] M. D. Shieh and C. C. Yang, “Multiclass SVM-RFE for product form feature selection,” 
Expert Syst. Appl., vol. 35, no. 1–2, pp. 531–541, Jul. 2008. https://doi.org/10.1016/j. 
eswa.2007.07.043 

[21] W. P. Alston and F. I. Dretske, “Knowledge and the Flow of Information.,” Philos. Rev., 
vol. 92, no. 3, p. 452, Jul. 1983. 

[22] M. a. Hall and L. a. Smith, “Practical feature subset selection for machine learning,” Com-
put. Sci., vol. 98, pp. 181–191, 1998. 

[23] I. Guyon and A. M. De, “An Introduction to Variable and Feature Selection André 
Elisseeff,” 2003. 

[24] H. Liu and R. Setiono, “Chi2: feature selection and discretization of numeric attributes,” in 
Proceedings of the International Conference on Tools with Artificial Intelligence, 1995, pp. 
388–391. https://doi.org/10.1109/tai.1995.479783 

[25] R. Kolde, S. Laur, P. Adler, and J. Vilo, “Robust rank aggregation for gene list integration 
and meta-analysis,” Bioinformatics, vol. 28, no. 4, pp. 573–580, Feb. 2012. https://doi.org/ 
10.1093/bioinformatics/btr709 

[26] V. Andonova, M. S. Nikolova, and D. Dimitrov, “What Is an Entrepreneurial Ecosystem?” 
in Entrepreneurial Ecosystems in Unexpected Places, Cham: Springer International Publish-
ing, 2019, pp. 3–16. https://doi.org/10.1007/978-3-319-98219-9_1 

[27] A. Pasha et al., “MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis 
and Disambiguation of Arabic,” in Proceedings of the Ninth International Conference on 
Language Resources and Evaluation (LREC), 2014, pp. 1094–1101. 

[28] R. G. Pontius and M. Millones, “Death to Kappa: birth of quantity disagreement and alloca-
tion disagreement for accuracy assessment,” Int. J. Remote Sens., vol. 32, no. 15, pp. 4407–
4429, Aug. 2011. https://doi.org/10.1080/01431161.2011.552923 

[29] A. Famili, W.-M. Shen, R. Weber, and E. Simoudis, “Data Preprocessing and Intelligent 
Data Analysis,” Intell. Data Anal., vol. 1, no. 1, pp. 3–23, Jan. 1997. 

88 http://www.i-jim.org


Paper—ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial… 

[30] S. G. K. Patro and K. K. Sahu, “Normalization: A Preprocessing Stage,” Comput. Sci., vol. 
74, no. 5, pp. 32–40, Mar. 2015. 

[31] H. Liu and R. Setiono, “Incremental Feature Selection,” Appl. Intell., vol. 9, no. 3, pp. 217–
230, 1998. 

[32] P. A. Flach PETERFLACH, “An Analysis of Rule Evaluation Metrics Johan-nes F urn-
kranz,” in Proceedings of the 20th international conference on machine learning (ICML-03), 
2003, pp. 202–209. 

[33] L. Kuffo, C. Vaca, E. Izquierdo, and J. C. Bustamante, “Mining Worldwide Entrepreneurs 
Psycholinguistic Dimensions from Twitter,” in 2018 International Conference on eDemoc-
racy & eGovernment (ICEDEG), 2018, pp. 179–186. https://doi.org/10.1109/icedeg.2018. 
8372352 

7 Authors 

B.Moheel Almotairy completed her master’s degree in information system Depart-
ment at the Faculty of Computing and Information Technology, King Abdulaziz Uni-
versity, Jeddah, Saudi Arabia in 2020. She obtained her bachelor’s degree with first 
honor from King Abdulaziz University. Her research field’s interest includes Data Sci-
ence and Social Network Analysis. 

M.Abdulaziz Abdullah. received her PhD in computers and systems engineering, 
Faculty of engineering, Ain Shams University, Cairo, Egypt, 2002. She has experienced 
in industrial computer networks and embedded systems. Her research interests include 
Artificial Intelligence, performance evaluation, WSN, network management, Big Data 
analysis, and pattern recognition. Dr Abdullah published more than 120 research papers 
in various international journals and conferences. She has also joined many HiCi re-
search projects all over the world. 

R.Abbasi completed his PhD from University of Koblenz-Landau, Germany in 
2010. He is working as an associate professor at the Department of Computer Science, 
Quaid-i-Azam University, Islamabad, Pakistan. He has a vast research experience in 
the fields of social media analytics and social network analysis. His research focuses 
on leveraging positive aspects of social media including social media's use in saving 
lives, understanding events, and analyzing sentiments among many others. He has pub-
lished more than 35 articles in reputed journals like IEEE Computational Intelligence 
Magazine, Computers in Human Behavior, Telematics and Informatics, Applied Soft 
Computing, and Scientometrics and international conferences like ACM HyperText 
Conference, ACM World Wide Web Conference, Pacific Asia Conference on 
Knowledge Discovery and Data mining, and European Conference on Information Re-
trieval. 

Article submitted 2020-04-13. Resubmitted 2020-05-15. Final acceptance 2020-05-16. Final version pub-
lished as submitted by the authors. 

iJIM ‒ Vol. 14, No. 10, 2020 89


	iJIM – Vol. 14, No. 10, 2020
	ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial Influencers in Arabic Twitter