Microsoft Word - BRAIN_vol9_issue1_2018_v7_final1.doc


 36 

A Corpus Study on the Difference between MOOCS and Real Classes 
 

Adel Rahimi 
Sharif University of Technology, Tehran, Iran 

Tehran, Azadi Avenue, Iran 
Tel.: +98 21 6601 3126 

Adel.rahimi@mehr.sharif.edu 
 

Parvaneh Khosravizadeh 
Sharif University of Technology, Tehran, Iran 

Tehran, Azadi Avenue, Iran 
Tel.: +98 21 6601 3126 
khosravizadeh@sharif.edu 

 
Abstract 
In this paper we take a look at how the language of Online classes (MOOCs) differs from 

those of real classes. Three corpora were created for this analysis; MOOC corpus, Lecture Capture 
corpus, and Philosophy Lecture Capture. Three factors were used in the study: Formality, Sentiment 
analysis, vocabulary analysis. Formality score was used to understand how formal the text is. 
Sentiment categorization of words was used to realize the positivity of the words used in the classes 
and finally top words used in corpora was analyzed to understand the usage. It was realized that the 
formality measure of real classes is slightly lower than online classes and professors use more 
positive words in real classes than online classes and the vocabulary usage is heavily under the 
influence of subject.  

 
Keywords: MOOC, Online Classes, Sentiment analysis, computer-related courses   
 
1. Introduction 
MOOC stands for Massive Online Open Course. The first word, Massive, denotes the fact 

that MOOCs are “all-at-once-ness” (Johnson, Nafukho, Valentin, Le counte & Valentin, 2014). In 
other words, MOOCs are created once but will be distributed through internet platforms many 
times. Stephen Downes and George Siemens created the first MOOC in 2008 and used this term for 
the first time in a course titled “Connectivism and connective knowledge” (Downes, 2012, p.10). 
This particular class had 2,000 nonpaying students enrolled (Daniel, 2012).  

In 2011 Stanford University offered the course Introduction to Artificial Intelligence. 
Initially 160,000 students enrolled and over 20,000 students completed the course. Udacity focuses 
on free education. Udacity incorporation, founded by Sebastien Thrun, was the first company which 
began to offer online courses. Since then, San Jose University, MIT, and Harvard among others 
began to offer on-line courses and establish MOOC platforms. 

In terms of quality, the courses in MOOCs were just class materials published online by 
university professors, however, currently instructors are designing high quality tailored materials.  

Different types of MOOCs have evolved from the beginning; EdX, Khan Academy and 
Coursera employ their very own style in creating class contents.  
 

Table 1. Comparison of key aspects of MOOCs or Open Education initiatives from (Yuan, 2013) 
# First name and family name For Profit Free to access Certification Institutional Credits 
1 EdX ⤬ ✔ ✔ ⤬ 

2 Coursera ✔ ✔ ✔ ✔ 
3 Udacity ✔ ✔ ✔ ✔ 
4 Udemy ✔ ✔ ✔ ✔ 
5 P2PU ⤬ ✔ ⤬ ⤬ 

Based on class-central.com, as of 2016 there are more than 4,000 MOOCs. 
Distribution of MOOCs over subjects (data by class-central.com): 


A. Rahimi, P. Khosravizadeh - A Corpus Study on the Difference between MOOCS and Real Classes 
 

 37 

 
Table 2. Comparison of MOOCs distribution in different subjects (data from: class-central.com) 
 

Subject Percent
  

Science 11.3% 
Business & Management 16.8% 
Mathematics 4.09% 
Engineering 6.11% 
Art and Design 6.73% 
Programming 7.44% 
Health and Medicine 8.27% 
Education and Teaching 9.36% 
Humanities 9.41% 
Computer Science 9.74% 
Social Sciences 10.8% 

 
As Belanger and Thornton (2013) suggest, the main reasons behind the popularity of 

MOOCs are; 
• To support lifelong learning or gain an understanding of the subject matter, with no 

particular expectations for completion or achievement, 
• For fun, entertainment, social experience and intellectual stimulation, 
• Convenience, often in conjunction with barriers to traditional education options, 
• To experience or explore online education. 
 
2. Related works 
Although MOOCs have a brief history, they have evolved so vastly during present time. 

 Several studies have been conducted on how MOOCs were started. Daneil (2012) describes a short 
history of MOOCs and expands it in the wider context of distant learning. Yuan (2013) describes a 
history of MOOCs and an analysis on the MOOC-style open educations. Also describes the 
challenges for MOOCs. Clow (2013) in a paper titled “MOOCs and the funnel of Participation” uses 
funnel as a metaphor for describing dropout rates in MOOCs. Jordan (2014) used public dataset to 
visualize the completion rates on MOOCs.  

In the terms of being difficulty level Konnikova, (2014)1 states that if MOOCs were to 
challenge students they would likely be more effective.  

Keats, (2016)2 in an article in wired magazine describes a history of MOOCs and writes that 
MOOCs should be expansive in order to be successful and replace formal education. In another 
paper Dellarocas and Alstyne (2013) explains business models for MOOCs and how making money 
out of MOOCs would work.   

Chen (2014) used text mining to understand the challenges of MOOCs. His study showed 
that among other challenges, MOOCs need to overcome course quality, high dropout rates, 
unavailable course credits, ineffective assessments, and complex copyright issues.  

Rodriguez (2012) classifies MOOCs into two categories: AI-Stanford and connectivist 
MOOCs (c-MOOCs). Rodriguez (2012) suggests that c-MOOCs are more social than AI-Stanford.  

Jordan (2014) also reported completion data on 24 MOOCs the data shows the highest 
competition rate was for Functional Programming Principles in Scala which was 19.2%.  

                                                 
1 Konnikova, M (2014). Will MOOCs be flukes? The New Yorker, Retrieved on July, 21, 2017 from 
www.newyorker.com/science/mariakonnikova/moocs-failure-solutions 
2 Keats, J. (2016). Are MOOCs in danger of becoming irrelevant? The New Yorker. Retrieved on August, 10, 2017 
from http://www.wired.co.uk/article/improving-moocs-jonathon-keats 


BRAIN – Broad Research in Artificial Intelligence and Neuroscience, Volume 9, Issue1 (February, 2018), ISSN 2067-8957 
 

 38 

Jordan (2014) also reported that most MOOCs had 43,000 students enrolled but the 
completion rate is only 6.5%. 

Although many students drop out from the course, Onah, Sinclair, and Boyatt (2014) shows 
that many participants follow the course in their own “preferred way”.  Onah et al. (2014) also 
suggests “structure of ‘a course’ may not be helpful to all participants and supporting different 
patterns of engagement and presentation of material may be beneficial.” 

Reasons for dropout suggested by Onah et al. (2014) are: No real intention to complete, Lack 
of time, Course difficulty and lack of support, lack of digital skills or learning skills, bad experience, 
bad expectations of the course, starting late, peer review. 

Brinton, Chiang, Jain, Lam, Liu, and Wong (2014) analyzed discussion forums in the 
MOOCs and identified two features of the discussion forums in the MOOCs: 1) high decline rate, 2) 
high volume noisy discussions. Brinton et al. then proposes a unified generative model for 
discussion threads and an algorithm for “Ranking thread relevance”. 

Wen, Yang, and Rose (2014) uses sentiment analysis to “monitor students’ trending opinions 
towards the course and major course tools”. Wen et al. (2014) also reported that there is a high 
correlation between number of dropouts and sentiments expressed in the discussion forums. 
 

3. MOOC/LC Corpus Design 
The main subject of this study is focused towards Computer Science and Computer related 

courses. 
The following 3 corpora were prepared for this study:  

1. MOOCs Computer Corpus (Computer) 
2. Lecture Capture Corpus (Computer) 
3. Lecture capture (Philosophy) 

 
Table 3. List of courses used in the MOOC corpus 
Course  Course provider Number of Word Token 

Computer science 101 Stanford University 79039 
Natural Language Processing Columbia University 136215 
An Introduction to Interactive 
Programming in Python (Part I) 

Rice University 145336 

Text Retrieval and Search engines University of Illinois at 
Urbana-Champaign 

77986 

Neural Networks University of Toronto 122714 

Digital Signal Processing École Polytechnique 
Fédérale de Lausanne 

154333 

CS1 Compilers Stanford University 196639 

Computational Investing, Part I Georgia Institute of 
Technology 

74173 

C++ for c programmers University of California, 
Santa Cruz 

63327 

Biology Meets programming: bio-
informatics for beginners 

University of California, 
San Diego 

10257 

Audio Signal Processing for Music 
Applications 

Stanford University 144364 

 1204383 

 
A. Rahimi, P. Khosravizadeh - A Corpus Study on the Difference between MOOCS and Real Classes 
 

 39 

The corpus is made of Coursera class subtitle which is exactly what the instructor is saying. 
Some files were originally in .srt format. The subtitle was first converted to .txt then a cleaning 
method was applied on the corpus meaning all the numbers and special characters were removed. 

The second corpus is from real university classes referred to as Lecture Capture. This corpus 
is the data from MIT OCW, CS50 website, and other course websites that offer closed captions for 
people with hard of hearing.  

 
Table 4. List of courses used in the LC Corpus 
Course Course Provider Number of word tokens 
6.001 MIT 121245 
CS50 Harvard University  342458 
Computer Science E-76: 
Building Mobile Applications 

Harvard University 133298 

Computer Science E1 
Computers and Internet 

Harvard University 223938 

CS50 (2016) Harvard University 246036 
Software Engineering Harvard University 66783 
 1133758 
 

The Third corpus is the lecture capture corpus for philosophy classes. Same as lecture 
capture corpus the corpus is compiled from course websites. 

 
Table 5. List of courses used in the Philosophy Lecture Capture corpus 
Course Course Provider Number of word tokens 
Philosophy and the Science of 
Human Nature 

Yale University 159210 

Introduction to Political 
Philosophy 

Yale University 145114 

Death Yale University 191559 
 495883 

 
The reason behind choosing several courses in MOOCs and only few courses from LC 

corpus is that the length of the classes in universities is much longer than those of the MOOCs and 
therefore to stratify the corpus, the number of MOOCs is higher. 

 
In order for the corpora to be in the same category, Courses have been chosen meticulously 

so that: Firstly, they do not differ in terms of theme, class organization, and other possible affecting 
factors. Secondly, the number of word tokens in the corpora to be roughly equal so that it does not 
affect quantitative factors. 
 
 4. Analysis 

 
4.1. Formality 
To analyze formality in the corpus, (Heylighen and Dewaele, 1999) F-score measurement 

was employed to indicate how formal the instructors’ speech is. As the formula supposes: 
 

F = (noun frequency + adjective freq. + preposition freq. + article freq. – pronoun freq. 
– Verb freq. – adverb freq. – interjection freq. + 100) / 2 

 
Thus, initially it is required to analyze parts of speech in the corpora. In the first step 

Stanford POS tagger was used to tag all the words in both corpora. 
 

BRAIN – Broad Research in Artificial Intelligence and Neuroscience, Volume 9, Issue1 (February, 2018), ISSN 2067-8957 
 

 40 

MOOC Corpus F-Score: 
 
F = (19.466+ 1.635+ 7.144+ 3.267+ 10.456+ 12.6230-  6.09- 18.010- 7.144- 3.521+100) / 2 

= 59.9 
 
The F-Score in MOOCs is slightly higher than the other two. This 10 percent doesn’t 

drastically change the formality whereas; it can be a measure for further studies on the formality of 
online classes. 

Lecture Capture Corpus F-Score: 
 
F = (1.132+19.66+6.626+0.12+3.38+10.57-6.56-18.676- 8.874-4.04+100) / 2 = 51.67 
 
Philosophy LC corpus F-Score: 
F = (9.2 + 0.53 + 3.48+ 4.94+ 1.44+ 4.420-  3.03- 7.76- 2.611- 1.348+ 100) / 2 = 54.63 
 
The formality score for Philosophy Lecture Capture is ~55%. This number is slightly higher 

than the Lecture capture. This shows that the subject has a role in the formality of class as well as 
the course platform. 

 
4.2. Sentiment 
Sentiment analysis can be employed as a measure to study how positive or negative the 

lecturers’ speech actually is. In this measure, the AFFIN wordlist Nielsen (2011) was used. The 
AFFIN wordlist is a list of vocabularies based on positive or negative sentiment of each word. 

 
 Table 5. Distribution of words by sentiment in the MOOC corpus 

 Frequency Percent 
-4 16 .0 
-3 804 2.2 
-2 5650 15.4 
-1 4591 12.5 
 1 9112 24.9 
 2 13306 36.3 
 3 2770 7.6 
 4 357 1.0 
 5 2 .0 

  
More than 69 percent of the words used in the MOOC classes, in the computer-related 

subjects, are positive words. The most frequent positive category is +2 and the most frequent non-
positive word category is -2. 

 
 Table 6. Distribution of words by sentiment in the Lecture Capture corpus 

 Frequency Percent 
-4 3 .0 
-3 131 .3 
-2 1705 3.3 
-1 6499 12.5 
1 6077 11.7 
2 12938 24.9 
3 19541 37.6 
4 4502 8.7 
5 637 1.2 

 
A. Rahimi, P. Khosravizadeh - A Corpus Study on the Difference between MOOCS and Real Classes 
 

 41 

In the Philosophy lecture capture class, percentage of +3 words is the highest. Also the most 
frequent negative word is -1. 

 
 Table 7. Distribution of words by sentiment in the Lecture Capture corpus 

 Frequency Percent 
-4 1 .0 
-3 32 .3 
-2 901 9.7 
-1 1930 20.7 
 1  972 10.4 
 2 1909 20.5 
 3 2675 28.7 
 4 825 8.8 
 5 85 .9 

 
The lecture capture is slightly different in terms of category distribution. Approximately 50% 

of the corpus is in the positive category. 
 

 Table 8. Distribution of words by sentiment in the all corpora 
 MOOC LC Philosophy Lectures 

-4 .0 .0 .0
-3 2.2 .3 .3
-2 15.4 3.3 9.7
-1 12.5 12.5 20.7
 1 24.9 11.7 10.4
 2 36.3 24.9 20.5
 3 7.6 37.6 28.7
4 1.0 8.7 8.8
5 .0 1.2 .9
Positive 69.8 84.1 69.3 

 
The above Table shows that the corpus with the most frequent positive vocabulary, having 

more than 80% of its words in the positive category, is the LC corpus. The philosophy LC and the 
MOOC classes were more or less the same having ~70%.    

 
4.3. Top Vocabularies 
 

 MOOCs: 
going 
one 
let 
okay 
see 
two 
like 
use 
get 
 

 LC corpus top vocabularies: 
going 
like 
one 


BRAIN – Broad Research in Artificial Intelligence and Neuroscience, Volume 9, Issue1 (February, 2018), ISSN 2067-8957 
 

 42 

let 
go 
right 
actually 
see 
get 
want 
 

 Philosophy LC corpus top vocabularies: 
think 
say 
would 
us 
life 
question 
going 
things 
way 
like 
 
In the terms of top vocabulary, difference between MOOC and LC is little but the 

Philosophy LC is much more different than the other two. This indicates that choosing frequent 
vocabularies is under the influence of the subject. 
 
 5. Results 
 The F score shows the MOOCs in general, are slightly more formal, and in particular 
computer-related courses (referred to as MOOC corpus), is the most formal among all. Sentiment 
distribution shows that lecture capture is more positive in term of word usage, while MOOC in the 
second place, and philosophy LC in the third, suggest subject might have an impact on the word 
usage. The top vocabulary list states that, subject heavily influences the frequent words instructors 
choose, and thus it will not change in real or virtual classes.  

These data show how and why MOOCs are different from real classes and how instructors 
can get the most out of their class time.  
 
         References  
Belanger, V., Thornton, J. (2013). Bioelectricity: A Quantitative Approach. Duke University Press.  
Brinton, C. G., Chiang, M., Jain, S., Lam, H., Liu, Z. & Wong, F. M. F. (2014). Learning about 

social learning in MOOCs: From statistical analysis to generative model, IEEE Transactions 
on Learning Technologies, 7(4), pp. 346-359. 

Chen, Y. (2014). Investigating MOOCs through blog mining. The International Review of Research 
in Open and Distributed Learning, 15(2), pp. 85-106. 

Clow, D. (2013). MOOCs and the funnel of participation. In Proceedings of the Third International 
Conference on Learning Analytics and Knowledge. New York, NY, USA: ACM. pp. 185-189. 

 Daniel, J. (2012). Making sense of MOOCs: Musings in a Maze of myth, paradox and possibility. 
Journal of Interactive Media in Education. 2012(3). DOI: http://doi.org/10.5334/2012-18  

Dellarocas, C., & Van Alstyne, M. (2013). Money models for MOOCs. Communications of the 
ACM, 56(8), pp. 25-28.  

Downes, S. (2012). Connectivism and connective knowledge: Essays on meaning and learning 
networks. Retrieved from http://www.downes.ca/files/books/Connective_Knowledge-
19May2012.pdf 

Heylighen, F., & Dewaele, J. M. (1999). Formality of language: Definition, measurement and 
behavioral determinants. Internal Report, Center "Leo Apostel", Free University of Brussels. 


A. Rahimi, P. Khosravizadeh - A Corpus Study on the Difference between MOOCS and Real Classes 
 

 43 

Johnson, D., Nafukho, F., Valentin, M., Lecounte, J., & Valentin, C. (2014). The origins of 
MOOCs: The beginning of the revolution of all at once-ness. In Proceedings of 15th 
International Conference on Human Resource Development research and practice across 
Europe. Edinburgh, UK: Edinburgh Napier University Business School. 

Jordan, K. (2014). Initial trends in enrolment and completion of massive open online courses. The 
International Review of Research in Open and Distance Learning, 15(1), 133-160. 

Nielsen, F. (2011). A new anew: Evaluation of a word list for sentiment analysis in microblogs. In 
Proceedings of the ESWC2011 Workshop on ’Making Sense of Microposts’: Big things come 
in small packages. Heraklion, Crete, Greece, May 30, 2011  

Onah, D. F. O., Sinclair, J. and Boyatt, R. (2014). Dropout rates of massive open online courses: 
Behavioural patterns. In proceedings of the 6th International Conference on Education and 
New Learning Technologies, Barcelona, Spain, 7-9 Jul 2014. pp. 5825-5834. 

Rodriguez, C. O. (2012). MOOCs and the AI-Stanford like courses: Two successful and distinct 
course formats for massive open online courses. European Journal of Open, Distance and E-
Learning, 15(2). 

Wen, M., Yang, D. & Rose. C. P. (2014). Sentiment analysis in MOOC discussion forums: What 
does it tell us? In Proceedings of the 7th International Conference on Educational Data 
Mining (EDM 2014), pp. 130–137, 2014. 

Yuan, L., & Powell, S. (2013). MOOCs and open education: Implications for higher education. 
Centre for Educational Technology & Inoperability Standards. Retrieved from 
http://publications.cetis.ac.uk/wp-content/uploads/2013/03/MOOCs-and-Open-Education.pdf. 

 
Adel RAHIMI is MSc student of computational linguistics at Sharif University of 
Technology. His research interests include: Machine Learning, Natural Language 
Processing, and Computational Linguistics. He is currently member of the Sharif 
Speech and Language Processing lab. 
You can visit the personal website of Mr Adel Rahimi at: http://adelr.ir/ 
 

Parvaneh KHOSRAVIZADEH is an assistant professor at Sharif University of 
Technology in the field of computational linguistics. Her research interests include 
Machine Learning, Machine Translation, Discourse analysis, and Psycholinguistics. 
You can see the Google Academic profile of Professor Khosravizadeh here: 
https://scholar.google.com/citations?user=UcH_97cAAAAJ&hl=en