Vol. 1, No. 1 | January - June 2018 
 

SJET | ISSN: 2616-7069 | Vol. 1 | No. 1 | © 2018 Sukkur IBA 

51 

 
Estimating News Coverage Patterns using Latent Dirichlet 

Allocation (LDA) 

Batool Zehra1,Naeem Ahmed Mahoto1,Vijdan Khalique1 

Abstract: 

The growing rate of unstructured textual data has made an open challenge for the 

knowledge discovery which aims extracting desired information from large collection of data. 

This study presents a system to derive news coverage patterns with the help of probabilistic 

model – Latent Dirichlet Allocation. Pattern is an arrangement of words within collected data 

that more likely appear together in certain context. The news coverage patterns have been 

computed as number function of news articles comprising of such patterns. A prototype, as a 

proof, has been developed to estimate the news coverage patterns for a newspaper – The Dawn. 

Analyzing the news coverage patterns from different aspects has been carried out using 

multidimensional data model. Further, the extracted news coverage patterns are illustrated by 

visual graphs to yield in-depth understanding of the topics which have been covered in the news. 

The results also assist in identification of schema related to newspaper and journalists’ articles.   

Keywords: News Coverage Pattern; Probabilistic Model; data visualization; Multi-
dimensional Data Model. 

1. Introduction 

The rapid growth of the Web technologies 

has resulted in number of websites. According 

to statistics, there are 1800,047,111 active 
websites recorded in 2017 [1]. That means, 

tremendous volume of data is produced on 

these websites, which paves the way for 

information processing in order to extract 

knowledge and meaningful patterns. It can 
assist in decision-making regarding a number 

of scenarios and problems. 

Out of total number of active websites, 

there are many e-news websites with a large 

share of visitors who regularly check news 

and read articles on these websites. For 

example, in US, around 70% of the population 

                                                           
1 Mehran University of Engineering and Technology, Jamshoro, Pakistan 

Corresponding Email: sayyid_zehra@yahoo.com  

refer to the Internet for keeping up-to-date 

with news [2]. In addition, a large number of 

news articles are published on the daily basis 

covering various current topics. There exists a 

huge platform that can be explored with 

modern computing and data processing tools 

to find out interesting and useful knowledge.  

In this research study, useful knowledge is 

extracted from the e-newspaper articles to 

obtain the news coverage patterns. News 

coverage pattern refers to finding issues or 

topics being discussed over a certain period of 

time. The aim is to find out the coverage given 

to a specific topic or issue in news. The 

coverage patterns would reveal that what 

topics or issues remained under discussed and 

trending subject in a newspaper. The news 

mailto:sayyid_zehra@yahoo.com


Batool Zehra (et al.), Estimating News Coverage Patterns using Latent Dirichlit Allocation (LDA)         (pp. 51 - 56) 

Sukkur IBA Journal of Emerging Technologies - SJET | Volume 1 No. 1 January – June 2018 © Sukkur IBA University 

52 

 
coverage patterns have been obtained by 

means of Latent Dirichlet Allocation (LDA), 

which is a probabilistic model for collection of 

discrete data [3]. LDA is a modeling technique, 

which automatically develops topics based on 

patterns of co-occurrence of words. LDA finds 

out a set of ideas or themes that well describe 

the entire corpus. The scope of this research is 

focused on the daily news and newspaper 

articles published on the website of a popular 

newspaper-The Dawn. During the course of 

this research, an application prototype as the 

proof of concept has been developed that 

performs the task of estimating the news 

coverage patterns. Text mining and analysis is 

performed to check coverage of certain news. 

Precisely, the goal of this research is to identify 

and highlight the topics or issues under 

discussion in numerous articles of the Dawn 

newspaper. Furthermore, the trending topics 

are statistically represented as knowledge to 

end-users. The information visualization has 

been considered an emerging field for the 

several application domains; for instance, 

structural information by injecting parameters 

of location has been represented in visual 

formats [7]. The paper is organized as follows: 

Related studies are discussed in section 2; the 

working principle of extracting news coverage 

patterns is reported in section 3; the outcomes 

of the study are described in the discussion 

section 4; finally, section 5 presents 

conclusions.  

2. Literature 

Maintaining the integrity of the 

specifications reference [4] described way to 

utilize LDA to journalism. The study [5] 

evaluated two novel approaches, one by using 

a video stream and second by using the closed 

caption stream. An LDA approach has been 

used to detect the text stream and the person 

shots. The research concluded that the 

individual system gave comparable results, 

however; a combination of the two systems 

provided a significant improvement as 

compared to the individual system. 

Reference [6] treated groups of objects 

together as spatial visual words and 

investigated configurations of regions using 

LDA and invariant descriptors. In this 

research, computation of invariant spatial 

signatures for pairs of objects was based on a 

measure of their interaction inside the scene. 

A simple classification was used to define 

spatial visual words to extract new patterns of 

similar object configurations. The modeling of 

the scene into a finite mixture was in 

accordance with the spatial visual words by 

the use of latent dirichlet model. Statistical 

analysis was done to better understand the 

spatial distributions inside the discovered 

semantic classes. One case study for synthetic 

imagery and for real imagery was 

experimented using LDA and the results 

proved that this model has good performances 

with the small amount of training data. It has 

been concluded that scene level analysis can 

be done through LDA with minimal human 

interaction leaving behind the traditional 

approaches of pixel or region level analysis 

[6]. The reference [8] unearthed significant 

components, for instance, nouns and verbs 

given broadcast transcript. It further computed 

weights of components with the help of their 

frequency in the text.  

The study under consideration applies 

LDA in order to get news patterns for better 

understanding of trends of news topics/issues. 

3. News Coverage Patterns Extraction 
Workflow  

The workflow of identifying news 
coverage patterns is explained in this section 
that comprises of four steps. The steps are 
explained in the order of their execution as 
illustrated in Fig. 1. The workflow intends to 
find out the coverage of various issues/topics 
in multiple news articles and news. Following 
are the steps and their explanations. 

3.1. Web Crawler 

The primary source of data is online news 
websites. In order to fetch data from these 
websites, a crawler has been designed and 


Batool Zehra (et al.), Estimating News Coverage Patterns using Latent Dirichlit Allocation (LDA)         (pp. 51 - 56) 

Sukkur IBA Journal of Emerging Technologies - SJET | Volume 1 No. 1 January – June 2018 © Sukkur IBA University 

53 

 
developed. However, crawler targeted only 
one website – The Dawn (www.dawn.com). 
The crawler gathered daily news and news 
articles, which needed preprocessing before 
determining news coverage patterns. These 
collected news and news articles are stored in 
documents referred to as document database or 
corpus. 

3.2. Data collection and preprocessing 

Preprocessing stage involves the 
rectification of data such that the subsequent 
processes can be done. The refinement of data 
leads to better results, since unnecessary data 
gets removed and useful data elements are left 
behind. The data refinement procedure during 
preprocessing performs activities: 
Tokenization, Stop word removal, Stemming 
and Vector Space Model. 

3.2.1. Tokenization 

Tokenization is the process to partition the 

sentences contained in the textual data into its 

tokens (i.e., words). For example, consider a 

sentence ‘This study aims at extracting news 

coverage patterns’; the tokenization results 

into tokens: {‘This’, ‘study’, ‘aims’, ‘at’, 

‘extracting’, ‘news’, ‘coverage’, ‘patterns’}. 
 

3.2.2. Stop word removal 

News and articles are read one-by-one by 
the system. The system removes stop words 
from the given document. Stop words are 
commonly used words in language such as is, 
at, a, on, of etc. Their presence can mislead the 
text search and text analysis. Therefore, stop 
words are removed from text during 
preprocessing stage. 

3.2.3. Stemming 

In the next stage of preprocessing, 
stemming takes place. Stemming is the 
procedure of determining the base word or root 
word of a given word. For example, extract is 
the root word of extracting. During stemming, 
each word is traced to its basic root word. This 
is an important process as it can help remove 
noise from data. 

3.2.4. Vector Space Model 

Vector space model represents text 
documents as vectors. Vector(s) corresponds 
to the dimension of the vector space. The text 
data of news and news articles has been 
represented in the form of bag-of-words. 

 
Fig. 1. Workflow of news pattern extraction and 
visualization. 

3.3. News Pattern Extraction 

News pattern extraction refers identifying 

patterns about coverage of certain topics in 

news and articles. LDA finds out what topics 

are discussed in a given article by observing 

processed news data (i.e., vector space model 

– bag-of-words) and produces a topic 

distribution. A prototype application 

developed in this study has implemented LDA 

for the news coverage pattern extraction.  

 
Latent Dirichlet Allocation (LDA) – LDA 

algorithm reported in [3] presents the 

mathematical model. The principle concept of 

LDA describes that documents’ database (or 

corpus) contains words referring to latent 

topics and thus relate to the overall theme of 

the documents. 


Batool Zehra (et al.), Estimating News Coverage Patterns using Latent Dirichlit Allocation (LDA)         (pp. 51 - 56) 

Sukkur IBA Journal of Emerging Technologies - SJET | Volume 1 No. 1 January – June 2018 © Sukkur IBA University 

54 

 
Consider a set of M documents’ database 

(DD) or corpus such that DD = {d
1
,d

2
,...,d

m
}

∑ 𝑛 =  {𝑝1 , … , 𝑝𝑘 }, where di represents set of 
documents in the corpus DD and each 

document d is a vector of N comprises of 

words wi such that d = {w
1
,w

2
,...,w

n
} .  

LDA accomplishes steps as reported below for 

each w in corpus DD. 

i. Select N from Poisson Probability 
(ξ) 

ii. Select q  from Dir( a ), where a
shows per-document topic 

distribution 

iii. For each of the N words wn: 

a. Select a topic zn ∼ 
Multinomial(θ) 

b. Select a word wn from 

p(w
n
| z
n
, b), a multinomial 

probability conditioned on 

topic zn; 

b
ij

= p(w
j
= 1| z

i
= 1) 

probability of wn towards 

the topic zn  

 
Having parameters a , b , the joint 

distribution of a topic mixture q , a set of N 

topics z, and a set of N words w is given by: 

p(q,z,w | a, b) = p(q | a) p(z
n
|q )p(w

n
| z
n
, b)

n=1

N

Õ

 
3.4. News Pattern Extraction 

Visualization is the final output of this 
study. The extracted news patterns are 
presented statistically using graphs, line charts 
and bar charts. The statistical representation 
gives a comprehensive view of how certain 
topics are discussed in news and articles in a 
newspaper. 

4. Discussion 

LDA helps to find the main theme of an 

article and discover the coverage of specific 

news theme. The discovered theme is then 

chronologically ordered and presented in this 

section as bar graphs and line charts in order 

to show the trend of various patterns of news 

issues and topics. 

Figure 2 and 3 show the trends of topics 

mentioned in newspaper. These topics are 

frequently discussed and have been in news 

for the given period of time. The frequencies 

are plotted on y-axis while x-axis possesses 

timings. Each topic has its own line with 

different colors to indicate the change in 

trends. For instance, the topic named NAB in 

Fig. 2 remained top in news during 15 days. 

Similarly, trends about corruption and 

politicians have been the most frequent topics 

during the given dates. It can also be observed 

that these topics have almost similar trend 

with slight variations. This trend leads to 

understanding the media coverage in 

newspapers. In other words, media targets 

news about NAB, corruption and politicians.  

The similar trends are depicted in Fig.  3. 

In Fig. 4, the topics (i.e. army, military, 

Pakistan) discussed in news articles by 

different columnist of The Dawn newspaper 

are depicted. The topics being covered by 

these writers show a trend of certain issues. 

These trends represent that who among the 

considered writers focuses which of the issues 

and topics in their writings. 

Fig. 2. News coverage trend of 3 topics - 15 days. 


Batool Zehra (et al.), Estimating News Coverage Patterns using Latent Dirichlit Allocation (LDA)         (pp. 51 - 56) 

Sukkur IBA Journal of Emerging Technologies - SJET | Volume 1 No. 1 January – June 2018 © Sukkur IBA University 

55 

 
Fig. 3. Number of news regarding certain topics. 

For example, referring to Fig. 4, Cyril 

Almeida mostly talked about Pakistan, 

military and army in his news articles. These 

news coverage patterns and topics covered by 

the writers in their writings clearly yield the 

direction and mindset of the media personnel, 

since media has been considered as opinion 

maker for the societies. 

Fig. 5 provides information regarding 

available collected news and news articles in 

the database. The prototype application offers 

to search for the topics or terms in the 

specified range of time within news or in news 

articles of columnists. 

 
Fig. 4. Frequency of topics covered in articles 
written by news analysts. 

For instance, Fig. 6 represents that term 

PPP (Pakistan People’s Party) has been 

trending more in the news on 5th March 2017 

as compared with rest of the days of March 

2017 in the available database of the prototype 

application. The news coverage patterns and 

their trend helps in understanding the behavior 

and mindset of media personnel, who play an 

essential role in building opinions of the 

people. The patterns not only will serve 

government officials to have in-depth 

information about directions of media and its 

agenda. Likewise, the writings of certain 

columnists will help in understanding the 

targets and their priorities in building the 

nation. 

 
Fig. 5. The news and news articles in the corpus. 

 
Fig. 6. Term PPP (Pakistan People Party) coverage 

trend. 


Batool Zehra (et al.), Estimating News Coverage Patterns using Latent Dirichlit Allocation (LDA)         (pp. 51 - 56) 

Sukkur IBA Journal of Emerging Technologies - SJET | Volume 1 No. 1 January – June 2018 © Sukkur IBA University 

56 

 
5. Conclusion  

This paper presents an approach to apply 

Probabilistic Topic Model (i.e., LDA) in 

finding significant patterns from newspapers. 

The data, crawled from The Dawn 

newspaper’s official website, has been 

processed to uncover the news coverage in 

certain time limits. To validate the potential of 

understanding news coverage patterns, a 

prototype application has been built, which 

performed the necessary steps to reveal the 

patterns. These patterns have been presented 

with the help of visualization methods.  
It has become a fact that media influences 

perception of general public and mold their 
sentiments and thinking about certain issues 
and events. This research would assist in 
identifying the narrative of media groups 
regarding certain issues and events. The point 
of view of article writers can be stipulated by 
applying the analytical approach presented in 
this study. Consequently, the inclination of a 
newspaper can be judged based on the 
coverage they are giving to various issues. The 
in-depth knowledge about media groups and 
their news coverage patterns may assist 
government authorities like PEMRA (Pakistan 
Electronic Media Regulatory Authority) to 
regulate the electronic print media. 

As future works, we plan to compare the 
applied approach with the state-of-the-art 
methods and included other newspapers as 
well as research articles to determine the 
topical coverage of the scientific research 
articles.  

 
REFERENCES 

 
[1] D. S. Fowler, “Tek Eye How Many 
Websites Are There In The World?,” 

Online: tekeye.uk/computing/how-many-

websites-are-there (Accessed 30-1-

2018).  

[2] Fuller, Steve, “Topic: News Industry”, 
Online: 

www.statista.com/topics/1640/news/ 

(Accessed 20-1-2018). 

[3] D. M. Blei, Y. Ng. Andrew, and Michael 
I. Jordan, "Latent dirichlet 

allocation," Journal of machine Learning 

research, pp. 993-1022, Jan 2003. 

[4] C. Jacobi, W. V. Atteveldt,  and K. 
Welbers, “Quantitative analysis of large 

amounts of journalistic texts using topic 

modelling”, Digital Journalism, vol. 4, 

no. 1, pp. 89-106, 2016. 

[5] H. Misra, F. Hopfgartner, A. Goual, P. 
Punitha, and J. M. Jose, “News Story 

Segementation Based on Semantic 

Coherence and Content Similiarity”, In 

MMM, pp.347-357,  January 2010. 

[6] C. Vaduva, I. Gavat, and M. Datcu, 
“Latent Dirichlet Allocation for spatial 

analysis of satellite images,” IEEE 

Transactions on Geoscience and Remote 

sensing, vol. 5, no. 15, pp. 2770-2786, 

2013. 

[7] S. Shah, V. Khalique, S. Saddar, and N. 
A. Mahoto, “A Framework for Visual 

Representation of Crime Information,” In 

Indian Journal of Science and 

Technology, vol. 10, no. 40, pp. 1-8. 

ISSN: 0974-6846, 2017. 

[8] M. J. Pickering, L. Wong, and S. R¨uger, 
“ANSES: Summarisation of news video,” 

Image and Video Retrieval, LNCS 2728, 

pp.425-434, 2013. 

 
http://www.statista.com/topics/1640/news/