








key: cord-328461-3r5vycnr
authors: Chire Saire, J. E.
title: Infoveillance based on Social Sensors to Analyze the impact of Covid19 in South American Population
date: 2020-04-11
journal: nan
DOI: 10.1101/2020.04.06.20055749
sha: 
doc_id: 328461
cord_uid: 3r5vycnr

Infoveillance is an application from Infodemiology field with the aim to monitor public health and create public policies. Social sensor is the people providing thought, ideas through electronic communication channels(i.e. Internet). The actual scenario is related to tackle the covid19 impact over the world, many countries have the infrastructure, scientists to help the growth and countries took actions to decrease the impact. South American countries have a different context about Economy, Health and Research, so Infoveillance can be a useful tool to monitor and improve the decisions and be more strategical. The motivation of this work is analyze the capital of Spanish Speakers Countries in South America using a Text Mining Approach with Twitter as data source. The preliminary results helps to understand what happens two weeks ago and opens the analysis from different perspectives i.e. Economics, Social.

Infodemiology 1 is a new research field, with the objective of monitoring public health 2 and support public policies based on electronic sources, i.e. Internet. Usually this data is open, textual and with no structure and comes from blogs, social networks and websites, all this data is analysed in real time. And Infoveillance is related to applications for surveillance proposals, i.e. monitor H1N1 pandemic with data source from Twitter 3 , monitor Dengue in Brazil 4 , monitor covid19 symptoms in Bogota, Colombia 5 . Besides, Social sensors is related to observe what people is doing to monitor the environment of citizens living in one city, state or country. And the connection to Internet, the access to Social Networks is open and with low control, people can share false information(fake news) 6 .

A disease caused by a kind coronavirus, named Coronavirus Disease 2019 (covid19) started in Wuhan, China at the end of 2019 year. This virus had a fast growth of infections in China, Italu and many countries in Asia, Europe during January and February. Countries in America(Central, North, South) started with infections at the middle of February or beginning of March. This disease was declared a global concern at the end of January by World Health Organization(WHO) 7 .

South America has different context about economics, politics and social issues than the rest of the world and share a common language: Spanish. The decisions made for each government were over the time, with different dates and actions: i.e. social isolation, close limits by air, land. But, there is no tool to monitor in real time what is happening in all the country, how the people is reacting and what action is more effective and what problems are growing.

For the previous context, the motivation of this work is analyze the capitals of countries with Spanish as language official to analyze, understand and support during this big challenge that we are facing everyday. This paper follows the next organization: section 2 explains the methodology for the experiments, section 3 presents results and analysis. Section 4 states the conclusions and section 5 introduces recommendations for studies related.

The present analysis is inspired on Cross Industry Standard Process for Data Mining(CRISP-DM) 8 steps, the phases are very frequent on Data Mining tasks. So, the steps for this analysis are the next:

• Select the scope of the analysis and the Social Network

• Find the relevant terms to search on Twitter

• Build the query for Twitter and collect data

• Cleaning data to eliminate words with no relevance(stopwords)

• Visualization to understand the countries . CC-BY-NC 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055749 doi: medRxiv preprint

Considering the countries where Spanish is the official language, there are 9 countries in South America: Argentina, Bolivia, Chile, Colombia, Ecuador, Paraguay, Perú, Uruguay, Venezuela and every nation has a different territory size as the table Tab. 1 shows. Therefore, analyze the whole countries could take a great effort about time then the scope of this paper considers the capital1 of each country because the highest population is found there.

At the same time, there are many Social Networks with like Facebook, Linkedin, Twitter, etc. with different kind of objective: Entertainment, Job Search and so on. During the last years, data privacy is an important concern and there is update on their politics, so considering the previous restriction Twitter is chosen because of the open access through Twitter API, the API will help us to collect the data for the present study. Although, the free access has a limitation of seven days, the collecting process is performed every week.

Actually, there is hundreds of news around the world and dozens of papers about the coronavirus so to perform the queries is necessary to select the specific terms and consider the popular names over the population. The selected terms are:

Ideally, people only uses the previous terms but, citizens does not write following this official names then special characters are found like @, #, -, _. For this reason, variations of coronavirus and covid19 are created, i.e. { '@coronavirus', #covid-19', '@covid_19' }

The extraction of tweets is through Twitter API, using the next parameters:

• date: 08-03-2020 to 21-03-2020, the last two weeks . CC-BY-NC 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055749 doi: medRxiv preprint

• Change format of date to year-month-day

• Eliminate alphanumeric symbols

• Uppercase to lowercase

• Eliminate words with size less or equal than 3

• Add some exceptions to eliminate, i.e. 'https', 'rt'

This step will help to answer some question to analyze what happens in every country.

• How is the frequency of posts everyday?

• Can we trust on all the posts?

• The date of user account creation

• Tweets per day to analyze the increasing number of posts

• Cloud of words to analyze the most frequent terms involved per day

The next graphics presents the results of the experiments and answer many questions to understand the phenomenon over the population.

At beginning, a fast preview about the frequency of post per country will support us to understand how many active users are in every capital.

Four things are important to highlight from Fig. 2: (1) Venezuela is a smaller country but the number of posts are pretty similar to Argentina, (2) Paraguay is almost a third from Peru territory and the number of publications are very similar, Chile is one small country but the number of publication are higher than Peru and (4) Uruguay is the smallest one with more tweets than Bolivia and Colombia even Ecuador has more.

By other hand, considering data from Table 1 , there is a strong relationship between Internet, Social Media and Mobile Connections in Argentina, Venezuela with the number of tweets and but a different context for Colombia, this insight show us the level of using in Bogota and says how the Internet Users are spread in other cities on Colombia. So, a similar behavior explained previously is present over this data.

. CC-BY-NC 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055749 doi: medRxiv preprint 

Considering the image Fig.2 , the number of post for each country, the total number of tweets is up to five millions(5 627 710), close to half of million(401 979) per day. So, the question about veracity is important to filter and analyze what people is thinking, because the noise could be a limitation to understand what truly happens. By consequence, it is necessary to consider some criterion to filter this data.

First, Argentina has the highest number of publications in last two weeks. For example, the firt dozen of the top users in Buenos Aires are:

'Portal Diario', '.', 'Clarín', 'Radio DoGo', 'Camila','El Intransigente', 'Agustina', 'Pablo', 'FrenteDeTodos', 'Ale','Lucas', 'Diario Crónica' Later, a search about the users, one natural finding is: they are related to newspapers, radio or television(mass media). But there is people with many hundreds of tweets and regular people. The next image Fig. 3 has the names of users and quantity of posts. #bolivia  siete  covid  @pagina  gobierno  salud  @larazon  cruz  ministro  medidas  cuarentena  caso  #coronavirusbo  santa  @rtp  pais  #esultimo  emergencia  personas  @erboldigital  anibal  paciente  confirma  pide  #lapaz  informa  confirmados  nuevo  oruro  presidenta  hospital  primer  @jeanineanez  #elalto  #urgente  pacientes  #deahora  prevencion  presidente  ciudad  evitar  anuncia  informo  @sumaj  warmi  nacional  nuevos  dice  sospechosos  #loultimo  alto  prevenir  centro  medicos  atencion  china  italia  declara  cuba  tres  horas  #deultimo  poblacion  #santacruz  ministerio  propagacion  video  mundo  hospitales  enfrentar  autoridades  tras  @yerkogarafulic  #coronavirusmundo  luis  @luchoxbolivia  #anibalcruz  jeanine  medico  debido  #mundo  #ultimo  virus  gobernacion  #videonoticias  pandemia  municipal  estan  primera  manos  policia  reporta  suspension  #oruro  @radiolider97  frente Helping the visualisation from Monday to Sunday during the last two weeks, a cloud of words is presented in Fig. 6 showing the first thirty terms per country. It is important to remember every country promote different actions on different dates. . CC-BY-NC 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055749 doi: medRxiv preprint

Infodemiology and infoveillance: Tracking online health information and cyberbehavior for public health

Social web mining and exploitation for serious applications: Technosocial predictive analytics and related technologies for public health, environmental and national security surveillance

Pandemics in the age of twitter: content analysis of tweets during the 2009 h1n1 outbreak

Building intelligent indicators to detect dengue epidemics in brazil using social networks

What is the people posting about symptoms related to coronavirus in bogota, colombia?

Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study

WHO. Who statement regarding cluster of pneumonia cases in wuhan, china. Beijing: WHO 9

The crisp-dm model: The new blueprint for data mining

Infoveillance based on Social Sensors with data coming from Twitter can help to understand the trends on the population of the capitals. Besides, it is necessary to filter the posts for processing the text and get insights about frequency, top users, most important terms. This data is useful to analyse the population from different approaches.

