How often are people mentioned in news articles published online?
How does this vary over time and are the mentions of different peoples correlated?

I tried to answer these questions with a small project and a website. The results can be seen at http://in-the-news.stoeckl.ai/. The site is built with the Python microframework “Dash” which uses the platform “Plotly” for the interactive charts.

The data comes from articles published by Reuters agency on their website www.reuters.com. At the moment about 70.000–80.000 news articles in English and German are indexed. German news are from 2015 until now, English from 2016 until now. A part of the dataset can be found on kaggle.com: https://www.kaggle.com/astoeckl/newsen

For each article, a Named Entity Recognition (NER) is conducted with a machine learning algorithm to detect the mentions of the persons in the texts. I used the Python library “Spacy” as in the following example:

Persons and organisations in the newstitles

This algorithm uses a model which was pretrained on a corpus of Google news articles for English and German. The lists of persons in the articles are used to calculate the counts and are stored in a database.

I show a bar chart of the counts for the most often mentioned persons. For up to four of these persons you can plot the time series of the counts at the same time for a time period you select. For two persons you can calculate their relation/correlation as a function of time.

Related Persons measures how correlated two persons are, in the sense that they are mentioned in the news on the same day. On one hand, if they have the same counts every day the correlation is 1, on the other hand, if a person appears always on days the second one does not, they are negative correlated near -1. If there is no correlation the value is near zero.

This measure varies over time as the correlation changes in the same way the relationship of the persons may change. We calculate the correlations over a sliding time window of 30 days and plot these values as a function of time.

More details can be found in the article https://arxiv.org/abs/1809.06083.

University of Applied Sciences Upper Austria / School of Informatics, Communications and Media http://www.stoeckl.ai/profil/

University of Applied Sciences Upper Austria / School of Informatics, Communications and Media http://www.stoeckl.ai/profil/