Sometimes it is even hard for humans to understand if a news article is real, fake or satire. So I asked my self if I can train a machine learning model to decide to which class (real or satire) a given article belongs. There are websites like https://www.theonion.com publishing satire news every day, which can be used together with regular news sites, to collect training data for this classification problem.
I grabbed large datasets of news articles in the German language from news agencies and newspapers via their websites:
and from the satirical news sites:
for training and testing of the model. In total, I collected 63,868 articles from 2008 to 2018 and stored them in a local database.
To train a classifier I used the “ScikitLearn” Package with a linear Support Vector Classifier (SVC). The news texts were vectorized with a count vectorizer and Tf-idf weighting (see the code below).
80% of the data was used for the training of the classifier and 20% for testing. On the test-set, I achieved an accuracy of 0.996, precision of 0.986, a recall of 0.952 and an F1 score of 0.969. In the confusion matrix below you can see the distribution of the correct and wrong classifications. Only 11 of the real news are classified as satire but 42 of the satirical texts are not detected as satire. Quite good results.
I think the presented method can be used with other languages and I expect similar results as with the German news.
Are computers better than humans in detecting satire in texts?
More details can be found in the article https://arxiv.org/abs/1810.00593