Creating News Headlines with AI

Creating a good headline for a news-article is an creative and intelligent task. Can this be done by an AI-System? In this article I will build a simple neural network machine learning model which learns to generate a headline given a start phrase. For example if given the text “Trump says …” the system should continue the headline.

The system is built with python and the machine learning library “Keras” which is an abstraction layer for Googles “Tensorflow” and other machine learning systems.

For the training I need examples of headlines, which I collected from news-articles from “Reuters”, to build a supervised machine learning system. Let us load the table with the news data using the “Pandas” library.

The table contains 40.062 news articles with headline, body, category and date.

Some rows of the newstable

I need the text as a stream of characters coming one after another. The system will be trained to predict the next character from a sequence of previous characters. This type of network is called “characterlevel-RNN”. I only use the column with the headlines and concatenate them to one long string (2.628.606 characters), where the single headlines are seperated by newlines.

In the next step, for all characters in the headlines (91 different in this example) a mapping to a number is built and this mapping is stored on the disk for later use. The text is then encoded with this numbers to a list of 2.628.606 integer numbers.

So the text “ Survivors of Florida school …” will be encoded as the sequence:

This sequence of numbers is now used to train a neural network. The network should learn to predict a new number (character) at the end of a given sequence of numbers (I use seqlen = 10 in the example for the length of the training sequences).

The training examples for the supervised learning are constructed in a generator function, which builds batches of sequence (X_batch) and target (y_batch) pairs, by taking a sequence of numbers of length 10 for X_batch from the encoded string and the following number for the y_batch. Then the sequence is shifted by 1 for the next pair. In this way batches of size 512 are generated batch by batch.

These batches are then one-hot encoded by the Keras helper function “to_categorial”. This means a number is represented by a vector of zeros (91 dimensions) with a 1 at the position of that number. i.e. 7 is encoded as:

(0,0,0,0,0,0,1,0,0,0, …)

With “Keras” I built a model with 3 layers of LSTM cells followed by a dense layer with softmax activation.

For an introduction of “Keras” see the article: https://medium.com/skyshidigital/getting-started-with-keras-624dbf106c87

For more about RNNs with LSTM Cells see the article: https://medium.com/@kangeugine/long-short-term-memory-lstm-concept-cb3283934359

Layers and parameters of the model

Using the batch-generator defined before we can now train and save the model. Training may take a while on standard hardware without a strong GPU. (30 min per epoch on my laptop).

After loading a model, it can be used to generate a headline starting with a sequence (“Trump tells” in the example) of characters. The maximum number of characters for the headline is set to 400, but the headline usualy ends before that, when the model decides to generate a “/n”.

The headline is generated character by character sampling the new characters according to the probability distribution the model defines.

The line:
probs = model.predict_proba(encoded)
generates a probability distribution on the vocabulary (numbers of 0 to 90).

The line:
yhat = random.choices(range(0,vocab_size), weights=probs[0], k=1)[0]
draws a number according to the distribution at random.

After one epoch of training I did some first tests with 3 different starting sequences ‘Tump tells’, ‘Erdogan is’ and ‘Clinton is’. The model came up with the headlines:

Of course the headlines do not make sense and there are random combinations of characters that do not build into correct words. But there are some things the network learned, it seperates the tokens by whithespaces and the headlines end after a few characters.

After a second epoch of training:

Some correct words appear in the texts, but still a lot of “random” words do too. So a lot of additional training is needed.

After 20 epochs:

As you can see above the system get better and better.

A similar article working with Tweets of Donal Trump: https://towardsdatascience.com/yet-another-text-generation-project-5cfb59b26255

University of Applied Sciences Upper Austria / School of Informatics, Communications and Media http://www.stoeckl.ai/profil/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store