Creating a good headline for a news-article is an creative and intelligent task. Can this be done by an AI-System? In this article I will build a simple neural network machine learning model which learns to generate a headline given a start phrase. For example if given the text “Trump says …” the system should continue the headline.
For the training I need examples of headlines, which I collected from news-articles from “Reuters”, to build a supervised machine learning system. Let us load the table with the news data using the “Pandas” library.
The table contains 40.062 news articles with headline, body, category and date.
I need the text as a stream of characters coming one after another. The system will be trained to predict the next character from a sequence of previous characters. This type of network is called “characterlevel-RNN”. I only use the column with the headlines and concatenate them to one long string (2.628.606 characters), where the single headlines are seperated by newlines.
In the next step, for all characters in the headlines (91 different in this example) a mapping to a number is built and this mapping is stored on the disk for later use. The text is then encoded with this numbers to a list of 2.628.606 integer numbers.
So the text “ Survivors of Florida school …” will be encoded as the sequence:
[47, 76, 73, 77, 64, 77, 70, 73, 74, 1, 70, 61, 1, 34, 67, 70, 73, 64, 59, 56, 1, 74, 58, 63, 70, 70, 67]
This sequence of numbers is now used to train a neural network. The network should learn to predict a new number (character) at the end of a given sequence of numbers (I use seqlen = 10 in the example for the length of the training sequences).
The training examples for the supervised learning are constructed in a generator function, which builds batches of sequence (X_batch) and target (y_batch) pairs, by taking a sequence of numbers of length 10 for X_batch from the encoded string and the following number for the y_batch. Then the sequence is shifted by 1 for the next pair. In this way batches of size 512 are generated batch by batch.
These batches are then one-hot encoded by the Keras helper function “to_categorial”. This means a number is represented by a vector of zeros (91 dimensions) with a 1 at the position of that number. i.e. 7 is encoded as:
With “Keras” I built a model with 3 layers of LSTM cells followed by a dense layer with softmax activation.
For an introduction of “Keras” see the article: https://medium.com/skyshidigital/getting-started-with-keras-624dbf106c87
For more about RNNs with LSTM Cells see the article: https://medium.com/@kangeugine/long-short-term-memory-lstm-concept-cb3283934359
Using the batch-generator defined before we can now train and save the model. Training may take a while on standard hardware without a strong GPU. (30 min per epoch on my laptop).
After loading a model, it can be used to generate a headline starting with a sequence (“Trump tells” in the example) of characters. The maximum number of characters for the headline is set to 400, but the headline usualy ends before that, when the model decides to generate a “/n”.
The headline is generated character by character sampling the new characters according to the probability distribution the model defines.
probs = model.predict_proba(encoded)
generates a probability distribution on the vocabulary (numbers of 0 to 90).
yhat = random.choices(range(0,vocab_size), weights=probs, k=1)
draws a number according to the distribution at random.
After one epoch of training I did some first tests with 3 different starting sequences ‘Tump tells’, ‘Erdogan is’ and ‘Clinton is’. The model came up with the headlines:
Tump tells girgelts Tun's
Erdogan is Suntice pceriab in Trhicipats brrit to egup iw ETpitbens at Atstr0 Hra't0s prose s0ek
Clinton is'w Txucr tawtco tes insorten apkint, Xeine
Of course the headlines do not make sense and there are random combinations of characters that do not build into correct words. But there are some things the network learned, it seperates the tokens by whithespaces and the headlines end after a few characters.
After a second epoch of training:
Tump tellsttit chamed on Iran
Erdogan iswa TecuSiley hyw Whemitoan
Clinton is for by
Some correct words appear in the texts, but still a lot of “random” words do too. So a lot of additional training is needed.
After 20 epochs:
Tump tells G7 finance minister
Erdogan is 'nontagium's 'Trump 'made 'sylighter contacts
Clinton islanding renewed
As you can see above the system get better and better.
A similar article working with Tweets of Donal Trump: https://towardsdatascience.com/yet-another-text-generation-project-5cfb59b26255