Stock-Prediction from News — A Naive Approach

Stock market prediction with machine learning is very popular this day. An example of such a platform was described in the article https://medium.com/@mobappdaily/market-sensei-expat-inc-launches-machine-learning-powered-stock-market-prediction-platform-1a9acdf8cb66.

Since news articles may have an influence on the markets I will try to build a model for stock prediction based on news published on the web. In a first step, I build a numerical representation of each document trained with the “Doc2Vec” method, followed by a regression model with a deep neural network.

As a dataset, I use articles from www.reuters.com published between 23. Feb. 2016 and 23. July 2018.

I only use texts with more than 500 characters, and load them into a data frame with the Python library “Pandas”. The data frame consists of columns for title, body text, category and the date the news was published.

For my prediction-model, I will only use the body text of each article. The total number of texts is 35.009 which I split in a set for training of the model and a set for testing the predictions. All articles before 1. June 2018 I use for training ( 32.901 articles — “df_train”), the rest of the texts (2.199 articles — “df_test”) I use for the tests.

To train a prediction model with neural networks I need a numerical representation of the texts. “Doc2Vec” (see https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e) is such a method which represents texts as n-dimensional vectors. I used the implementation of “Doc2Vec” in the Python library “Gensim” to calculate 300-dimensional vectors for each of my texts. First, the representation model was trained on the 32.901 training-texts:

In the next step for all texts in the training set the representation vectors are calculated with this model, and stored in the data frame:

To try a prediction I need some stock values, and so I downloaded a data frame with the values for “Apple Inc.” from the “Quandl” API: (610 values)

And then I calculated a column of the “Daily Change” in the data frame from the “adjusted Closing” value of each day:

Now I joined the two tables “df_train” and “apple” on the column “Date”.

And I got 26.371 rows with representation vectors and stock values in one dataset. Since some of the news were published on days with no “Daily Change” we got not all the 32.901 rows.

To learn a mapping between the document vectors (“features”) and the “labels” I constructed a matrix “X” with the representations (features) and a vector “y” with the target values (stock prices — “Daily Change”) from the merged tables (“result”).

Next, a deep learning model with “Keras” is constructed, which gets X and y as data to learn a mapping between the “text-features” and the stock prices.

The model uses some dense layers with “relu” activation and dropout for regularisation.

Structure of the model

I trained the model with “Mean Squared Error” as loss function with the “ADAM” optimizer over 500 episodes.

The loss function decreases as I like:

Now let us look at the correlation between predicted stock values versus real values on the training dataset.

I got a correlation coefficient of more than 0.92, which is highly significant as the scatterplot also shows:

Predictions on the training data set

This looks like I have constructed a good prediction model for stock prices. But let us validate it on the test set of articles from 1. June 2018 to 23. July 2018. These news texts were not used in building the model.

Now I merge the test dataset with the stock values in the same way as the training data before and test the model on this set of data, by constructing the vector representations with the trained “Doc2Vec” model and then use the regression model on this vectors.

We get a correlation coefficient of only about 0.02 as the scatterplot suggests. So predicted values have nothing to do with the real values, and therefore the model is useless.

Predictions on the test data set

With our naive approach, we clearly overfitted the model on the training dataset and it does not generalize to the test set. So you can not use the model to make any real prediction.

So as Niels Bohr said: “It’s Difficult to Make Predictions, Especially About the Future”.

University of Applied Sciences Upper Austria / School of Informatics, Communications and Media http://www.stoeckl.ai/profil/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store