Developing a machine learning model for sentiment analysis to predict trends of stock market price from news (full report)

7 min readSep 3, 2021

(Data Science Udacity Nanodegree Project)

1. Project Definition

1.1 Project Overview

One of the most important factors relating to stock market trend is news. Positive and negative news could be a trigger to change price of stock market. Building an automatic trading system requires signal from news to input system. Not only does the trading system require, individual traders also need information from news to analyse the market trend.

This project uses BERT model — one of the most state-of-the-art Deep Learning models in Natural Language Processing, to develop a tool to analyse news. It analyses the news and provides output as two classes — positive/neutral and negative news. The dataset of this project referred to “Sun, J. (2016, August). Daily News for Stock Market Prediction, Version 1. Retrieved from https://www.kaggle.com/aaron7sun/stocknews". The dataset that combines news from 25 sources and trend of stock market price of each day.

This project will show how to develop an NLP model to understand news relating to stock market trend (positive/neutral trend and negative trend).

1.2 Problem Statement

News could be one of the most important factors in stock market prediction. It is challenging tasks to monitor news from many sources real time. Building a tool to understand news relating to stock market trend is important. This project is to develop an NLP model to extract signal (positive/neutral or negative) from news.

1.3 Metrics

Firstly, this project uses “Model Accuracy” to measure the performance of an NLP model. This will show the successful rate of predicting trend of stock market price from news. However, only model accuracy might not be enough for measuring model performance. In this project, dataset seems to be not perfectly balanced (closely balanced). It shows that the data samples of each label are not the same number. “Confusion Metric Analysis” will be considered. This will show the accuracy rate by labels.

In addition to these, “Precision”, “Recall”, and “F1-Score” will be also considered. They shows the insightful model performance.

Precision shows the accuracy rate of predicting positive values referred to all predicted positive values (false and true predicted positive). In this case, it can be the accuracy of predicting negative trends compared to all predicted negative trends. Precision = True Positive / (True Positive + False Positive).
Recall shows the accuracy rate of predicting positive value referred to all ground-truth positive values (true positive and false negative). In this case, it means the accuracy of predicting negative trends compared to all negative ground-truth trends. Recall = True Positive / (True Positive + False Negative).
F1-Score shows the average of precision and recall. It is a useful indicator for measuring model performance when the dataset is imbalanced. The perfect score for F1 is 1. F1 = 2 * (Precision * Recall) / (Precision + Recall).

2. Analysis

2.1 Data Exploration

Data observation is listed below.

The data is between 2018–08–08 and 2016–07–01.
There are 1,989 data samples
The source of data is from 25 sources (topics).
The data is labelled into 2 groups (0 and 1). — 0: Price of stock market at close value rose or stayed the same — 1: Price of stock market at the close value decreased

Example of the dataset can be seen from the table below.

In term of statistical information, there is no highlight. However, these below may be the possible examples (they are repeated as mention above).

2.2 Data Visualisation

The pie chart below shows number of news related to each label (0 or 1). This can be seen that label 0 (Price of stock market at close value rose or stayed the same) contains 924 samples, while label 1 (Price of stock market at the close value decreased) has 1065 samples.

3. Methodology

3.1 Data Preprocessing

The original dataset is .csv file. Pandas (data processing framework) is a tool to do preprocessing data. It is used to pull csv data, remove unexpected word, and sort/split the data into input data and label data.

For text preprocessing, BERT model already contains word embedding process. This means that we do not need to do data featuring. We can feed text data into the pre-train model directly.

3.2 Implementation

In this project, BERT model link is considered to use. BERT is one of state-of-the-art models for Natural Language Processing. It is based on Transformer (attention models). BERT was invented by Google AI Research in 2018. The pre-train model is called from TensorFlow Dev. However, we add one layer at the end of BERT model as the classification layer.

Hyper-parameters, model information, and training information are listed below.

The ration of training and test data is 80:20 with random split method — random state is 12.
Using Pre-trained model from TensorFlow Dev — model BERT v.2 (link).
One layer of BatchNormalization (default setting) is added at the end of BERT output.
Dropout layer (10%) is added after BatchNormalization layer (the setting value is 0.1).
Classification layer is added at the end of the model (the activation function is None).
Binary Cross Entropy is the loss function.
Adam is considered to use as an optimizer.
Number of epochs for training is 20.
Initial value of learning rate is 0.00003.

The model structure is shown below.

3.3 Refinement

From the early state of development, the model experience two issues (1) low accuracy score and (2) over-fitting (accuracy gap of training and test data is about 50%). The model has been improved by adding Batch Normalization layer to improve model accuracy and adding dropout to overcome an over-fitting issue.

Using Batch Normalization can improve speed of training. Comparing at the same epoch, at epoch 30 of training data, using BatchNormalization provides 87.30% but it is roughly 60% without BatchNormalization.
Using drop layer with threshold 0.1 (10% dropout), the accuracy of test data increases approximately from 47% to 51%.

4. Results

4.1 Model Evaluation and Validation

During model is training, the accuracy of training data increases, while for the test data, the accuracy is slightly fluctuated but in overall, it remains the same. At the final state (epoch 30), the model accuracy of training data is 87.30% and unseen data (test data) is 51.51%. The accuracy of training data seems to be able to increase if we expand the number of epochs. On the other hand, the accuracy of unseen data is different. It stays remain and is not in the acceptable level. In addition to this, this model still experiences over-fitting issue. We may consider including other techniques to fix the issues such as adding L1/L2 regularization.

4.2 Justification

This project uses BERT model to predict the trends of stock market price. BERT is based on attention model, another name is Transformer. It can convert raw text data into word vector. Then, the attention mechanism provides ability to focus only useful words from the text. This can create ability to understand sentiment of text and link to trends of stock market price (which is labels 0 or 1).

In addition to this, BatchNormalization is used to speed up the training accuracy. BatchNormalization can prevent over-average turning parameters when each data sample is too different. This results in better turning parameters for all text data.

Dropout is considered to overcome an overfitting problem. the overfitting problem happens when the parameters are tuned too good for the training data. We need to generalize the model to be able to deal with unseen data. Dropout excludes some nodes.

5. Conclusion

5.1 Reflection

This project shows a possible way to implement the state of the art model in NLP (BERT model) to develop a tool to understand news relating to trends of stock market price.

BERT already contains word embedding feature. This means that we do not need to do data featuring. We can feed our cleaned text data into the model directly.

Using a pre-trained model is straightforward way to implement. We do not need to train the model with large dataset and long hours to have a high performance model. We can use the pre-trained model can re-train for our task.

However, the model accuracy of unseen data is not in the acceptable level. There is some room for the improvement.

5.2 Improvement

The accuracy of the model is still not in the acceptable level. There is some room for the improvement such as adding more process to clean the data in data pre-processing level. The data might contain too much noise for the BERT model.

Github Link