Predicting stock market index with gradient boosting machine ensemble, bayesian optimization, temporal consistency analysis, market sentiment analysis, game theory and novel holdout method

The potential of machine learning has sustained the interest of both academia and industry in stock market prediction for over the past decade. This project aims to integrate modern techniques used in the field into a resource-efficient and accurate stock index predictor. While Gradient Boosting Ma...

Full description

Saved in:
Bibliographic Details
Main Author: Yeo, Jarrett Shan Wei
Other Authors: Yeo Chai Kiat
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/148294
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The potential of machine learning has sustained the interest of both academia and industry in stock market prediction for over the past decade. This project aims to integrate modern techniques used in the field into a resource-efficient and accurate stock index predictor. While Gradient Boosting Machines (GBMs) have been around for more than twenty years, they have recently received a revival in popularity of modern gradient-boosted decision trees such as XGBoost in 2014, and LightGBM and CatBoost in 2017. Additionally, literature in stock market prediction field has been focused on the use of macro-economic metrics, the creation of technical financial indicators, and more recently, the analysis of social media big data as well. This project serves to unify such techniques into an efficient yet effective ensemble called CalixBoost Ensemble of the GBMs using the aforementioned data. The models are tuned with Bayesian Optimization, and temporal consistency analysis is also used for invariant feature selection over random trial-and-error. Market sentiment analysis is then conducted using a simple and fast but effective rule-based model tuned specifically for understanding social media posts. Finally, the feature importance and inter-feature relationships of every model will be explained using a unified game theory approach using Shapley values to better appreciate their inner workings. All models will be evaluated using a novel holdout method, viz. on two separate test datasets whose datapoints are collected under different conditions: first, normal economic activity; and second, during a black swan / financial downturn.