Skip to content

Latest commit

 

History

History
144 lines (109 loc) · 10.7 KB

README.md

File metadata and controls

144 lines (109 loc) · 10.7 KB

Stock Market Regression - Starbucks Corp starbucks

The goal of this notebook is to implement classic regression models and implement metrics from scratch to calculate the quality of the regressors, aslo show some interesting plots that tells the steps involved. Join me on this funny journey ☕

Tools

Python Matplotlib scikit-learn Numpy Pandas PyCharm

Dataset

In this project we use data from the SBUX stock market. The data goes from 2019-06-05 to 2024-06-05. The data was extracted directly from Yahoo Finance using the API and the library yfinance.

Metrics

Metric Formula Interpretation
MSE $$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$ Lower values indicate a better fit
RMSE $$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$ Lower values indicate a better fit, same units as ( y )
MAE $$MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$ Lower values indicate a better fit
RSE $$RSE = \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$$ Values closer to 0 indicate a better fit
RAE $$RAE = \frac{\sum_{i=1}^{n} |y_i - \hat{{y}_i} | }{\sum_{i=1}^{n} |y_i - \bar{y}|}$$ Values closer to 0 indicate a better fit
R $$R = \frac{\sum_{i=1}^{n} (y_i - \bar{y})(\hat{y}_i - \bar{\hat{y}})}{\sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2 \sum_{i=1}^{n} (\hat{y}_i - \bar{\hat{y}})^2}}$$ Values closer to 1 or -1 indicate a strong linear relationship
$$R^2$$ $$R^2 = 1 - \frac{ \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^2}{ \sum_{i=1}^{n} (y_i - \bar{y})^2}$$ Values closer to 1 indicate a better fit

Dataset attributes

Using import yfinance as yf and sbux_data = yf.download('SBUX', period='5y') we have the following dataset. Notice that the attributes are Date, Open, High, Low, Close and Volume. The dataset header is the following:

Date Open High Low Close Volume
2019-06-05 78.790001 79.970001 78.660004 79.959999 7437100
2019-06-06 80.029999 81.629997 79.900002 81.400002 10457200
2019-06-07 81.599998 83.330002 81.510002 82.480003 11278800
2019-06-10 82.849998 82.860001 81.379997 81.930000 8102800
2019-06-11 82.300003 82.860001 81.849998 82.370003 6226400

Plotting the historic attribute Close

Here, we take the Close attribute from the entire historic and plot it to see the evoluting of the stock over time.

historical

Make train and test splits

Becasuse we seek to implement regression over time-series data, we need to take sequences, therefore, to make train and test splits, we take a sequence for train split and another sequence for test split.

  • The train split goes from Jan 01, 2022 to Mar 01, 2024.
  • The test split goes from Mar 01, 2023 to Jun 03, 2024. We can see this in the following image:

train_test

Process data

Make chunks of time-data

In order to find paterns in the data, we give the data some special estructure that we call chunks i.e.time windows:

  • take the data and makes chunk, the chunk shape is a window of (7,5)
  • 7 are the days and 5 are the attributes (open, high, low, close, volume)
  • for wach chunk, we add a target that will be the Close value of the 8th day
  • the chunks will have an offset of 1 day i.e. the window will slide one day to make the next time-window and the next target value

Reshape data

In order to employ classic regression models, we need to give the data a valid shape to be compatible with the model. Thats why we flateen our data. Previously oru data has a shape of (7,5), now will be (35,)

Normalize data

We implement a normalization using standard-deviation like: $$z = \frac{X - \mu}{\sigma}$$

Regressor models

Random Forest Regressor

It's an ensemble learning method that operates by constructing a multitude of decision trees during training time and outputting the mean prediction of the individual trees for regression problems. It's known for its robustness against overfitting and high performance.

Gradient Boosting Regressor

It's another ensemble method that builds trees one at a time, where each new tree helps to correct errors made by the previously trained set of trees. Gradient boosting tends to be more sensitive to overfitting than Random Forests but can often yield better performance if tuned correctly.

Extra Trees Regressor

This is similar to Random Forests, but with one key difference: instead of searching for the best split point in the feature when building trees, it selects a random split point. This randomization can lead to faster training times and sometimes better performance, especially for high-dimensional data.

Train and predict

The previously regressor models where implemented directly from the scikit-learn modules.

  • the trainning was implemented with the fit() method on the train split data
  • and the prediction with the predict() method with the test split data

Pathway

The following image shows the steps summarized in this project, notice that the last plots shows the results of prediction values for Close for each regressor model.

pathway

Actual vs Predicted

In this image, we can see a comparison for the actual values and th predicted ones for each regressor.

actual-predicted

Residual analysis

In this image, we can see a comparison for the residual analysis for each regressor.

residual

Metrics chart

In this plot, we can observe a summarization of the metrics calculated for each regressors model.

metrics-plot

Results: Regressor quality

As we can observe, the regressor models have a very similar behaviour in all the seven metrics calculated. Nevertheless, the Random Forest Regressor show an slightly improvement in the metrics.

Regressors MSE RMSE MAE RSE RAE R R2
RandomForestReg 9.4459 3.0734 1.9380 3.1298 0.3270 0.8927 0.7737
GradientBoostingReg 9.9440 3.1534 2.0396 3.2113 0.3441 0.8845 0.7617
ExtraTreesReg 10.3015 3.2096 1.9351 3.2685 0.3265 0.8804 0.7532

Conclussions

Regression analysis is a powerful tool for understanding relationships between variables and making predictions. By carefully considering the assumptions and properly interpreting the results, regression models can provide valuable insights in many fields, from economics and finance to biology and engineering.

In this project, we explored various metrics to evaluate the performance of regression models. The metrics we focused on included Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Relative Squared Error (RSE), Relative Absolute Error (RAE), Mean Absolute Percentage Error (MAPE), the Correlation Coefficient (R), and the Coefficient of Determination (R2).

Each metric provides unique insights into the performance of regression models:

  • MSE and RMSE are useful for understanding the average magnitude of errors, with RMSE being particularly interpretable due to its same units as the dependent variable.
  • MAE offers a robust measure less sensitive to outliers compared to MSE and RMSE.
  • RSE and RAE provide relative measures comparing model performance to a baseline model, with values closer to 0 indicating superior performance.
  • R helps assess the strength and direction of the linear relationship between actual and predicted values, with values closer to 1 or -1 indicating stronger linear relationships.
  • R2 indicates the proportion of variance in the dependent variable explained by the independent variables, with higher values signifying better model fit.

The comprehensive evaluation using these metrics allows an understanding of model performance. For example, while a model might exhibit low MSE and RMSE, indicating small average errors, it could still have high RSE or RAE values if the baseline model performs similarly well. Similarly, a high R2 value signifies that a significant portion of the variance is explained by the model, but it doesn’t provide information about the actual size of prediction errors, which metrics like MAE and RMSE do.

In conclusion, employing a diverse set of evaluation metrics provides a holistic view of the regression model’s effectiveness, enabling more informed decisions in model selection and refinement. The combination of these metrics ensures that the models not only fit the data well but also generalize effectively to new data, ultimately leading to more accurate and reliable predictions in practical applications.