This repo is to test & develop my hypothesis for using PCA to perform feature reduction as a pre-procerssing step for model creations.
Imagine you have a collection of input features that are to be used to build a ML model, let us assume Linear Regression (LR). Now suppose that some of the input features are correlated. This would means that if you have one of the features you can create a bijective mapping that you take you from one to the other (approximately). If you then proceeded to build a model using these features then there would be a lot of redundant information in the system, potentially increasing training time and reducing accuracy.
Thinking in terms of linear algebra, our input features are our basis vectors. What we desire is to have our input features be the equivalent of a linearly independent set. If one or more of the input features is correlated to others and can be expressed in terms of others then they are not an independant set.
For the time being we shall assume that all the input features are continous variables. Adapatation/extension to Discreet-continous and Discreet-Discreet interactions are still being worked upon.
This is the proceedure we propose. Once the data cleaning has been completed, produce a feature correlation matrix using a metric of you choice. We shall use Pearson/Spearman correlation. If we interpret this matrix as an adjaceny matrix then we can visualise it as a graph with weighted edges. Next choose some threshold
For the components that are not isolated nodes we take the corresponding features from the data set and we use PCA to reduce those down to single dimension. This is then used for the model creation inplace of the features that made it.
To test/validate our method we shall run the following expirement.
We shall construct a synthetic data set, so we pocess a underlying ground truth which we can make comparisions against. We shall produce 3 models.
- Natural model - this shall be a model with no PCA performed using all of the input features to create the model.
- Hybrid model - this shall be the model where we implement our proposed FR method to create the model.
- PCA model - this shall be a model where we apply PCA to all of the features regardless of their correlations and use the top
$k$ princepal components.
Keeping in mind the restriction of the current concept we are only creating synthetic continous data for a LR problem.
Choose an initial number of features and then out of those features choose a number of them will be correlated. They will not be all correlated but broken into sub groups which will be correlated. For example on our initial exploration we chose to have 10 features, 5 of which were to be correlated, broken into subgroups of 2 and 3.
These subgroups will be generated a multivariate distribution, so we select a covariance matrix that produces the desired correlations. Note that in large subgroups not all the vars have to be correlated, the minimum requirement is that they form a connected component. In our inital run, our 3-correlated-subgroup has f1f2, f2f3 but not f1~f3.
Once the correlated features have been created the remaining features are/can be generated by normal distributions.
To create the target labels we generate a random number for each feature to be that features coefficent in the linear equation, plus a intercept term, and then multiply them together, adding in a little noise to give the corresponding target variable.
We started by running the initial testing to see if there were any early indication that the method would be worth persuing. having reach that point we went and created a class that would do the proposed method