Multiple regression is a statistical method used to examine the relationship between a dependent variable (
-
Dependent Variable (
$Y$ ): The outcome or response variable we aim to predict or explain. -
Independent Variables (
$X_1, X_2, \dots, X_k$ ):
The predictors or explanatory variables that influence$Y$ . -
Regression Equation:
$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_kX_k + \epsilon$$ Where:
-
$\beta_0$ : Intercept (value of$Y$ when all$X_i = 0$ ) -
$\beta_1, \beta_2, \dots, \beta_k$ : Coefficients showing the effect of each$X_i$ on$Y$ -
$\epsilon$ : Error term
-
-
Linearity: The relationship between
$Y$ and each$X_i$ is linear. - Independence: Observations are independent of each other.
-
Homoscedasticity: The variance of residuals is constant across all levels of
$X_i$ . - No Multicollinearity: Independent variables are not highly correlated with each other.
- Normality: Residuals follow a normal distribution.
Define the dependent variable (
Examine the dataset for missing values, outliers, and potential correlations among independent variables.
Estimate the coefficients (
Use metrics such as
Understand the influence of each independent variable on
A company wants to predict employee salaries (
Years of Experience ( |
Education Level ( |
Salary ( |
---|---|---|
2 | 1 | 50,000 |
4 | 2 | 60,000 |
6 | 2 | 70,000 |
8 | 3 | 85,000 |
10 | 3 | 95,000 |
-
Regression Equation:
$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2$$ -
Fit the Model:
Using statistical software or calculations, the estimated coefficients are:
$\beta_0 = 40,000$ $\beta_1 = 5,000$ $\beta_2 = 10,000$
-
Regression Equation:
$$Y = 40,000 + 5,000X_1 + 10,000X_2$$ -
Prediction:
For an employee with 7 years of experience (
$X_1 = 7$ ) and education level 2 ($X_2 = 2$ ):$$Y = 40,000 + 5,000(7) + 10,000(2) = 40,000 + 35,000 + 20,000 = 95,000$$
-
R-Squared (
$R^2$ ):
Measures the proportion of variance in$Y$ explained by all$X_i$ . -
Adjusted R-Squared:
Accounts for the number of predictors, providing a better measure for models with multiple variables. -
P-Values:
Tests the significance of each predictor. If$p \leq \alpha$ (e.g., 0.05), the predictor is statistically significant. -
Variance Inflation Factor (VIF):
Detects multicollinearity. Values greater than 10 indicate high multicollinearity.
- Business: Predicting sales based on advertising spend, price, and product quality.
- Healthcare: Estimating patient outcomes based on age, weight, and treatment type.
- Education: Analyzing student performance based on study time, attendance, and teaching methods.
-
Scatter Plot Matrix: Shows relationships between
$Y$ and each$X_i$ . - Residual Plot: Helps check assumptions like linearity and homoscedasticity.
Multiple regression is a versatile tool for understanding complex relationships between a dependent variable and multiple predictors. By carefully checking assumptions and interpreting results, you can use this method to make accurate predictions and data-driven decisions.
Next Steps: Logistic Regression