Skip to content

Latest commit

 

History

History
160 lines (94 loc) · 4.07 KB

1. Simple Linear Regression.md

File metadata and controls

160 lines (94 loc) · 4.07 KB

Simple Linear Regression

Simple linear regression is a statistical method used to examine the relationship between two continuous variables: one independent variable ($X$) and one dependent variable ($Y$). It determines how $X$ can predict or explain $Y$.


Key Concepts

  1. Independent Variable ($X$):
    The predictor or explanatory variable.

  2. Dependent Variable ($Y$):
    The response or outcome variable.

  3. Regression Line:
    A line that best fits the data, showing the relationship between $X$ and $Y$.

  4. Equation of the Regression Line:

    $$Y = \beta_0 + \beta_1X + \epsilon$$

    Where:

    • $Y$: Predicted value of the dependent variable
    • $\beta_0$: Intercept (value of $Y$ when $X = 0$)
    • $\beta_1$: Slope (change in $Y$ for a one-unit increase in $X$)
    • $\epsilon$: Error term

Assumptions of Simple Linear Regression

  1. Linearity: The relationship between $X$ and $Y$ is linear.
  2. Independence: Observations are independent of each other.
  3. Homoscedasticity: The variance of residuals (errors) is constant across all levels of $X$.
  4. Normality: Residuals follow a normal distribution.

Steps to Perform Simple Linear Regression

Step 1: Formulate the Model

Define the dependent ($Y$) and independent ($X$) variables.

Step 2: Collect and Explore Data

Examine the dataset for missing values, outliers, and patterns.

Step 3: Fit the Model

Estimate the coefficients ($\beta_0, \beta_1$) using the least squares method, which minimizes the sum of squared residuals:

$$ \text{Residual} = Y_{\text{observed}} - Y_{\text{predicted}} $$

Step 4: Evaluate the Model

Assess the fit of the regression line using metrics like $R^2$, p-value, and standard error.

Step 5: Interpret Results

Understand the relationship between $X$ and $Y$ based on the slope and intercept.


Example

Problem:

A researcher wants to study the relationship between study hours ($X$) and test scores ($Y$) for a group of students. The data is:

Study Hours ($X$) Test Scores ($Y$)
2 50
4 55
6 60
8 70
10 85

Solution:

  1. Equation of Regression Line:

    $$Y = \beta_0 + \beta_1X$$

  2. Calculate Coefficients:

    Using the least squares method:

    • Slope ($\beta_1$):

      $$\beta_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)}$$

    • Intercept ($\beta_0$):

      $$\beta_0 = \bar{Y} - \beta_1\bar{X}$$

    After calculations:

    $$\beta_1 = 3.5, \quad \beta_0 = 46$$

  3. Regression Equation:

    $$Y = 46 + 3.5X$$

  4. Predicted Values:

    For $X = 6$:

    $$Y = 46 + 3.5(6) = 67$$


Model Evaluation Metrics

  1. R-Squared ($R^2$): Measures the proportion of variance in $Y$ explained by $X$.

    • Value ranges from 0 to 1.
    • Higher values indicate a better fit.
  2. Standard Error:
    Represents the average distance of the observed values from the regression line.

  3. P-Value:
    Tests the null hypothesis that there is no relationship between $X$ and $Y$.

    • If $p \leq \alpha$ (e.g., 0.05), reject the null hypothesis.

Applications of Simple Linear Regression

  1. Business:
    Predicting sales based on advertising expenses.

  2. Healthcare:
    Estimating blood pressure changes based on age.

  3. Education:
    Analyzing the impact of study time on academic performance.


Visualization of Regression

Scatter Plot with Regression Line:

  • A scatter plot shows individual data points.
  • The regression line depicts the relationship between $X$ and $Y$.

Conclusion

Simple linear regression is a foundational statistical tool for analyzing relationships between two variables. By understanding its assumptions, calculations, and interpretations, you can derive meaningful insights and make data-driven decisions.


Next Steps: Multiple Regression