Skip to main content

Classification and Regression

Overview

This section covers statistical methods for classification and regression problems.

Binary Classification

Def. (Binary Classification)

Binary classification is the task of categorizing observations into one of two classes based on input features.

Examples in computer science:

  • Spam detection (spam/not spam)
  • Malware detection (malicious/benign)
  • Network intrusion detection (attack/normal)
Def. (Confusion Matrix)

A confusion matrix is a 2×2 table that summarizes the performance of a binary classifier:

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)
  • Type I Error: False Positive (rejecting true null hypothesis)
  • Type II Error: False Negative (failing to reject false null hypothesis)
Def. (Classification Metrics)

Common metrics derived from the confusion matrix:

  • Accuracy: TP+TNTP+TN+FP+FN\frac{TP + TN}{TP + TN + FP + FN} (proportion of correct predictions)
  • Precision: TPTP+FP\frac{TP}{TP + FP} (proportion of positive predictions that are correct)
  • Recall (Sensitivity): TPTP+FN\frac{TP}{TP + FN} (proportion of actual positives correctly identified)
  • Specificity: TNTN+FP\frac{TN}{TN + FP} (proportion of actual negatives correctly identified)

Joint Distributions and Relationships

Def. (Joint Probability Distribution)

For random variables XX and YY, the joint distribution specifies the probability of all possible combinations of values:

  • Discrete: Joint PMF pX,Y(x,y)=P(X=x,Y=y)p_{X,Y}(x,y) = P(X=x, Y=y)
  • Continuous: Joint PDF fX,Y(x,y)f_{X,Y}(x,y) where P((X,Y)A)=AfX,Y(x,y)dxdyP((X,Y) \in A) = \iint_A f_{X,Y}(x,y) \, dx \, dy
Def. (Marginal Distribution)

The marginal distribution of a random variable can be obtained from a joint distribution by summing or integrating over the other variable(s):

  • Discrete: pX(x)=ypX,Y(x,y)p_X(x) = \sum_y p_{X,Y}(x,y)
  • Continuous: fX(x)=fX,Y(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy

The marginal distribution gives the probabilities for one variable without regard to the values of the other variable(s).

Def. (Conditional Distribution)

The conditional distribution of YY given X=xX = x is:

  • Discrete: pYX(yx)=pX,Y(x,y)pX(x)p_{Y|X}(y|x) = \frac{p_{X,Y}(x,y)}{p_X(x)} provided pX(x)>0p_X(x) > 0
  • Continuous: fYX(yx)=fX,Y(x,y)fX(x)f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)} provided fX(x)>0f_X(x) > 0

This gives the distribution of YY when we know X=xX = x.

Def. (Covariance)

Covariance measures the linear relationship between two random variables XX and YY:

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]

  • Positive covariance: XX and YY tend to increase together
  • Negative covariance: When XX increases, YY tends to decrease
  • Zero covariance: No linear relationship (but may have nonlinear relationship)
Def. (Correlation Coefficient)

The correlation coefficient ρ\rho standardizes covariance to the range [1,1][-1, 1]:

ρX,Y=Cov(X,Y)σXσY\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}

  • ρ=1\rho = 1: Perfect positive linear relationship
  • ρ=1\rho = -1: Perfect negative linear relationship
  • ρ=0\rho = 0: No linear relationship

Important: Correlation does not imply causation!

Linear Regression

Def. (Simple Linear Regression)

Simple linear regression models the relationship between a response variable YY and predictor XX:

Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilon

where ϵ\epsilon is the error term with E[ϵ]=0E[\epsilon] = 0.

Least Squares Estimates:

β1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2\beta_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}

β0=yˉβ1xˉ\beta_0 = \bar{y} - \beta_1 \bar{x}

Coefficient of Determination: R2=1SSresSStotR^2 = 1 - \frac{SS_{res}}{SS_{tot}} measures the proportion of variance explained by the model.

Def. (Residual)

A residual is the difference between an observed value and the predicted value from the regression model:

ei=yiy^i=yi(β0+β1xi)e_i = y_i - \hat{y}_i = y_i - (\beta_0 + \beta_1 x_i)

Residuals measure the error in the model's predictions. Analysis of residuals is used to check regression assumptions.

Assumptions of Linear Regression

For valid inference in linear regression, the following assumptions should hold:

  1. Linearity: The relationship between XX and YY is linear. The true relationship can be expressed as Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilon.

  2. Independence: Observations are independent of each other. The value of one observation does not influence another.

  3. Homoscedasticity (Constant Variance): The variance of the errors ϵ\epsilon is constant across all values of XX. Formally, Var(ϵX=x)=σ2\text{Var}(\epsilon|X=x) = \sigma^2 for all xx.

  4. Normality: The errors ϵ\epsilon are normally distributed: ϵN(0,σ2)\epsilon \sim N(0, \sigma^2). This assumption is particularly important for hypothesis testing and constructing confidence intervals.

  5. No Perfect Multicollinearity (for multiple regression): Predictor variables should not be perfectly correlated. In simple linear regression with one predictor, this is not a concern.

Note: Violations of these assumptions can lead to biased estimates, incorrect standard errors, and invalid hypothesis tests. Diagnostic plots (residual plots, Q-Q plots) are used to check these assumptions.

Applications

  • Prediction: Estimate future values based on observed relationships
  • Trend Analysis: Identify and quantify relationships between variables
  • Feature Importance: Determine which variables are most predictive