Classification and Regression

Overview

This section covers statistical methods for classification and regression problems.

Binary Classification

Def. (Binary Classification)

Binary classification is the task of categorizing observations into one of two classes based on input features.

Examples in computer science:

Spam detection (spam/not spam)
Malware detection (malicious/benign)
Network intrusion detection (attack/normal)

Def. (Confusion Matrix)

A confusion matrix is a 2×2 table that summarizes the performance of a binary classifier:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Type I Error: False Positive (rejecting true null hypothesis)
Type II Error: False Negative (failing to reject false null hypothesis)

Def. (Classification Metrics)

Common metrics derived from the confusion matrix:

Accuracy: $\frac{TP + TN}{TP + TN + FP + FN}$ (proportion of correct predictions)
Precision: $\frac{TP}{TP + FP}$ (proportion of positive predictions that are correct)
Recall (Sensitivity): $\frac{TP}{TP + FN}$ (proportion of actual positives correctly identified)
Specificity: $\frac{TN}{TN + FP}$ (proportion of actual negatives correctly identified)

Joint Distributions and Relationships

Def. (Joint Probability Distribution)

For random variables $X$ and $Y$ , the joint distribution specifies the probability of all possible combinations of values:

Discrete: Joint PMF $p_{X,Y}(x,y) = P(X=x, Y=y)$
Continuous: Joint PDF $f_{X,Y}(x,y)$ where $P((X,Y) \in A) = \iint_A f_{X,Y}(x,y) \, dx \, dy$

Def. (Marginal Distribution)

The marginal distribution of a random variable can be obtained from a joint distribution by summing or integrating over the other variable(s):

Discrete: $p_X(x) = \sum_y p_{X,Y}(x,y)$
Continuous: $f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy$

The marginal distribution gives the probabilities for one variable without regard to the values of the other variable(s).

Def. (Conditional Distribution)

The conditional distribution of $Y$ given $X = x$ is:

Discrete: $p_{Y|X}(y|x) = \frac{p_{X,Y}(x,y)}{p_X(x)}$ provided $p_X(x) > 0$
Continuous: $f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}$ provided $f_X(x) > 0$

This gives the distribution of $Y$ when we know $X = x$ .

Def. (Covariance)

Covariance measures the linear relationship between two random variables $X$ and $Y$ :

$\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]$

Positive covariance: $X$ and $Y$ tend to increase together
Negative covariance: When $X$ increases, $Y$ tends to decrease
Zero covariance: No linear relationship (but may have nonlinear relationship)

Def. (Correlation Coefficient)

The correlation coefficient $\rho$ standardizes covariance to the range $[-1, 1]$ :

$\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$

$\rho = 1$ : Perfect positive linear relationship
$\rho = -1$ : Perfect negative linear relationship
$\rho = 0$ : No linear relationship

Important: Correlation does not imply causation!

Linear Regression

Def. (Simple Linear Regression)

Simple linear regression models the relationship between a response variable $Y$ and predictor $X$ :

$Y = \beta_0 + \beta_1 X + \epsilon$

where $\epsilon$ is the error term with $E[\epsilon] = 0$ .

Least Squares Estimates:

$\beta_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}$

$\beta_0 = \bar{y} - \beta_1 \bar{x}$

Coefficient of Determination: $R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$ measures the proportion of variance explained by the model.

Def. (Residual)

A residual is the difference between an observed value and the predicted value from the regression model:

$e_i = y_i - \hat{y}_i = y_i - (\beta_0 + \beta_1 x_i)$

Residuals measure the error in the model's predictions. Analysis of residuals is used to check regression assumptions.

Assumptions of Linear Regression

For valid inference in linear regression, the following assumptions should hold:

Linearity: The relationship between $X$ and $Y$ is linear. The true relationship can be expressed as $Y = \beta_0 + \beta_1 X + \epsilon$ .
Independence: Observations are independent of each other. The value of one observation does not influence another.
Homoscedasticity (Constant Variance): The variance of the errors $\epsilon$ is constant across all values of $X$ . Formally, $\text{Var}(\epsilon|X=x) = \sigma^2$ for all $x$ .
Normality: The errors $\epsilon$ are normally distributed: $\epsilon \sim N(0, \sigma^2)$ . This assumption is particularly important for hypothesis testing and constructing confidence intervals.
No Perfect Multicollinearity (for multiple regression): Predictor variables should not be perfectly correlated. In simple linear regression with one predictor, this is not a concern.

Note: Violations of these assumptions can lead to biased estimates, incorrect standard errors, and invalid hypothesis tests. Diagnostic plots (residual plots, Q-Q plots) are used to check these assumptions.

Applications

Prediction: Estimate future values based on observed relationships
Trend Analysis: Identify and quantify relationships between variables
Feature Importance: Determine which variables are most predictive

Overview​

Binary Classification​

Joint Distributions and Relationships​

Linear Regression​

Assumptions of Linear Regression​

Applications​