Classification and Regression
Overview
This section covers statistical methods for classification and regression problems.
Binary Classification
Binary classification is the task of categorizing observations into one of two classes based on input features.
Examples in computer science:
- Spam detection (spam/not spam)
- Malware detection (malicious/benign)
- Network intrusion detection (attack/normal)
A confusion matrix is a 2×2 table that summarizes the performance of a binary classifier:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
- Type I Error: False Positive (rejecting true null hypothesis)
- Type II Error: False Negative (failing to reject false null hypothesis)
Common metrics derived from the confusion matrix:
- Accuracy: (proportion of correct predictions)
- Precision: (proportion of positive predictions that are correct)
- Recall (Sensitivity): (proportion of actual positives correctly identified)
- Specificity: (proportion of actual negatives correctly identified)
Joint Distributions and Relationships
For random variables and , the joint distribution specifies the probability of all possible combinations of values:
- Discrete: Joint PMF
- Continuous: Joint PDF where
The marginal distribution of a random variable can be obtained from a joint distribution by summing or integrating over the other variable(s):
- Discrete:
- Continuous:
The marginal distribution gives the probabilities for one variable without regard to the values of the other variable(s).
The conditional distribution of given is:
- Discrete: provided
- Continuous: provided
This gives the distribution of when we know .
Covariance measures the linear relationship between two random variables and :
- Positive covariance: and tend to increase together
- Negative covariance: When increases, tends to decrease
- Zero covariance: No linear relationship (but may have nonlinear relationship)
The correlation coefficient standardizes covariance to the range :
- : Perfect positive linear relationship
- : Perfect negative linear relationship
- : No linear relationship
Important: Correlation does not imply causation!
Linear Regression
Simple linear regression models the relationship between a response variable and predictor :
where is the error term with .
Least Squares Estimates:
Coefficient of Determination: measures the proportion of variance explained by the model.
A residual is the difference between an observed value and the predicted value from the regression model:
Residuals measure the error in the model's predictions. Analysis of residuals is used to check regression assumptions.
Assumptions of Linear Regression
For valid inference in linear regression, the following assumptions should hold:
-
Linearity: The relationship between and is linear. The true relationship can be expressed as .
-
Independence: Observations are independent of each other. The value of one observation does not influence another.
-
Homoscedasticity (Constant Variance): The variance of the errors is constant across all values of . Formally, for all .
-
Normality: The errors are normally distributed: . This assumption is particularly important for hypothesis testing and constructing confidence intervals.
-
No Perfect Multicollinearity (for multiple regression): Predictor variables should not be perfectly correlated. In simple linear regression with one predictor, this is not a concern.
Note: Violations of these assumptions can lead to biased estimates, incorrect standard errors, and invalid hypothesis tests. Diagnostic plots (residual plots, Q-Q plots) are used to check these assumptions.
Applications
- Prediction: Estimate future values based on observed relationships
- Trend Analysis: Identify and quantify relationships between variables
- Feature Importance: Determine which variables are most predictive