Tldr

Summary: This page lists strategies and techniques that can be used to add substance and depth to machine learning projects.

Audience: Hopefully useful for students in my machine learning and AI classes.

Introduction

You’re nearing the end of your first machine learning project. You collected some data, cleaned it, and trained a model and evaluated it. Like any good machine learning pracitioner, you test your model on the heldout data, and you are relieved to see that your model achieves an score of .

“Nice, that’s a B+… I’m done!” you think to yourself. Now all you have to do is write that 10-page report, turn it in, and you’re off the hook. Easy.

That is, until you sit down to start writing. After recording your feat of generalization, the page — still nearly empty — stares back at you. “Our model achieves 0.88 F1 score” is only… six words! What in the world are you going to talk about for 9.95 more pages!?

There is an answer to this question, and it’s not “write a long introduction” or “rehash an even longer version of the introduction as the conclusion”, or “use a bigger font”, or “double space 2 inch margins”. The answer is that you are not actually at the end of your project. Having trained your first model is the end of one phase in a machine learning project and the beginning of another. There are more things to do with a model than print a confusion matrix and sail off into the sunset. The savvy machine learning student knows this, and plans in advance to do some of these things.

Without further ado, here is a list of ways you can add substance to your machine learning project.

Ways to Add Substance to Machine Learning Projects

Compare your model to a baseline

Evaluating your model’s performance against a baseline model helps determine its effectiveness and provides context for its success or failure.

A baseline model is a simple model that is used to compare the performance of a new model.

Example

When developing a neural network for image classification, compare its accuracy against simpler baselines like logistic regression or a decision tree. If your complex neural network achieves 85% accuracy while logistic regression achieves 82%, the marginal improvement may not justify the added complexity.

Perform an ablation study

An ablation study systematically removes components of your model to understand their contributions to overall performance, allowing you to identify which features or parts enhance or hinder efficacy.

Example

For a CNN with multiple convolutional layers, dropout, and batch normalization, you might:

  1. Remove batch normalization and measure performance change
  2. Remove dropout layers and observe impact
  3. Reduce number of convolutional layers This helps understand which components are crucial vs. unnecessary.

Perform a sensitivity analysis

A sensitivity analysis examines how variations in hyperparameters affect model performance, allowing you to understand the robustness of your model.

Here are two kinds of sensitivity analysis that you might try:

  • sensitivity to hyperparameter choice: Analyzing the impact of different hyperparameter settings can reveal how sensitive your model is to these adjustments and guide optimal tuning.

Example

Testing learning rates [0.001, 0.01, 0.1] shows your model only works well with 0.01, indicating high sensitivity to this parameter.

  • sensitivity to random seed selection: Understanding the effects of different random seeds helps assess the stability and reliability of your model’s results.

Example

Running your model with 5 different random seeds shows accuracy varies between 82-84%, suggesting stable performance.

Interpret your model and its predictions

Model interpretability attempts to make model predictions understandable to humans, which is critical for trust, debugging, and regulatory compliance.

Example

Using LIME or SHAP to explain why a model classified an X-ray as showing pneumonia by highlighting the relevant areas of the image that influenced the decision.

Learn more: @molnarInterpretableMachineLearning

Profile your model’s non-functional characteristics

Assessing non-functional aspects such as runtime performance, memory usage, and scalability is important to ensure the model’s practicality in real-world scenarios.

Example

Profiling shows your model takes 2GB RAM and 100ms per prediction, making it suitable for cloud deployment but not mobile devices.

Compare multiple model classes

Evaluating various algorithms on the same task aids in identifying which one performs best under specific conditions, ensuring the most effective approach is selected.

Example

For a text classification task, comparing:

  • BERT (95% accuracy, slow)
  • Random Forest (92% accuracy, fast)
  • SVM (90% accuracy, medium speed)

Helps choose the best trade-off between accuracy and speed.

Conduct an error analysis

An error analysis is a detailed examination of the errors made by your model. Error analysis can uncover patterns or specific areas of weakness, guiding improvements, or suggesting areas where your model is not applicable.

Example

Analyzing misclassifications in a dog breed classifier reveals it frequently confuses Huskies with Malamutes, suggesting need for more training data of these breeds or additional features focusing on their distinguishing characteristics.

Assess the statistical significance of your findings

Employing statistical methods like bootstrapping allows practitioners to assess confidence intervals and significance of findings, ensuring that results are not due to random chance.

Example

Bootstrapping 1000 samples of test results shows your model’s 85% accuracy has a 95% confidence interval of [83%, 87%], significantly better than the baseline’s 80% [78%, 82%].

Test generalization in different settings

Testing your model on multiple datasets can help determine its generalization capabilities, ensuring it performs well across various data distributions.

Example

An image classification model trained on ImageNet may be tested on new pictures taken from your own camera. Sometimes subtle differences in lighting or background can cause the model to fail!