Three Uses of the word "Model" in Machine Learning

One might view a model as an equation:

$f : (x, t h e t a, a lp ha) \to y$

x is a state in the world, theta is a vector of real numbers that we call the “model parameters”, and alpha is another vector of real numbers that we call the “model hyperparameters”. The model parameters and model hyperparameters are treated separately because machine learning practitioners are responsible for setting the values of alpha by hand, and usually let another algorithm called “optimizer” set the values of theta.

That is, there is a function like

$t r ain : (X, a lp ha) - > T h e t a$

Sometimes even setting the values of alpha is too tedious, and we delegate this responsibility like so:

$h y p er p a r am e t er - se l ec t : (X, g) - > A lp ha$ where g is some function that can tell us whether a particular choice of alpha is good or not.

On the term “model”:

In machine learning there are a few different ways that the word “model” is used. I list three of them here, from least to most concrete:

- **Model Class:** Roughly the learning algorithm. e.g. "BERT" or "Decision Tree" or "Neural Network"
- **Model Configuration:** The learning algorithm along with fixed values for the hyperparameters. E.g. "BERT with X learning rate schedule and Y optimizer", or "Decision Tree with max_depth=12"
- **Model Instance:** An instance of a model configuration with fixed values for the parameters.

We can also view the above list as starting with a large set of possible concrete models (model class), taking a subset by selecting fixing the hyperparameter values (model configuration), and selecting a single element from the subset (model instance) by applying a training procedure to the model configuration to select the model parameters.

On “generalization”:

“Generalization” is something that people can ascribe to each of the three kinds of “models”:

Generalization of a model instance is usually the most concrete. Usually this is evaluated by testing the model instance on unseen data.
When people talk about generalization of a model configuration, this usually means something like “will it generalize after having been trained?“. A model configuration may leave some aspects of the training procedure unspecified — for example, “BERT with X learning rate schedule and Y optimizer” says nothing about what data this configuration will be trained on. Trained on some datasets, this configuration may generalize well. On others, it may generalize poorly. So statements about generalization for model configurations are usually statements about the range of behaviors that instances in the class might exhibit, or about typcial or expected behavior.
Generalization of a model class is even more abstract — most or all of the hyperparameters have not been specified. Generalization depends heavily on these unspecified parameters. To view a statement like “deep learning models generalize” or “deep learning models do not generalize” as an empirical statement, we either need to make lots of assumptions to specify a single model instance that can be tested for generalization, or we need to consider the range of generalization behavior over all possible deep learning models on all possible deep learning problems.

My main point is that when we consider the generalization of a model class or model configuration, we’re usually talking about expected or typical behavior, since we don’t have a concrete model instance.

For a model class, this gets tricky. Are we talking about typical behavior for all possible configurations? Typical behavior for the few best configurations? Or typical behavior for configurations that are standard among practitioners?

To be more concrete — some configurations of BERT simply will not train successfully, and practitioners know and accept this fact. For example, I would be very surprised if a BERT fine-tune via the Adam optimizer succeeded with the learning rate set to 100.0. The training will likely not even meet the normal stopping criteria to decide when a training run is done.

If it doesn’t train, it’s not going to generalize.

Is this a strike against the model class? Or against the practitioner for picking a bad configuration?

📚 gabe's wiki

Explorer

Three Uses of the word "Model" in Machine Learning

On the term “model”:

On “generalization”:

Table of Contents

Table of Contents