One might view a model as an equation:
x is a state in the world, theta is a vector of real numbers that we call the âmodel parametersâ, and alpha is another vector of real numbers that we call the âmodel hyperparametersâ. The model parameters and model hyperparameters are treated separately because machine learning practitioners are responsible for setting the values of alpha by hand, and usually let another algorithm called âoptimizerâ set the values of theta.
That is, there is a function like
Sometimes even setting the values of alpha is too tedious, and we delegate this responsibility like so:
where g is some function that can tell us whether a particular choice of alpha is good or not.
On the term âmodelâ:
In machine learning there are a few different ways that the word âmodelâ is used. I list three of them here, from least to most concrete:
- **Model Class:** Roughly the learning algorithm. e.g. "BERT" or "Decision Tree" or "Neural Network"
- **Model Configuration:** The learning algorithm along with fixed values for the hyperparameters. E.g. "BERT with X learning rate schedule and Y optimizer", or "Decision Tree with max_depth=12"
- **Model Instance:** An instance of a model configuration with fixed values for the parameters.
We can also view the above list as starting with a large set of possible concrete models (model class), taking a subset by selecting fixing the hyperparameter values (model configuration), and selecting a single element from the subset (model instance) by applying a training procedure to the model configuration to select the model parameters.
On âgeneralizationâ:
âGeneralizationâ is something that people can ascribe to each of the three kinds of âmodelsâ:
- Generalization of a model instance is usually the most concrete. Usually this is evaluated by testing the model instance on unseen data.
- When people talk about generalization of a model configuration, this usually means something like âwill it generalize after having been trained?â. A model configuration may leave some aspects of the training procedure unspecified â for example, âBERT with X learning rate schedule and Y optimizerâ says nothing about what data this configuration will be trained on. Trained on some datasets, this configuration may generalize well. On others, it may generalize poorly. So statements about generalization for model configurations are usually statements about the range of behaviors that instances in the class might exhibit, or about typcial or expected behavior.
- Generalization of a model class is even more abstract â most or all of the hyperparameters have not been specified. Generalization depends heavily on these unspecified parameters. To view a statement like âdeep learning models generalizeâ or âdeep learning models do not generalizeâ as an empirical statement, we either need to make lots of assumptions to specify a single model instance that can be tested for generalization, or we need to consider the range of generalization behavior over all possible deep learning models on all possible deep learning problems.
My main point is that when we consider the generalization of a model class or model configuration, weâre usually talking about expected or typical behavior, since we donât have a concrete model instance.
For a model class, this gets tricky. Are we talking about typical behavior for all possible configurations? Typical behavior for the few best configurations? Or typical behavior for configurations that are standard among practitioners?
To be more concrete â some configurations of BERT simply will not train successfully, and practitioners know and accept this fact. For example, I would be very surprised if a BERT fine-tune via the Adam optimizer succeeded with the learning rate set to 100.0. The training will likely not even meet the normal stopping criteria to decide when a training run is done.
If it doesnât train, itâs not going to generalize.
Is this a strike against the model class? Or against the practitioner for picking a bad configuration?
See also:
- The section titled âThree Senses of âModelâ in Deep Learningâ in Florian J. Bogeâs paper âTwo Dimensions of Opacity and the Deep Learning Predicamentâ