Some notes of this course offered by Kaggle, for my poor memory.
Cross validation
Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions.
Use pipelines for doing cross-validation, you will save a lot of time.
XGBoost = Gradient boosting
We refer to the random forest method as an “ensemble method”. By definition, ensemble methods combine the predictions of several models (e.g., several trees, in the case of random forests).
Gradient boosting is an ensemble method too that goes through cycles to iteratively add models into an ensemble.
- Naive model is build as initial prediction that will be used as basis for following predictions.
- Make predictions, we use the first prediction or the iterative predictions to generate predictions for each observation in the dataset. These predictions are added in the ensemble.
- Calculate loss, we track the data returned by a loss function (like mean square error (MAE)).
- Train new model, we use the loss function to fit a new model that will be added to the ensemble. Specifically, we determine model parameters so that adding this new model to the ensemble will reduce the loss. (Side note: The “gradient” in “gradient boosting” refers to the fact that we’ll use gradient descent on the loss function to determine the parameters in this new model.).
- Add new model to ensemble, and repeat again and again.
Data leakage
Data leakage happens when your training data contains information about the target, but similar data will not be available when the model is used for prediction. This leads you to obtain high performance on the training set (and possibly even the validation data), but low performance in production (and a lot of frustration).
Types of data leakage:
- Target leakage: it occurs when your predictors include data that will not be available at the time you make predictions.
- Train-test leakage: it occurs when you aren’t careful to distinguish training data from validation data.
How to prevent them:
- Target leakage: Any variable updated (or created) after the target value is realized should be excluded.
- Train-test leakage: take care separating the train and validation data properly.