Machine learning, source of errors

Before to start

What is an error?

Observation prediction error = Target – Prediction = Bias + Variance + Noise

The main sources of errors are

  • Bias and Variability (variance).
  • Underfitting or overfitting.
  • Underclustering or overclustering.
  • Improper validation (after the training). It could be that comes from the wrong validation set. It is important to divide completely the training and validation processes to minimize this error, and document assumptions in detail.

Underfitting

This phenomenon happens when we have low variance and high bias.

This happens typically when we have too few features and the final model we have is too simple.

How can I prevent underfitting?

  • Increase the number of features and hence the model complexity.
  • If you are using a PCA, it applies a dimension reduction, so the step should be to unapply this dimension reduction.
  • Perform cross-validation.

Overfitting

This phenomenon happens when we have high variance and low bias.

This happens typically when we have too many features and the final model we have is too complex.

How can I prevent overfitting?

  • Decrease the number of features and hence the complexity of the model.
  • Perform a dimension reduction (PCA)
  • Perform cross-validation.

Cross validation

This is one of the typical methods to reduce the appareance of errors on a machine learning solution. It consist on testing the model in many different contexts.

You have to be careful when re-testing model on the same training/test sets, the reason? this often leads you to underfitting or overfitting errors.

The cross-validation tries to mitigate these behaviors.

The typical way to enable cross validation is to divide the data set in different sections, so you use 1 for testing, and the others for validations. For instance, you can take a stock data set from 2010 to 2017, use the data from 2012 as testing dataset and use the other divisions by year for validation of your trading model.

Neural networks

They can be used to avoid that errors are backpropagated. The neural network helps you to minimize the error by adjusting the impact of the accumulation of data.

 

Leave a Comment