I’m going through the course “Intro to Machine Learning“, and I would like to keep some notes about it.
My first machine learning code
# Code you have previously used to load data import pandas as pd # Path of the file to read iowa_file_path = '../input/home-data-for-ml-course/train.csv' home_data = pd.read_csv(iowa_file_path) # Set up code checking from learntools.core import binder binder.bind(globals()) from learntools.machine_learning.ex3 import * # print the list of columns in the dataset to find the name of the prediction target home_data.columns # Set the target (price) y = home_data.SalePrice # Create the list of features below feature_names = ['LotArea', 'YearBuilt','1stFlrSF','2ndFlrSF','FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd'] # Select data corresponding to features in feature_names X = home_data[feature_names] # Review data # print description or statistics from X print(X.describe) # print the top few lines print(X.head()) #specify the model. #For model reproducibility, set a numeric value for random_state when specifying the model from sklearn.tree import DecisionTreeRegressor iowa_model = DecisionTreeRegressor(random_state=1) # Fit the model iowa_model.fit(X,y) predictions = iowa_model.predict(X) print(predictions)
In almost all applications, the relevant measure of model quality is predictive accuracy. In other words, will the model’s predictions be close to what actually happens.
Mean Absolute Error (MAE) give us the absolute value of each error.
# Specify the model iowa_model = DecisionTreeRegressor(random_state=1) # Fit iowa_model with the training data. iowa_model.fit(train_X, train_y) # Predict with all validation observations val_predictions = iowa_model.predict(val_X) # print the top few validation predictions print(val_predictions) from sklearn.metrics import mean_absolute_error val_mae = mean_absolute_error(val_y, val_predictions) # uncomment following line to see the validation_mae print(val_mae)
Some machine learning metrics:
- Confusion Matrix
- Area Under the ROC Curve (AUC)
- F1 Score
- Precision-Recall Curve
- Log/Cross Entropy Loss
- Mean Squared Error
- Mean Absolute Error
Under-fitting and over-fitting
Models can suffer from either:
- Over fitting: capturing spurious patterns that won’t recur in the future, leading to less accurate predictions, or
- Under fitting: failing to capture relevant patterns, again leading to less accurate predictions.
The 7 Steps of Machine Learning
- Step 1: Gather the data. When participating in a Kaggle competition, this step is already completed for you.
- Step 2: Prepare the data – Deal with missing values and categorical data. (Feature engineering is covered in a separate course.)
- Step 4: Train the model – Fit decision trees and random forests to patterns in training data.
- Step 5: Evaluate the model – Use a validation set to assess how well a trained model performs on unseen data.
- Step 6: Tune parameters – Tune parameters to get better performance from XGBoost models.
- Step 7: Get predictions – Generate predictions with a trained model and submit your results to a Kaggle competition.
Automated machine learning (AutoML)
Read how to use Google Cloud AutoML Tables to automate the machine learning process. While Kaggle has already taken care of the data collection, AutoML Tables will take care of all remaining steps.