Introduction to Machine Learning in Kaggle

I’m going through the course “Intro to Machine Learning“, and I would like to keep some notes about it.

My first machine learning code

# Code you have previously used to load data
import pandas as pd

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex3 import *

# print the list of columns in the dataset to find the name of the prediction target
home_data.columns

# Set the target (price)
y = home_data.SalePrice

# Create the list of features below
feature_names = ['LotArea', 'YearBuilt','1stFlrSF','2ndFlrSF','FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# Select data corresponding to features in feature_names
X = home_data[feature_names]

# Review data
# print description or statistics from X
print(X.describe)

# print the top few lines
print(X.head())

#specify the model. 
#For model reproducibility, set a numeric value for random_state when specifying the model
from sklearn.tree import DecisionTreeRegressor
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit the model
iowa_model.fit(X,y)

predictions = iowa_model.predict(X)
print(predictions)


Model validation

In almost all applications, the relevant measure of model quality is predictive accuracy. In other words, will the model’s predictions be close to what actually happens.

Mean Absolute Error (MAE) give us the absolute value of each error.

# Specify the model
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)

# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)

# print the top few validation predictions
print(val_predictions)

from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y, val_predictions)

# uncomment following line to see the validation_mae
print(val_mae)

Some machine learning metrics:

  • Accuracy
  • Confusion Matrix
  • Area Under the ROC Curve (AUC)
  • F1 Score
  • Precision-Recall Curve
  • Log/Cross Entropy Loss
  • Mean Squared Error
  • Mean Absolute Error

Under-fitting and over-fitting

Models can suffer from either:

  • Over fitting: capturing spurious patterns that won’t recur in the future, leading to less accurate predictions, or
  • Under fitting: failing to capture relevant patterns, again leading to less accurate predictions.

The 7 Steps of Machine Learning

  • Step 1: Gather the data. When participating in a Kaggle competition, this step is already completed for you.
  • Step 2: Prepare the data – Deal with missing values and categorical data. (Feature engineering is covered in a separate course.)
  • Step 4: Train the model – Fit decision trees and random forests to patterns in training data.
  • Step 5: Evaluate the model – Use a validation set to assess how well a trained model performs on unseen data.
  • Step 6: Tune parameters – Tune parameters to get better performance from XGBoost models.
  • Step 7: Get predictions – Generate predictions with a trained model and submit your results to a Kaggle competition.

Automated machine learning (AutoML)

Read how to use Google Cloud AutoML Tables to automate the machine learning process. While Kaggle has already taken care of the data collection, AutoML Tables will take care of all remaining steps.

Important to read the Google Cloud AutoML Tables

Course completed!

Certificate recognizing that joapen has successfully completed the Kaggle course Intro to Machine Learning taught by Dan Becker on April 22, 2021

Leave a Comment