Why over-fitting is bad? Because it’s over-fitted to the existing data points, but may not necessary the best to predict out-of-sample data, which is actually needed when you develop and figure a linear regression parameters.

A great many can be learned from Hammard Shaikhha to in his github post, where he applied regularization to the IMDB movie data. The data set is downloaded from https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset. The data contains various statistics from IMDB for over 5000 movies. For the purpose of applying regularization, only rating and gross sales revenue data are used.

The regression model can be expressed as:

The output in charts look like:

The conclusion drawn is: ” The left plot shows a simple degree 1 linear regression model (square loss) fit to the data, and the right figure illustrates a degree 5 linear regression polynomial (square loss) as well as degree 5 ridge regression (L2 loss). From the figure on the left, the line (yellow) doesn’t seem to fit the data well as it does not capture the non-linearities in gross revenues for higher rated movies, and predicts negative gross revenues for low rated movies. In contrast, although the degree 5 polynomial (black) shown to the right camptures nonlinear patterns in the data, it seems to overfit the data as it predicts decreasing gross sales revenue for the highly rated movies. Finally the ridge regression (degree 5 with L2 loss) shown to the right seems to be “just correct” as it captures the non-linear increase in gross sales revenue with increasing movie rating.

The codes are:

# -*- coding: utf-8 -*- “”” Created on Mon Apr 01 20:13:33 2019 @author: Hammard Shaikhha “”” from IPython.display import Image, display display(Image(filename=’OverfittedData.png’, embed=True)) # Import dependencies # numpy for matrix algbera import numpy as np # Pandas for data manipulation import pandas as pd # matplotlib for data visualization import matplotlib.pyplot as plt # Load the IMDB data movie_data = pd.read_csv(“movie_metadata.csv”) # Show structure of data set movie_data.head() #data cleaning # Drop rows with missing gross revenue data (standard practice in regression analysis) movie_data = movie_data.dropna(subset = [‘gross’]) # Only keep data on movies from the US movie_data = movie_data[movie_data[‘country’] == “USA”] # Assuming a average US inflation rate of 2.5%, we convert gross revenue in terms of 2017 dollars movie_data[[‘gross’]] = (1.025**(2017-movie_data[‘title_year’]))*movie_data[‘gross’] # Only keep the variables of interest, ‘imdb_score’ and ‘gross’ movie_data = movie_data[[‘gross’,’imdb_score’]] # Let’s scale the gross revenue to be in millions of dollars so its easier to read movie_data[[‘gross’]] = movie_data[[‘gross’]]/1000000 # Randomly drop 90% of the data so overfitting from high degree polynomial can be seen on scatter plot # This is not done in practice, we are just doing it to better visualize regulization methods # Set seed so we get same random allocation on each run of code np.random.seed(2017) # Add column vector with observations generated randomly from U[0,1] distribution movie_data[“uniform”] = list(np.random.uniform(0,1, len(movie_data.index))) # Only keep observations if uniform < 0.1 (this is randomly dropping 90% of data) movie_data = movie_data[movie_data["uniform"] < 0.1] # Drop uniform column, we just added it in to randomly drop 90% of observations movie_data = movie_data[['gross','imdb_score']] # Summary statistics (mean, stdev, min, max) movie_data.describe() # Visualize data plt.scatter(movie_data['imdb_score'], movie_data['gross']) # Chart title plt.title('IMDB Rating and Gross Sales') # y-label plt.ylabel('Gross sales revenue ($ millions)') # x-label plt.xlabel('IMDB Rating (0 – 10)') # Show scatter plot plt.show() # Implement closed form solutions for linear and L2 norm regression def estimate_model(y,X,Lambda): # X transpose Xtranspose = np.matrix.transpose(X) # Identity matrix (number of parameters is the dimension) Identity = np.identity(len(X[1,:])) # We don't add penalty to intercept Identity[0,0] = 0 # Closed form solution is BetaHat = inv(X'X + Lambda*I)*X'y # Estimate model parameters (if Lambda = 0, we get standard square loss function result) BetaHat = np.dot(np.linalg.inv(np.add(np.dot(Xtranspose,X),Lambda*Identity)),np.dot(Xtranspose,y)) return BetaHat # Estimate a one degree linear regression model (using standard square loss function) # Simple linear regression is y = B0 + B1*x # Define outcome vector (gross movie sales revenue) outcome = np.array(movie_data['gross']) # Define covariate (IMDB movie rating) imdb_score = np.array(movie_data['imdb_score']) # Vector of ones (for B0) ones = np.ones(len(y)) # Define design matrix design_simple = np.column_stack((ones,imdb_score)) # Estimate (Beta0, Beta1) for simple linear regression model betahat_simple = estimate_model(outcome,design_simple,0) print betahat_simple # Estimate a five degree linear regression model (using standard square loss function) # Mutiple linear regression is y = B0 + B1*x + B2*x^2 + … + B5*x^5 # Define higher order covariates imdb_score2 = np.power(imdb_score,2) imdb_score3 = np.power(imdb_score,3) imdb_score4 = np.power(imdb_score,4) imdb_score5 = np.power(imdb_score,5) # Define design matrix design_multiple = np.column_stack((ones,imdb_score, imdb_score2, imdb_score3, imdb_score4, imdb_score5)) # Estimate (Beta0, Beta1, Beta2, Beta3, Beta4, and Beta5) for multiple linear regression model betahat_multiple = estimate_model(outcome,design_multiple,0) print betahat_multiple # Estimate a L2 regularized loss function regression (also known as ridge regression) # Multiple linear regression is y = B0 + B1*x + B2*x^2 + … + B5*x^5 # Estimate (Beta0, Beta1, Beta2, Beta3, Beta4, and Beta5) for ridge regression # We set Lambda = 5 as the tuning parameters for L2 regularization betahat_multiple_L2 = estimate_model(outcome,design_multiple,5) print betahat_multiple_L2 # Visualize simple linear regression (degree 1), figure on the left plt.subplot(1, 2, 1) plt.scatter(movie_data['imdb_score'], movie_data['gross']) # Chart title plt.title('IMDB Rating and Gross Sales (Linear)') # y-label plt.ylabel('Gross sales revenue ($ millions)') # x-label plt.xlabel('IMDB Rating (0 – 10)') # Plot simple linear regression (degree 1) simple, = plt.plot(imdb_score, betahat_simple[0] + imdb_score*betahat_simple[1], 'y') # Legend for simple linear regression scatter plot, plot on left plt.legend([simple], ['Degree 1']) # Visualize multiple linear regression (degree 5) and L2 loss function (degree 5), figure on the right plt.subplot(1, 2, 2) plt.scatter(movie_data['imdb_score'], movie_data['gross']) # Chart title plt.title('IMDB Rating and Gross Sales (Polynomial and L2)') # y-label plt.ylabel('Gross sales revenue ($ millions)') # x-label plt.xlabel('IMDB Rating (0 – 10)') # Plot multiple linear regression (degree 5) multiple, = plt.plot(np.sort(imdb_score), betahat_multiple[0] + np.sort(imdb_score)*betahat_multiple[1] + np.sort(imdb_score2)*betahat_multiple[2] + np.sort(imdb_score3)*betahat_multiple[3] + np.sort(imdb_score4)*betahat_multiple[4] + np.sort(imdb_score5)*betahat_multiple[5],'-k') # Plot ridge regression (L2 loss function with degree 5) ridge, = plt.plot(np.sort(imdb_score), betahat_multiple_L2[0] + np.sort(imdb_score)*betahat_multiple_L2[1] + np.sort(imdb_score2)*betahat_multiple_L2[2] + np.sort(imdb_score3)*betahat_multiple_L2[3] + np.sort(imdb_score4)*betahat_multiple_L2[4] + np.sort(imdb_score5)*betahat_multiple_L2[5],'r') # Set legend for plot on the right plt.legend([multiple, ridge], ['Degree 5','L2 Reg (Degree 5)']) # Show scatter plots plt.show()