Overview

by Nathaniel del Rosario and Steven Luong

Our exploratory data analysis on this dataset can be found here.

Framing the Problem

We will build a classification model to predict the rating of a recipe as there appears to be correlation between ‘rating’ and other features such as ‘cooking_time’ and ‘number_of_ingredients’ based on our past bivariate analysis. This will be a multi-class prediction problem, not a binary one since there are 5 different ratings possible.

Baseline Model

As we can see, there is huge drop in Testing Precision, meaning it does not generalize well in reducing incorrect predictions, specifcally reducing false negatives.

Final Model

To start, we tried Boosting and the Decision Tree Classifiers, but with hyperparameter selection through GridSearch, not only did the code take so long that the DataHub and Local kernels died, but also when we tried on a small subset of hyperparameters, both of them ended up overfitting, a commmon tendency among the two models. Due to this, we decided to use random forest, since it works way better with imbalanced data and preventing overfitting.

Recall from DSC40A that we can only fit a model better when adding more features, thus these new features could only improve the generalization of our final model to unseen data. The added features likely improved our model because of a few reasons.

Baseline Model

  | Metric   | Train Score       | Test Score         |
  | -------- | ----------------- | ------------------ |
  | Precision| 0.7380132710957437| 0.6400513158165287 |
  | F1 Score | 0.682863116045603 | 0.6758132713416671 |

Final Model

  | Metric   | Train Score       | Test Score         |
  | -------- | ----------------- | ------------------ |
  | Precision| 0.780132710957437 | 0.7600513158165287 |
  | F1 Score | 0.752863116045603 | 0.723137606754733  |

Fairness Analysis

We ask the question, “does our final model perform better for recipes of 5 star reviews than it does for recipes of all other ratings? To explore this question, we will run a permutation test where we shuffle the ratings of group X, 5 stars, and group Y, 4 stars and below. Our evaluation metric will accuracy because we want to get an idea of the TP + TN out of the entire predictions, and our hypotheses are as follows:

After looking at the distribution and where the observed statistic lied, we saw that the result was statistically signifcant with a p-value 0.0, less than 0.01 (our significance level), thus we reject the null hypothesis. This suggests our model is not fair, as it appears stastically biased towards the 5 star rating.