What are New York's best and worst pizza restaurants?

Sep 30, 2019

TidyTuesday / bar graphs / small multiples / tutorial

In this post, I analyze the Pizza Party dataset from TidyTuesday, a project that shares a new dataset each week to give R users a way to apply and practice their skills. This week’s data is about survey ratings of New York pizza restaurants.

Setup

First, let’s load the tidyverse, change our default ggplot2 theme, and load the data. (I named the dataframe pizza_barstool_raw because I’ll probably add some cleaning steps and I like to have the original data on hand.)

library(tidyverse)

theme_set(theme_light())

pizza_barstool_raw <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_barstool.csv")

Data inspection

Let’s take a look!

pizza_barstool_raw %>%
  glimpse()

## Rows: 463
## Columns: 22
## $ name                                 <chr> "Pugsley's Pizza", "Williamsbu...
## $ address1                             <chr> "590 E 191st St", "265 Union A...
## $ city                                 <chr> "Bronx", "Brooklyn", "New York...
## $ zip                                  <dbl> 10458, 11211, 10017, 10036, 10...
## $ country                              <chr> "US", "US", "US", "US", "US", ...
## $ latitude                             <dbl> 40.85877, 40.70808, 40.75370, ...
## $ longitude                            <dbl> -73.88484, -73.95090, -73.9741...
## $ price_level                          <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, ...
## $ provider_rating                      <dbl> 4.5, 3.0, 4.0, 4.0, 3.0, 3.5, ...
## $ provider_review_count                <dbl> 121, 281, 118, 1055, 143, 28, ...
## $ review_stats_all_average_score       <dbl> 8.011111, 7.774074, 5.666667, ...
## $ review_stats_all_count               <dbl> 27, 27, 9, 2, 1, 4, 5, 17, 14,...
## $ review_stats_all_total_score         <dbl> 216.3, 209.9, 51.0, 11.2, 7.1,...
## $ review_stats_community_average_score <dbl> 7.992000, 7.742308, 5.762500, ...
## $ review_stats_community_count         <dbl> 25, 26, 8, 0, 0, 3, 4, 16, 13,...
## $ review_stats_community_total_score   <dbl> 199.8, 201.3, 46.1, 0.0, 0.0, ...
## $ review_stats_critic_average_score    <dbl> 8.8, 0.0, 0.0, 4.3, 0.0, 0.0, ...
## $ review_stats_critic_count            <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, ...
## $ review_stats_critic_total_score      <dbl> 8.8, 0.0, 0.0, 4.3, 0.0, 0.0, ...
## $ review_stats_dave_average_score      <dbl> 7.7, 8.6, 4.9, 6.9, 7.1, 3.2, ...
## $ review_stats_dave_count              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ review_stats_dave_total_score        <dbl> 7.7, 8.6, 4.9, 6.9, 7.1, 3.2, ...

We have three major categories of data:

Location: where the restaurants are, including city (borough) and geographic coordinates
Restaurant: name and price level
Reviews: review counts scores for four types of reviewers
- Provider
- Community
- Critics
- Dave (Barstool)

We could investigate a lot of things with this data, including where in New York the best pizza places are, how consistent the different types of reviewers are, and whether higher prices mean better pizza.

I want to know, What’s the best (and worst) pizza in New York, for each price level? Next time I visit, I want to know the must-go pizza places (and which ones to avoid) and I want options depending on how much I’m looking to spend.

Data cleaning

Now that we have our research question, let’s clean the data. We will:

Filter out restaurants outside of New York.
Add pizzeria_id as unique ID field for each restaurant. This will help if there are multiple restaurants with the same name.
Ditch location data – it doesn’t help answer our question.
Rename variables that start with review_stats_. They’re cumbersome.

pizza_barstool <- pizza_barstool_raw %>%
  filter(city %in% c("New York", # New York is coded two different ways
                     "New York City",
                     "Brooklyn",
                     "Bronx",
                     "Staten Island",
                     "Hoboken")) %>%
  transmute(pizzeria_id = row_number(),
            pizzeria_name = name,
            price_level,
            provider_rating,
            provider_reviews = provider_review_count,
            all_rating = review_stats_all_average_score,
            all_reviews = review_stats_all_count,
            community_rating = review_stats_community_average_score,
            community_reviews = review_stats_community_count,
            critic_rating = review_stats_critic_average_score,
            critic_reviews = review_stats_critic_count,
            dave_rating = review_stats_dave_average_score,
            dave_reviews = review_stats_dave_count)

Two more cleaning steps. First, this data is in a wide format. I want it to be in a tidy format, which is easier to work with. We can do this using gather and spread. After that, we we’ll filter out data that isn’t useful to us – restaurants without zero reviews and the aggregated reviewer “all” (we want the reviewer type to stay unaggregated).

pizza_barstool_tidy <- pizza_barstool %>%
  gather("category", "value", provider_rating:dave_reviews) %>%
  separate(category, into = c("reviewer", "measure"), sep = "_") %>%
  spread(measure, value) %>%
  filter(reviews > 0,
         reviewer != "all")

We’re looking at ratings, which is a continuous variable, so let’s look at the distribution of average scores for each type of reviewer using histograms.

pizza_barstool_tidy %>%
  ggplot(aes(rating)) +
  geom_histogram(binwidth = 1) +  # binwidth = 1 to get clean breaks in histogram
  facet_wrap(~ reviewer) +
  labs(title = "How did different reviewers score restaurants?")

Most reviews are in the 7-8 range, except for provider, which has a lot of 3s and 4s and none over 5. I suspect provider ratings are on a 5-point scale.

pizza_barstool_tidy %>%
  filter(reviewer == "provider") %>%
  count(rating)

## # A tibble: 7 x 2
##   rating     n
##    <dbl> <int>
## 1    2       1
## 2    2.5     9
## 3    3      53
## 4    3.5    96
## 5    4      98
## 6    4.5    25
## 7    5       3

Yep. Let’s double the provider scores so they are on the same scale as the other reviewers. We will lose some nuance, since people may score differently when presented with a 5-point scale compared to a 10-point scale, but I’m comfortable making the that trade-off to get all the scores on the same scale.

pizza_barstool_tidy <- pizza_barstool_tidy %>%
  filter(reviews > 0) %>%
  mutate(rating = if_else(reviewer == "provider", rating * 2, rating))

Visualization

We want to know, What’s the best (and worst) pizza in New York, for each price level? One way to answer that is to determine a typical score for restaurants overall, then calculate which restaurants are farthest away from that typical score. I’m going to use the median as the typical score, since it is less susceptible to extreme values. (For example, some restaurants have a rating of zero and I prefer that those ratings not drag down the typical score too much.)

new_york_pizza <- pizza_barstool_tidy %>%
  mutate(overall_median_rating = median(rating)) %>%
  group_by(pizzeria_id, pizzeria_name, overall_median_rating, price_level) %>%
  summarise(avg_rating = weighted.mean(rating, reviews)) %>%
  ungroup() %>%
  mutate(diff_from_median = avg_rating - overall_median_rating)

## `summarise()` regrouping output by 'pizzeria_id', 'pizzeria_name', 'overall_median_rating' (override with `.groups` argument)

new_york_extremes <- new_york_pizza %>%
  mutate(direction = ifelse(diff_from_median > 0, "higher", "lower")) %>%
  group_by(price_level, direction) %>%
  top_n(5, wt = abs(diff_from_median)) %>%  # Take top 5 best and worst
  ungroup()

Great! The data is in the right format to graph. Let’s do a couple clean-up steps so that our graph looks nice:

Convert numeric price levels (0-4) to dollar signs ($, $$, etc.).
Convert pizzeria_name to a factor and then re-order it so that, when graphed, they will appear in order from highest- to lowest-rated.
Find nice colours and their hex values from ColorBrewer (I use blue for high score and red for low score).
Fiddle with some chart elements, like legend, gridlines, and text size.
Add titles and annotations.

library(tidytext) # Needed for reorder_within and scale_x_reordered functions

new_york_extremes %>%
  # 1. Convert price levels with dollars signs
  mutate(price_level = case_when(price_level == 0 ~ "$",
                                 price_level == 1 ~ "$$",
                                 price_level == 2 ~ "$$$",
                                 price_level == 3 ~ "$$$$",
                                 TRUE ~ NA_character_),
  # 2. Re-order restaurants so they appear from best to worst
         pizzeria_name = reorder_within(pizzeria_name, diff_from_median, price_level),
         avg_rating = round(avg_rating, 1)) %>%
  ggplot(aes(pizzeria_name, diff_from_median, fill = direction)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ price_level, scales = "free_y") +
  scale_x_reordered() +
  # 3. Get nice colours from colorbrewer2.org
  scale_fill_manual(values = c("#3288bd", "#d53e4f")) +
  # 4. Fiddle with some chart elements
  theme(legend.position = "none", # No legend
        panel.grid.minor = element_blank(), # No minor gridlines 
        strip.background = element_rect(fill = NA), # Blank facet title background
        strip.text = element_text(colour = "black", face = "bold", size = 12)) +
  # 5. Add titles and annotations
  labs(title = "What's the best and worst pizza in New York?",
       subtitle = "Broken down by price level",
       x = "",
       y = "Score difference from the median",
       caption = "Based on Barstool, community, critic, and Dave reviews")

I’m definitely going to L’industrie Pizzeria next time I’m in New York!

Conclusion