dragracer
Packagelibrary(dragracer)
#> Now sissy that walk
library(tibble)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
The dragracer
package has three data sets. The first is episode-level data (rpdr_ep
). These data contain some more granular information about each episode that may not be discernible from how episodes are typically summarized on Wikipedia (e.g. mini-challenge winners, runway themes [where applicable], lip-sync song and artist). The second data set is contestant-level (rpdr_contestants
). This data frame includes the contestant name, hometown, and purported date of birth and age by the start of the show. The third data set is episode-contestant-level data (rpdr_contep
). This is the most familiar form of the data that a reader of the show’s Wikipedia entries could discern. They include information about how a contestant fared in a particular episode (i.e. whether they won, scored high, were safe, scored low, or were in the bottom). The show’s fans are accustomed to seeing this form of the data as akin to a pyramid. However, I convert the data from wide to long, making the data akin to a survival data-generating process.
Here are some potential uses of the data.
A user can learn about how to summarize data. Here, we can get the average age of the contestants by season from the rpdr_contestants
data.
%>%
rpdr_contestants group_by(season) %>%
summarize(mean_age = mean(age))
#> # A tibble: 14 × 2
#> season mean_age
#> <chr> <dbl>
#> 1 S01 31
#> 2 S02 27.6
#> 3 S03 28.2
#> 4 S04 29.2
#> 5 S05 28
#> 6 S06 29.3
#> 7 S07 30.2
#> 8 S08 29.8
#> 9 S09 30.4
#> 10 S10 28.3
#> 11 S11 29.3
#> 12 S12 28.4
#> 13 S13 29.6
#> 14 S14 26.4
A user can also see which musical artists have appeared most for lip-syncs. The answer here is, unsurprisingly, RuPaul.
%>%
rpdr_ep group_by(lipsyncartist) %>%
summarize(n = n()) %>%
%>%
na.omit arrange(-n) %>% head(10)
#> # A tibble: 10 × 2
#> lipsyncartist n
#> <chr> <int>
#> 1 RuPaul 11
#> 2 Britney Spears 6
#> 3 Madonna 5
#> 4 Aretha Franklin 4
#> 5 Lady Gaga 4
#> 6 Whitney Houston 4
#> 7 Ariana Grande 3
#> 8 Cher 3
#> 9 Donna Summer 3
#> 10 Blondie 2
A user can also see how Jinkx Monsoon, the GOAT, fared in all her episodes.
%>%
rpdr_contep filter(contestant == "Jinkx Monsoon") %>%
select(season, contestant, episode, outcome, finale)
#> # A tibble: 12 × 5
#> season contestant episode outcome finale
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 S05 Jinkx Monsoon 1 SAFE 0
#> 2 S05 Jinkx Monsoon 2 HIGH 0
#> 3 S05 Jinkx Monsoon 3 HIGH 0
#> 4 S05 Jinkx Monsoon 4 HIGH 0
#> 5 S05 Jinkx Monsoon 5 WIN 0
#> 6 S05 Jinkx Monsoon 6 HIGH 0
#> 7 S05 Jinkx Monsoon 7 HIGH 0
#> 8 S05 Jinkx Monsoon 8 HIGH 0
#> 9 S05 Jinkx Monsoon 9 WIN 0
#> 10 S05 Jinkx Monsoon 10 HIGH 0
#> 11 S05 Jinkx Monsoon 11 BTM 0
#> 12 S05 Jinkx Monsoon 14 WIN 1
Previous versions of the data included all sorts of information at the contestant-level. For release, I decided to strip that information from the data in order to allow the user to learn how to do this. For example, if you were interested in summarizing how each contestant did in their particular season on various metrics, here’s how you might do that.
First, let’s merge in the mini-challenge data. Mini-challenges are irregular; not every episode has them. Indeed, Season 12 had very few of them. So, they get special treatment in the episode-level data.
%>%
rpdr_ep select(season, minicw1:minicw3) %>%
group_by(season) %>%
gather(Category, contestant, minicw1:minicw3) %>%
%>%
na.omit group_by(season, contestant) %>%
summarize(minicwins = n()) %>%
left_join(rpdr_contestants, .) %>%
mutate(minicwins = ifelse(is.na(minicwins), 0, minicwins)) -> D
#> `summarise()` has grouped output by 'season'. You can override using the `.groups` argument.
#> Joining, by = c("season", "contestant")
Now, let’s merge in data from the episode-contestant-level about how each contestant fared, excluding finales and specials. We’ll calculate all sorts of things here, including estimated “points per episode” and “Dusted or Busted” scores.
%>%
rpdr_contep filter(participant == 1 & finale == 0 & penultimate == 0) %>%
mutate(high = ifelse(outcome == "HIGH", 1, 0),
win = ifelse(outcome == "WIN", 1, 0),
low = ifelse(outcome == "LOW", 1, 0),
safe = ifelse(outcome == "SAFE", 1, 0),
highsafe = ifelse(outcome %in% c("HIGH", "SAFE"), 1, 0),
winhigh = ifelse(outcome %in% c("HIGH", "WIN"), 1, 0),
btm = ifelse(outcome == "BTM", 1, 0),
lowbtm = ifelse(outcome %in% c("BTM", "LOW"), 1, 0)) %>%
group_by(season,contestant,rank) %>%
mutate(numcontests = n()) %>%
group_by(season,contestant, numcontests, rank) %>%
summarize(perc_high = sum(high)/unique(numcontests),
perc_win = sum(win)/unique(numcontests),
perc_winhigh = sum(winhigh)/unique(numcontests),
perc_low = sum(low)/unique(numcontests),
perc_btm = sum(btm)/unique(numcontests),
perc_lowbtm = sum(lowbtm)/unique(numcontests),
num_high = sum(high),
num_win = sum(win),
num_winhigh = sum(winhigh),
num_btm = sum(btm),
num_low = sum(low),
num_lowbtm = sum(lowbtm),
db_score = 2*sum(win, na.rm=T) +
1*sum(high, na.rm=T) +
sum(safe, na.rm=T)*0) +
(sum(low, na.rm=T)*-1) + (sum(btm, na.rm=T)*-2)) %>%
(ungroup() %>%
mutate(points = (2*num_win + num_high - num_low + (-2)*num_btm),
ppe = points/numcontests) %>%
full_join(D, .) -> D
#> `summarise()` has grouped output by 'season', 'contestant', 'numcontests'. You can override using the `.groups` argument.
#> Joining, by = c("season", "contestant")
How, let’s look at who had the highest “Dusted or Busted” score across all seasons.
%>%
D arrange(-db_score) %>%
head(10) %>%
select(season, contestant, rank, db_score)
#> # A tibble: 10 × 4
#> season contestant rank db_score
#> <chr> <chr> <dbl> <dbl>
#> 1 S06 Bianca Del Rio 1 10
#> 2 S05 Jinkx Monsoon 1 9
#> 3 S09 Shea Couleé 3 9
#> 4 S13 Gottmik 3 9
#> 5 S13 Rosé 3 9
#> 6 S09 Sasha Velour 1 8
#> 7 S13 Symone 1 8
#> 8 S02 Tyra Sanchez 1 7
#> 9 S03 Raja 1 7
#> 10 S03 Manila Luzon 2 7
Let’s also see who has the highest “points per episode” score.
%>%
D arrange(-ppe) %>%
head(10) %>%
select(season, contestant, rank, ppe)
#> # A tibble: 10 × 4
#> season contestant rank ppe
#> <chr> <chr> <dbl> <dbl>
#> 1 S06 Bianca Del Rio 1 1
#> 2 S05 Jinkx Monsoon 1 0.818
#> 3 S09 Shea Couleé 3 0.818
#> 4 S01 Ongina 5 0.8
#> 5 S02 Tyra Sanchez 1 0.778
#> 6 S09 Sasha Velour 1 0.727
#> 7 S01 Nina Flowers 2 0.667
#> 8 S13 Gottmik 3 0.643
#> 9 S13 Rosé 3 0.643
#> 10 S04 Sharon Needles 1 0.636
Feel free to use the data for your own ends or learn R from it.