generated from andreashandel/online-portfolio-template
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathtidytuesday_exercise.qmd
177 lines (125 loc) · 8.46 KB
/
tidytuesday_exercise.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
title: "Tidy Tuesday Exercise"
output:
html_document:
toc: FALSE
---
This is my first Tidy Tuesday exercise! I feel like this is such a cool community to be a part of, and I'm excited to get into it.
First thing's first, let's load in the data.
```{r, include=FALSE}
library(tidytuesdayR)
library(tidyverse)
```
```{r, echo=FALSE}
movies<-read.csv("data/movies.csv")
glimpse(movies)
```
## Data Exploration
Alrighty, off the bat it looks like the first actor is usually a guy while the second one is a mix of men and women, but I want to check this out.
```{r}
unique(movies$Actor.1.Gender) #There's both! How many of each?
sum(movies$Actor.1.Gender == "man") #1139
sum(movies$Actor.1.Gender == "woman") #16
unique(movies$Actor.2.Gender)
sum(movies$Actor.2.Gender == "man") #17
sum(movies$Actor.2.Gender == "woman") #1138
```
Do we have any overlap in the few where men/women are flipped?
```{r}
movies %>%
filter(Actor.1.Name %in% Actor.2.Name)%>%
distinct(Actor.1.Name)
```
Ok so we have 19 actors that are in both Actor.1 and Actor.2. Right now we might not need to adjust for this, but it's good to know for the future.
So there seems to be a flip-flop of leading men/ladies, and about 1 of each gender per film. So, the order of Actor 1 and Actor 2 it's not exclusively men and women, is Actor 1 the older one?
```{r}
movies%>%
mutate(act1.agediff = Actor.1.Age - Actor.2.Age)%>%
count(act1.agediff < 0)
```
We have 186 instances where Actor 2 is older than Actor 1. It seems like for this dataset is a bit arbitrary in terms of who is listed 1st and 2nd - unless it's by whoever is paid most which is information we don't have here.
Welp, we'll figure out what to do with this later. Until I know what I'm doing with the data I won't mess with it. To continue data exploration, I want to see how many unique actors there are across the board.
```{r, results = FALSE}
unique(movies$Actor.1.Name) #491
unique(movies$Actor.2.Name) #559
```
Cool, so we have a wide range of different actors! We're still going to have some duplicates, so who are the most common/popular actors across both?
```{r, out.lines = 10}
head(movies%>%
count(movies$Actor.1.Name)%>%
arrange(desc(n)), n=10)
```
```{r, out.lines = 10}
head(movies%>%
count(movies$Actor.2.Name)%>%
arrange(desc(n)), n=10)
```
We have the incredible Keanu Reeves and Keria Knightly leading the actors with 27 and 14 movies each, respectively.
## Analysis
Alright, now that we've explored the data a bit, let's get into the juicy stuff -- looking at these age differences.
```{r}
ggplot()+
geom_point(aes(x=Actor.1.Age, y=Actor.2.Age), data=movies)+
geom_smooth(aes(x=Actor.1.Age, y=Actor.2.Age), data=movies)+
theme_bw()
```
The good news is it seems relatively steady in terms of age gaps between the two actors. Does this age difference change over time?
```{r}
ggplot()+
geom_point(aes(x=Release.Year, y=Age.Difference), data=movies)+
theme_bw()
```
It seems pretty hard to see trends since we have such a large number of movies released more recently. Let's see how else we can visualize this.
Since we have such a long time-frame, I'm going to create a dummy variable for decade of release.
```{r, warning=FALSE}
movies.decade<-movies%>%
mutate(Decade = ifelse(Release.Year %in% c(1930:1939), "1930",
ifelse(Release.Year %in% c(1940:1949), "1940",
ifelse(Release.Year %in% c(1950:1959), "1950",
ifelse(Release.Year %in% c(1960:1969), "1960",
ifelse(Release.Year %in% c(1970:1979), "1970",
ifelse(Release.Year %in% c(1980:1989), "1980",
ifelse(Release.Year %in% c(1990:1999), "1990",
ifelse(Release.Year %in% c(2000:2009), "2000",
ifelse(Release.Year %in% c(2010:2019), "2010", "2020"))))))))))
```
Next, I want to plot the Age Differences relative to Decade and see if we can notice any trends when the scales are standardized to a proportion to try and mitigate the influx of movie production in the recent years.
```{r, warning=FALSE}
ggplot()+
geom_bar(aes(x=Age.Difference, fill=Decade), data=movies.decade, position = "fill", width = 2)+
theme_bw()
```
A barchart such as this helps us see which decades have certain age gaps, and using a proportional approach helps mitigate the large number of more recent movies. However, we still run into the problem where the number of movies created in each decade influences how we read the chart. For example, those movies made in the 2020s don't seem like a large impact although these recent movies have over 20 years age differences, and those from the 1930s are barely visible.
```{r}
movies.age<-movies.decade%>%
mutate(age.diff = ifelse(Age.Difference %in% c(0:9), "<10",
ifelse(Age.Difference %in% c(10:19), "10-20",
ifelse(Age.Difference %in% c(20:29), "20+",
ifelse(Age.Difference %in% c(30:39), "30+",
ifelse(Age.Difference %in% c(40:49), "40+", "50+"))))))
ggplot()+
geom_bar(aes(x=Decade, fill=age.diff), data=movies.age, position = position_fill(reverse = TRUE))+
theme_bw()
```
Switching our (in)dependent variables helped us see how the movies in each trended towards different age differences. Shockingly, 2020s had some drastic age differences of 20+ years. Maybe not shockingly, overall, the older movies tended to have more age-gap couples, specifically the 1940s and 50s with most of their respective movies having couples that had over a 20 year age gap.
I'm curious to see if any specific directors are guilty of leading these movies, or if it's an industry-wide concern. Because of the variation by decade and the limited longevity of people's careers, I'm going to keep the decade consideration as a grouping with these directors.
```{r, warning = FALSE}
movies.direct<-movies.decade%>%
group_by(Director, Decade)%>%
summarize(mean.diff = mean(Age.Difference))%>%
count(mean.diff)%>%
arrange(desc(mean.diff))
head(movies.direct, n=10)
```
So, I was able to ultimately do this by Director and by decade but not both. There's definitely a way to do it (and probably very easily), but if I'm honest I'm very tired and would rather do it separately and call it a day.
Anyways, we see that our top 10 directors have age gaps greater than 30 years, but they only directed one movie. The most our directors have in this dataset is 2 movies, so I think it's safe to say a few directors aren't exclusively responsible for these age-gap couples.
Our last bit of exploration is getting a bit intense. Age gaps can be acceptable (or at least legal), between two consenting adults, but do we have any couples who we should call the police on? Per Romeo and Juliet laws, we'll give a 5 year buffer.
```{r}
movies%>%
filter(Actor.1.Age < 18 | Actor.2.Age <18,
Age.Difference > 5)
```
We have 6 lovely movies that are questionable. And all of which were made pretty recently, a bit shocking. Our women are Hollywood IT girls: Drew Barrymore, Dominque Swain, Scarlett Johansson, and Kate Bosworth. Our men... are old (comparatively). We have *three* 20+ age differences, making these men in their 40s-50s as the love interest of 17 year olds. We also have Cate Blanchett (37) with Andrew Simpson (17) in Notes on a Scandal -- which is fitting -- and Cary Elwes and Alicia Silverstone in The Crush. While not appropriate, these are the premises of the movies.
## Conclusions
This was a fun and slightly scandalous first Tidy Tuesday for me! I'm intruiged to see what more seasoned participants do with this information, as I feel like there's a lot of fun ways you could spin this data.
Ultimately, by my elementary findings, we can't find an immediate rhyme or reason for these age gaps, or patterns among movies by Release Year, director, or specific actors. Of 1155 movies, only 3 really called legality into question which is slightly affirming; however, they were all filmed pretty recently (1990s+). We also didn't look at those situations on the cusp, like an 18/19 year old with and older costar. With the given pop news, maybe I should have looked more into if Leonardo DiCaprio is mentioned anywhere. That might be a subject for another day.