Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remaining stringAsFactor removed. #152

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 59 additions & 37 deletions episodes/04-data-structures-part2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,10 @@ So far, you have seen the basics of manipulating data frames with our nordic dat

::::::::::::::::::::::::::::::::::::::::: instructor

Pay attention to and explain the errors and warnings generated from the
Pay attention to and explain the errors and warnings generated from the
examples in this episode.

:::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::

```{r, echo=TRUE}
gapminder <- read.csv("data/gapminder_data.csv")
Expand Down Expand Up @@ -70,12 +70,12 @@ gapminder <- read.csv("data/gapminder_data.csv")
your computer. For example,

```{r, eval=FALSE, echo=TRUE}
gapminder <- read.csv("https://datacarpentry.org/r-intro-geospatial/data/gapminder_data.csv", stringsAsFactors = TRUE) #in R version 4.0.0 the default stringsAsFactors changed from TRUE to FALSE. But because below we use some examples to show what is a factor, we need to add the stringAsFactors = TRUE to be able to perform the below examples with factor.
gapminder <- read.csv("https://datacarpentry.org/r-intro-geospatial/data/gapminder_data.csv")
```

- You can read directly from excel spreadsheets without
converting them to plain text first by using the [readxl](https://cran.r-project.org/package=readxl) package.


::::::::::::::::::::::::::::::::::::::::::::::::::

Expand Down Expand Up @@ -193,11 +193,10 @@ gapminder[sample(nrow(gapminder), 5), ]

## Challenge 2

Read the output of `str(gapminder)` again; this time, use what you've learned
about factors and vectors, as well as the output of functions like `colnames`
and `dim` to explain what everything that `str` prints out for `gapminder`
means. If there are any parts you can't interpret, discuss with your
neighbors!
Read the output of `str(gapminder)` again; this time, use what you've learned,
as well as the output of functions like `colnames` and `dim` to explain what
everything that `str` prints out for `gapminder` means. If there are any parts
you can't interpret, discuss with your neighbors!

::::::::::::::: solution

Expand All @@ -219,7 +218,6 @@ We would like to create a new column to hold information on whether the life exp

```{r}
below_average <- gapminder$lifeExp < 70.5
head(gapminder)
```

We can then add this as a column via:
Expand All @@ -228,10 +226,6 @@ We can then add this as a column via:
cbind(gapminder, below_average)
```

```{r, eval=TRUE, echo=FALSE}
head(cbind(gapminder, below_average))
```

We probably don't want to print the entire dataframe each time, so
let's put our `cbind` command within a call to `head` to return
only the first six lines of the output.
Expand Down Expand Up @@ -267,7 +261,7 @@ The sequence `TRUE,TRUE,FALSE` is repeated over all the gapminder rows.
Let's overwrite the content of gapminder with our new data frame.

```{r}
below_average <- as.logical(gapminder$lifeExp<70.5)
below_average <- as.logical(gapminder$lifeExp < 70.5)
gapminder <- cbind(gapminder, below_average)
```

Expand All @@ -279,39 +273,68 @@ gapminder_norway <- rbind(gapminder, new_row)
tail(gapminder_norway)
```

To understand why R is giving us a warning when we try to add this row, let's learn a little more about factors.

## Factors

Here is another thing to look out for: in a `factor`, each different value
represents what is called a `level`. In our case, the `factor` "continent" has 5
levels: "Africa", "Americas", "Asia", "Europe" and "Oceania". R will only accept
values that match one of the levels. If you add a new value, it will become
`NA`.

The warning is telling us that we unsuccessfully added "Nordic" to our
*continent* factor, but 2016 (a numeric), 5000000 (a numeric), 80.3 (a numeric),
49400\.0 (a numeric) and `FALSE` (a logical) were successfully added to
*country*, *year*, *pop*, *lifeExp*, *gdpPercap* and *below\_average*
respectively, since those variables are not factors. 'Norway' was also
successfully added since it corresponds to an existing level. To successfully
add a gapminder row with a "Nordic" *continent*, add "Nordic" as a *level* of
the factor:
represents what is called a `level`.

Let's convert the columns continent and country into factors:

```{r}
gapminder$continent <- factor(gapminder$continent)
gapminder$country <- factor(gapminder$country)
str(gapminder)
```

In our case, the `factor` "continent" has 5 levels: "Africa", "Americas",
"Asia", "Europe" and "Oceania":

```{r}
levels(gapminder$continent)
```

A factor is not a character. For example, if we try to add the same row from
above to our data.frame, some values will become `NA`. This is so because
"continent" and "country" are now factors and R will only accept new values
that match one of the factor's levels:

```{r}
new_row <- list('Norway', 2016, 5000000, 'Nordic', 80.3, 49400.0, FALSE)
gapminder_norway <- rbind(gapminder, new_row)
```

This warning is telling us that we unsuccessfully added "Nordic" to our
*continent* factor (see below), but 2016 (a numeric), 5000000 (a numeric), 80.3
(a numeric), 49400\.0 (a numeric) and `FALSE` (a logical) were successfully
added to *country*, *year*, *pop*, *lifeExp*, *gdpPercap* and *below\_average*
respectively, since those variables are not factors. 'Norway' was also
successfully added since it corresponds to an existing level.

```{r}
tail(gapminder_norway, n = 1)
```

To successfully add a row with a "Nordic" *continent*, add "Nordic" as a
*level* of the factor:

```{r}
levels(gapminder$continent) <- c(levels(gapminder$continent), "Nordic")
```

And then add the Norway row again:

```{r}
gapminder_norway <- rbind(gapminder,
list('Norway', 2016, 5000000, 'Nordic', 80.3,49400.0, FALSE))
tail(gapminder_norway)
list('Norway', 2016, 5000000, 'Nordic', 80.3,49400.0, FALSE))
tail(gapminder_norway, n = 1)
```

Alternatively, we can change a factor into a character vector; we lose the handy
categories of the factor, but we can subsequently add any word we want to the
column without babysitting the factor levels:
Alternatively, we can change the "continent" factor into a character vector. In
this way, we lose the handy categories of the factor, but we can subsequently
add any word we want to the column without babysitting the factor levels:

```{r}
str(gapminder)
gapminder$continent <- as.character(gapminder$continent)
str(gapminder)
```
Expand All @@ -324,7 +347,7 @@ vectors and rows are lists.* We can also glue two data frames together with

```{r}
gapminder <- rbind(gapminder, gapminder)
tail(gapminder, n=3)
tail(gapminder, n = 3)
```

But now the row names are unnecessarily complicated (not consecutive numbers).
Expand Down Expand Up @@ -386,4 +409,3 @@ df <- cbind(df, coffeetime = c(TRUE, TRUE))

::::::::::::::::::::::::::::::::::::::::::::::::::


Loading