-
Notifications
You must be signed in to change notification settings - Fork 0
diseasy coding standard
The diseasy coding standard closely follows (in large parts) the tidyverse style with some clarifications / deviations.
(code should pass devtools::lint() with no issues)
We also like The Zen of Python.
Keeping a maximum line length of 80 character is an old convention from a time where widescreen monitors were much less widespread. In order of keeping up with the times, a lines of up to 120 characters in length is accepted.
We perfer variable names to be concise yet unambiguous.
Good:
start_date <- as.Date("2020-01-01")
for (row_index = seq_len(10)) {
print(mtcars[row_index, ])
}
first_day_of_the_month # Not ambiguous
Bad:
sd <- as.Date("2020-01-01")
for (i = seq(10)) {
print(mtcars[i, ])
}
day_one # This is not clear what it is the day one of.
We prefer to validate input to functions and provide easily understood feedback to the user
Good:
do_something <- function(character_variable, character_vector_variable, date_variable) {
coll <- checkmate::makeAssertCollection()
checkmate::assert_character(character_variable, len = 1, add = coll)
checkmate::assert_character(character_variable, add = coll)
checkmate::assert_date(date_variable, lower = as.Date("2020-01-01"), add = coll)
checkmate::reportAssertions(coll)
...
}
Bad:
do_something <- function(character_variable, character_vector_variable, date_variable) {
stopifnot(is.character(character_variable) && length(character_variable) == 1)
stopifnot(is.character(character_variable))
stopifnot(inherits(date_variable, "Date"))
stopifnot(date_variable >= as.Date("2020-01-01"))
...
}
Code should use the minimum nesting needed.
Utilise early return and re-factoring as needed to reduce nesting
Good:
test_valid_date_range <- function(start_date, end_date) {
if (start_date < as.Date("2020-01-01")) {
return(FALSE)
}
if (end_date > as.Date("2023-12-31")) {
return(FALSE)
}
if (end_date < start_date) {
return(FALSE)
}
return(TRUE)
}
Bad:
test_valid_date_range <- function(start_date, end_date) {
if (start_date >= as.Date("2020-01-01")) {
if (end_date <= as.Date("2023-12-31")) {
if (end_date >= start_date) {
return(TRUE)
} else {
return(FALSE)
}
} else {
return(FALSE)
}
} else {
return(FALSE)
}
}
The tidyverse style guide say "&
and |
should never be used inside of an if
clause" as they may return vectors.
However, as long as you make sure the final product of the if
statement is not a vector, this is allowed.
Code should explicitly show what they return by use of the return
functions.
In some cases, it may be acceptable to omit the use of return
.
Good:
do_something <- function(data) {
out <- another_function(data)
...
return(out)
}
Acceptable:
do_something <- function(data) {
data |>
another_function() |>
...
}
Bad:
do_something <- function(data) {
out <- another_function(data)
...
out
}
As from the Zen of Python, we prefer to be explicit.
A lot of examples of R code uses inherently implicit syntax which should be avoided via the .data$
pronoun and double quotes "
.
# Good
iris |>
tidyr::gather("measure", "value", !"Species") |>
dplyr::arrange(-.data$value)
# Better
iris |>
tidyr::gather(
key = "measure",
value = "value",
!"Species"
) |>
dplyr::arrange(-.data$value)
# Bad
iris |>
gather(measure, value, -Species) |>
arrange(value)
The %>%
(from magrittr
) is widely used and also allowed, but it should be considered if the use of %>%
provides any advantage over the native |>
pipe, introduced in R 4.1.0.
The tidyverse style guide mentions three forms of assignment.
We prefer variable name and assignment on the same line with <-
:
# Good
iris_long <- iris |>
tidyr::gather("measure", "value", !"Species") |>
dplyr::arrange(-.data$value)
# Acceptable
accidentally_long_variable_name_i_made <-
also_long_dataset_name |>
dplyr::filter(.data$box_owner == "Pandora")
# Bad
iris |>
tidyr::gather("measure", "value", !"Species") |>
dplyr::arrange(-.data$value) ->
iris_long
When piping into ggplot
, it makes for more readable code with additional indentation to the additional layers (added with +
):
# Allowed
iris |>
dplyr::filter(.data$Species == "setosa") |>
ggplot(aes(x = Sepal.Width, y = Sepal.Length)) +
geom_point()
When writing tests, tests should be grouped in smaller rather than larger sections.
Each test should test one functionality.
Typically, this comes up when testing R6 classes.
Instead of one large test called "MyClass works", more, smaller tests for individual functionalities of MyClass should be tested.
Good:
test_that("MyClass initialize works", {
# Test with defaults
expect_no_condition(MyClass$new())
# Test with non-defaults
expect_no_condition(MyClass$new(argument = TRUE))
# Test malformed inputs
expect_error(MyClass$new(non_existent_argument = FALSE))
})
test_that("MyClass test_function works", {
# Create new object for testing
m <- MyClass$new()
# Test function with defaults
expect_no_condition(m$test_function())
# Test malformed inputs
expect_error(m$test_function(non_existent_argument = FALSE))
})
Bad:
test_that("MyClass works", {
# Test with defaults
expect_no_condition(MyClass$new())
# Test with non-defaults
expect_no_condition(MyClass$new(argument = TRUE))
# Test malformed inputs
expect_error(MyClass$new(non_existent_argument = FALSE))
# Create new object for testing
m <- MyClass$new()
# Test function with defaults
expect_no_condition(m$test_function())
# Test malformed inputs
expect_error(m$test_function(non_existent_argument = FALSE))
})