Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add row_count() to count specific values row-wise #553

Merged
merged 13 commits into from
Oct 11, 2024
Merged
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: datawizard
Title: Easy Data Wrangling and Statistical Transformations
Version: 0.13.0.2
Version: 0.13.0.4
Authors@R: c(
person("Indrajeet", "Patil", , "[email protected]", role = "aut",
comment = c(ORCID = "0000-0003-1995-6531")),
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,7 @@ export(reshape_longer)
export(reshape_wider)
export(reverse)
export(reverse_scale)
export(row_count)
export(row_means)
export(row_to_colnames)
export(rowid_as_column)
Expand Down
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ CHANGES
variables, can now also be a character vector with quoted variable names,
including a colon to indicate a range of several variables (e.g. `"cyl:gear"`).

* New function `row_count()`, to calculate row-wise sums of specific values.
strengejacke marked this conversation as resolved.
Show resolved Hide resolved

BUG FIXES

* `describe_distribution()` no longer errors if the sample was too sparse to compute
Expand Down
101 changes: 101 additions & 0 deletions R/row_count.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
#' @title Row means or sums (optionally with minimum amount of valid values)
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
#' @name row_count
#' @description `row_count()` mimics base R's `rowSums()`, with sums for a
#' specific value indicated by `count`. Hence, it is equivalent to
#' `rowSums(x == count, na.rm = TRUE)`, but offers some more options, including
#' strict comparisons: Comparisons using `==` coerce values to atomic vectors,
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
#' thus both `2 == 2` and `"2" == 2` are `TRUE`. In `row_count()`, it is also
#' possible to make "type safe" comparisons using the `exact` argument, where
#' `"2" == 2` is not treated as identical.
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
#'
#' @param data A data frame with at least two columns, where number of specific
#' values are counted row-wise.
#' @param count The value for which the row sum should be computed. May be a
#' numeric value, a character string (for factors or character vectors), `NA` or
#' `Inf`.
#' @param exact Logical, if `TRUE`, `count` matches only values of same type
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
#' (i.e. when `count = 2`, the value `"2"` is not counted and vice versa).
#' By default, when `exact = FALSE`, `count = 2` also matches `"2"`. See
#' 'Examples'.
#'
#' @inheritParams extract_column_names
#' @inheritParams row_means
#'
#' @return A vector with row-wise counts of values specified in `count`.
#'
#' @examples
#' dat <- data.frame(
#' c1 = c(1, 2, NA, 4),
#' c2 = c(NA, 2, NA, 5),
#' c3 = c(NA, 4, NA, NA),
#' c4 = c(2, 3, 7, 8)
#' )
#'
#' # count all 2s per row
#' row_count(dat, count = 2)
#' # count all missing values per row
#' row_count(dat, count = NA)
#'
#' dat <- data.frame(
#' c1 = c("1", "2", NA, "3"),
#' c2 = c(NA, "2", NA, "3"),
#' c3 = c(NA, 4, NA, NA),
#' c4 = c(2, 3, 7, Inf)
#' )
#' # count all 2s and "2"s per row
#' row_count(dat, count = 2)
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
#' # only count 2s, but not "2"s
#' row_count(dat, count = 2, exact = TRUE)
#'
#' @export
row_count <- function(data,
select = NULL,
exclude = NULL,
count = NULL,
exact = FALSE,
ignore_case = FALSE,
regex = FALSE,
verbose = TRUE) {
# evaluate arguments
select <- .select_nse(select,
data,
exclude,
ignore_case = ignore_case,
regex = regex,
verbose = verbose
)

if (is.null(count)) {
insight::format_error("`count` must be a valid value (including `NA` or `Inf`), but not `NULL`.")
}

if (is.null(select) || length(select) == 0) {
insight::format_error("No columns selected.")
}

data <- .coerce_to_dataframe(data[select])

# check if we have a data framme with at least two columns
if (ncol(data) < 2) {
insight::format_error("`data` must be a data frame with at least two numeric columns.")
}

# special case: count missing
if (is.na(count)) {
rowSums(is.na(data))
} else {
# comparisons in R using == coerce values into a atomic vector, i.e.
# 2 == "2" is TRUE. If `exact = TRUE`, we only want 2 == 2 or "2" == "2".
# to achieve this, we simply compute the comparison on numeric or non-numeric
# columns only
if (isTRUE(exact)) {
numeric_columns <- vapply(data, is.numeric, TRUE)
if (is.numeric(count)) {
data <- data[numeric_columns]
} else {
data <- data[!numeric_columns]
}
}
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
rowSums(data == count, na.rm = TRUE)
}
}
120 changes: 120 additions & 0 deletions man/row_count.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pkgdown/_pkgdown.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ reference:
- kurtosis
- smoothness
- skewness
- row_count
- row_means
- weighted_mean
- mean_sd
Expand Down
39 changes: 39 additions & 0 deletions tests/testthat/test-row_count.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
test_that("row_count", {
d_mn <- data.frame(
c1 = c(1, 2, NA, 4),
c2 = c(NA, 2, NA, 5),
c3 = c(NA, 4, NA, NA),
c4 = c(2, 3, 7, 8)
)
expect_identical(row_count(d_mn, count = 2), c(1, 2, 0, 0))
expect_identical(row_count(d_mn, count = NA), c(2, 0, 3, 1))
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
d_mn <- data.frame(
c1 = c("a", "b", NA, "c"),
c2 = c(NA, "b", NA, "d"),
c3 = c(NA, 4, NA, NA),
c4 = c(2, 3, 7, Inf),
stringsAsFactors = FALSE
)
expect_identical(row_count(d_mn, count = "b"), c(0, 2, 0, 0))
expect_identical(row_count(d_mn, count = Inf), c(0, 0, 0, 1))
})

test_that("row_count, errors or messages", {
data(iris)
expect_error(expect_warning(row_count(iris, select = "abc")), regex = "must be a valid")
expect_error(expect_warning(row_count(iris, select = "abc", count = 3)), regex = "No columns")
expect_error(row_count(iris[1], count = 3), regex = "with at least")
})

test_that("row_count, exact match", {
d_mn <- data.frame(
c1 = c("1", "2", NA, "3"),
c2 = c(NA, "2", NA, "3"),
c3 = c(NA, 4, NA, NA),
c4 = c(2, 3, 7, Inf),
stringsAsFactors = FALSE
)
expect_identical(row_count(d_mn, count = 2, exact = FALSE), c(1, 2, 0, 0))
expect_identical(row_count(d_mn, count = 2, exact = TRUE), c(1, 0, 0, 0))
expect_identical(row_count(d_mn, count = "2", exact = TRUE), c(0, 2, 0, 0))
})
Loading