Rdatatable · venom1204 · Dec 22, 2024
@@ -1014,3 +1014,5 @@ rowwiseDT(
 20. Some clarity is added to `?GForce` for the case when subtle changes to `j` produce different results because of differences in locale. Because `data.table` _always_ uses the "C" locale, small changes to queries which activate/deactivate GForce might cause confusingly different results when sorting is involved, [#5331](https://github.com/Rdatatable/data.table/issues/5331). The inspirational example compared `DT[, .(max(a), max(b)), by=grp]` and `DT[, .(max(a), max(tolower(b))), by=grp]` -- in the latter case, GForce is deactivated owing to the _ad-hoc_ column, so the result for `max(a)` might differ for the two queries. An example is added to `?GForce`. As always, there are several options to guarantee consistency, for example (1) use namespace qualification to deactivate GForce: `DT[, .(base::max(a), base::max(b)), by=grp]`; (2) turn off all optimizations with `options(datatable.optimize = 0)`; or (3) set your R session to always sort in C locale with `Sys.setlocale("LC_COLLATE", "C")` (or temporarily with e.g. `withr::with_locale()`). Thanks @markseeto for the example and @michaelchirico for the improved documentation.
 
 # data.table v1.14.10 (Dec 2023) back to v1.10.0 (Dec 2016) has been moved to [NEWS.1.md](https://github.com/Rdatatable/data.table/blob/master/NEWS.1.md)
+
+merge() now provides improved error handling for invalid column names in the by argument. When performing a join, the error messages explicitly identify the missing columns in both x and y, ensuring clarity for users. Fixes #6556. Thanks @venom1204 for the PR.
@@ -50,8 +50,19 @@
       by = intersect(nm_x, nm_y)
     if (length(by) == 0L || !is.character(by))
       stopf("A non-empty vector of column names for `by` is required.")
-    if (!all(by %chin% intersect(nm_x, nm_y)))
-      stopf("Elements listed in `by` must be valid column names in x and y")
+
+    ## Updated Error Handling Section
+    missing_in_x = setdiff(by, nm_x)
+    missing_in_y = setdiff(by, nm_y)
+    if (length(missing_in_x) > 0 || length(missing_in_y) > 0) {
+      error_msg = "Columns listed in `by` must be valid column names in both data.tables.\n"
+      if (length(missing_in_x) > 0) 
+        error_msg = paste0(error_msg, sprintf("✖ Missing in x: %s\n", paste(missing_in_x, collapse = ", ")))
+      if (length(missing_in_y) > 0) 
+        error_msg = paste0(error_msg, sprintf("✖ Missing in y: %s", paste(missing_in_y, collapse = ", ")))
+      stopf(error_msg)
+    }
+
     by = unname(by)
     by.x = by.y = by
   }

@@ -20697,3 +20697,46 @@ if (test_bit64) {
   test(2300.3, DT1[DT2, on='id'], error="Incompatible join types")
   test(2300.4, DT2[DT1, on='id'], error="Incompatible join types")
 }
+
+if (test_bit64) {
+  # Test for identifying missing columns in the `by` argument
+  DT1 = data.table(x = as.integer64(1:5), y = letters[1:5])
+  DT2 = data.table(a = as.integer64(6:10), b = letters[6:10])
+
+  # Missing column in both data tables
+  test(2301.1, {
+    tryCatch({
+      merge.data.table(DT1, DT2, by = "z")
+    }, error = function(e) {
+      e$message
+    })
+  }, "Columns listed in `by` must be valid column names in both data.tables.\n✖ Missing in x: z\n✖ Missing in y: z")
+
+  # Multiple missing columns
+  test(2301.2, {
+    tryCatch({
+      merge.data.table(DT1, DT2, by = c("x", "a"))
+    }, error = function(e) {
+      e$message
+    })
+  }, "Columns listed in `by` must be valid column names in both data.tables.\n✖ Missing in x: a\n✖ Missing in y: x")
+
+  # Valid columns for `by`
+  test(2301.3, {
+    tryCatch({
+      merge.data.table(DT1, DT2, by = c("y", "b"))
+    }, error = function(e) {
+      e$message
+    })
+  }, NULL)  # Expect no error since `y` and `b` exist in DT1 and DT2 respectively
+
+  # Incompatible join types
+  DT2[, a := as.numeric(a)]
+  test(2301.4, {
+    tryCatch({
+      merge.data.table(DT1, DT2, by = c("x", "a"))
+    }, error = function(e) {
+      e$message
+    })
+  }, "Incompatible join types")
+}
Original file line number	Diff line number	Diff line change
Expand Up		@@ -1014,3 +1014,5 @@ rowwiseDT(
		20. Some clarity is added to `?GForce` for the case when subtle changes to `j` produce different results because of differences in locale. Because `data.table` _always_ uses the "C" locale, small changes to queries which activate/deactivate GForce might cause confusingly different results when sorting is involved, [#5331](https://github.com/Rdatatable/data.table/issues/5331). The inspirational example compared `DT[, .(max(a), max(b)), by=grp]` and `DT[, .(max(a), max(tolower(b))), by=grp]` -- in the latter case, GForce is deactivated owing to the _ad-hoc_ column, so the result for `max(a)` might differ for the two queries. An example is added to `?GForce`. As always, there are several options to guarantee consistency, for example (1) use namespace qualification to deactivate GForce: `DT[, .(base::max(a), base::max(b)), by=grp]`; (2) turn off all optimizations with `options(datatable.optimize = 0)`; or (3) set your R session to always sort in C locale with `Sys.setlocale("LC_COLLATE", "C")` (or temporarily with e.g. `withr::with_locale()`). Thanks @markseeto for the example and @michaelchirico for the improved documentation.

		# data.table v1.14.10 (Dec 2023) back to v1.10.0 (Dec 2016) has been moved to [NEWS.1.md](https://github.com/Rdatatable/data.table/blob/master/NEWS.1.md)

		merge() now provides improved error handling for invalid column names in the by argument. When performing a join, the error messages explicitly identify the missing columns in both x and y, ensuring clarity for users. Fixes #6556. Thanks @venom1204 for the PR.