Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rationalize (unify if possible) retrieval of symbol names for error messages #65

Open
gmbecker opened this issue Aug 27, 2023 · 8 comments
Labels
Analyses Issues related to analyses C Involves C code Interest From R-core Interest/support has been shown from at least one R-core member

Comments

@gmbecker
Copy link

Endorsed by Luke

This is done (at least) 3 different ways in base C code:

   2 translateChar(PRINTNAME(...))
  27 EncodeChar(PRINTNAME(...))
 122 CHAR(PRINTNAME(...))

 Can we standardize only one? If not, make it easier to see which
 why a particular one is needed?
@gmbecker gmbecker added Analyses Issues related to analyses C Involves C code Interest From R-core Interest/support has been shown from at least one R-core member labels Aug 27, 2023
@aitap
Copy link
Contributor

aitap commented Aug 28, 2023 via email

@gmbecker
Copy link
Author

gmbecker commented Aug 28, 2023 via email

@ltierney
Copy link

That is part of the question, i.e. is translateChar ever needed and if it is, should it be used more often?

EncodeString does something different as described in printutils.c. But again one question is whether it should be used more or less.

@aitap
Copy link
Contributor

aitap commented Aug 29, 2023

I can work on checking this. I think it's possible using a combination of static call graph analysis and adding a check to SET_PRINTNAME() and seeing what breaks.

On the other hand, nothing currently prevents the C code from creating a symbol in non-native encoding:

#define R_NO_REMAP
#include <R.h>
#include <Rinternals.h>
#include <R_ext/Rdynload.h>

LibExtern Rboolean utf8locale;

SEXP ohno(void) {
	SEXP sym = PROTECT(Rf_allocSExp(SYMSXP));
	SET_PRINTNAME(sym,
		utf8locale ? Rf_mkCharCE("\xff", CE_LATIN1)
		: Rf_mkCharCE("\xc3\xbf", CE_UTF8)
	);
	SET_SYMVALUE(sym, R_UnboundValue);
	SET_DDVAL(sym, 0);
	UNPROTECT(1);
	return sym;
}

Depending on whether the current locale is UTF-8 or not, it breaks in different ways.

@aitap
Copy link
Contributor

aitap commented Sep 2, 2023

R symbols begin their life by being allocated using allocSExp. The
following Coccinelle script locates all uses of allocSExp where the
type argument is the constant SYMSXP or a variable:

@@
SEXPTYPE x;
@@
*allocSExp(
(
  SYMSXP
|
  x
)
 )
spatch --recursive-includes allocSExp.cocci --dir R-devel/src

(Coccinelle speaks in unified diffs, and it indicates matches by marking
them for removal.)

Let's follow the functions that allocate symbols:

allocSExp(SYMSXP)

  • main/names.c
    • static SEXP mkSymMarker(SEXP pname)
      This function sets the PRINTNAME of the symbol to pname. It is
      only called with R_NilValue or mkChar(...), so is guaranteed not
      to produce symbols with non-native encoding.
  • main/dstruct.c
    • SEXP mkSYMSXP(SEXP name, SEXP value)
      This function sets the PRINTNAME of the symbol to name. Let's
      locate its callers:
      @@
      expression x, y;
      @@
      *mkSYMSXP(x,y)
      
    • main/names.c
      • SEXP install(const char *name)
        Uses mkChar(name), so always returns a native-encoding symbol.
      • SEXP installNoTrChar(SEXP charSXP)
        Documented to be equivalent to install(CHAR(charSXP)). May
        create symbols with UTF-8 encoding when a UTF-8 locale is in
        effect or Latin-1 symbols when a Latin-1 locale is in effect. A
        CHARSXP in the wrong encoding will be reinterpreted as if it was
        in the native encoding, so any use of it with incompatible
        CHARSXP is a bug. A comment also says that installTrChar
        should be used instead in almost all cases.
        @@
        expression x;
        @@
        *installNoTrChar(x)
        
      • main/sysutils.c: SEXP installTrChar(SEXP x)
        This function delegates to installNoTrChar if no translation is
        needed, otherwise translates the CHARSXP and calls install.
  • There are a few uses of allocSExp in main/saveload.c and
    main/serialize.c, but they all allocate other types. In particular,
    the code in these two files always creates symbols using either
    install() or installTrChar().

SEXP installNoTrChar(SEXP charSXP) is also mentioned in Rinternals.h, so could conceivably be called by packages. Is this enough of a proof that symbol names are valid in the native encoding?

@aitap
Copy link
Contributor

aitap commented Sep 11, 2023

Symbols ought to be treated as immutable, but it's worth checking for uses of SET_PRINTNAME just in case:

@@
SEXP x, y;
@@
*SET_PRINTNAME(x,y)

The only two users are static SEXP mkSymMarker(SEXP pname) and attribute_hidden SEXP mkSYMSXP(SEXP name, SEXP value), which we've already discussed.

Searching for allocSExp on GitHub CRAN mirror gives a few hits, but almost all of them use a constant argument that is not SYMSXP and the remaining uses are obviously not creating a symbol either. mkSYMSXP gives 0 results. SET_PRINTNAME gives 1 result that is not about setting a non-native-encoding symbol. installNoTrChar gives a few Rust declarations. (All search links require logging in.)

This was about encoding safety. What about other operations on PRINTNAME?

The EncodeChar calls are there for a reason. They are used when formatting error messages, and they are needed even for native-encoding symbols in order to reproduce special characters like \n when printing Error: object '\n' not found.

The two uses of translateChar may need to be changed.

  • One is in src/main/subscript.c, OneIndex(SEXP x, SEXP s, R_xlen_t nx, int partial, SEXP *newname, int pos, SEXP call), used in e.g. cars[[as.name('dist')]] <- NA. Since symbols normally exist in the native encoding, translateChar is harmless but extraneous.
  • The other is in src/main/debug.c, static void memtrace_stack_dump(void), used to print the stack trace when a memory-traced object is duplicated. Since a function could concievably be named something weird like \r\n\t\v, this should probably be changed to EncodeChar().

What about the majority of the CHAR(PRINTNAME(.)) uses? Many of these are used in error messages, so should probably be changed to EncodeChar() too. I can prepare a patch to do that.

@ltierney
Copy link

A patch would be great. Happy to review it.

@aitap
Copy link
Contributor

aitap commented Oct 17, 2023

Work in progress on R-devel, but involves a bit of yak shaving. The best way to present a symbol in an error message is EncodeChar, but it uses a statically-allocated buffer that may be overwritten any time R is called. The only safe way to use it with errorcall() is errorcall_cpy(). It's much safer to make errorcall() and warningcall() stringify their format arguments before calling R (like currently done by error() and warning() using a temporary string buffer). We can save a lot of stack space if we rework multiple functions in errors.c to perform the formatting only once and share the buffer while doing that. And so on...

Once this is done, I intend to implement the function EncodeSymbol(sym, buf, buf_size) suggested by Luke Tierney to make it possible to EncodeChar() multiple different arguments in a single call to error(...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Analyses Issues related to analyses C Involves C code Interest From R-core Interest/support has been shown from at least one R-core member
Projects
None yet
Development

No branches or pull requests

3 participants