Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we allow more identifiers instead of having @name? #21

Open
tlively opened this issue Aug 31, 2023 · 13 comments
Open

Should we allow more identifiers instead of having @name? #21

tlively opened this issue Aug 31, 2023 · 13 comments

Comments

@tlively
Copy link
Member

tlively commented Aug 31, 2023

The utility of the @name annotation comes from being able to use more than the restricted set of ASCII characters allowed in the identifiers already present in the text format. But it's kind of awkward to have both an identifier and a name annotation in the text format when only one can appear in the binary name section. The status quo where you can have arbitrary strings in the name section but cannot represent them as identifiers is also awkward. Neither allow full bidirectional round-tripping of identifiers and names.

I propose that instead of adding a @name annotation, we expand the grammar for identifiers to allow arbitrary strings. With the extended name section proposal, this would allow full bidirectional round tripping of identifiers/names.

The specific grammar for identifiers could be:

id ::= '$' idchar+ | '$' string

There would be no ambiguity in this grammar because strings always start with ", which is not included in idchar.

If this is the direction we decide to go in, I would suggest that the extension of the grammar be part of the extended name section proposal. cc @ashleynh

@yuri91
Copy link
Contributor

yuri91 commented Sep 1, 2023

Just thinking out loud about the implementation of this in the new custom section support in the interpreter:

We can still use the current custom section handler for the name section, and for text parsing, instead of getting the names from @name annotations, get them from the identifiers in the text.

It is a bit annoying that the AST does not keep the names, so they need to be re-parsed from text, but it's doable.

@rossberg
Copy link
Member

rossberg commented Sep 6, 2023

This seems fine, as long as there is no expectation that we are getting into any of the Unicode messes surrounding normalisation etc., i.e., for determining id equality, we are comparing just raw code points (which may or may not be what users might expect).

@rossberg
Copy link
Member

rossberg commented Sep 6, 2023

PS: Also, a reminder that Unicode spoofing has become a real security concern. Allowing completely arbitrary Unicode strings in something as central and sensitive as identifiers is elevating that risk. It's hard to predict how relevant that is for Wasm, but I could imagine security-sensitive hand-written Wasm in the future, which makes me feel uncomfortable about this. Perhaps we should try to get the opinion of some knowledgeable security experts?

@dzfrias
Copy link

dzfrias commented Sep 19, 2023

Just to jump into this, I've been working on an implementation of the @name annotation at WebAssembly/wabt#2297, and I think this new syntax is a good idea from both a grammar standpoint and I believe is just more intuitive in general.

Having a "binary-specific" name (from the @name annotation) detached from the actual identifier posed a few of questions while I was been working on the implementation:

  • Should @name be valid when there's no corresponding $name when writing an identifier for a function, for example? I'd assume not, or else the grammar would have to change to support @name as a substitute for $name
  • When writing names from the binary format to the text format (maybe an identifier that has spaces in it), how should it be represented in the $name form if both $name and @name are required?

Given these problems, I don't think @name helps very much when trying to roundtrip invalid names. I think the $"name" syntax is a much better idea, both for implementing a tool like wasm2wat and wat2wasm and for when writing WAT files in general.

@alexcrichton
Copy link
Contributor

Currently in wasm-tools when a wasm file is printed any names of functions which are not valid identifiers cause a (@name ...) annotation to get printed along with a synthesized $... identifer that is as well valid. I personally find the idea here, being able to print $"...", is a great idea as it would help simplify this and make it easier to read as well.

One gotcha though is that the custom name section has no requirements on uniqueness of identifiers but the text format does have requirements on this. For example this module:

(module
  (func (@name "foo"))
  (func (@name "foo"))
)

is not valid to print with $foo (or $"foo") for both identifiers. This means that if the (@name ...) annotation were removed then I don't think there would be a remaining means to print a module like this in the text representation.

For comparison, currently wasm-tools print over the above module prints:

(module
  (type (;0;) (func))
  (func $foo (;0;) (type 0))
  (func $#func1<foo> (@name "foo") (;1;) (type 0))
)

where the second function gets @name to indicate its name in addition to a synthesized $-name.

@tlively
Copy link
Member Author

tlively commented Oct 12, 2023

Ahh, good point that we so far would not be able to round-trip binaries with repeated names in the name section. We could solve that by allowing repeated identifiers in the text format and using indices (with comments) rather than identifiers to disambiguate at the use sites.

tlively added a commit to WebAssembly/binaryen that referenced this issue Feb 6, 2024
In addition to normal identifiers, support parsing identifiers of the format
`$"..."`. This format is not yet allowed by the standard, but it is a popular
proposed extension (see WebAssembly/spec#617 and
WebAssembly/annotations#21).

Binaryen has historically allowed a similar format and has supported arbitrary
non-standard identifier characters, so it's much easier to support this extended
syntax than to fix everything to use the restricted standard syntax.
tlively added a commit to WebAssembly/binaryen that referenced this issue Feb 6, 2024
In addition to normal identifiers, support parsing identifiers of the format
`$"..."`. This format is not yet allowed by the standard, but it is a popular
proposed extension (see WebAssembly/spec#617 and
WebAssembly/annotations#21).

Binaryen has historically allowed a similar format and has supported arbitrary
non-standard identifier characters, so it's much easier to support this extended
syntax than to fix everything to use the restricted standard syntax.
tlively added a commit to WebAssembly/binaryen that referenced this issue Feb 6, 2024
In addition to normal identifiers, support parsing identifiers of the format
`$"..."`. This format is not yet allowed by the standard, but it is a popular
proposed extension (see WebAssembly/spec#617 and
WebAssembly/annotations#21).

Binaryen has historically allowed a similar format and has supported arbitrary
non-standard identifier characters, so it's much easier to support this extended
syntax than to fix everything to use the restricted standard syntax.
@tlively
Copy link
Member Author

tlively commented Apr 3, 2024

@rossberg, what do we need to make progress here? Given that the names section is used only as a debugging aid, I haven't heard anyone else voice concerns about unicode spoofing risks.

@rossberg
Copy link
Member

rossberg commented Apr 4, 2024

I'm okay doing this change, but it requires some spec work. The easy part is to extend the definition of identifiers to

id ::= '$' idchar+ | '$' name

But that's far from enough. The text format spec currently relies on equality of identifiers being completely syntactic. If we allow strings, then syntax and denotation can differ, so we'll need to introduce equivalence classes, and need to define that elaboration. For example $"\41" is equal not just to itself but also to $"A" (and presumably $A).

@bvisness
Copy link
Contributor

After today's CG discussion, I think it makes more sense to keep the @name annotation, since it doesn't seem like identifiers will ever be able to express the full range of what the name section allows.

I would also suggest that we shouldn't worry about round-tripping identifiers. It's not a bad thing to put identifiers in the name section, or to pull identifiers from the name section, but identifiers don't really correspond to anything in the source language and hence I don't think it's that critical to preserve them. (To that end, I don't think I see the value of quoted identifiers if also keeping @name.)

@rossberg
Copy link
Member

I was about to merge #23, but I don't mind either way. They do add some complexity to the spec, o.t.o.h. it's handy to have an escape mechanism for richer Wasm identifiers.

How strongly does everybody feel?

@tlively
Copy link
Member Author

tlively commented May 21, 2024

I very much want the extended identifiers, even though we're also keeping the name annotation. For the common case where there are not multiple or repeated names, the names can match 1:1, so being able to use the arbitrary names as identifiers is useful. If we don't have the extended identifier syntax, then tools would have to do their best to synthesize meaningful identifiers for names that are not also valid identifiers.

@alexcrichton
Copy link
Contributor

I would also cast my vote in favor of extended identifiers. Synthesized identifiers are (subjectively to me) much harder to read than $"foo" and not having to synthesize an identifier at all would be useful.

@rossberg
Copy link
Member

Okay, thanks, I'll move ahead with merging the PR then.

radekdoulik pushed a commit to dotnet/binaryen that referenced this issue Jul 12, 2024
In addition to normal identifiers, support parsing identifiers of the format
`$"..."`. This format is not yet allowed by the standard, but it is a popular
proposed extension (see WebAssembly/spec#617 and
WebAssembly/annotations#21).

Binaryen has historically allowed a similar format and has supported arbitrary
non-standard identifier characters, so it's much easier to support this extended
syntax than to fix everything to use the restricted standard syntax.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants