Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Unicode in "Name" Token #68

Open
DER-SSt opened this issue Dec 8, 2024 · 1 comment
Open

Support Unicode in "Name" Token #68

DER-SSt opened this issue Dec 8, 2024 · 1 comment

Comments

@DER-SSt
Copy link
Contributor

DER-SSt commented Dec 8, 2024

Is your feature request related to a problem? Please describe.

Current Definition of "Name" Token

Only letters a-z are allowed.

token Name =
( 'a'..'z' | 'A'..'Z' | '_' | '$' )
( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '$' )*;

Limitation for UML-Languages

The name token is used in nearly all monticore-languages. The restrictions of the name make it harder for users to describe their problem in their language.

e.g. CDs:

class Käse {      // "ä" not allowed
  bool flüßig;   // "ü" and "ß"
}

or ODs:

object Époisses: Käse {    // "É"
  flüßig = false; 
}

and so on.

Limitation for General Languages

Other languages have a much broader definition of names. A monticore-grammar for these languages is either more restrictive and cannot parse all valid instances, or it redefines the name token and is hard to use with other monticore-languages.

Java:

https://docs.oracle.com/javase/specs/jls/se23/html/jls-3.html#jls-3.8

Letters and digits may be drawn from the entire Unicode character set, which supports most writing scripts in use in the world today, including the large sets for Chinese, Japanese, and Korean. This allows programmers to use identifiers in their programs that are written in their native languages.

αρετη is explicitly mentioned in the java specification as an allowed identifier.

XML

XML also allows unicode-characters in the identifier. As a consequence, the MontiCore-XML Language Overrides the name token:
https://github.com/MontiCore/xml/blob/ed432849540eab55c952aabfa748b923c541b55c/src/main/grammars/de/monticore/lang/XMLBasis.mc4#L22-L47

Describe the solution you'd like?

Allow Unicode-Characters for name token in MCBasis.mc4. This allows the developer to create models closer to her native language, and ensures that general languages such as Java & XML can be parsed without overwriting the name token.

There is a unicode-identifier standard, which can serve as a language-independent basis: https://www.unicode.org/reports/tr31/

Java-RTE also knows the unicode-identifier standard: https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/lang/Character.html#isUnicodeIdentifierStart(int)

Describe alternatives you've considered

No response

Additional context

No response

@rumpe
Copy link
Member

rumpe commented Dec 20, 2024

For continuity: Redefining the content of "Name" would be disruptive. so that should not happen.

But, it makes sense to allow alternative Definitions (Extensions) for "Name", e.g. "UniCodeName" with strong Integration into
(a) Parsing, (b) Symbol-Infrastructure, e.g. symbol X = UnicodeName ... should also work.
(while also allowing restructed "Name" usage in the same models, but for different places)

Furthermore to clarify: Is it enough to define one fixed new Nonterminal (like "UniCodeName") or do we need a general "Name"- extension/redefinition mechanism that allows to introduce new such nonterminals on the fly? (I hope not?)

PS: Some languages (SysML) allow Spaces within Names, but cover them with "...", like in "Driver Seat".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants