Support Unicode in "Name" Token #68

DER-SSt · 2024-12-08T16:06:47Z

Is your feature request related to a problem? Please describe.

Current Definition of "Name" Token

Only letters a-z are allowed.

monticore/monticore-grammar/src/main/grammars/de/monticore/MCBasics.mc4

Lines 20 to 22 in 585f2d6

    
           token Name = 
        
             ( 'a'..'z' | 'A'..'Z' | '_' | '$' ) 
        
             ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '$' )*;

Limitation for UML-Languages

The name token is used in nearly all monticore-languages. The restrictions of the name make it harder for users to describe their problem in their language.

e.g. CDs:

class Käse {      // "ä" not allowed
  bool flüßig;   // "ü" and "ß"
}

or ODs:

object Époisses: Käse {    // "É"
  flüßig = false; 
}

and so on.

Limitation for General Languages

Other languages have a much broader definition of names. A monticore-grammar for these languages is either more restrictive and cannot parse all valid instances, or it redefines the name token and is hard to use with other monticore-languages.

Java:

https://docs.oracle.com/javase/specs/jls/se23/html/jls-3.html#jls-3.8

Letters and digits may be drawn from the entire Unicode character set, which supports most writing scripts in use in the world today, including the large sets for Chinese, Japanese, and Korean. This allows programmers to use identifiers in their programs that are written in their native languages.

αρετη is explicitly mentioned in the java specification as an allowed identifier.

XML

XML also allows unicode-characters in the identifier. As a consequence, the MontiCore-XML Language Overrides the name token:
https://github.com/MontiCore/xml/blob/ed432849540eab55c952aabfa748b923c541b55c/src/main/grammars/de/monticore/lang/XMLBasis.mc4#L22-L47

Describe the solution you'd like?

Allow Unicode-Characters for name token in MCBasis.mc4. This allows the developer to create models closer to her native language, and ensures that general languages such as Java & XML can be parsed without overwriting the name token.

There is a unicode-identifier standard, which can serve as a language-independent basis: https://www.unicode.org/reports/tr31/

Java-RTE also knows the unicode-identifier standard: https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/lang/Character.html#isUnicodeIdentifierStart(int)

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

rumpe · 2024-12-20T08:13:42Z

For continuity: Redefining the content of "Name" would be disruptive. so that should not happen.

But, it makes sense to allow alternative Definitions (Extensions) for "Name", e.g. "UniCodeName" with strong Integration into
(a) Parsing, (b) Symbol-Infrastructure, e.g. symbol X = UnicodeName ... should also work.
(while also allowing restructed "Name" usage in the same models, but for different places)

Furthermore to clarify: Is it enough to define one fixed new Nonterminal (like "UniCodeName") or do we need a general "Name"- extension/redefinition mechanism that allows to introduce new such nonterminals on the fly? (I hope not?)

PS: Some languages (SysML) allow Spaces within Names, but cover them with "...", like in "Driver Seat".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Unicode in "Name" Token #68

Support Unicode in "Name" Token #68

DER-SSt commented Dec 8, 2024

rumpe commented Dec 20, 2024

Support Unicode in "Name" Token #68

Support Unicode in "Name" Token #68

Comments

DER-SSt commented Dec 8, 2024

Is your feature request related to a problem? Please describe.

Current Definition of "Name" Token

Limitation for UML-Languages

Limitation for General Languages

Java:

XML

Describe the solution you'd like?

Describe alternatives you've considered

Additional context

rumpe commented Dec 20, 2024