Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common HTML named entity handled badly within JSX (.tsx) files #47030

Closed
tshinnic opened this issue Dec 6, 2021 · 6 comments
Closed

Common HTML named entity handled badly within JSX (.tsx) files #47030

tshinnic opened this issue Dec 6, 2021 · 6 comments
Labels
Working as Intended The behavior described is the intended behavior; this is not a bug

Comments

@tshinnic
Copy link

tshinnic commented Dec 6, 2021

Bug Report

I've encountered a known HTML named entity that is not recognized by TSC when present within a React JSX file (aka .tsx), which is instead then spat out unchanged into the HTML page in its original entity text form.

Specifically, using   within JSX code does not work, with "&‌numsp;" being displayed in the browser window.   does work when placed directly within index.html, and the numeric form   works everywhere.

Looking at the intersection between Unicode "General Punctuation" (2000–206F) sections "Spaces" (2000-200A) and "Format Characters" (200B-200F) chart@Unicode...

and the list of known HTML entity names table@whatwg...

and looking at the definition of HTML entity names known to TSC TypeScript/src/compiler/transformers/jsx.ts...

TSC has 7 of these named HTML entities defined:

        ensp: 0x2002,
        emsp: 0x2003,
        thinsp: 0x2009,
        zwnj: 0x200C,
        zwj: 0x200D,
        lrm: 0x200E,
        rlm: 0x200F,

I have used 6 of these in projects.

The other known HTML entity names for 'spaces' are:

    2004   emsp13
    2005   emsp14
    2007   numsp
    2008   puncsp
    200A   hairsp
    200B   ZeroWidthSpace

I have used the last 4 in projects as well, the last only for experiments.

I believe it is true that the list of entities in jsx.ts has not changed since that file was created rbuckton committed on Feb 16, 2016 (ca. line 232)

And above I have identified named entities missing from only two very small sections within Unicode.

It is certainly true that the workaround for entity names missing from TSC is to use the numeric entity references, such as  , but then matters of 'usability', 'cryptic', 'confusing', etc. arise.

I am wondering what policy you would use in deciding whether to include or not include additional entity names. I am hoping you can be somewhat less severe than WHATWG's "additions are bad" stance.

In any case, documentation somewhere that TSC does not handle 'every' HTML named entity would be useful. Such as "CounterClockwiseContourIntegral" or "leftrightsquigarrow" or "angrtvbd"...

🔎 Search Terms

entity numsp thinsp HTML

🕗 Version & Regression Information

TSC 4.5.2

Inspecting source on Github shows this code section in jsx.ts has not changed in (4?) years.

🙁 Actual behavior

HTML entity &‌numsp; when present in a JSX file is echoed unchanged to web page and displayed there as &‌numsp;

🙂 Expected behavior

use of &‌numsp; should have exactly the same result as &‌#x2007; , instead JSX (.tsx) source containing for example

      ! ! !
      ! ! !

appears in browser window as

! ! ! ! ! !

which is not surprising given the generated JS code has:

children:[Object(s.jsx)(p,{}),Object(s.jsx)("br",{}),"! !\u2007! !\u2009!\u2009!"]
@nmain
Copy link

nmain commented Dec 6, 2021

Typescript appears to match Babel here:

Babel REPL

Typescript Playground

@tshinnic
Copy link
Author

tshinnic commented Dec 7, 2021

Interesting, yes, I see the original commit for Babel's src/plugins/jsx/xhtml.js 7(?) years ago, since then renamed to packages/babel-parser/src/plugins/jsx/xhtml.js

The same content is found both there and src/compiler/transformers/jsx.ts. Same number of entities (253) and same ordering. One can assume one is derived from the other, or both are derived from the same third source.

Ah, I think "from where" is solved. The list of entities contains all the named entities from HTML 4.01 with the addition of apos: 0x0027,

HTML 4.01 was ... a while ago. 2000?

The number currently supported, 253, indicates the scale of the problem. There are ~2100+ HTML5 entity names for over 1500 separate Unicode characters. TSC nears handling one-eighth of those.

I would not suggest attempting to handle all 2100+ of the HTML named entities.
Something like the currently supported &‌aelig; æ is much more likely to be useful than &‌angrtvbd; ⦝ .

I would wish a review of named entity support could add names beyond the bare-bones HTML 4.01 list. (Surely somewhere there is data on frequency of entity name usage?) But in any case, documenting that TSC is currently limited to those entity names published 21 years ago would be a good thing (even if potentially embarrassing).

@MartinJohns
Copy link
Contributor

TypeScript doesn't own the JSX specification. They'll follow whatever Facebook defines. TypeScript doesn't support a limited set of HTML entities, they support JSX.

There's some traction happening regarding this subject: facebook/jsx#132 (comment)

@RyanCavanaugh RyanCavanaugh added the Working as Intended The behavior described is the intended behavior; this is not a bug label Dec 7, 2021
@RyanCavanaugh
Copy link
Member

RyanCavanaugh commented Dec 7, 2021

documenting that TSC is currently limited to those entity names published 21 years ago would be a good thing (even if potentially embarrassing).

I find zero embarrassment in exactly matching the behavior of the de facto reference implementation of the JSX transform 😃

The above linked issue (or its parent repo) would be the place to track change in this behavior. As-is, I don't want people randomly seeing ‐ vs - in their program depending on whether they use Babel or TypeScript -- that's a far more deleterious problem than being surprised at an entity not being supported at all.

@typescript-bot
Copy link
Collaborator

This issue has been marked 'Working as Intended' and has seen no recent activity. It has been automatically closed for house-keeping purposes.

@RyanCavanaugh
Copy link
Member

This is now the specified behavior facebook/jsx#136

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Working as Intended The behavior described is the intended behavior; this is not a bug
Projects
None yet
Development

No branches or pull requests

5 participants