Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML in output (proof-of-concept) #5172

Open
PerBothner opened this issue Sep 30, 2024 · 20 comments
Open

HTML in output (proof-of-concept) #5172

PerBothner opened this issue Sep 30, 2024 · 20 comments
Labels
type/proposal A proposal that needs some discussion before proceeding

Comments

@PerBothner
Copy link
Contributor

My html-blocks forkallows an application to "print" output lines containing HTML to an xterm.js terminal. This is a generalization of existing extensions to "print" images, such as using Sixel. However, using HTML is often preferable, as the output can be scaled, selected (copied), can include clickable links and buttons, is usually more compact, and more easily serialized. The output can also re-flow (based on terminal width and zoom), and can also react to style changes, such a light vs dark mode.

The current implementation is restricted to HTML blocks that extend the full width of the screen. It is also limited to output that gets appended to the end of the buffer. While limited, this is basically what you need to support a REPL with "rich" output. For example a graphing program that display plots using SVG. Emitting nicer-looking and copyable tables. A symbolic math program that emits formulas using MathML.

Replacing/updating previously-printed HTML blocks would be a straight-forward extension. Another natural extension would be a protocol for buttons that when clicked sends a string to the application.

This is a very preliminary proof-of-concept, not usable for real use. It is based on my buffer-cell-cursor fork, which is mostly-usable (though there are still bugs to fix). I think of the html-blocks branch as an example and motivation for the buffer-cell-cursor branch: The former adds a new class ElementBufferLine that extends BufferLine.

Screenshots and examples

Gnuplot is plotting program that can emit plots in a number of formals, including SVG and "domterm" (which is just SVG wrapped in an escape sequence). Gnuplot defaults to "domterm" output when the DOMTERM environment varible is set. The following shows an example, running gnuplot in batch (rather than interactive) mode.

Screenshot from 2024-09-29 17-49-50

More examples later.

Issues

  • Before polishing and finishing the html-blocks branch we need to polish and finish the buffer-cell-cursor branch that it depends on.

  • Scrolling is xterm.js is done as multiples of rows. This doesn't work well when lines are different heights. A work-around is to treat an HTML block as multiple rows, rounding up the height divided by standard row height. However, this leads to ugly excess space. It is also not good long-term. For example one might want to allow plain-text lines with a mix of font sizes:
    We don't want each line to be some multiple of the "standard row size" depending on the font used.

    Probably the best solution is to change the scrolling API and implementation to work in terms of pixels rather than rows.
    This is not inherently complicated, but it is extensive. I added a scrollPartialLines option to enable scrolling by fractional rows, but it does not do much yet. Getting it working should be a separate issue and PR.

  • If lines no longer are a fixed height, mapping between pixel offsets and (row, char) offsets are no longer simple
    multiplications or divisions. Linear or binary search may be needed, augmented with caching. However, note that while (for example) mapping a mouse click to a (row, char) offset may require a linear or binary search, the constant factors are small, since we are restricted to the visible screen.

  • Only the Dom renderer has the needed support, but I see no reason the WebGl renderer should be a problem.

  • Selection is not implemented. Ideally, one would want selection to extend across both regular rows and parts of HTML blocks.

  • Re-flow on screen resize is not implemented.

  • Truncation of output is not implemented. (Most people who want rich HTML output will probably want infinite scrollback, so it is a lesser priority.)

  • Serialization of HTML segments has not been implemented, though no complicated issues are foreseen.

Trying it out

Let me know if you want to try it out.

My current test-bed uses xterm.js embedded in DomTerm. DomTerm provides "safety-scrubbing" of the HTML that most people will want, and some other features to make the feature easier. I can provide instructions, if requested.

The next step is to make this feature not depend on DomTerm. Specifcally, it should be accessible from the Demo. This would probably involve a new addon (which we might call addon-html-blocks). This would include customizable safety-scrubbing.

@jerch
Copy link
Member

jerch commented Sep 30, 2024

@PerBothner The first question coming to my mind is indeed about security - can we get this secure enough within a browser env? Linked to this:

  • What about JS in html snippets? Can we forbid any JS? Could we allow JS but keep it separated from the terminal context (ECMA spec speaks here of "realms")?
  • What about asset sources? Can we forbid side channel loading, force them all to be inlined? If not, can we force them to be routed through the TE connection?

Currently I have my doubts about both, JS execution and asset loading. We might get somewhere here with putting the snippets into iframes and applying very strict security policies, re-routing of asset source can be achieve with a service worker. Still that type of "isolation" is subject of many browser bugs, so I feel a bit uneasy about the whole concept.

To not only argue with FUD from my side - the main attack vector I see here is breaking out of any secluded area and gaining access the the terminal's JS context and thus direct shell access. Such a break could happen through any weak security setup in our counter measures or browser bugs itself, which means, that we need perfect knowledge/control here (eww what a burden).
A second vector would be the ability to track terminal user side through side-channelling foreign assets, which violates a fundamental pattern of current terminal interaction (everything has to come through the TE connection).

So any ideas, how not to fall into these traps?

@Tyriar
Copy link
Member

Tyriar commented Sep 30, 2024

Something I tried and gave up on as it was a little too big of a change I wanted at the time was to add the concept of monaco editor-style "view zones" to the terminal. Basically the embedder would register a view zone that inserts a chunk of empty space in the renderer and manages a DOM element that is put there. The embedder could dispose of this view zone whenever it was done with it and maybe also change properties on it (like height).

The great thing about this approach is that it works similar to decorations (which were also inspired by monaco), its entirely up to the embedder to implement their sequence handler/feature (eg. clicking a link could open the image inline) and it also doesn't get into the mess of touching how the buffer works and making it more complex by having multiple buffer line types. The main challenges are scrolling as you've called out in your example and having the renderer support the gap properly.

I found the WIP branch I was working on master...Tyriar:xterm.js:zone_wip. I think the conclusion I came to was we must work out smooth scrolling and then pixel-based scrolling, where the scroll bar would be able to land in between buffer rows, instead of always having the top of the terminal show the top of some row.

@Tyriar Tyriar added the type/proposal A proposal that needs some discussion before proceeding label Sep 30, 2024
@jerch
Copy link
Member

jerch commented Sep 30, 2024

Idk if inline placing of HTML snippets is a good idea. It basically can be arbitrary in height/width - how is that supposed to work inline? How will too wide content be handled (typically we only have a top-down scrollbar)? What about overprinting with text content?
To me it seems that inlining complex content into the terminal buffer output will lead to "hard to comprehend" UI pattern.

How about this - always render complex content into a separate buffer, thus make it a full-viewport view. This way the content size does not matter, as we can place scrollbars as needed for both directions. This "other buffer" could be accessible through a link-like annotation in the original REPL context, and even could be held as long as the marker in the original buffer stays active (thus gets auto-evicted on scrolling off).

@PerBothner
Copy link
Contributor Author

About security, there are various approaches. I suggest we support some level of configurability, since different applications may prefer different approaches. I would focus on what should be default (standard) for a general-purpose terminal-emulator, either a stand-alone application or embedded in an IDE.

One approach is to remove "bad stuff" using a blacklist or whitelist (preferable) of allowed HTML. Domterm uses this scrubHtml function in domterm-utils.js. It has a white-list of allowed elements (see the HTMLinfo table) and it also restricts attributes (see the allowAttribute function). Specifically, it disallows <script> elements, on event handlers, and javascript: URLs in <a>, <base>, and <img> elements,

Not allowing JavaScript has worked fine for the "REPL" uses cases I'm most interested in. However, JavaScript might be useful in some applications - for example you might want animated or interactive output. One possibility is to allow the terminal to install "extensions" under explicit user control, but that may too limited.

Another approach is to wrap all HTML inside an <iframe>, as you suggest, and as YAET (see issue #5110) does. This should be fine for most REPL-style uses cases, and in that case you could allow JavaScript.

A possible problem using <iframe> is that it is presumably more resource-heavy. This might be an issue if you have a long session, with hundreds of interactions, each producing a frame. One possibility is to have two different OSC codes - one that wraps the HTML in an <ifrane> (and doesn't scrub or only scrubs minimally), and another code that inserts directly (and scrubs heavily, including removing all JavaScript).

Another possibility is that emitted HTML needs to include a session-specific randomly-generated passkey, passed in via an environment variable. I haven't really though about this, but if I remember correctly this is the approach used by GraphTerm.

@PerBothner
Copy link
Contributor Author

By the way, a very interesting related project was GraphTerm, written by R. Saravanan. He also wrote the even older XMLTerm, which was a big inspiration to me.

@PerBothner
Copy link
Contributor Author

@Tyriar The "view zone" approach might work. However, I don't know anything about Monaco view zones, nor have I looked into the decorator API and implementation.

I'm unclear if it saves us anything in terms of implementation complexity - we still need to deal with scrolling, and (preferably) zones that are a non-integral number of rows high. Furthermore, it seems like it would be helpful to have this extra content be part of the buffer. I think it may simplify some of the logic - and I think it ties in with being able to at support lines with different heights and fonts (which I think at some point we should). The user model should be that "rich output" is part of the buffer. Serialization is probably simpler if the DOM element is directly accessible from the buffer structure.

@PerBothner
Copy link
Contributor Author

@jerch _ Idk if inline placing of HTML snippets is a good idea._

If by "inline" you mean in the CSS display: inline sense, I'm inclined to agree: It's quite a bit more complicated (both conceptually and implementation), and I think less useful.

If you mean vertical interleaving of user input lines, fixed-width output lines, and "rich" HTML output lines, I disagree: I think that is very useful. It's the paradigm of REPL updated to allow rich text, graphics, images, math etc in the "print" part.

That said, having the option to individually show commands (or just their output) in a separate window, delete commands, "fold" command output etc is useful . That can be built on a "shell integration" protocol such that the terminal can understand the concepts of commands as consisting of prompt, input, and output (with possible nesting).

@jerch
Copy link
Member

jerch commented Sep 30, 2024

If you mean vertical interleaving of user input lines, fixed-width output lines, and "rich" HTML output lines, I disagree: I think that is very useful. It's the paradigm of REPL updated to allow rich text, graphics, images, math etc in the "print" part.

Yes, with "inline" I mean within the normal text buffer progression. Plz note, that my argument was not about usefulness - I also think thats quite useful. My argument is about breaking the UI so badly, that ppl will get confused, how to properly interact with it. Especially for stuff, that needs a proper width/height to show up correctly.

@PerBothner
Copy link
Contributor Author

Another demo/screenshot - the man command outputting rich text instead of mono-space text:

Screenshot from 2024-09-30 14-36-46

The xt-man command is just this script:

echo -en '\e]72;'
man -Hcat $1 2>/dev/null
echo -en '\007'

This assumes the patch in issue #5173 is applied. Without it, literal newlines in the man -Hcat output have to be converted to &#xA;. This is tricky because in HTML it is context-dependent whether whitespace is ignored.

@PerBothner
Copy link
Contributor Author

@jerch My argument is about breaking the UI so badly, that ppl will get confused, how to properly interact with it.

The use-case I'm focusing on (for now) is a REPL (including a shell) with rich output. This uses the normal scrollable screen buffer, with full-width output. The cursor never moves backwards, except within the prompt+input area. I believe this would be straightforward and intuitive.

Mixing regular column-based output with rich text using the alternate screen buffer, where the cursor can jump between column-based and rich output - that is more complicated. Not necessarily for the user, but for the application programmer. We would have to define the cursor semantics of rich output (see below).

I think if an application wants to do combine rich text with interactive full-screen alternate-buffer use, the preferred way to do that would be to clear the screen and output HTML to cover the entire screen (scrollable as needed). Instead of using row+column addressing to navigate to sections to update, one should use id attributes to specify a chunk to replace,

Of course we still want to define how row+column addressing is defined when moving through rich-text HTML blocks - even if that's not the recommended way to do things. The simplest would be to define each HTML block as a single large character-cell: It might be an extra-tall line containing a single extra-wide character. This model is more-or-less what the prototype does.

@sawka
Copy link
Contributor

sawka commented Dec 4, 2024

This is really interesting. I've been thinking a lot about this problem as well. I think trying to make it "pixel perfect" or out of band with the current row x col model, is not the right approach as it will cause a lot of problems with scrolling, cursor positioning, resizing, etc.

I would propose a very simple change. Just allow the terminal to reserve a number of "rows" as a portal to 3rd-party content. These rows/portals would participate in scrolling. If any of the cells inside of the reserved rows/portal are modified (e.g. a terminal command wants to write text into that zone or clears that zone, etc.) then we delete the portal, and convert the reserved area back to just regular cells. Then we just need an API to access that portal -- maybe it can be as simple as providing a reference to an HTML div, with a dispose event (resize and scrolling events are not even necessary as the client can use a ResizeObservers, IntersectionObservers, or getBoundingClientRect to manage the div).

This could work great for the image addon, and for @PerBothner HTML content (and for a bunch of ideas that I'm trying to implement as well).

The key though is not linking this core functionality to what gets put inside these portals. that can be handled outside of core xterm code through addons or 3rd party code. By dealing with this at the "row" level, we also remove all of the complexity around terminal resizing.

I think the implementation of this could be straight-forward. I've been thinking about trying to make this change, but as I'm not as familiar with the codebase, I'm worried about some edge case (especially with the webgl renderer).

@Tyriar
Copy link
Member

Tyriar commented Dec 5, 2024

This has come up a few times recently for me wrt notebooks/python REPLs, where we could show a webview inline in the terminal in a view zone (making room for the element between 2 lines). Where I'm at currently with that though is that there are 2 types of users this stuff is mainly targeting; programmers and data scientists and the rich UI is primarily for the data science crowd that would usually opt for a full notebook interface, of which VS Code has a fully featured one. So it doesn't feel worth the effort on my end to invest in richer terminal UI for that.

This is similar to the case of printing images to the console beyond the simple image sequence support we already have. For that you could instead run code <imgfile> to open it in the rich image viewer/editor (which even works in a remote connected window!), or <imgfile> to open it in the default editor.

Also view zones were originally considered for the AI chat view, and for that I'm pretty happy with where I landed where I shift the terminal depending on the position of the cursor or overlay:

Overlay:

Image

Shifted (cursor on last line):

Image

TLDR I'm not sure it's worth myself investing in this stuff further. A generic view zone implementation would be a great addition to the project though imo.

@PerBothner
Copy link
Contributor Author

We all agree that the VT100-terminal model have been very flexible and useful but we would like to support images, math, rich text, and complex scripts (which do not work well with fixed columns). One solution is the notebook interface, which can be very nice. However, I believe it would be preferable if we can combine or integrate the terminal interface and the notebook interface into a single tool:

  • The notebook protocol is not as standardized or well-specified.
  • You have to decide in advance if you want to use the terminal interface or the notebook interface. Each has their advantages.
  • The notebook interface does not (typically) support dynamic interaction and updates. Basically there is an input area (which you can edit) and an output area (which cannot be updated or edited).
  • One nice thing about the terminal proticol is that it is mostly standardized and it works remotely. An application an also be incrementally enhanced from barebones terminal to rich interaction depending on application support and terminal support.
  • It is nice to have the standard terminal escape sequences (colors and styles) be available in the notebook interface.
  • For those who need both terminals and notebooks, a single tool is easier to master. For example: what are the keyboard shortcuts for scrolling the window?

@PerBothner
Copy link
Contributor Author

If we want to add rich output (and maybe notebook functionality) to the terminal, then I believe limiting ourselves to fixed-size cells is not acceptable:

  • If the space for an image (or math formula or whatever) needs to be rounded up to a whole number of rows, you well get variable amounts of padding, in a way that is ugly and unexpected to regular users.
  • Restricting text to monospace is ugly and not needed. As people want to use new Unicode symbols, they run into promlems: not only how many columns to use, but also how to render them,
  • Worse, many of the World's writing systems, do not work well with discrete monospace characters.
  • Breaking of lines into rows should be done in the terminal, not the application. Having the application re-flow the line on window width change is inherently racy and glitchy.
  • The input editor of a REPL (for example a shell) needs to move the cursor within the input area. There is no way to do that if the input area is is taller than the logical terminal height.

To fix these problems, we can still work with conceptuals "cells" but cells may have different sizes. A cell may consist of one or more combined characters,or some image. A "line" is number of cells in one or more rows. The terminal divides lines into rows. The primary cursor addressing is by line plus cell or character (depending on context) within line. Wrapped lines are considered a single line in this mode.

To support cursor movement beyond the nominal terminal height we can allow relative movement abovr the home position. Alternatively, we can have a command to disable moving the home position automatically on scrolling.

The "legacy" (row, column) addressing will of course be supporting using traditional escape sequences.

It is possible to fix some of the problems mentioned above without modifying the application using a special shell-integration mode, but I'll leave out the details.

@PerBothner
Copy link
Contributor Author

Of course a reasonable response is "this sounds interesting, but it is too researchy and quite beyond the scope of the xterm.js project". I agree - but I believe xterm.js provides a good and performant base for working on these problems, at least if you're open to the BufferLine re-write (PR #4928) when it gets polished, passes the tests, and assuming performance isn't hurt. (I'm still working on it, on and off, but various things have slowed me down.)

@Tyriar
Copy link
Member

Tyriar commented Dec 5, 2024

@PerBothner to be clear, the view zone concept isn't fixed size, you as an xterm.js embedder would be able to create as many as you want, as large as you want and afterwards resize them to whatever size you want. This is a view zone in monaco for reference:

Image

So instead of having different types of buffer lines, all buffer lines are standard monospaced and don't really change at all, other than the fact that they can have gaps in between them where a HTML container is rendered and positioned for you via the API. Then the contents of the container is up to the embedder. Another very important place where this just works will be the webgl renderer where all that it needs to know about is the gap position(s) and size(s).

Of course a reasonable response is "this sounds interesting, but it is too researchy and quite beyond the scope of the xterm.js project"

Some of the ideas in #4928 are quite interesting, for reflow especially not needing to jump through hoops to wrap/unwrap lines like we do now by having something own the whole unwrapped line would be great. However, the PR is enormous and extremely scary as it changes a lot of core functionality. When I consider the possibility of shipping such a large refactor in VS Code; the time to study the PR, verify it, deal with inevitable regressions, complexities that come from suddenly cells may no longer be the same size, etc. it's hard for me to reconcile. Especially when I weigh it against both the impact it has for VS Code and other work I'm involved in right now which has much more impact (native terminal completions, GPU renderer in VS Code's editor, AI terminal integration, improving shell integration).

So at least for the state that PR is in right now, I can't see myself having the time any time soon to be able to review and merge it unfortunately.

What the zones idea does provide in contrast is:

  • A relatively small change in a new isolated component, therefore much lower risk and little maintenance burden.
  • A simple interface for embedders to add gaps in the buffer of any size, what goes into them xterm.js doesn't care. I think this gives the power to embedders for the HTML in output scenario entirely.
  • No dealing with HTML and therefore no concerns that opens wrt security. We cannot set via .innerHTML in VS Code for example, and any foreign HTML content must be presented in a webview (/iframe for web).
  • Should be a lot easier to implement and test.

@PerBothner
Copy link
Contributor Author

PerBothner commented Dec 6, 2024

Zones: If I'm understanding "zones" correctly, it's basically a 100%-width HTMLElement attached to a marker. (However, if that is all it is, it doesn't explain the amount of WebGL code, so I assume I'm missing something.)

It looks like IImageSpec in the image addon is also an element attached to a marker. Would it make sense to re-factor the image addon to use something like a Zone concept? (I haven't looked enough at either.)

If it makes sense, I can take a look.

There is a lot of WebGL code (which I don't understand), but I don't see any DomRenderer code. Is that handled automagically, or is it just not implemented yet?

Why does IZoneWidget duplicate IWidgett? Is this a TypeScript think I'm missing?

For me too it is important to be able to serialize images and embedded HTML. I see issue #4470 - is that likely to happen? I don't know WASM (yet) but I might be able to help out.

@PerBothner
Copy link
Contributor Author

The BufferLine re-write: Some of the complexity is because of desire to be able to switch at run-time between the old and the new implementation (defaulting to the old) until we are sufficiently comfortable with the new implementation. That reduces the risk, but it makes the code harder to understand and review. (Though it's been a while since I tested if the code for the old implementation still works.)

I am certainly not asking for a review until the code is more polished, the tests pass, and performance has been benchmaked. (Though early feedback is certainly welcome.) Plus some concrete benefit or feature that it enables. (This Issue was supposed to be an example of such a feature.)

@sawka
Copy link
Contributor

sawka commented Dec 6, 2024

@Tyriar Appreciate your insight (and the link to the zone_wip branch)! The arbitrarily sized view zones idea is really interesting. I'll noodle on that a bit, and maybe see if I can make any progress on it over the holidays. My biggest worry is around how that will mess with the terminal size. If the terminal was 25 rows, and then I opened up a zone which was approximately 5 rows... I think we'd have to send a SIGWINCH saying the terminal now only has 20 rows. As the zone scrolls off the top, the terminal would re-gain rows again (and get resized). Also laughing a bit at what it would look like to run something like "vi" with a zone opened up in the middle (although I guess that is one of the use-cases).

@Tyriar
Copy link
Member

Tyriar commented Dec 12, 2024

Zones: If I'm understanding "zones" correctly, it's basically a 100%-width HTMLElement attached to a marker. (However, if that is all it is, it doesn't explain the amount of WebGL code, so I assume I'm missing something.)

@PerBothner any change to rendering will need a bunch of webgl code as it's so verbose. The branch isn't doing much particularly complicated, just translating the gaps so we know what rows to render.

It looks like IImageSpec in the image addon is also an element attached to a marker. Would it make sense to re-factor the image addon to use something like a Zone concept? (I haven't looked enough at either.)

💯 it makes a lot of sense to do this as it would make the image addon a lot simpler. I don't think it would mess with any hard requirements of images taking up actual room in the buffer?

There is a lot of WebGL code (which I don't understand), but I don't see any DomRenderer code. Is that handled automagically, or is it just not implemented yet?

I think the DOM code isn't done yet as it was just a hacked up prototype and I was tackling the harder webgl side first.

Why does IZoneWidget duplicate IWidgett? Is this a TypeScript think I'm missing?

I guess I was thinking maybe there would be more widgets, I think I'd remove that if I was to clean up the PR.

If it makes sense, I can take a look.

That would be great if you could investigate this. If you do find the time to look at improving the idea in the branch, feel free to ignore the webgl side. Once it's in a good state I should be able to implement that part without too much effort.

I am certainly not asking for a review until the code is more polished, the tests pass, and performance has been benchmaked.

I just want to set expectations so you don't waste your time; if the PR is really large unfortunately it's going to have a hard to getting merged due to my time available vs the impact in VS Code.

For me too it is important to be able to serialize images and embedded HTML. I see issue #4470 - is that likely to happen? I don't know WASM (yet) but I might be able to help out.

I like how simple the serialize addon is right now. Another benefit of the view zone approach, it puts the onus on to embedders to restore the view zones. So when you call into the serialize addon, you would also record all the view zones and then recreate them when you restore the serialized content.

This also aligns with the philosophy I know we've talked about before in some issue about the purpose of xterm.js which is to provide a good baseline terminal with features that all terminals expect, with the capability to extend it to add more modern non-standard functionality.


If the terminal was 25 rows, and then I opened up a zone which was approximately 5 rows... I think we'd have to send a SIGWINCH saying the terminal now only has 20 rows.

@sawka view zones would be purely front end, there shouldn't be any interaction between them and the pty size. Using a view zone in the viewport would mean that parts of the viewport according to the pty would maybe be offscreen, this is just a consequence of the feature.

Related, if there's a pty resize that renders the view zone/IMarker invalid (eg. show view zone at row 20 and pty resize rows to 10), it would be up to the embedder to recreate it as appropritate.

Also laughing a bit at what it would look like to run something like "vi" with a zone opened up in the middle (although I guess that is one of the use-cases).

The alt buffer wasn't really in mind when I was conceiving the feature. I think you could do it, it would just need custom resize logic on the embedder size which may get complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/proposal A proposal that needs some discussion before proceeding
Projects
None yet
Development

No branches or pull requests

4 participants