-
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] HTML is modified when saved #593
Comments
What are your eXide > Edit > Preferences > Serialization settings? The default is to apply indentation - for backward compatibility. But you can disable indentation there, for whitespace fidelity. |
@joewiz the issue is not indentation. It is autoclosing tags and swallowing CDATA sections. |
@joewiz I clarified the issue description |
Have you tested other editors? How much of this behavior is eXide specific, versus universal for eXist serializing HTML stored as XML (as is default) instead of as binary (as is possible with effort). |
Any other editor I use behaves as expected (vim, vscode, IDEA). None of them autocloses tags nor swallow CDATA. I mean, I was able to create, upload and read the test document. exist-db does not do this either. This is why I opened the issue here. |
I should add that I had plenty of time testing this (but could not point my finger on it) and it was reported to me by several other parties. Please also see the linked exist-db issue. |
To give you one example:
|
The scenario above gets worse when the tainted HTML page gets synced back to disk and committed into a repository. Also, think about the amount of debugging necessary, if the project in question has an additional layer of indirection. It could be making use of the templating module for instance. |
I think there are some things we can do to help the situation, particularly in eXide, with its uniquely customizable facility for setting the serialization parameters used when a document is loaded from the database (i.e., in Specifically, we can detect that the file being loaded has a mime type of Other interfaces, such as REST, WebDAV, and XML-RPC are not so customizable; their parameters are set in To demonstrate that eXide can be patched to fix both the auto-closing tags and the CDATA issue, replace https://github.com/eXist-db/eXide/blob/develop/modules/load.xq#L60-L72 with the following: serialize(
doc($path),
map:merge((
map {
"indent": $indent,
"exist:expand-xincludes": $expand-xincludes
},
if ($mime eq "text/html") then
map {
"method": "xhtml",
"html-version": 5.0,
"cdata-section-elements": xs:QName("script")
}
else
()
))
) As you'll notice, this patch switches declare namespace output="http://www.w3.org/2010/xslt-xquery-serialization";
declare option output:method "text"; A PR adding this to eXide should probably fold this into the serialization preferences. Users may like the settings above, or they may wish to retain the existing approach, i.e., serializing HTML using the XML output method. And support for the cdata-section-elements has use cases outside of HTML serialization and should be exposed. Also, as mentioned, I think eXide has much finer-grained serialization options than other interfaces. A fuller investigation of these other interfaces' serialization limitations and approaches for giving users finer-grained control may be warranted. The other interfaces have one advantage: they don't need to be told which elements may have CDATA blocks. I wonder if eXist could add some sort of serialization option that presents CDATA blocks without requiring them to be specified. |
just a small addition: eXide - and probably the other interfaces mentioned by Joe - is also auto-closing web component tags when they are empty. This is a serious problem that leads to documents being run in quirks mode and causing potential errors esp. when those web components are siblings in a document. Self-closing elements are called 'void elements' nowday in the HTML5 spec - https://html.spec.whatwg.org/multipage/syntax.html#void-elements |
I want to focus solely on the storage side of the equation first, as Serialization is a different matter and is configurable by the user to achieve whatever they wish. There is an important concern here missing from the "Reproducible Example" which will possibly answer the proposed issue.
How "exactly" is that file being stored into the database? Different tools will do this differently! Ultimately eXist-db can only store XML or Binary documents (i.e. its storage sub-system has no concept of HTML):
|
The example file is stored as XML. As @joewiz was able to show it is possible to load my example in eXide without the CDATA section to be replaced. |
The real world example is a template that is opened with |
Sorry - will all due respect - but the fact that it has been like this from the start of eXist-db does not say it is correct these days. And "is expected" by whom? Certainly not by me writing
Here a browser would create a nested structure instead of siblings which completely breaks functionality. Web Components have not been around in xhtml but have been introduced with HTML5 - unfortunately it has been decided that those cannot be 'void' to use the terms of the WHATWG but need to use a closing tag. Thus a HTML serializer must not modify those elements just because they have no content as an XML serializer would do. If it does it's not in sync with the standard. @ALL the CDATA section are of a much lesser practical concern IMO as inline script (at least in production) should be forbidden anyway - however breaking it is still an issue dissallowing the quick dev-time hack. |
This page is a good reference on void elements (the one allowed to be serialized as empty) and self-closing tags: https://developer.mozilla.org/en-US/docs/Glossary/Void_element It clearly says:
|
@JoernT @adamretter I believe that my patch to eXide comprehensively fixes both the CDATA and self-closing tags problems with opening XHTML documents in eXide. Opening Juri's XHTML test document (with one added <!DOCTYPE html>
<html>
<head>
<title>foo</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@jinntec/fore@latest/resources/fore.css" />
</head>
<body>
<main></main>
<script><![CDATA[ console.log(true && 1 < 2) ]]></script>
</body>
</html> I believe that this is what he was after, in terms of meeting his expectations for opening HTML documents in eXide. All that's left for a thorough fix in eXide is to implement the user preferences for controlling this behavior. If total fidelity to an HTML source document that is not well-formed XML is required, then the file needs to be stored and handled as a binary file. eXide should respect those preferences, too, although I haven't tested them. From my perspective, the patch I put together will let users view (* re: editing, see my next post) HTML documents as XHTML, with sane defaults for CDATA section elements. These documents can always be serialized as HTML, with its non-well-formed XML elements (namely, the "void elements" Joern mentioned, e.g., eXist's REST, WebDAV, and XML-RPC interfaces, however, remain XML-centric and do not expose any method for XHTML-specific serialization preferences that users might expect or desire. What these interfaces do appear to provide is a feature not exposed to |
As for saving HTML files that contain CDATA blocks, I think the problem @line-o described is inherent to applications that rely on XQuery to parse XML. When eXide stores a file, the eXide client submits a PUT request to eXide's
So it receives the payload and tries to parse it as XML and return a document node. Effectively, it performs the following: xquery version "3.1";
``[
<!DOCTYPE html>
<html>
<head>
<title>foo</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@jinntec/fore@latest/resources/fore.css" />
</head>
<body>
<main></main>
<script><![CDATA[ console.log(true && 1 < 2) ]]></script>
</body>
</html>
]``
=> parse-xml() In both eXist and BaseX, this query returns: <html>
<head>
<title>foo</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@jinntec/fore@latest/resources/fore.css"/>
</head>
<body>
<main/>
<script> console.log(true && 1 < 2) </script>
</body>
</html> In other words, the CDATA is stripped and its contents are escaped. Until XQuery offers a method to parse a document and preserve its CDATA section elements, I don't see a workaround for this behavior available to eXide. |
As a band-aid for eXide I propose to store html by other means (like the rest API). At least the XML-RPC endpoints do check for well-formedness and for appropriate permissions as well. |
@JoernT Sorry if you misunderstood my comment. I was not saying that it is "correct these days", rather I was saying that eXist-db only has an XML DOM and a Blob Store. If you want to store HTML and not have it XML'ified to then the only option is to store it as a Blob. There is nothing wrong with storing and retrieving HTML as a Blob in eXist-db and that will ensure the HTML is not modified. By saying it is not a bug, I am trying to say that eXist-db is not at fault. It is likely the manner of the user or tool that is storing that data into eXist-db that is at fault. If you want to have eXist-db store HTML natively then you need to add a plethora of new features like we did in FusionDB to introduce a native HTML DOM, native storage for it, and a mapping onto the XML DOM for XPath and XQuery. |
@joewiz Each of these APIs allow you to store either XML or Binary documents. It is the choice of the user or application as to whether they store their HTML as either:
|
Describe the bug
NOTE: This issue is about
<script>
Both are likely due to some serialization that takes place.
Expected behavior
A source file is opened and displayed as it is stored within the database.
A source file created or modified in eXide is stored unmodified.
To Reproduce
/db/apps/test/test.html
in the database with following contentsNOTE: you have to use other means than eXide to do that, obviously. Think: WebDav, rest (
curl
), XML-RPC (xst
)curl http://localhost:8080/exist/rest/db/apps/test/test.html
)NOTE: That the file has an XML declaration now might be due to the serialization on retrieval via
curl
Integration Test (TODO)
For UI and browser based testing we use cypress.js
Context (please always complete the following information):
Build: eXist-6.2.0
Java: 1.8.0_362 (Azul Systems, Inc.)
OS: Mac OS X 12.6.3 (aarch64)
App Version: 3.5.0
Additional context
conf.xml
? noneThe text was updated successfully, but these errors were encountered: