-
-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: script to convert HTML manual pages to markdown #4620
Conversation
Script to convert recursively all .html files to .md (GitHub flavoured Markdown). (see related OSGeo#3849)
Is this only to have the script in the repo? After that, is it supposed to be one time use or always used? (I assume that the docs in the repo will be in markdown instead of html soon..) If it's supposed to be used only once, does it need to be part of the repo? If ever we convert all of our html files to markdown (+formatting), I suggest to have an intermediate PR that only does the rename, to have the GitHub history +blame follow the file, instead of deleting and adding files (which would happen if the markdown formatting would have a lot of changes + renames) |
I see the following use cases:
Yes, an idea is to do that with multiple PRs:
The "rename" comment I don't fully understand. |
I see the value of having this available for use outside of a one-time use, like add-ons outside of the osgeo/grass-addons repo.
I imagined that passing from html to markdown as the docs source would be done at once. Thus, no interim with both formats in the repo. Thus, to help navigating the history for the future, I was suggesting to have an interim commit in the main branch that renamed all html files to change the extension of html to md, without any content changes, and directly after, applying the conversion to md in these md files that are in fact html. Html can be used in md to some extent. However these two must be done right after the other, as I don't expect the html builds to be working in that interim commit. But I think it will greatly help navigating history by allowing to continue going back commits of the renamed file (instead of stopping there). |
The history would be nice and the HTML==MD is nice trick, but...
...I'm afraid we can't just replace the server infrastructure for HTML with Markdown/mkdocs at the same time as merging the PR, so I think the change needs to be gradual in one way or another. |
If the build process generates the html from the md, what changes would be needed on the web server infrastructure? Maybe a test with a staging instance might help. But I'll keep thinking about what might be best here.. |
Valid point about the git history. Suggestion:
|
That's what I was thinking about if we needed to have html too in parallel (vs a clean cutoff, switching to Md at once). |
We could prepare that in one PR and then "Allow rebase merging" for couple minutes to merge that PR with its two commits. This would not break or workaround the CI. |
The rename way seems also appealing because it is more natural: Even with several conversions done already here, I get plenty of HTML tags, some perhaps need to stay. cd dist.x86_64-pc-linux-gnu/docs/mkdocs/source
grep -Eor '<[a-zA-Z][^>]*>' *.md | grep -Ev '<https?://[^>]+>'
|
… in relative URLs; simplify path to utils
Using the latest version of this script, I still get a lot of warnings:
The reason is (example, see
Seems we overlooked that in #3849? Any idea, @landam? As the script of this PR runs before the parser is invoked a change in |
Change in f91a111 to avoid undesired escaping of Example: Before: After: |
…ll URLs as before; fix %20 to dash for mkdocs
Is the link conversion regex only applied to our links, not all external references to other websites? |
With the |
I didn't catch the "not" part in the sed call/syntax, so your explanation makes sense in this case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To prepare for the migration, it is a good idea to have this merged, and properly tested inside the further PRs that also prepare the migration. Adjustments could then be added when finding more edge cases before the final switch. This script is an internal migration tool helper as I see it.
Test submission of conversion of all HTML manual pages to markdown using the `pandoc` based converter script (see OSGeo#4620). For figure code conversion issues, see OSGeo#4864
The initial commit simply uncoments the Markdown compilation lines from OSGeo#3849. This requires the actual Markdown files to be present, so this breaks the CI. This will stay like this for now. The PR is mean to be tested together with the Markdown files generated by script in OSGeo#4620 (switch to this branch after generating the files). Subsequent commits will add fixes of the compilation (missing tools, files, etc.).
Not necessarily for this PR, but I found some issues with link conversion in these files:
Parentheses with "see" is turned into a link or simple textual reference is turned into a link which happens to be wrong:
There is more, but I checked these specifically while exploring the current issues. To be fixed manually, but it is not immediately clear to me how:
|
@wenzeslaus, all: In #4864 (comment) @ninsbl has suggested to convert all HTML manual pages into markdown with the markitdown converter. Shall we continue with pandoc or rather use markitdown? |
Hi guys, the markitdown package by Microsoft is just a very thin wrapper around markdownify, meaning it basically calls markdownify when converting HTML documents... Thus I would probably go rather with markdownify than markitdown The markdownify developers were relatively fast to address issues I raised with the tool when converting GRASS GIS manuals: So, pros for markdownify would be handling of images, it being a light-weight Python tool; cons are lack of link conversions, not yet perfect definition lists. An option would be to tweak markdownify with custom converters... That said, I merely explored options when stubling upon one and issues arose with images. Since markdownify is not near-perfect either, please feel free to continue with pandoc... |
@ninsbl I've started working on additional pandoc lua filters that enforces markdownlint rules and fixes other issues with embedded html. Can we merge this PR so I can open a new one with the updates? If we are not satisfied with pandoc after reviewing the new outputs then we I think we can explore other tooling. |
Sure. I did not mean to delay. Sorry for the confusion... |
I looked at some of the markitdown outputs and saw a basic issues here and there which I think you (@neteler) already checked/fixed in the pandoc. (Perhaps it is a different set of issues then with pandoc, but definitively not perfect as is.) |
Script to convert recursively all .html files to .md (GitHub flavoured Markdown).
This is not only relevant for the conversion in GRASS-core but also for GRASS-Addons.
(see related #3849)
Suggestions needed for: