docs: script to convert HTML manual pages to markdown #4620

neteler · 2024-10-31T08:32:19Z

Script to convert recursively all .html files to .md (GitHub flavoured Markdown).
This is not only relevant for the conversion in GRASS-core but also for GRASS-Addons.

(see related #3849)

Suggestions needed for:

path treatment of LUA file (e.g. on Windows)
quality of generated markdown
figure conversion issues, see [Bug] Conversion of HTML manual pages to markdown fails for HTML figure code #4864

Script to convert recursively all .html files to .md (GitHub flavoured Markdown). (see related OSGeo#3849)

echoix · 2024-10-31T19:55:24Z

Is this only to have the script in the repo? After that, is it supposed to be one time use or always used? (I assume that the docs in the repo will be in markdown instead of html soon..)

If it's supposed to be used only once, does it need to be part of the repo?

If ever we convert all of our html files to markdown (+formatting), I suggest to have an intermediate PR that only does the rename, to have the GitHub history +blame follow the file, instead of deleting and adding files (which would happen if the markdown formatting would have a lot of changes + renames)

neteler · 2024-10-31T22:35:23Z

Is this only to have the script in the repo? After that, is it supposed to be one time use or always used? (I assume that the docs in the repo will be in markdown instead of html soon..)

If it's supposed to be used only once, does it need to be part of the repo?

I see the following use cases:

bulk conversion of all core manual pages (one time unless we need to optimize this script for better markdown output/modifications of HTML residuals, ...)
bulk conversion of all addon manual pages (same as for core)
helper script for addons in non-standard repos (see for example the "grass-gis-addons" tag: https://github.com/topics/grass-gis-addons). This should be offered long-term, IMHO.

If ever we convert all of our html files to markdown (+formatting), I suggest to have an intermediate PR that only does the rename, to have the GitHub history +blame follow the file, instead of deleting and adding files (which would happen if the markdown formatting would have a lot of changes + renames)

Yes, an idea is to do that with multiple PRs:

convert HTML to MD, keep the HTML (PR 1, 2, ... n, i.e., submit in chunks to avoid too large PRs)
have a (sort?) interim phase of both HTML and MD in parallel, esp. for quality control
remove the HTML files in a different PR, keeping MD only

The "rename" comment I don't fully understand.

echoix · 2024-10-31T22:47:02Z

Is this only to have the script in the repo? After that, is it supposed to be one time use or always used? (I assume that the docs in the repo will be in markdown instead of html soon..)

If it's supposed to be used only once, does it need to be part of the repo?

I see the following use cases:

bulk conversion of all core manual pages (one time unless we need to optimize this script for better markdown output/modifications of HTML residuals, ...)

bulk conversion of all addon manual pages (same as for core)

helper script for addons in non-standard repos (see for example the "grass-gis-addons" tag: https://github.com/topics/grass-gis-addons). This should be offered long-term, IMHO.

I see the value of having this available for use outside of a one-time use, like add-ons outside of the osgeo/grass-addons repo.

If ever we convert all of our html files to markdown (+formatting), I suggest to have an intermediate PR that only does the rename, to have the GitHub history +blame follow the file, instead of deleting and adding files (which would happen if the markdown formatting would have a lot of changes + renames)

Yes, an idea is to do that with multiple PRs:

convert HTML to MD, keep the HTML (PR 1, 2, ... n, i.e., submit in chunks to avoid too large PRs)

have a (sort?) interim phase of both HTML and MD in parallel, esp. for quality control

remove the HTML files in a different PR, keeping MD only

The "rename" comment I don't fully understand.

I imagined that passing from html to markdown as the docs source would be done at once. Thus, no interim with both formats in the repo. Thus, to help navigating the history for the future, I was suggesting to have an interim commit in the main branch that renamed all html files to change the extension of html to md, without any content changes, and directly after, applying the conversion to md in these md files that are in fact html. Html can be used in md to some extent. However these two must be done right after the other, as I don't expect the html builds to be working in that interim commit. But I think it will greatly help navigating history by allowing to continue going back commits of the renamed file (instead of stopping there).

wenzeslaus · 2024-11-01T17:47:36Z

I was suggesting to have an interim commit in the main branch that renamed all html files to change the extension of html to md, without any content changes, and directly after, applying the conversion to md in these md files that are in fact html.

The history would be nice and the HTML==MD is nice trick, but...

However these two must be done right after the other, as I don't expect the html builds to be working in that interim commit.

...I'm afraid we can't just replace the server infrastructure for HTML with Markdown/mkdocs at the same time as merging the PR, so I think the change needs to be gradual in one way or another.

echoix · 2024-11-01T21:50:52Z

If the build process generates the html from the md, what changes would be needed on the web server infrastructure?

Maybe a test with a staging instance might help. But I'll keep thinking about what might be best here..

neteler · 2024-11-08T16:27:05Z

Valid point about the git history.

Suggestion:

we git move the .html to .md (so history is kept) and commit
we copy it back to .html (to keep them for a little while in parallel; these files do not have the git history), add and commit
we replace HTML content of the now .md file with the markdown content and commit
we compare both (quality control)
eventually we drop the .html file

echoix · 2024-11-08T17:40:10Z

Valid point about the git history.

Suggestion:

we git move the .html to .md (so history is kept) and commit

we copy it back to .html (to keep them for a little while in parallel; these files do not have the git history), add and commit

we replace HTML content of the now .md file with the markdown content and commit

we compare both (quality control)

eventually we drop the .html file

That's what I was thinking about if we needed to have html too in parallel (vs a clean cutoff, switching to Md at once).

utils/grass_html2md.sh

wenzeslaus · 2024-11-08T19:05:46Z

we git move the .html to .md (so history is kept) and commit

we copy it back to .html (to keep them for a little while in parallel; these files do not have the git history), add and commit

We could prepare that in one PR and then "Allow rebase merging" for couple minutes to merge that PR with its two commits. This would not break or workaround the CI.

wenzeslaus · 2024-11-08T19:26:25Z

The rename way seems also appealing because it is more natural: Even with several conversions done already here, I get plenty of HTML tags, some perhaps need to stay.

cd dist.x86_64-pc-linux-gnu/docs/mkdocs/source
grep -Eor '<[a-zA-Z][^>]*>' *.md | grep -Ev '<https?://[^>]+>'

...
r.in.wms.md:<div data-align="center" style="margin: 10px">
r.in.xyz.md:<sup>
r.li.cwed.md:<span class="small">
...

utils/grass_html2md.sh

… in relative URLs; simplify path to utils

neteler · 2024-11-15T12:48:33Z

Using the latest version of this script, I still get a lot of warnings:

...
WARNING -  Doc file 'db.execute.md' contains a link 'topic_attribute_table.html', but the target is not found among documentation files.
WARNING -  Doc file 'db.execute.md' contains a link 'keywords.html#SQL', but the target 'keywords.html' is not found among documentation files. Did
           you mean 'keywords.md#SQL'?
WARNING -  Doc file 'db.in.ogr.md' contains a link 'database.html', but the target is not found among documentation files. Did you mean
           'database.md'?
WARNING -  Doc file 'db.in.ogr.md' contains a link 'topic_import.html', but the target is not found among documentation files. Did you mean
           'topic_import.md'?
WARNING -  Doc file 'db.in.ogr.md' contains a link 'keywords.html#attribute%20table', but the target 'keywords.html' is not found among documentation
           files. Did you mean 'keywords.md#attribute%20table'?
WARNING -  Doc file 'db.login.md' contains a link 'database.html', but the target is not found among documentation files. Did you mean 'database.md'?
WARNING -  Doc file 'db.login.md' contains a link 'topic_connection_settings.html', but the target is not found among documentation files.
...

The reason is (example, see KEYWORDS section where .md should be present rather than .html):

---
name: db.in.ogr
description: Imports attribute tables in various formats.
keywords: database, import, attribute table
---

# db.in.ogr

## NAME

***db.in.ogr*** - Imports attribute tables in various formats.

### KEYWORDS

[database](database.html),
[import](topic_import.html),
[attribute table](keywords.html#attribute%20table)

### SYNOPSIS

**db.in.ogr**
...

Seems we overlooked that in #3849? Any idea, @landam?

As the script of this PR runs before the parser is invoked a change in man/build_md.py or around might be needed?

landam · 2024-11-23T14:55:58Z

Seems we overlooked that in #3849? Any idea, @landam?

As the script of this PR runs before the parser is invoked a change in man/build_md.py or around might be needed?

For record, solved by #4740

neteler · 2024-11-24T17:07:41Z

Change in f91a111 to avoid undesired escaping of $ in plain text:

Example: dist.x86_64-pc-linux-gnu/docs/mkdocs/site/grass.html#examples

Before:

After:

…ll URLs as before; fix %20 to dash for mkdocs

echoix · 2024-12-05T23:29:07Z

Is the link conversion regex only applied to our links, not all external references to other websites?

neteler · 2024-12-06T00:20:24Z

Is the link conversion regex only applied to our links, not all external references to other websites?

With the sed part URLs with "http[s]" will not be modified, only relative links pointing to other internal manual pages.

Requested changes were addressed

echoix · 2024-12-09T22:20:25Z

Is the link conversion regex only applied to our links, not all external references to other websites?

With the sed part URLs with "http[s]" will not be modified, only relative links pointing to other internal manual pages.

I didn't catch the "not" part in the sed call/syntax, so your explanation makes sense in this case

echoix

To prepare for the migration, it is a good idea to have this merged, and properly tested inside the further PRs that also prepare the migration. Adjustments could then be added when finding more edge cases before the final switch. This script is an internal migration tool helper as I see it.

Test submission of conversion of all HTML manual pages to markdown using the `pandoc` based converter script (see OSGeo#4620). For figure code conversion issues, see OSGeo#4864

The initial commit simply uncoments the Markdown compilation lines from OSGeo#3849. This requires the actual Markdown files to be present, so this breaks the CI. This will stay like this for now. The PR is mean to be tested together with the Markdown files generated by script in OSGeo#4620 (switch to this branch after generating the files). Subsequent commits will add fixes of the compilation (missing tools, files, etc.).

wenzeslaus · 2025-02-05T01:57:45Z

Not necessarily for this PR, but I found some issues with link conversion in these files:

.html extension not converted to .md (other links in the same file have .md):

'imageryintro.md' contains a link 'r.series.html'
'r.in.gdal.md' contains a link 'i.rectify.html'
'r.semantic.label.md' contains a link 'r.support'

.html dropped (Markdown does not have any extension):

'r.lake.md' contains a link 'r.grow' & 'r.lake.md' contains a link 'r.mapcalc' & 'r.lake.md' contains a link 'r.mapcalc'

Parentheses with "see" is turned into a link or simple textual reference is turned into a link which happens to be wrong:

'i.smap.md' contains a link '#ref1' & 'i.smap.md' contains a link '#ref2' & 'i.smap.md' contains a link '#mflag.md'
'r.in.poly.md' contains a link '#format' (ancor name is )

There is more, but I checked these specifically while exploring the current issues.

To be fixed manually, but it is not immediately clear to me how:

Use of custom id attribute in the HTML in r.surf.idw.html: <h3 id="minuse">Surface-generation error analysis</h3>
More special characters removed from anchor than with HTML: 'v.random.md' contains a link '#stratified-random-sampling:-random-sampling-from-vector-map-by-attribute'

neteler · 2025-02-05T11:35:51Z

@wenzeslaus, all:

In #4864 (comment) @ninsbl has suggested to convert all HTML manual pages into markdown with the markitdown converter.

Shall we continue with pandoc or rather use markitdown?

ninsbl · 2025-02-05T11:55:53Z

@wenzeslaus, all:

In #4864 (comment) @ninsbl has suggested to convert all HTML manual pages into markdown with the markitdown converter.

Shall we continue with pandoc or rather use markitdown?

Hi guys, the markitdown package by Microsoft is just a very thin wrapper around markdownify, meaning it basically calls markdownify when converting HTML documents... Thus I would probably go rather with markdownify than markitdown

The markdownify developers were relatively fast to address issues I raised with the tool when converting GRASS GIS manuals:
matthewwithanm/python-markdownify#176
matthewwithanm/python-markdownify#173

So, pros for markdownify would be handling of images, it being a light-weight Python tool; cons are lack of link conversions, not yet perfect definition lists. An option would be to tweak markdownify with custom converters...

That said, I merely explored options when stubling upon one and issues arose with images. Since markdownify is not near-perfect either, please feel free to continue with pandoc...

cwhite911 · 2025-02-05T14:27:23Z

@ninsbl I've started working on additional pandoc lua filters that enforces markdownlint rules and fixes other issues with embedded html.

Can we merge this PR so I can open a new one with the updates?

If we are not satisfied with pandoc after reviewing the new outputs then we I think we can explore other tooling.

ninsbl · 2025-02-05T14:40:52Z

Sure. I did not mean to delay. Sorry for the confusion...

wenzeslaus · 2025-02-06T03:51:29Z

Shall we continue with pandoc or rather use markitdown?

I looked at some of the markitdown outputs and saw a basic issues here and there which I think you (@neteler) already checked/fixed in the pandoc. (Perhaps it is a different set of issues then with pandoc, but definitively not perfect as is.)

docs: script to convert HTML manual pages to markdown

cbb2324

Script to convert recursively all .html files to .md (GitHub flavoured Markdown). (see related OSGeo#3849)

neteler added manual Documentation related issues docs labels Oct 31, 2024

neteler added this to the 8.5.0 milestone Oct 31, 2024

neteler self-assigned this Oct 31, 2024

landam mentioned this pull request Oct 31, 2024

docs: Generate manual from markdown using mkdocs #3849

Merged

31 tasks

wenzeslaus requested changes Nov 8, 2024

View reviewed changes

utils/grass_html2md.sh Outdated Show resolved Hide resolved

utils/grass_html2md.sh Outdated Show resolved Hide resolved

utils/grass_html2md.sh Outdated Show resolved Hide resolved

wenzeslaus previously requested changes Nov 8, 2024

View reviewed changes

utils/grass_html2md.sh Outdated Show resolved Hide resolved

utils/grass_html2md.sh Outdated Show resolved Hide resolved

neteler and others added 2 commits November 15, 2024 11:52

Merge branch 'main' into manual_html2md

5972b2f

HTML: Process the tmp file to selectively replace .html with .md only…

023f943

… in relative URLs; simplify path to utils

neteler added 2 commits November 20, 2024 00:14

Merge branch 'main' into manual_html2md

4d2f822

Merge branch 'main' into manual_html2md

2076251

landam requested a review from wenzeslaus November 23, 2024 14:56

neteler and others added 4 commits November 23, 2024 18:41

Merge branch 'main' into manual_html2md

7524c52

added signal handler to cleanup at user break

8ccc15d

switch to bash

80c5ce5

un-escape in text (e.g. in grass.html#examples)

f91a111

Merge branch 'main' into manual_html2md

df635eb

neteler mentioned this pull request Dec 5, 2024

[Bug] docs: remaining issues with markdown manual using mkdocs #4748

Open

15 tasks

grass_html2md.sh: also convert relative URLs with #anchor but keep fu…

dcf1572

…ll URLs as before; fix %20 to dash for mkdocs

echoix previously approved these changes Dec 9, 2024

View reviewed changes

neteler mentioned this pull request Dec 20, 2024

[Bug] Conversion of HTML manual pages to markdown fails for HTML figure code #4864

Open

4 tasks

neteler mentioned this pull request Dec 20, 2024

manual: conversion of all HTML manual pages to markdown #4865

Draft

change markdown fenced code blocks from bash to shell

8f04336

neteler dismissed echoix’s stale review via 8f04336 December 22, 2024 11:42

wenzeslaus mentioned this pull request Feb 4, 2025

doc: Enable Markdown doc compilation #5048

Open

Merge branch 'main' into manual_html2md

ec845be

petrasovaa approved these changes Feb 5, 2025

View reviewed changes

petrasovaa removed the request for review from wenzeslaus February 5, 2025 17:22

petrasovaa merged commit 560e6d2 into OSGeo:main Feb 5, 2025
28 checks passed

neteler deleted the manual_html2md branch February 5, 2025 17:33

cwhite911 mentioned this pull request Feb 5, 2025

docs: HTML to Markdown lua filters #5054

Merged

wenzeslaus mentioned this pull request Feb 6, 2025

doc: Add Markdown files #5057

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: script to convert HTML manual pages to markdown #4620

docs: script to convert HTML manual pages to markdown #4620

neteler commented Oct 31, 2024 •

edited

Loading

echoix commented Oct 31, 2024

neteler commented Oct 31, 2024

echoix commented Oct 31, 2024

wenzeslaus commented Nov 1, 2024

echoix commented Nov 1, 2024

neteler commented Nov 8, 2024

echoix commented Nov 8, 2024

wenzeslaus commented Nov 8, 2024

wenzeslaus commented Nov 8, 2024

neteler commented Nov 15, 2024

landam commented Nov 23, 2024

neteler commented Nov 24, 2024

echoix commented Dec 5, 2024

neteler commented Dec 6, 2024

echoix commented Dec 9, 2024

echoix left a comment

wenzeslaus commented Feb 5, 2025 •

edited

Loading

neteler commented Feb 5, 2025

ninsbl commented Feb 5, 2025

cwhite911 commented Feb 5, 2025 •

edited

Loading

ninsbl commented Feb 5, 2025

wenzeslaus commented Feb 6, 2025

docs: script to convert HTML manual pages to markdown #4620

docs: script to convert HTML manual pages to markdown #4620

Conversation

neteler commented Oct 31, 2024 • edited Loading

echoix commented Oct 31, 2024

neteler commented Oct 31, 2024

echoix commented Oct 31, 2024

wenzeslaus commented Nov 1, 2024

echoix commented Nov 1, 2024

neteler commented Nov 8, 2024

echoix commented Nov 8, 2024

wenzeslaus commented Nov 8, 2024

wenzeslaus commented Nov 8, 2024

neteler commented Nov 15, 2024

landam commented Nov 23, 2024

neteler commented Nov 24, 2024

echoix commented Dec 5, 2024

neteler commented Dec 6, 2024

echoix commented Dec 9, 2024

echoix left a comment

Choose a reason for hiding this comment

wenzeslaus commented Feb 5, 2025 • edited Loading

neteler commented Feb 5, 2025

ninsbl commented Feb 5, 2025

cwhite911 commented Feb 5, 2025 • edited Loading

ninsbl commented Feb 5, 2025

wenzeslaus commented Feb 6, 2025

neteler commented Oct 31, 2024 •

edited

Loading

wenzeslaus commented Feb 5, 2025 •

edited

Loading

cwhite911 commented Feb 5, 2025 •

edited

Loading