Replies: 8 comments
-
Use YAML config for blocklist generationInstead of using one hardcoded .py file for each list (adblock.py, ublacklist.py...), use one single Python script and store lists configurations in YAML files. The command would look like this: It would parse YAML files looking like this:
It would be one YAML file per list. I have no idea when dashes should be used. What the "includes" values refer to would be temporarily hardcoded (see another feature request). They refer to specific TXT files within the /sources/ folder (domains.txt, urls.txt...). Most of the formats are simple TXT files, but it would also need to support generation of CSV (Fediblockhole) and JSON (Peertube Isolation) files. This last format includes values such as time of update and action (add or remove, e.g. false positive). For CSV, it should also be able to use placeholder values such as {list name} (Mastodon uses a comment field in their blocklist format). |
Beta Was this translation helpful? Give feedback.
-
Better header generationCurrently, headers are "hardcoded" txt files. They should instead use placeholders for different values. Some values would be dynamic (date, versioning), others should be stored in a configuration file, meaning that changing the name or description of the list in one general config file changes all the headers. Example for the adblock header:
Those values would be stored in a YAML general config file. Most of those values should be a single line, as a line break in the header can cause headaches. The user should also be able to add new values, if they pair a keyword in a custom header and in the YAML general config file.
|
Beta Was this translation helpful? Give feedback.
-
Proper source data automatic formattingThe current cleanup process (e.g. reducing a full URL to a domain) is now handled by a bunch of bloated seds. We need to use better tools for that. There are different kind of data supported by the blocklist generation: domain, domain+path (quasi-url), regexes, tlds... All of them should be properly formatted so they can be put in the generated blocklists. For example, domain+path should delete the last |
Beta Was this translation helpful? Give feedback.
-
Separate support for subdomainsMost of the blocklist format automatically block all the subdomains of a blocked domain. However, older DNS filters such as hosts need an explicit mention of all blocked subdomains. Currently, subdomains are manually added to the domains list when found. This causes two problems:
The ideal solution would be a regularly scheduled automatic search for registered subdomains for all the domains in the This still raises a couple of issues:
|
Beta Was this translation helpful? Give feedback.
-
Culling inactive addressesThe list is long, big, and full of already long dead websites. And I'm not doing the cleanup by hand. Plus some of them might rise from the dead again so it's better to keep them in the source. However, reducing the size of the distributed lists by removing currently inactive sites and urls is a benefit. It should be a task run regularly (maybe monthly) that pings all the entries from domains and domains+path. If it returns an error, it should put it in a txt file with its error information. This file is then used as an "allowlist" in order to remove these entries during the list pre-generation process (when the If a site becomes active again, it should be removed from this culling list. It should also have an override mechanism, in case there is an error in the test process (site temporarily offline/broken during the test process or weird web shenanigans). |
Beta Was this translation helpful? Give feedback.
-
Better importCurrent import mechanism just downloads a file, puts it somewhere as a .txt and runs the awful sed-based normalization on it. It works, but it's lacking a lot of things. It should use YAML files to define files to be downloaded and what to do with them. The current script supports:
Other information might be useful, depending on how we decide to implement it.
There are two important problems to tackle which are related to the issue of source formatting.
When there is an error downloading a list, it should not replace the list with a blank file (currently implemented in bash). When a YAML config file is removed, should the data from the downloaded list be automatically removed from the sources ? Information from the YAML file should also be used to generate the aggregated list part of the repo's readme, using a markdown file with placeholder values as a template. These information can also be extracted directly from the list when it has a header (adblock format)
|
Beta Was this translation helpful? Give feedback.
-
Readme generationTyping is hard, especially the part with all the different formats and the links. We should make that automatic using YAML config (same as for the header generation) and markdown files (like the ones used for generating the list of imported lists). There should be a way to define the order the different markdown files are written into the readme, either via a YAML or their naming (1_intro.md, etc). The part with the different list should get its info from the list generation YAMLs (name, URL, description is in another md file with placeholder values since writing in a YAML is not a good experience). |
Beta Was this translation helpful? Give feedback.
-
Sources definitionCurrently, sources are organized in folders and what they contain is hardcoded in The Awful Bash Script. This should be handled with YAML files. They define values used in the list generation processus (the
This example demonstrates the crude tagging system that the list is currently using. It uses filenames to find files to group. Should this be moved to a YAML frontmatter in the TXT files? It could enable a more robust optional tagging system with links and dependencies plus a more granular list generation. It could also group and comment content automatically, as a lot of other blocking lists do (mostly those outputting one or two formats). |
Beta Was this translation helpful? Give feedback.
-
Feature whishlist:
Beta Was this translation helpful? Give feedback.
All reactions