Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[subscribestar] Better extraction of content
The structure of content is like this: ``` <div class="post-content" data-role="post_content-text"> <div class="trix-content"> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <body> <div> Unspeakable thing are written here<br /> <br /> haiiiiiiiiiiiiiiii hi hi hiii its meee back againnn, plspls leave a comment if uuuu liked it mwah <3 </div> </body> </html> </div> </div> <div class="post-uploads ``` Currently we extract content with: ``` (extr('<div class="post-content', '<div class="post-uploads').partition(">")[2]) ``` I propose we just take the body parts: ``` extr('<body>', '</body>') ``` which only happen when surrounding actual content. It is then easier to use it in the filename content with the `!H` formatter: `content[:160]!H}`. Otherwise the content currently extracted can't be decoded with it.
- Loading branch information