[subscribestar] Better extraction of content

The structure of content is like this: ``` <div class="post-content" data-role="post_content-text"> <div class="trix-content"> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <body> <div> Unspeakable thing are written here<br /> <br /> haiiiiiiiiiiiiiiii hi hi hiii its meee back againnn, plspls leave a comment if uuuu liked it mwah <3 </div> </body> </html> </div> </div> <div class="post-uploads ``` Currently we extract content with: ``` (extr('<div class="post-content', '<div class="post-uploads').partition(">")[2]) ``` I propose we just take the body parts: ``` extr('<body>', '</body>') ``` which only happen when surrounding actual content. It is then easier to use it in the filename content with the `!H` formatter: `content[:160]!H}`. Otherwise the content currently extracted can't be decoded with it.
mikf · Jan 3, 2025 · 22d4e84 · 22d4e84
1 parent 5767c08
commit 22d4e84
Showing 1 changed file with 2 additions and 6 deletions.
diff --git a/gallery_dl/extractor/subscribestar.py b/gallery_dl/extractor/subscribestar.py
@@ -137,9 +137,7 @@ def _data_from_post(self, html):
             "author_nick": text.unescape(extr('>', '<')),
             "date"       : self._parse_datetime(extr(
                 'class="post-date">', '</').rpartition(">")[2]),
-            "content"    : (extr(
-                '<div class="post-content', '<div class="post-uploads')
-                .partition(">")[2]),
+            "content"    : extr('<body>', '</body>')
         }
 
     def _parse_datetime(self, dt):
@@ -196,7 +194,5 @@ def _data_from_post(self, html):
             "author_nick": text.unescape(extr('alt="', '"')),
             "date"       : self._parse_datetime(extr(
                 '<span class="star_link-types">', '<')),
-            "content"    : (extr(
-                '<div class="post-content', '<div class="post-uploads')
-                .partition(">")[2]),
+            "content"    : extr('<body>', '</body>')
         }