Skip to content

Commit

Permalink
[subscribestar] Better extraction of content
Browse files Browse the repository at this point in the history
The structure of content is like this:

```
<div class="post-content" data-role="post_content-text">
                <div class="trix-content">
                    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
                    <html>
                        <body>
                            <div>
                                Unspeakable thing are written here<br />
                                <br />
                                haiiiiiiiiiiiiiiii hi hi hiii its meee back againnn, plspls leave a comment if uuuu liked it mwah
                                &lt;3
                            </div>
                        </body>
                    </html>
                </div>
            </div>
            <div class="post-uploads
```

Currently we extract content with:

```
(extr('<div class="post-content', '<div class="post-uploads').partition(">")[2])
```

I propose we just take the body parts:

```
extr('<body>', '</body>')
```

which only happen when surrounding actual content.

It is then easier to use it in the filename content with the `!H`
formatter: `content[:160]!H}`. Otherwise the content currently extracted
can't be decoded with it.
  • Loading branch information
WyohKnott committed Jan 3, 2025
1 parent 5767c08 commit 22d4e84
Showing 1 changed file with 2 additions and 6 deletions.
8 changes: 2 additions & 6 deletions gallery_dl/extractor/subscribestar.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,9 +137,7 @@ def _data_from_post(self, html):
"author_nick": text.unescape(extr('>', '<')),
"date" : self._parse_datetime(extr(
'class="post-date">', '</').rpartition(">")[2]),
"content" : (extr(
'<div class="post-content', '<div class="post-uploads')
.partition(">")[2]),
"content" : extr('<body>', '</body>')
}

def _parse_datetime(self, dt):
Expand Down Expand Up @@ -196,7 +194,5 @@ def _data_from_post(self, html):
"author_nick": text.unescape(extr('alt="', '"')),
"date" : self._parse_datetime(extr(
'<span class="star_link-types">', '<')),
"content" : (extr(
'<div class="post-content', '<div class="post-uploads')
.partition(">")[2]),
"content" : extr('<body>', '</body>')
}

0 comments on commit 22d4e84

Please sign in to comment.