Skip to content

Commit

Permalink
V1
Browse files Browse the repository at this point in the history
This is a complete rewrite of the library to use xmltodict and pydantic

Notable changes:
- Ditched bs4
- Now using xmltodict and pydantic
- Removed limit option
- Parser now uses classmethods

* 6cc67f9 Uncomment ci stuff
* fb139f0 Add better Tag docs
* 9b14325 Fix tests after refactor
* e0b6a3a Rewrite Parser to classmethods, add basic docs
* 7708c77 Update Tag docstring and run doctests in ci.yml
* 3130ca1 Rename RSSFeed->RSS, RSSBaseModel->XMLBaseModel
* 8f763d5 Scarp all of the wrap/unwrap work Improve conftest fixutes Add support for self-closing tags Set every field
 to be a Tag Add json/dict_plain and tests for it Ignore unused imports for all inits
* e9e841a Update sample jsons
* fc02cf1 Add wrap/unwrap population tests
* e02a007 Add tests for wrap/unwrap chaining (renamed from with/without)
* c436ce4 Add autogenerated dunder methods to Tag
* c88388c Fix windows charmap for tests
* 329765a Fix datetime tests
* 2147f9a Remove push rule from ci until V2 is done
* 1e44298 Add with/without_tags factory to all schemas
* bd31f3c Fix tests with item, add apology_line tests
* d5a80f4 Add items to channel [WIP]
* 49db408 Add datetime comparison tests Refactor CI a bit Allow schema object mutation Add current and future todos Ad
d IPython to dev deps Clean up README a bit [WIP] Add more rss samples for test
* 5a2fcb4 Remove 3.10 syntax
* a07aa9c bump setup python to v4
* 955b1ff Fix 3.12 version
* b9d64c6 Replace flake8 with ruff
* 908d2b0 Fix ci.yml
* dd75c66 Update cron
* 461eb82 Add no category attr test, remove unused file
* c99b985 More updates to V2
* 1a1d20e Backup before os reinstall
* 2cad195 Temp commit, reword later
* e96faba Intermediate commit, added models, fixing linting and them
  • Loading branch information
dhvcc authored May 31, 2023
2 parents a13e8fa + 6cc67f9 commit 5c20e0e
Show file tree
Hide file tree
Showing 35 changed files with 3,308 additions and 323 deletions.
5 changes: 0 additions & 5 deletions .flake8

This file was deleted.

8 changes: 0 additions & 8 deletions .github/dependabot.yml

This file was deleted.

24 changes: 11 additions & 13 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
name: Lint and test

on:
schedule:
- cron: "0 0 1 * *"
# TODO: Uncomment after V2 is finished
push:
paths-ignore:
- '.github/**'
- '!.github/workflows/ci.yml'
- '.gitignore'
- 'README.md'
- ".gitignore"
- "README.md"
pull_request:

jobs:
build:
test:
strategy:
max-parallel: 6
matrix:
os: [ "ubuntu-latest", "windows-latest", "macos-latest" ]
python-version: [ 3.7, 3.8, 3.9, '3.10' ]
python-version: [ "3.8", "3.9", "3.10", "3.11"]

runs-on: ${{ matrix.os }}

Expand All @@ -27,7 +28,7 @@ jobs:

- name: Set up Python ${{ matrix.python-version }} on ${{ matrix.os }}
id: setup-python
uses: actions/setup-python@v3
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache-dependency-path: pyproject.toml
Expand All @@ -37,14 +38,11 @@ jobs:
if: steps.setup-python.outputs.cache-hit != 'true'
run: poetry install

- name: Lint code with flake8
run: poetry run flake8

- name: Lint code with black
run: poetry run black --check .

- name: Lint code with isort
run: poetry run isort --check-only .
- name: Lint code with ruff
run: poetry run ruff check .

- name: Test code with pytest
run: poetry run pytest
run: poetry run pytest --doctest-modules
7 changes: 4 additions & 3 deletions .github/workflows/publish_to_pypi.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
name: Publish Package to PyPI with poetry
name: Publish to PyPI

on:
push:
tags:
- 'v*'
- "v*"
# TODO: Only on CI success

jobs:
build-and-test-publish:
Expand All @@ -13,4 +14,4 @@ jobs:
- name: Build and publish to pypi
uses: JRubics/[email protected]
with:
pypi_token: ${{ secrets.pypi_password }}
pypi_token: ${{ secrets.pypi_password }}
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -113,4 +113,4 @@ venv.bak/
.mypy_cache/

.rss-parser
poetry.lock
.ruff_cache
24 changes: 6 additions & 18 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ repos:
- repo: local

hooks:
- id: black
- id: black-format-staged
name: black
entry: poetry
args:
Expand All @@ -23,26 +23,14 @@ repos:
language: system
types: [ python ]
stages: [ commit ]
# Black should use the config from the pyproject.toml file

- id: isort
name: isort
- id: ruff-check-global
name: ruff
entry: poetry
args:
- run
- isort
- ruff
- check
language: system
types: [ python ]
stages: [ commit ]
# isort's config is also stored in pyproject.toml

- id: flake8
name: flake8
entry: poetry
args:
- run
- flake8
language: system
always_run: true
pass_filenames: false
stages: [ push ]
stages: [ commit, push ]
148 changes: 134 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,12 @@
[![License](https://img.shields.io/pypi/l/rss-parser?color=success)](https://github.com/dhvcc/rss-parser/blob/master/LICENSE)
[![GitHub Pages](https://badgen.net/github/status/dhvcc/rss-parser/gh-pages?label=docs)](https://dhvcc.github.io/rss-parser#documentation)

[![Pypi publish](https://github.com/dhvcc/rss-parser/workflows/Pypi%20publish/badge.svg)](https://github.com/dhvcc/rss-parser/actions?query=workflow%3A%22Pypi+publish%22)
![CI](https://github.com/dhvcc/rss-parser/actions/workflows/ci.yml/badge.svg?branch=master)
![PyPi publish](https://github.com/dhvcc/rss-parser/actions/workflows/publish_to_pypi.yml/badge.svg?branch=master)

## About

`rss-parser` is typed python RSS parsing module built using `BeautifulSoup` and `pydantic`
`rss-parser` is typed python RSS parsing module built using [pydantic](https://github.com/pydantic/pydantic) and [xmltodict](https://github.com/martinblech/xmltodict)

## Installation

Expand All @@ -27,34 +28,153 @@ or
```bash
git clone https://github.com/dhvcc/rss-parser.git
cd rss-parser
pip install .
poetry build
pip install dist/*.whl
```

## Usage

### Quickstart

```python
from rss_parser import Parser
from requests import get

rss_url = "https://feedforall.com/sample.xml"
xml = get(rss_url)
rss_url = "https://rss.art19.com/apology-line"
response = get(rss_url)

# Limit feed output to 5 items
# To disable limit simply do not provide the argument or use None
parser = Parser(xml=xml.content, limit=5)
feed = parser.parse()
rss = Parser.parse(response.text)

# Print out feed meta data
print(feed.language)
print(feed.version)
# Print out rss meta data
print("Language", rss.channel.language)
print("RSS", rss.version)

# Iteratively print feed items
for item in feed.feed:
for item in rss.channel.items:
print(item.title)
print(item.description)
print(item.description[:50])

# Language en
# RSS 2.0
# Wondery Presents - Flipping The Bird: Elon vs Twitter
# <p>When Elon Musk posted a video of himself arrivi
# Introducing: The Apology Line
# <p>If you could call a number and say you’re sorry
```

Here we can see that description is still somehow has <p> - this is beacause it's placed as [CDATA](https://www.w3resource.com/xml/CDATA-sections.php) like so

```xml
<![CDATA[<p>If you could call ...</p>]]>
```

### Overriding schema

If you want to customize the schema or provide a custom one - use `schema` keyword argument of the parser

```python
from rss_parser.models import XMLBaseModel
from rss_parser.models.rss import RSS
from rss_parser.models.types import Tag

class CustomSchema(RSS, XMLBaseModel):
channel: None = None # Removing previous channel field
custom: Tag[str]

with open("tests/samples/custom.xml") as f:
data = f.read()

rss = Parser.parse(data, schema=CustomSchema)

print("RSS", rss.version)
print("Custom", rss.custom)

# RSS 2.0
# Custom Custom tag data
```

### xmltodict

This library uses [xmltodict](https://github.com/martinblech/xmltodict) to parse XML data. You can see the detailed documentation [here](https://github.com/martinblech/xmltodict#xmltodict)

The basic thing you should know is that your data is processed into dictionaries

For example, this data

```xml
<tag>content</tag>
```

will result in the following

```python
{
"tag": "content"
}
```

*But*, when handling attributes, the content of the tag will be also a dictionary

```xml
<tag attr="1" data-value="data">data</tag>
```

Turns into

```python
{
"tag": {
"@attr": "1",
"@data-value": "data",
"#text": "content"
}
}
```

### Tag field

This is a generic field that handles tags as raw data or a dictonary returned with attributes

*Although this is a complex class, it forwards most of the methods to it's content attribute, so you don't notice a difference if you're only after the .content value*

Example

```python
from rss_parser.models import XMLBaseModel
class Model(XMLBaseModel):
number: Tag[int]
string: Tag[str]

m = Model(
number=1,
string={'@attr': '1', '#text': 'content'},
)

m.number.content == 1 # Content value is an integer, as per the generic type

m.number.content + 10 == m.number + 10 # But you're still able to use the Tag itself in common operators

m.number.bit_length() == 1 # As it's the case for methods/attributes not found in the Tag itself

type(m.number), type(m.number.content) == (<class 'rss_parser.models.image.Tag[int]'>, <class 'int'>) # types are NOT the same, however, the interfaces are very similar most of the time

m.number.attributes == {} # The attributes are empty by default

m.string.attributes == {'attr': '1'} # But are populated when provided. Note that the @ symbol is trimmed from the beggining, however, camelCase is not converted

# Generic argument types are handled by pydantic - let's try to provide a string for a Tag[int] number

m = Model(number='not_a_number', string={'@customAttr': 'v', '#text': 'str tag value'}) # This will lead in the following traceback

# Traceback (most recent call last):
# ...
# pydantic.error_wrappers.ValidationError: 1 validation error for Model
# number -> content
# value is not a valid integer (type=type_error.integer)
```

**If you wish to avoid all of the method/attribute forwarding "magic" - you should use `rss_parser.models.types.TagRaw`**

## Contributing

Pull requests are welcome. For major changes, please open an issue first
Expand Down
Loading

0 comments on commit 5c20e0e

Please sign in to comment.