V1

This is a complete rewrite of the library to use xmltodict and pydantic Notable changes: - Ditched bs4 - Now using xmltodict and pydantic - Removed limit option - Parser now uses classmethods * 6cc67f9 Uncomment ci stuff * fb139f0 Add better Tag docs * 9b14325 Fix tests after refactor * e0b6a3a Rewrite Parser to classmethods, add basic docs * 7708c77 Update Tag docstring and run doctests in ci.yml * 3130ca1 Rename RSSFeed->RSS, RSSBaseModel->XMLBaseModel * 8f763d5 Scarp all of the wrap/unwrap work Improve conftest fixutes Add support for self-closing tags Set every field to be a Tag Add json/dict_plain and tests for it Ignore unused imports for all inits * e9e841a Update sample jsons * fc02cf1 Add wrap/unwrap population tests * e02a007 Add tests for wrap/unwrap chaining (renamed from with/without) * c436ce4 Add autogenerated dunder methods to Tag * c88388c Fix windows charmap for tests * 329765a Fix datetime tests * 2147f9a Remove push rule from ci until V2 is done * 1e44298 Add with/without_tags factory to all schemas * bd31f3c Fix tests with item, add apology_line tests * d5a80f4 Add items to channel [WIP] * 49db408 Add datetime comparison tests Refactor CI a bit Allow schema object mutation Add current and future todos Ad d IPython to dev deps Clean up README a bit [WIP] Add more rss samples for test * 5a2fcb4 Remove 3.10 syntax * a07aa9c bump setup python to v4 * 955b1ff Fix 3.12 version * b9d64c6 Replace flake8 with ruff * 908d2b0 Fix ci.yml * dd75c66 Update cron * 461eb82 Add no category attr test, remove unused file * c99b985 More updates to V2 * 1a1d20e Backup before os reinstall * 2cad195 Temp commit, reword later * e96faba Intermediate commit, added models, fixing linting and them
dhvcc · May 31, 2023 · 5c20e0e · 5c20e0e
2 parents a13e8fa + 6cc67f9
commit 5c20e0e
Show file tree

Hide file tree

Showing 35 changed files with 3,308 additions and 323 deletions.
diff --git a/.flake8 b/.flake8
diff --git a/.github/dependabot.yml b/.github/dependabot.yml
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -1,21 +1,22 @@
 name: Lint and test
 
 on:
+  schedule:
+    - cron: "0 0 1 * *"
+  # TODO: Uncomment after V2 is finished
   push:
     paths-ignore:
-      - '.github/**'
-      - '!.github/workflows/ci.yml'
-      - '.gitignore'
-      - 'README.md'
+      - ".gitignore"
+      - "README.md"
   pull_request:
 
 jobs:
-  build:
+  test:
     strategy:
       max-parallel: 6
       matrix:
         os: [ "ubuntu-latest", "windows-latest", "macos-latest" ]
-        python-version: [ 3.7, 3.8, 3.9, '3.10' ]
+        python-version: [ "3.8", "3.9", "3.10", "3.11"]
 
     runs-on: ${{ matrix.os }}
 
@@ -27,7 +28,7 @@ jobs:
 
       - name: Set up Python ${{ matrix.python-version }} on ${{ matrix.os }}
         id: setup-python
-        uses: actions/setup-python@v3
+        uses: actions/setup-python@v4
         with:
           python-version: ${{ matrix.python-version }}
           cache-dependency-path: pyproject.toml
@@ -37,14 +38,11 @@ jobs:
         if: steps.setup-python.outputs.cache-hit != 'true'
         run: poetry install
 
-      - name: Lint code with flake8
-        run: poetry run flake8
-
       - name: Lint code with black
         run: poetry run black --check .
 
-      - name: Lint code with isort
-        run: poetry run isort --check-only .
+      - name: Lint code with ruff
+        run: poetry run ruff check .
 
       - name: Test code with pytest
-        run: poetry run pytest
+        run: poetry run pytest --doctest-modules
diff --git a/.github/workflows/publish_to_pypi.yml b/.github/workflows/publish_to_pypi.yml
@@ -1,9 +1,10 @@
-name: Publish Package to PyPI with poetry
+name: Publish to PyPI
 
 on:
   push:
     tags:
-      - 'v*'
+      - "v*"
+  # TODO: Only on CI success
 
 jobs:
   build-and-test-publish:
@@ -13,4 +14,4 @@ jobs:
       - name: Build and publish to pypi
         uses: JRubics/[email protected]
         with:
-          pypi_token: ${{ secrets.pypi_password }}
+          pypi_token: ${{ secrets.pypi_password }}
diff --git a/.gitignore b/.gitignore
@@ -113,4 +113,4 @@ venv.bak/
 .mypy_cache/
 
 .rss-parser
-poetry.lock
+.ruff_cache
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -14,7 +14,7 @@ repos:
   - repo: local
 
     hooks:
-      - id: black
+      - id: black-format-staged
         name: black
         entry: poetry
         args:
@@ -23,26 +23,14 @@ repos:
         language: system
         types: [ python ]
         stages: [ commit ]
-        # Black should use the config from the pyproject.toml file
 
-      - id: isort
-        name: isort
+      - id: ruff-check-global
+        name: ruff
         entry: poetry
         args:
           - run
-          - isort
+          - ruff
+          - check
         language: system
         types: [ python ]
-        stages: [ commit ]
-        # isort's config is also stored in pyproject.toml
-
-      - id: flake8
-        name: flake8
-        entry: poetry
-        args:
-          - run
-          - flake8
-        language: system
-        always_run: true
-        pass_filenames: false
-        stages: [ push ]
+        stages: [ commit, push ]
diff --git a/README.md b/README.md
@@ -10,11 +10,12 @@
 [![License](https://img.shields.io/pypi/l/rss-parser?color=success)](https://github.com/dhvcc/rss-parser/blob/master/LICENSE)
 [![GitHub Pages](https://badgen.net/github/status/dhvcc/rss-parser/gh-pages?label=docs)](https://dhvcc.github.io/rss-parser#documentation)
 
-[![Pypi publish](https://github.com/dhvcc/rss-parser/workflows/Pypi%20publish/badge.svg)](https://github.com/dhvcc/rss-parser/actions?query=workflow%3A%22Pypi+publish%22)
+![CI](https://github.com/dhvcc/rss-parser/actions/workflows/ci.yml/badge.svg?branch=master)
+![PyPi publish](https://github.com/dhvcc/rss-parser/actions/workflows/publish_to_pypi.yml/badge.svg?branch=master)
 
 ## About
 
-`rss-parser` is typed python RSS parsing module built using `BeautifulSoup` and `pydantic`
+`rss-parser` is typed python RSS parsing module built using [pydantic](https://github.com/pydantic/pydantic) and [xmltodict](https://github.com/martinblech/xmltodict)
 
 ## Installation
 
@@ -27,34 +28,153 @@ or
 ```bash
 git clone https://github.com/dhvcc/rss-parser.git
 cd rss-parser
-pip install .
+poetry build
+pip install dist/*.whl
 ```
 
 ## Usage
 
+### Quickstart
+
 ```python
 from rss_parser import Parser
 from requests import get
 
-rss_url = "https://feedforall.com/sample.xml"
-xml = get(rss_url)
+rss_url = "https://rss.art19.com/apology-line"
+response = get(rss_url)
 
-# Limit feed output to 5 items
-# To disable limit simply do not provide the argument or use None
-parser = Parser(xml=xml.content, limit=5)
-feed = parser.parse()
+rss = Parser.parse(response.text)
 
-# Print out feed meta data
-print(feed.language)
-print(feed.version)
+# Print out rss meta data
+print("Language", rss.channel.language)
+print("RSS", rss.version)
 
 # Iteratively print feed items
-for item in feed.feed:
+for item in rss.channel.items:
     print(item.title)
-    print(item.description)
+    print(item.description[:50])
+
+# Language en
+# RSS 2.0
+# Wondery Presents - Flipping The Bird: Elon vs Twitter
+# <p>When Elon Musk posted a video of himself arrivi
+# Introducing: The Apology Line
+# <p>If you could call a number and say you’re sorry
+```
+
+Here we can see that description is still somehow has <p> - this is beacause it's placed as [CDATA](https://www.w3resource.com/xml/CDATA-sections.php) like so
+
+```xml
+<![CDATA[<p>If you could call ...</p>]]>
+```
+
+### Overriding schema
+
+If you want to customize the schema or provide a custom one - use `schema` keyword argument of the parser
+
+```python
+from rss_parser.models import XMLBaseModel
+from rss_parser.models.rss import RSS
+from rss_parser.models.types import Tag
+
+class CustomSchema(RSS, XMLBaseModel):
+    channel: None = None # Removing previous channel field
+    custom: Tag[str]
+
+with open("tests/samples/custom.xml") as f:
+    data = f.read()
+
+rss = Parser.parse(data, schema=CustomSchema)
+
+print("RSS", rss.version)
+print("Custom", rss.custom)
+
+# RSS 2.0
+# Custom Custom tag data
+```
+
+### xmltodict
+
+This library uses [xmltodict](https://github.com/martinblech/xmltodict) to parse XML data. You can see the detailed documentation [here](https://github.com/martinblech/xmltodict#xmltodict)
+
+The basic thing you should know is that your data is processed into dictionaries
+
+For example, this data
+
+```xml
+<tag>content</tag>
+```
+
+will result in the following
+
+```python
+{
+    "tag": "content"
+}
+```
+
+*But*, when handling attributes, the content of the tag will be also a dictionary
+
+```xml
+<tag attr="1" data-value="data">data</tag>
+```
+
+Turns into
+
+```python
+{
+    "tag": {
+        "@attr": "1",
+        "@data-value": "data",
+        "#text": "content"
+    }
+}
+```
+
+### Tag field
+
+This is a generic field that handles tags as raw data or a dictonary returned with attributes
+
+*Although this is a complex class, it forwards most of the methods to it's content attribute, so you don't notice a difference if you're only after the .content value*
+
+Example
 
+```python
+from rss_parser.models import XMLBaseModel
+class Model(XMLBaseModel):
+     number: Tag[int]
+     string: Tag[str]
+
+m = Model(
+    number=1,
+    string={'@attr': '1', '#text': 'content'},
+)
+
+m.number.content == 1  # Content value is an integer, as per the generic type
+
+m.number.content + 10 == m.number + 10  # But you're still able to use the Tag itself in common operators
+
+m.number.bit_length() == 1  # As it's the case for methods/attributes not found in the Tag itself
+
+type(m.number), type(m.number.content) == (<class 'rss_parser.models.image.Tag[int]'>, <class 'int'>)  # types are NOT the same, however, the interfaces are very similar most of the time
+
+m.number.attributes == {}  # The attributes are empty by default
+
+m.string.attributes == {'attr': '1'}  # But are populated when provided. Note that the @ symbol is trimmed from the beggining, however, camelCase is not converted
+
+# Generic argument types are handled by pydantic - let's try to provide a string for a Tag[int] number
+
+m = Model(number='not_a_number', string={'@customAttr': 'v', '#text': 'str tag value'})  # This will lead in the following traceback
+
+# Traceback (most recent call last):
+#     ...
+# pydantic.error_wrappers.ValidationError: 1 validation error for Model
+# number -> content
+#     value is not a valid integer (type=type_error.integer)
 ```
 
+**If you wish to avoid all of the method/attribute forwarding "magic" - you should use `rss_parser.models.types.TagRaw`**
+
 ## Contributing
 
 Pull requests are welcome. For major changes, please open an issue first