feat: to_xml and read_xml

dmyersturnbull · Jul 20, 2021 · 9836690 · 9836690
1 parent d4a5562
commit 9836690
Show file tree

Hide file tree

Showing 12 changed files with 419 additions and 213 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -33,7 +33,7 @@ repos:
   - hooks:
       - id: black
     repo: https://github.com/psf/black
-    rev: 21.6b0
+    rev: 21.7b0
   - repo: https://github.com/asottile/blacken-docs
     rev: v1.10.0
     hooks:

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,9 +3,16 @@
 Adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and
 [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
+## [0.7.1] - 2021-07-19
+
+### Added
+
+- Support for `to_xml` and `read_xml`
+
 ## [0.7.0] - 2021-06-08
 
 ### Added
+
 - `can_read` and `can_write` on `BaseDf` to get supported file formats
 - Write (and read) to "flex" fixed-width;
   currently, this is only used for ".flexwf" as a preview
@@ -15,10 +22,12 @@ Adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and
 - Methods to set default read_file/to_file args
 
 ### Removed
+
 - All args from `read_file` and `to_file`
 - `comment` from `to_lines`; it was too confusing because no other write functions had one
 
 ### Changed
+
 - `dtype` values in `TypedDfBuilder` are now used;
   specifically, `TypedDf.convert` calls `pd.Series.astype` with them.
 - Overrode `assign` to handle indices
@@ -31,92 +40,106 @@ Adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and
 - Empty DataFrames are read via `BaseDf.read_csv`, etc. without issue (`pd.read_csv` normally fails)
 
 ### Fixed
+
 - `to_lines` and `read_lines` are fully inverses
-- Read/write are inverses for *untyped* DFs for all formats
+- Read/write are inverses for _untyped_ DFs for all formats
 - Deleted .dockerignore and codemeta.json
 - `check` workflow no longer errors on push
 - Better read/write tests; enabled Parquet-format tests
 
 ## [0.6.1] - 2021-03-31
 
 ### Added
+
 - `vanilla_reset`
 
 ### Removed
+
 - Unused Sphinx/readthedocs files
 
 ### Fixed
+
 - Not passing kwargs to `UntypedDf.to_csv`
 - Simplified some read/write code
 
 ## [0.6.0] - 2021-03-30
 
 ## Added
+
 - Read/write wrappers for Feather, Parquet, and JSON
 - Added general functions `read_file` and `write_file`
 - `TypeDfs.wrap` and `FinalDf`
 
 ### Fixed
+
 - `to_csv` was not passing along `args` and `kwargs`
 - Slightly better build config
 
 ## [0.5.0] - 2021-01-19
 
 ### Changed
+
 - Made `tables` an optional dependency; use `typeddfs[hdf5]`
 - `natsort` is no longer pinned to version 7; it's now `>=7`.
-   Added a note in the readme that this just requires some caution.
+  Added a note in the readme that this just requires some caution.
 
 ### Fixed
+
 - Slight improvement to build and metadata
 
 ## [0.4.0] - 2020-08-29
 
 ### Removed
+
 - support for Python 3.7
 
 #### Changed
+
 - Bumped Pandas to 1.2
 - Updated build
 
-
 ## [0.3.0] - 2020-08-29
 
 ### Removed:
+
 - `require_full` argument
 - support for Pandas <1.1
 
 ## Changed:
+
 - `convert` now keeps non-reserved indices in the index as long as `more_indices_allowed` is false
 - Moved builder to a separate module
 - Changed or added type annotations using `__qualname__`
 - Moved some basic functions from `AbsFrame` to its superclass `PrettyFrame`
 
 ## Added:
+
 - A method on `BaseFrame` called `such_that` to do type-retaining slicing
 
 ### Fixed:
+
 - A bug in `only`
 - A bug in checking symmetry
 - Dropped unnecessary imports
 - Clarified that `detype` is needed for functions like `applymap` if requirements will fail the returned value
 - Improved test coverage
 - Added docstrings
 
-
 ## [0.2.0] - 2020-05-19
 
 ### Added:
+
 - Builder and static factory for new classes
 - Symmetry and custom conditions
 
 ### Changed:
+
 - Renamed most classes
 - Renamed `to_vanilla` to `vanilla`, dropping the latter
 - Split code into several files
 
-
 ## [0.1.0] - 2020-05-12
 
 ### Added:
+
 - Main code.
diff --git a/README.md b/README.md
@@ -12,9 +12,8 @@
 [![Scrutinizer Code Quality](https://scrutinizer-ci.com/g/dmyersturnbull/typed-dfs/badges/quality-score.png?b=main)](https://scrutinizer-ci.com/g/dmyersturnbull/typed-dfs/?branch=main)
 [![Created with Tyrannosaurus](https://img.shields.io/badge/Created_with-Tyrannosaurus-0000ff.svg)](https://github.com/dmyersturnbull/tyrannosaurus)
 
-
 Pandas DataFrame subclasses that enforce structure and self-organize.  
-*Because your functions can’t exactly accept **any**  DataFrame**.  
+\*Because your functions can’t exactly accept **any** DataFrame\*\*.  
 `pip install typeddfs[feather,fwf]`
 
 Stop passing `index_cols=` and `header=` to `to_csv` and `read_csv`.
@@ -23,23 +22,23 @@ That means columns are used for the index, string columns are always read as str
 and custom constraints are verified.
 
 Need to read a tab-delimited file? `read_file("myfile.tab")`.
-Feather? Parquet? HDF5? .json.zip? Gzipped fixed-width?
+Feather? Parquet? HDF5? .json.zip? Gzipped fixed-width? XML?
 Use `read_file`. Write a file? Use `write_file`.
 
 Some useful extra functions, plus various Pandas issues fixed:
-- `read_csv`/`to_csv`,  `read_json`/`to_json`, etc., are inverses.
-  `read_file`/`write_file`, too.
-- In Pandas, you can write an empty DataFrame but not read it.
+
+- `read_csv`/`to_csv`, `read_json`/`to_json`, etc., are inverses.
+  `read_file`/`write_file`, too
+- You can always read and write empty DataFrames -- that doesn't raise weird exceptions.
   Typed-dfs will always read in what you wrote out.
 - No more empty `.feather`/`.snappy`/`.h5` files written on error.
 - You can write fixed-width as well as read.
 
 ```python
-
 from typeddfs._entries import TypedDfs
 
 MyDfType = (
-  TypedDfs.typed("MyDfType")
+    TypedDfs.typed("MyDfType")
     .require("name", index=True)  # always keep in index
     .require("value", dtype=float)  # require a column and type
     .drop("_temp")  # auto-drop a column
@@ -54,17 +53,16 @@ df.sort_natural().write_file("myfile.feather")
 
 For a CSV like this:
 
-| key   | value  | note |
-| ----- | ------ | ---- |
-| abc   | 123    | ?    |
+| key | value | note |
+| --- | ----- | ---- |
+| abc | 123   | ?    |
 
 ```python
-
 from typeddfs._entries import TypedDfs
 
 # Build me a Key-Value-Note class!
 KeyValue = (
-  TypedDfs.typed("KeyValue")  # With enforced reqs / typing
+    TypedDfs.typed("KeyValue")  # With enforced reqs / typing
     .require("key", dtype=str, index=True)  # automagically add to index
     .require("value")  # required
     .reserve("note")  # permitted but not required
@@ -83,7 +81,7 @@ print(df.index_names(), df.column_names())  # ["key"], ["value", "note"]
 # And now, we can type a function to require a KeyValue,
 # and let it raise an `InvalidDfError` (here, a `MissingColumnError`):
 def my_special_function(df: KeyValue) -> float:
-  return KeyValue(df)["value"].sum()
+    return KeyValue(df)["value"].sum()
 ```
 
 All of the normal DataFrame methods are available.
@@ -103,6 +101,7 @@ Serialization is provided through Pandas, and some formats require additional pa
 Pandas does not specify compatible versions, so typed-dfs specifies
 [extras](https://python-poetry.org/docs/pyproject/#extras) are provided in typed-dfs
 to ensure that those packages are installed with compatible versions.
+
 - To install with [Feather](https://arrow.apache.org/docs/python/feather.html) support,
   use `pip install typeddfs[feather]`.
 - To install with support for all serialization formats,
@@ -121,21 +120,23 @@ Feather is the preferred format for most cases.
 
 **⚠ Note:** The `hdf5` and `parquet` extras are currently disabled.
 
-| format   | packages              | extra     | compatibility | performance  |
-| -------- | --------------------  | --------- | ------------- | ------------ |
-| pickle   | none                  | none      | ❗ ️           | −           |
-| csv      | none                  | none      | ✅             | −−          |
-| json     | none                  | none      | /️            | −−-         |
-| .npy †   | none                  | none      | †️            | +           |
-| .npz †   | none                  | none      | †️            | +           |
-| flexwf   | none                  | `fwf`     | ✅             | −−-         |
-| Feather  | `pyarrow`             | `feather` | ✅             | ++++        |
-| Parquet  | `pyarrow,fastparquet` | `parquet` | ❌             | +++         |
-| HDF5     | `tables`              | `hdf5`    | ❌             | −           |
+| format  | packages              | extra     | compatibility | performance |
+| ------- | --------------------- | --------- | ------------- | ----------- |
+| pickle  | none                  | none      | ❗ ️          | −           |
+| csv     | none                  | none      | ✅            | −−          |
+| json    | none                  | none      | /️            | −−-         |
+| xml     | `lxml`                | `xml`     | .             | ---         |
+| .npy †  | none                  | none      | †️            | +           |
+| .npz †  | none                  | none      | †️            | +           |
+| flexwf  | none                  | `fwf`     | ✅            | −−-         |
+| Feather | `pyarrow`             | `feather` | ✅            | ++++        |
+| Parquet | `pyarrow,fastparquet` | `parquet` | ❌            | +++         |
+| HDF5    | `tables`              | `hdf5`    | ❌            | −           |
 
 ❗ == Pickle is explicitly not supported due to vulnerabilities and other issues.  
 / == Mostly. JSON has inconsistent handling of `None`.  
 † == .npy and .npz only serialize numpy objects and therefore skip indices.  
+. = requires Pandas 1.3+  
 Note: `.flexwf` is fixed-width with optional delimiters; `.fwf` is not used
 to avoid a potential future conflict with `pd.DataFrame.to_fwf` (which does not exist yet).