Skip to content

Commit

Permalink
feat: to_xml and read_xml
Browse files Browse the repository at this point in the history
  • Loading branch information
dmyersturnbull committed Jul 20, 2021
1 parent d4a5562 commit 9836690
Show file tree
Hide file tree
Showing 12 changed files with 419 additions and 213 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ repos:
- hooks:
- id: black
repo: https://github.com/psf/black
rev: 21.6b0
rev: 21.7b0
- repo: https://github.com/asottile/blacken-docs
rev: v1.10.0
hooks:
Expand Down
33 changes: 28 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,16 @@
Adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and
[Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [0.7.1] - 2021-07-19

### Added

- Support for `to_xml` and `read_xml`

## [0.7.0] - 2021-06-08

### Added

- `can_read` and `can_write` on `BaseDf` to get supported file formats
- Write (and read) to "flex" fixed-width;
currently, this is only used for ".flexwf" as a preview
Expand All @@ -15,10 +22,12 @@ Adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and
- Methods to set default read_file/to_file args

### Removed

- All args from `read_file` and `to_file`
- `comment` from `to_lines`; it was too confusing because no other write functions had one

### Changed

- `dtype` values in `TypedDfBuilder` are now used;
specifically, `TypedDf.convert` calls `pd.Series.astype` with them.
- Overrode `assign` to handle indices
Expand All @@ -31,92 +40,106 @@ Adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and
- Empty DataFrames are read via `BaseDf.read_csv`, etc. without issue (`pd.read_csv` normally fails)

### Fixed

- `to_lines` and `read_lines` are fully inverses
- Read/write are inverses for *untyped* DFs for all formats
- Read/write are inverses for _untyped_ DFs for all formats
- Deleted .dockerignore and codemeta.json
- `check` workflow no longer errors on push
- Better read/write tests; enabled Parquet-format tests

## [0.6.1] - 2021-03-31

### Added

- `vanilla_reset`

### Removed

- Unused Sphinx/readthedocs files

### Fixed

- Not passing kwargs to `UntypedDf.to_csv`
- Simplified some read/write code

## [0.6.0] - 2021-03-30

## Added

- Read/write wrappers for Feather, Parquet, and JSON
- Added general functions `read_file` and `write_file`
- `TypeDfs.wrap` and `FinalDf`

### Fixed

- `to_csv` was not passing along `args` and `kwargs`
- Slightly better build config

## [0.5.0] - 2021-01-19

### Changed

- Made `tables` an optional dependency; use `typeddfs[hdf5]`
- `natsort` is no longer pinned to version 7; it's now `>=7`.
Added a note in the readme that this just requires some caution.
Added a note in the readme that this just requires some caution.

### Fixed

- Slight improvement to build and metadata

## [0.4.0] - 2020-08-29

### Removed

- support for Python 3.7

#### Changed

- Bumped Pandas to 1.2
- Updated build


## [0.3.0] - 2020-08-29

### Removed:

- `require_full` argument
- support for Pandas <1.1

## Changed:

- `convert` now keeps non-reserved indices in the index as long as `more_indices_allowed` is false
- Moved builder to a separate module
- Changed or added type annotations using `__qualname__`
- Moved some basic functions from `AbsFrame` to its superclass `PrettyFrame`

## Added:

- A method on `BaseFrame` called `such_that` to do type-retaining slicing

### Fixed:

- A bug in `only`
- A bug in checking symmetry
- Dropped unnecessary imports
- Clarified that `detype` is needed for functions like `applymap` if requirements will fail the returned value
- Improved test coverage
- Added docstrings


## [0.2.0] - 2020-05-19

### Added:

- Builder and static factory for new classes
- Symmetry and custom conditions

### Changed:

- Renamed most classes
- Renamed `to_vanilla` to `vanilla`, dropping the latter
- Split code into several files


## [0.1.0] - 2020-05-12

### Added:

- Main code.
51 changes: 26 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,8 @@
[![Scrutinizer Code Quality](https://scrutinizer-ci.com/g/dmyersturnbull/typed-dfs/badges/quality-score.png?b=main)](https://scrutinizer-ci.com/g/dmyersturnbull/typed-dfs/?branch=main)
[![Created with Tyrannosaurus](https://img.shields.io/badge/Created_with-Tyrannosaurus-0000ff.svg)](https://github.com/dmyersturnbull/tyrannosaurus)


Pandas DataFrame subclasses that enforce structure and self-organize.
*Because your functions can’t exactly accept **any** DataFrame**.
\*Because your functions can’t exactly accept **any** DataFrame\*\*.
`pip install typeddfs[feather,fwf]`

Stop passing `index_cols=` and `header=` to `to_csv` and `read_csv`.
Expand All @@ -23,23 +22,23 @@ That means columns are used for the index, string columns are always read as str
and custom constraints are verified.

Need to read a tab-delimited file? `read_file("myfile.tab")`.
Feather? Parquet? HDF5? .json.zip? Gzipped fixed-width?
Feather? Parquet? HDF5? .json.zip? Gzipped fixed-width? XML?
Use `read_file`. Write a file? Use `write_file`.

Some useful extra functions, plus various Pandas issues fixed:
- `read_csv`/`to_csv`, `read_json`/`to_json`, etc., are inverses.
`read_file`/`write_file`, too.
- In Pandas, you can write an empty DataFrame but not read it.

- `read_csv`/`to_csv`, `read_json`/`to_json`, etc., are inverses.
`read_file`/`write_file`, too
- You can always read and write empty DataFrames -- that doesn't raise weird exceptions.
Typed-dfs will always read in what you wrote out.
- No more empty `.feather`/`.snappy`/`.h5` files written on error.
- You can write fixed-width as well as read.

```python

from typeddfs._entries import TypedDfs

MyDfType = (
TypedDfs.typed("MyDfType")
TypedDfs.typed("MyDfType")
.require("name", index=True) # always keep in index
.require("value", dtype=float) # require a column and type
.drop("_temp") # auto-drop a column
Expand All @@ -54,17 +53,16 @@ df.sort_natural().write_file("myfile.feather")

For a CSV like this:

| key | value | note |
| ----- | ------ | ---- |
| abc | 123 | ? |
| key | value | note |
| --- | ----- | ---- |
| abc | 123 | ? |

```python

from typeddfs._entries import TypedDfs

# Build me a Key-Value-Note class!
KeyValue = (
TypedDfs.typed("KeyValue") # With enforced reqs / typing
TypedDfs.typed("KeyValue") # With enforced reqs / typing
.require("key", dtype=str, index=True) # automagically add to index
.require("value") # required
.reserve("note") # permitted but not required
Expand All @@ -83,7 +81,7 @@ print(df.index_names(), df.column_names()) # ["key"], ["value", "note"]
# And now, we can type a function to require a KeyValue,
# and let it raise an `InvalidDfError` (here, a `MissingColumnError`):
def my_special_function(df: KeyValue) -> float:
return KeyValue(df)["value"].sum()
return KeyValue(df)["value"].sum()
```

All of the normal DataFrame methods are available.
Expand All @@ -103,6 +101,7 @@ Serialization is provided through Pandas, and some formats require additional pa
Pandas does not specify compatible versions, so typed-dfs specifies
[extras](https://python-poetry.org/docs/pyproject/#extras) are provided in typed-dfs
to ensure that those packages are installed with compatible versions.

- To install with [Feather](https://arrow.apache.org/docs/python/feather.html) support,
use `pip install typeddfs[feather]`.
- To install with support for all serialization formats,
Expand All @@ -121,21 +120,23 @@ Feather is the preferred format for most cases.

**⚠ Note:** The `hdf5` and `parquet` extras are currently disabled.

| format | packages | extra | compatibility | performance |
| -------- | -------------------- | --------- | ------------- | ------------ |
| pickle | none | none | ❗ ️ ||
| csv | none | none || −− |
| json | none | none | /️ | −−- |
| .npy † | none | none | †️ | + |
| .npz † | none | none | †️ | + |
| flexwf | none | `fwf` || −−- |
| Feather | `pyarrow` | `feather` || ++++ |
| Parquet | `pyarrow,fastparquet` | `parquet` || +++ |
| HDF5 | `tables` | `hdf5` |||
| format | packages | extra | compatibility | performance |
| ------- | --------------------- | --------- | ------------- | ----------- |
| pickle | none | none | ❗ ️ ||
| csv | none | none || −− |
| json | none | none | /️ | −−- |
| xml | `lxml` | `xml` | . | --- |
| .npy † | none | none | †️ | + |
| .npz † | none | none | †️ | + |
| flexwf | none | `fwf` || −−- |
| Feather | `pyarrow` | `feather` || ++++ |
| Parquet | `pyarrow,fastparquet` | `parquet` || +++ |
| HDF5 | `tables` | `hdf5` |||

❗ == Pickle is explicitly not supported due to vulnerabilities and other issues.
/ == Mostly. JSON has inconsistent handling of `None`.
† == .npy and .npz only serialize numpy objects and therefore skip indices.
. = requires Pandas 1.3+
Note: `.flexwf` is fixed-width with optional delimiters; `.fwf` is not used
to avoid a potential future conflict with `pd.DataFrame.to_fwf` (which does not exist yet).

Expand Down
Loading

0 comments on commit 9836690

Please sign in to comment.