Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Update performance comparison section of io docs #28890

Merged
merged 26 commits into from
Nov 9, 2019
Merged
Changes from 20 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
4e85c6d
Merge pull request #1 from pandas-dev/master
WuraolaOyewusi Aug 21, 2019
44df2ee
Merge pull request #2 from pandas-dev/master
WuraolaOyewusi Aug 22, 2019
b887983
Merge pull request #3 from pandas-dev/master
WuraolaOyewusi Aug 23, 2019
9554ea6
Merge pull request #4 from pandas-dev/master
WuraolaOyewusi Sep 17, 2019
fd27a6f
Merge pull request #5 from pandas-dev/master
WuraolaOyewusi Sep 24, 2019
3425a0a
Merge pull request #6 from pandas-dev/master
WuraolaOyewusi Oct 2, 2019
e53bce0
Update io.rst
WuraolaOyewusi Oct 10, 2019
76ccef3
Update io.rst
WuraolaOyewusi Oct 10, 2019
9672526
Update io.rst
WuraolaOyewusi Oct 10, 2019
d2c1e20
Update io.rst
WuraolaOyewusi Oct 10, 2019
709d571
Update io.rst
WuraolaOyewusi Oct 10, 2019
ddd39f6
Update io.rst
WuraolaOyewusi Oct 10, 2019
26b5db1
Update io.rst
WuraolaOyewusi Oct 10, 2019
cf85f95
Update io.rst
WuraolaOyewusi Oct 10, 2019
3d71d40
Update io.rst
WuraolaOyewusi Oct 10, 2019
8c8ed93
Update io.rst
WuraolaOyewusi Oct 10, 2019
3e62c8f
Update io.rst
WuraolaOyewusi Oct 11, 2019
1af539c
Update io.rst
WuraolaOyewusi Oct 11, 2019
ce51d5e
Update io.rst
WuraolaOyewusi Oct 11, 2019
524c7e0
Update io.rst
WuraolaOyewusi Oct 12, 2019
2b77c5d
Update io.rst
WuraolaOyewusi Oct 12, 2019
2224738
restore indentation
jorisvandenbossche Oct 21, 2019
df377c1
fixup
jorisvandenbossche Oct 21, 2019
e3eba95
Update io.rst
WuraolaOyewusi Nov 8, 2019
3aa5dea
Update io.rst
WuraolaOyewusi Nov 8, 2019
0af75a0
Merge branch 'master' into Update-Performance-Comparison-section-of-I…
WuraolaOyewusi Nov 8, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
218 changes: 110 additions & 108 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5572,7 +5572,7 @@ Performance considerations
--------------------------

This is an informal comparison of various IO methods, using pandas
0.20.3. Timings are machine dependent and small differences should be
0.24.2. Timings are machine dependent and small differences should be
ignored.

.. code-block:: ipython
Expand All @@ -5593,164 +5593,166 @@ Given the next test set:

.. code-block:: python

from numpy.random import randn
from numpy.random import randn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change this example to use our formatting, meaning don import like this, rather use
np.random.randn directory, and np.random.seed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

from numpy.random import seed

sz = 1000000
df = pd.DataFrame({'A': randn(sz), 'B': [1] * sz})
sz = 1000000
datapythonista marked this conversation as resolved.
Show resolved Hide resolved
seed(42)
df = pd.DataFrame({'A': randn(sz), 'B': [1] * sz})

def test_sql_write(df):
if os.path.exists('test.sql'):
os.remove('test.sql')
sql_db = sqlite3.connect('test.sql')
df.to_sql(name='test_table', con=sql_db)
sql_db.close()

def test_sql_write(df):
if os.path.exists('test.sql'):
os.remove('test.sql')
sql_db = sqlite3.connect('test.sql')
df.to_sql(name='test_table', con=sql_db)
sql_db.close()
def test_sql_read():
sql_db = sqlite3.connect('test.sql')
pd.read_sql_query("select * from test_table", sql_db)
sql_db.close()

def test_hdf_fixed_write(df):
df.to_hdf('test_fixed.hdf', 'test', mode='w')

def test_sql_read():
sql_db = sqlite3.connect('test.sql')
pd.read_sql_query("select * from test_table", sql_db)
sql_db.close()
def test_hdf_fixed_read():
pd.read_hdf('test_fixed.hdf', 'test')

def test_hdf_fixed_write_compress(df):
df.to_hdf('test_fixed_compress.hdf', 'test', mode='w', complib='blosc')

def test_hdf_fixed_write(df):
df.to_hdf('test_fixed.hdf', 'test', mode='w')
def test_hdf_fixed_read_compress():
pd.read_hdf('test_fixed_compress.hdf', 'test')

def test_hdf_table_write(df):
df.to_hdf('test_table.hdf', 'test', mode='w', format='table')

def test_hdf_fixed_read():
pd.read_hdf('test_fixed.hdf', 'test')
def test_hdf_table_read():
pd.read_hdf('test_table.hdf', 'test')

def test_hdf_table_write_compress(df):
df.to_hdf('test_table_compress.hdf', 'test', mode='w',
complib='blosc', format='table')

def test_hdf_fixed_write_compress(df):
df.to_hdf('test_fixed_compress.hdf', 'test', mode='w', complib='blosc')
def test_hdf_table_read_compress():
pd.read_hdf('test_table_compress.hdf', 'test')

def test_csv_write(df):
df.to_csv('test.csv', mode='w')

def test_hdf_fixed_read_compress():
pd.read_hdf('test_fixed_compress.hdf', 'test')
def test_csv_read():
pd.read_csv('test.csv', index_col=0)

def test_feather_write(df):
df.to_feather('test.feather')

def test_hdf_table_write(df):
df.to_hdf('test_table.hdf', 'test', mode='w', format='table')
def test_feather_read():
pd.read_feather('test.feather')

def test_pickle_write(df):
df.to_pickle('test.pkl')

def test_hdf_table_read():
pd.read_hdf('test_table.hdf', 'test')
def test_pickle_read():
pd.read_pickle('test.pkl')

def test_pickle_write_compress(df):
df.to_pickle('test.pkl.compress', compression='xz')

def test_hdf_table_write_compress(df):
df.to_hdf('test_table_compress.hdf', 'test', mode='w',
complib='blosc', format='table')
def test_pickle_read_compress():
pd.read_pickle('test.pkl.compress', compression='xz')

def test_parquet_write(df):
df.to_parquet('test.parquet')

def test_hdf_table_read_compress():
pd.read_hdf('test_table_compress.hdf', 'test')
def test_parquet_read():
pd.read_parquet('test.parquet')

When writing, the top-three functions in terms of speed are ``test_feather_write``, ``test_hdf_fixed_write`` and ``test_hdf_fixed_write_compress``.

def test_csv_write(df):
df.to_csv('test.csv', mode='w')


def test_csv_read():
pd.read_csv('test.csv', index_col=0)

.. code-block:: ipython

def test_feather_write(df):
df.to_feather('test.feather')
In [4]: %timeit test_sql_write(df)
3.29 s ± 43.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit test_hdf_fixed_write(df)
19.4 ms ± 560 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

def test_feather_read():
pd.read_feather('test.feather')
In [6]: %timeit test_hdf_fixed_write_compress(df)
19.6 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit test_hdf_table_write(df)
449 ms ± 5.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def test_pickle_write(df):
df.to_pickle('test.pkl')
In [8]: %timeit test_hdf_table_write_compress(df)
448 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit test_csv_write(df)
3.66 s ± 26.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def test_pickle_read():
pd.read_pickle('test.pkl')
In [10]: %timeit test_feather_write(df)
9.75 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit test_pickle_write(df)
30.1 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

def test_pickle_write_compress(df):
df.to_pickle('test.pkl.compress', compression='xz')
In [12]: %timeit test_pickle_write_compress(df)
4.29 s ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]: %timeit test_parquet_write(df)
67.6 ms ± 706 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

def test_pickle_read_compress():
pd.read_pickle('test.pkl.compress', compression='xz')
When reading, the top three are ``test_feather_read``, ``test_pickle_read`` and
``test_hdf_fixed_read``.

When writing, the top-three functions in terms of speed are are
``test_pickle_write``, ``test_feather_write`` and ``test_hdf_fixed_write_compress``.

.. code-block:: ipython

In [14]: %timeit test_sql_write(df)
2.37 s ± 36.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [14]: %timeit test_sql_read()
1.77 s ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit test_hdf_fixed_write(df)
194 ms ± 65.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %timeit test_hdf_fixed_read()
19.4 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [26]: %timeit test_hdf_fixed_write_compress(df)
119 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: %timeit test_hdf_fixed_read_compress()
19.5 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: %timeit test_hdf_table_write(df)
623 ms ± 125 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [17]: %timeit test_hdf_table_read()
38.6 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [27]: %timeit test_hdf_table_write_compress(df)
563 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [18]: %timeit test_hdf_table_read_compress()
38.8 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [17]: %timeit test_csv_write(df)
3.13 s ± 49.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: %timeit test_csv_read()
452 ms ± 9.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [30]: %timeit test_feather_write(df)
103 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [20]: %timeit test_feather_read()
12.4 ms ± 99.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [31]: %timeit test_pickle_write(df)
109 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [21]: %timeit test_pickle_read()
18.4 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [32]: %timeit test_pickle_write_compress(df)
3.33 s ± 55.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [22]: %timeit test_pickle_read_compress()
915 ms ± 7.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

When reading, the top three are ``test_feather_read``, ``test_pickle_read`` and
``test_hdf_fixed_read``.

.. code-block:: ipython
In [23]: %timeit test_parquet_read()
24.4 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [18]: %timeit test_sql_read()
1.35 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [19]: %timeit test_hdf_fixed_read()
14.3 ms ± 438 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [28]: %timeit test_hdf_fixed_read_compress()
23.5 ms ± 672 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [20]: %timeit test_hdf_table_read()
35.4 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [29]: %timeit test_hdf_table_read_compress()
42.6 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [22]: %timeit test_csv_read()
516 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [33]: %timeit test_feather_read()
4.06 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For this test case ``test.pkl.compress``, ``test.parquet`` and ``test.feather`` took the least space on disk.
Space on disk (in bytes)

In [34]: %timeit test_pickle_read()
6.5 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
.. code-block:: none

In [35]: %timeit test_pickle_read_compress()
588 ms ± 3.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
29519500 Oct 10 06:45 test.csv
16000248 Oct 10 06:45 test.feather
8281983 Oct 10 06:49 test.parquet
16000857 Oct 10 06:47 test.pkl
7552144 Oct 10 06:48 test.pkl.compress
34816000 Oct 10 06:42 test.sql
24009288 Oct 10 06:43 test_fixed.hdf
24009288 Oct 10 06:43 test_fixed_compress.hdf
24458940 Oct 10 06:44 test_table.hdf
24458940 Oct 10 06:44 test_table_compress.hdf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed this here: it seems something went wrong with the compression (as it is exactly the same size as the non-compressed one; and also the timing is not slower). Maybe it did fall back to non-compressed because you didn't have the compression lib installed?
(however, if it does that silently, that feels like a bug to me)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche I found out the original code had 3-space indentation, aligning my update to the previous code was the reason some checks failed. When I made the indentation 4-space. It passed.

Let me check the notebook again about the compression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right about the compression.

IMG_20191021_103019

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche, @datapythonista

I just noticed this here: it seems something went wrong with the compression (as it is exactly the same size as the non-compressed one; and also the timing is not slower). Maybe it did fall back to non-compressed because you didn't have the compression lib installed?
(however, if it does that silently, that feels like a bug to me)

I ran the codes again, tried version '0.25.0' and it's still the same. It seems like a bug.
What can I do to fix it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a separate issue, referencing this issue, explaining about the lack of compression in hdf.

@jorisvandenbossche probably worth merging this as is, and fix that in a separate PR, since it's an unrelated change. And looks like a bug in the code, and I guess the fix won't be trivial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok Marc


Space on disk (in bytes)

.. code-block:: none

34816000 Aug 21 18:00 test.sql
24009240 Aug 21 18:00 test_fixed.hdf
7919610 Aug 21 18:00 test_fixed_compress.hdf
24458892 Aug 21 18:00 test_table.hdf
8657116 Aug 21 18:00 test_table_compress.hdf
28520770 Aug 21 18:00 test.csv
16000248 Aug 21 18:00 test.feather
16000848 Aug 21 18:00 test.pkl
7554108 Aug 21 18:00 test.pkl.compress