Skip to content

Commit

Permalink
Merge branch 'development'
Browse files Browse the repository at this point in the history
  • Loading branch information
hkage committed May 27, 2021
2 parents 60f244e + c05efc7 commit 661435a
Show file tree
Hide file tree
Showing 10 changed files with 322 additions and 226 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

## Development

## 0.4.1 (2021-05-27)

* [#19](https://github.com/rheinwerk-verlag/postgresql-anonymizer/pull/19): Make chunk size in the table definition dynamic ([halilkaya](https://github.com/halilkaya))

## 0.4.0 (2021-05-05)

* [#18](https://github.com/rheinwerk-verlag/postgresql-anonymizer/pull/18): Specify (SQL WHERE) search_condition, to filter the table for rows to be anonymized ([bobslee](https://github.com/bobslee))
Expand Down
4 changes: 2 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ After that you can pass a schema file to the container, using Docker volumes, an
.. _schema documentation: https://python-postgresql-anonymizer.readthedocs.io/en/latest/schema.html
.. _YAML sample schema: https://github.com/rheinwerk-verlag/postgresql-anonymizer/blob/master/sample_schema.yml

.. |python| image:: https://img.shields.io/pypi/pyversions/pganonymize
.. |python| image:: https://img.shields.io/pypi/pyversions/pganonymize
:alt: PyPI - Python Version

.. |license| image:: https://img.shields.io/badge/license-MIT-green.svg
Expand All @@ -158,6 +158,6 @@ After that you can pass a schema file to the container, using Docker volumes, an
.. |downloads| image:: https://static.pepy.tech/personalized-badge/pganonymize?period=total&units=international_system&left_color=blue&right_color=black&left_text=Downloads
:target: https://pepy.tech/project/pganonymize
:alt: Download count

.. |build| image:: https://github.com/rheinwerk-verlag/postgresql-anonymizer/workflows/Test/badge.svg
:target: https://github.com/rheinwerk-verlag/postgresql-anonymizer/actions
141 changes: 97 additions & 44 deletions docs/schema.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,56 @@ Schema

``pganonymize`` uses a YAML based schema definition for the anonymization rules.

Top level
---------

``tables``
~~~~~~~~~~

On the top level a list of tables can be defined with the ``tables`` keyword. This will define
which tables should be anonymized.

On the table level you can specify the tables primary key with the keyword ``primary_key`` if it
isn't the default ``id``.
**Example**::

tables:
- table_a:
fields:
- field_a: ...
- field_b: ...
- table_b:
fields:
- field_a: ...
- field_b: ...

``truncate``
~~~~~~~~~~~~

You can also specify a list of tables that should be cleared instead of anonymized with the `truncated` key. This is
useful if you don't need the table data for development purposes or the reduce the size of the database dump.

**Example**::

truncate:
- django_session
- my_other_table

If two tables have a foreign key relation and you don't need to keep one of the table's data, just add the
second table and they will be truncated at once, without causing a constraint error.

Table level
-----------

``primary_key``
~~~~~~~~~~~~~~~

Defines the name of the primary key field for the current table. The default is ``id``.

**Example**::

tables:
- my_table:
primary_key: my_primary_key
fields: ...

``fields``
~~~~~~~~~~
Expand All @@ -23,7 +65,6 @@ be treated.

tables:
- auth_user:
primary_key: id
fields:
- first_name:
provider:
Expand All @@ -39,7 +80,7 @@ be treated.
~~~~~~~~~~~~

For each table you can also specify a list of ``excludes``. Each entry has to be a field name which contains
a list of exclude patterns. If one of these patterns matches, the whole table row won't ne anonymized.
a list of exclude patterns. If one of these patterns matches, the whole table row won't be anonymized.

**Example**::

Expand All @@ -57,22 +98,6 @@ a list of exclude patterns. If one of these patterns matches, the whole table ro
This will exclude all data from the table ``auth_user`` that have an ``email`` field which matches the
regular expression pattern (the backslash is to escape the string for YAML).

``truncate``
~~~~~~~~~~~~

In addition to the field level providers you can also specify a list of tables that should be cleared with
the `truncated` key. This is useful if you don't need the table data for development purposes or the reduce
the size of the database dump.

**Example**::

truncate:
- django_session
- my_other_table

If two tables have a foreign key relation and you don't need to keep one of the table's data, just add the
second table and they will be truncated at once, without causing a constraint error.

``search``
~~~~~~~~~~

Expand All @@ -89,11 +114,59 @@ This is useful if you need to anonymize one or more specific records, eg for "Ri
provider:
name: clear

Providers
---------
``chunk_size``
~~~~~~~~~~~~~~

Defines how many data rows should be fetched for each iteration of anonymizing the current table. The default is 2000.

**Example**::

tables:
- auth_user:
chunk_size: 5000
fields: ...

Field level
-----------

``provider``
~~~~~~~~~~~~

Providers are the tools, which means functions, used to alter the data within the database. You can specify on field
level which provider should be used to alter the specific field. The reference a provider you will have can use the
``name`` attribute.

**Example**::

tables:
- auth_user:
fields:
- first_name:
provider:
name: set
value: "Foo"

Providers are the tools, which means functions, used to alter the data within the database.
The following provider are currently supported:

For a complete list of providers see the next section.

``append``
~~~~~~~~~~

This argument will append a value at the end of the altered value:

**Example usage**::

tables:
- auth_user:
fields:
- email:
provider:
name: md5
append: "@example.com"


Provider
--------

``choice``
~~~~~~~~~~
Expand Down Expand Up @@ -225,23 +298,3 @@ The value can also be a dictionary for JSONB columns::
provider:
name: set
value: '{"foo": "bar", "baz": 1}'

Arguments
---------

In addition to the providers there is also a list of arguments that can be added to each provider:

``append``
~~~~~~~~~~

This argument will append a value at the end of the altered value:

**Example usage**::

tables:
- auth_user:
fields:
- email:
provider:
name: md5
append: "@example.com"
3 changes: 3 additions & 0 deletions pganonymizer/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,6 @@

# Filename of the default schema
DEFAULT_SCHEMA_FILE = 'schema.yml'

# Default chunk size for data fetch
DEFAULT_CHUNK_SIZE = 2000
11 changes: 7 additions & 4 deletions pganonymizer/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
from psycopg2.errors import BadCopyFileFormat, InvalidTextRepresentation
from six import StringIO

from pganonymizer.constants import COPY_DB_DELIMITER, DEFAULT_PRIMARY_KEY
from pganonymizer.constants import COPY_DB_DELIMITER, DEFAULT_CHUNK_SIZE, DEFAULT_PRIMARY_KEY
from pganonymizer.exceptions import BadDataFormat
from pganonymizer.providers import get_provider

Expand All @@ -37,11 +37,13 @@ def anonymize_tables(connection, definitions, verbose=False):
column_dict = get_column_dict(columns)
primary_key = table_definition.get('primary_key', DEFAULT_PRIMARY_KEY)
total_count = get_table_count(connection, table_name)
data, table_columns = build_data(connection, table_name, columns, excludes, search, total_count, verbose)
chunk_size = table_definition.get('chunk_size', DEFAULT_CHUNK_SIZE)
data, table_columns = build_data(connection, table_name, columns, excludes, search, total_count, chunk_size,
verbose)
import_data(connection, column_dict, table_name, table_columns, primary_key, data)


def build_data(connection, table, columns, excludes, search, total_count, verbose=False):
def build_data(connection, table, columns, excludes, search, total_count, chunk_size, verbose=False):
"""
Select all data from a table and return it together with a list of table columns.
Expand All @@ -51,6 +53,7 @@ def build_data(connection, table, columns, excludes, search, total_count, verbos
:param list[dict] excludes: A list of exclude definitions.
:param str search: A SQL WHERE (search_condition) to filter and keep only the searched rows.
:param int total_count: The amount of rows for the current table
:param int chunk_size: Number of data rows to fetch with the cursor
:param bool verbose: Display logging information and a progress bar.
:return: A tuple containing the data list and a complete list of all table columns.
:rtype: (list, list)
Expand All @@ -67,7 +70,7 @@ def build_data(connection, table, columns, excludes, search, total_count, verbos
data = []
table_columns = None
while True:
records = cursor.fetchmany(size=2000)
records = cursor.fetchmany(size=chunk_size)
if not records:
break
for row in records:
Expand Down
2 changes: 1 addition & 1 deletion pganonymizer/version.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# -*- coding: utf-8 -*-

__version__ = '0.4.0'
__version__ = '0.4.1'
Loading

0 comments on commit 661435a

Please sign in to comment.