Merge branch 'development'

rheinwerk-verlag · May 27, 2021 · 661435a · 661435a
2 parents 60f244e + c05efc7
commit 661435a
Show file tree

Hide file tree

Showing 10 changed files with 322 additions and 226 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,10 @@
 
 ## Development
 
+## 0.4.1 (2021-05-27)
+
+* [#19](https://github.com/rheinwerk-verlag/postgresql-anonymizer/pull/19): Make chunk size in the table definition dynamic ([halilkaya](https://github.com/halilkaya))
+
 ## 0.4.0 (2021-05-05)
 
 * [#18](https://github.com/rheinwerk-verlag/postgresql-anonymizer/pull/18): Specify (SQL WHERE) search_condition, to filter the table for rows to be anonymized ([bobslee](https://github.com/bobslee))

diff --git a/README.rst b/README.rst
@@ -146,7 +146,7 @@ After that you can pass a schema file to the container, using Docker volumes, an
 .. _schema documentation: https://python-postgresql-anonymizer.readthedocs.io/en/latest/schema.html
 .. _YAML sample schema: https://github.com/rheinwerk-verlag/postgresql-anonymizer/blob/master/sample_schema.yml
 
-.. |python| image:: https://img.shields.io/pypi/pyversions/pganonymize 
+.. |python| image:: https://img.shields.io/pypi/pyversions/pganonymize
     :alt: PyPI - Python Version
 
 .. |license| image:: https://img.shields.io/badge/license-MIT-green.svg
@@ -158,6 +158,6 @@ After that you can pass a schema file to the container, using Docker volumes, an
 .. |downloads| image:: https://static.pepy.tech/personalized-badge/pganonymize?period=total&units=international_system&left_color=blue&right_color=black&left_text=Downloads
     :target: https://pepy.tech/project/pganonymize
     :alt: Download count
-    
+
 .. |build| image:: https://github.com/rheinwerk-verlag/postgresql-anonymizer/workflows/Test/badge.svg
     :target: https://github.com/rheinwerk-verlag/postgresql-anonymizer/actions
diff --git a/docs/schema.rst b/docs/schema.rst
@@ -3,14 +3,56 @@ Schema
 
 ``pganonymize`` uses a YAML based schema definition for the anonymization rules.
 
+Top level
+---------
+
 ``tables``
 ~~~~~~~~~~
 
 On the top level a list of tables can be defined with the ``tables`` keyword. This will define
 which tables should be anonymized.
 
-On the table level you can specify the tables primary key with the keyword ``primary_key`` if it
-isn't the default ``id``.
+**Example**::
+
+    tables:
+     - table_a:
+        fields:
+         - field_a: ...
+         - field_b: ...
+     - table_b:
+        fields:
+         - field_a: ...
+         - field_b: ...
+
+``truncate``
+~~~~~~~~~~~~
+
+You can also specify a list of tables that should be cleared instead of anonymized with the  `truncated` key. This is
+useful if you don't need the table data for development purposes or the reduce the size of the database dump.
+
+**Example**::
+
+    truncate:
+     - django_session
+     - my_other_table
+
+If two tables have a foreign key relation and you don't need to keep one of the table's data, just add the
+second table and they will be truncated at once, without causing a constraint error.
+
+Table level
+-----------
+
+``primary_key``
+~~~~~~~~~~~~~~~
+
+Defines the name of the primary key field for the current table. The default is ``id``.
+
+**Example**::
+
+    tables:
+     - my_table:
+        primary_key: my_primary_key
+        fields: ...
 
 ``fields``
 ~~~~~~~~~~
@@ -23,7 +65,6 @@ be treated.
 
     tables:
      - auth_user:
-        primary_key: id
         fields:
          - first_name:
             provider:
@@ -39,7 +80,7 @@ be treated.
 ~~~~~~~~~~~~
 
 For each table you can also specify a list of ``excludes``. Each entry has to be a field name which contains
-a list of exclude patterns. If one of these patterns matches, the whole table row won't ne anonymized.
+a list of exclude patterns. If one of these patterns matches, the whole table row won't be anonymized.
 
 **Example**::
 
@@ -57,22 +98,6 @@ a list of exclude patterns. If one of these patterns matches, the whole table ro
 This will exclude all data from the table ``auth_user`` that have an ``email`` field which matches the
 regular expression pattern (the backslash is to escape the string for YAML).
 
-``truncate``
-~~~~~~~~~~~~
-
-In addition to the field level providers you can also specify a list of tables that should be cleared with
-the  `truncated` key. This is useful if you don't need the table data for development purposes or the reduce
-the size of the database dump.
-
-**Example**::
-
-    truncate:
-     - django_session
-     - my_other_table
-
-If two tables have a foreign key relation and you don't need to keep one of the table's data, just add the
-second table and they will be truncated at once, without causing a constraint error.
-
 ``search``
 ~~~~~~~~~~
 
@@ -89,11 +114,59 @@ This is useful if you need to anonymize one or more specific records, eg for "Ri
             provider:
               name: clear
 
-Providers
----------
+``chunk_size``
+~~~~~~~~~~~~~~
+
+Defines how many data rows should be fetched for each iteration of anonymizing the current table. The default is 2000.
+
+**Example**::
+
+    tables:
+     - auth_user:
+        chunk_size: 5000
+        fields: ...
+
+Field level
+-----------
+
+``provider``
+~~~~~~~~~~~~
+
+Providers are the tools, which means functions, used to alter the data within the database. You can specify on field
+level which provider should be used to alter the specific field. The reference a provider you will have can use the
+``name`` attribute.
+
+**Example**::
+
+    tables:
+     - auth_user:
+        fields:
+         - first_name:
+            provider:
+              name: set
+              value: "Foo"
 
-Providers are the tools, which means functions, used to alter the data within the database.
-The following provider are currently supported:
+
+For a complete list of providers see the next section.
+
+``append``
+~~~~~~~~~~
+
+This argument will append a value at the end of the altered value:
+
+**Example usage**::
+
+    tables:
+     - auth_user:
+        fields:
+         - email:
+            provider:
+              name: md5
+            append: "@example.com"
+
+
+Provider
+--------
 
 ``choice``
 ~~~~~~~~~~
@@ -225,23 +298,3 @@ The value can also be a dictionary for JSONB columns::
             provider:
               name: set
               value: '{"foo": "bar", "baz": 1}'
-
-Arguments
----------
-
-In addition to the providers there is also a list of arguments that can be added to each provider:
-
-``append``
-~~~~~~~~~~
-
-This argument will append a value at the end of the altered value:
-
-**Example usage**::
-
-    tables:
-     - auth_user:
-        fields:
-         - email:
-            provider:
-              name: md5
-            append: "@example.com"
diff --git a/pganonymizer/constants.py b/pganonymizer/constants.py
@@ -9,3 +9,6 @@
 
 # Filename of the default schema
 DEFAULT_SCHEMA_FILE = 'schema.yml'
+
+# Default chunk size for data fetch
+DEFAULT_CHUNK_SIZE = 2000
diff --git a/pganonymizer/utils.py b/pganonymizer/utils.py
@@ -14,7 +14,7 @@
 from psycopg2.errors import BadCopyFileFormat, InvalidTextRepresentation
 from six import StringIO
 
-from pganonymizer.constants import COPY_DB_DELIMITER, DEFAULT_PRIMARY_KEY
+from pganonymizer.constants import COPY_DB_DELIMITER, DEFAULT_CHUNK_SIZE, DEFAULT_PRIMARY_KEY
 from pganonymizer.exceptions import BadDataFormat
 from pganonymizer.providers import get_provider
 
@@ -37,11 +37,13 @@ def anonymize_tables(connection, definitions, verbose=False):
         column_dict = get_column_dict(columns)
         primary_key = table_definition.get('primary_key', DEFAULT_PRIMARY_KEY)
         total_count = get_table_count(connection, table_name)
-        data, table_columns = build_data(connection, table_name, columns, excludes, search, total_count, verbose)
+        chunk_size = table_definition.get('chunk_size', DEFAULT_CHUNK_SIZE)
+        data, table_columns = build_data(connection, table_name, columns, excludes, search, total_count, chunk_size,
+                                         verbose)
         import_data(connection, column_dict, table_name, table_columns, primary_key, data)
 
 
-def build_data(connection, table, columns, excludes, search, total_count, verbose=False):
+def build_data(connection, table, columns, excludes, search, total_count, chunk_size, verbose=False):
     """
     Select all data from a table and return it together with a list of table columns.
 
@@ -51,6 +53,7 @@ def build_data(connection, table, columns, excludes, search, total_count, verbos
     :param list[dict] excludes: A list of exclude definitions.
     :param str search: A SQL WHERE (search_condition) to filter and keep only the searched rows.
     :param int total_count: The amount of rows for the current table
+    :param int chunk_size: Number of data rows to fetch with the cursor
     :param bool verbose: Display logging information and a progress bar.
     :return: A tuple containing the data list and a complete list of all table columns.
     :rtype: (list, list)
@@ -67,7 +70,7 @@ def build_data(connection, table, columns, excludes, search, total_count, verbos
     data = []
     table_columns = None
     while True:
-        records = cursor.fetchmany(size=2000)
+        records = cursor.fetchmany(size=chunk_size)
         if not records:
             break
         for row in records:

diff --git a/pganonymizer/version.py b/pganonymizer/version.py
@@ -1,3 +1,3 @@
 # -*- coding: utf-8 -*-
 
-__version__ = '0.4.0'
+__version__ = '0.4.1'