Merge branch 'development'

rheinwerk-verlag · May 5, 2021 · 60f244e · 60f244e
2 parents 9e7a921 + ab37d70
commit 60f244e
Show file tree

Hide file tree

Showing 10 changed files with 73 additions and 18 deletions.
diff --git a/.github/workflows/python_test.yml b/.github/workflows/python_test.yml
@@ -11,7 +11,7 @@ jobs:
     strategy:
       max-parallel: 4
       matrix:
-        python-version: [2.7, 3.5, 3.6, 3.7, 3.8]
+        python-version: [2.7, 3.5, 3.6, 3.7, 3.8, 3.9]
 
     steps:
     - uses: actions/checkout@v2

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,11 @@
 
 ## Development
 
+## 0.4.0 (2021-05-05)
+
+* [#18](https://github.com/rheinwerk-verlag/postgresql-anonymizer/pull/18): Specify (SQL WHERE) search_condition, to filter the table for rows to be anonymized ([bobslee](https://github.com/bobslee))
+* [#17](https://github.com/rheinwerk-verlag/postgresql-anonymizer/pull/17): Fix anonymizing error if there is a JSONB column in a table ([koptelovav](https://github.com/koptelovav))
+
 ## 0.3.3 (2021-04-16)
 
 * [#16](https://github.com/rheinwerk-verlag/postgresql-anonymizer/issues/16): Preserve column and table cases during the copy process

diff --git a/LICENSE.rst b/LICENSE.rst
@@ -3,7 +3,7 @@ License
 
 The MIT License
 
-Copyright (c) 2019-2020, Rheinwerk Verlag GmbH
+Copyright (c) 2019-2021, Rheinwerk Verlag GmbH
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.rst b/README.rst
@@ -1,12 +1,13 @@
 PostgreSQL Anonymizer
 =====================
 
-This commandline tool makes PostgreSQL database anonymization easy. It uses a YAML definition file
-to define which tables and fields should be anonymized and provides various methods of anonymization.
+A commandline tool to anonymize PostgreSQL databases for GDPR purposes.
+
+It uses a YAML definition file to define which tables and fields should be anonymized and provides various methods of anonymization.
 
 .. class:: no-web no-pdf
 
-    |license| |pypi| |downloads| |build|
+    |python| |license| |pypi| |downloads| |build|
 
 .. contents::
 
@@ -15,6 +16,7 @@ to define which tables and fields should be anonymized and provides various meth
 Features
 --------
 
+* Intentionally compatible with Python 2.7 (for old, productive platforms)
 * Anonymize PostgreSQL tables on data level entry with various methods (s. table below)
 * Exclude data for anonymization depending on regular expressions
 * Truncate entire tables for unwanted data
@@ -144,15 +146,18 @@ After that you can pass a schema file to the container, using Docker volumes, an
 .. _schema documentation: https://python-postgresql-anonymizer.readthedocs.io/en/latest/schema.html
 .. _YAML sample schema: https://github.com/rheinwerk-verlag/postgresql-anonymizer/blob/master/sample_schema.yml
 
+.. |python| image:: https://img.shields.io/pypi/pyversions/pganonymize 
+    :alt: PyPI - Python Version
+
 .. |license| image:: https://img.shields.io/badge/license-MIT-green.svg
     :target: https://github.com/rheinwerk-verlag/postgresql-anonymizer/blob/master/LICENSE.rst
 
 .. |pypi| image:: https://badge.fury.io/py/pganonymize.svg
     :target: https://badge.fury.io/py/pganonymize
 
-.. |downloads| image:: https://pepy.tech/badge/pganonymize
+.. |downloads| image:: https://static.pepy.tech/personalized-badge/pganonymize?period=total&units=international_system&left_color=blue&right_color=black&left_text=Downloads
     :target: https://pepy.tech/project/pganonymize
     :alt: Download count
-
+    
 .. |build| image:: https://github.com/rheinwerk-verlag/postgresql-anonymizer/workflows/Test/badge.svg
     :target: https://github.com/rheinwerk-verlag/postgresql-anonymizer/actions
diff --git a/docs/schema.rst b/docs/schema.rst
@@ -61,7 +61,7 @@ regular expression pattern (the backslash is to escape the string for YAML).
 ~~~~~~~~~~~~
 
 In addition to the field level providers you can also specify a list of tables that should be cleared with
-the  `truncated` key. This is useful if you don't need the table data for development purposes or the reduce 
+the  `truncated` key. This is useful if you don't need the table data for development purposes or the reduce
 the size of the database dump.
 
 **Example**::
@@ -73,6 +73,22 @@ the size of the database dump.
 If two tables have a foreign key relation and you don't need to keep one of the table's data, just add the
 second table and they will be truncated at once, without causing a constraint error.
 
+``search``
+~~~~~~~~~~
+
+You can also specify a (SQL WHERE) `search_condition`, to filter the table for rows to be anonymized.
+This is useful if you need to anonymize one or more specific records, eg for "Right to be forgotten" (GDPR etc) purpose.
+
+**Example**::
+
+    tables:
+     - auth_user:
+        search: id BETWEEN 18 AND 140 AND user_type = 'customer'
+        fields:
+         - first_name:
+            provider:
+              name: clear
+
 Providers
 ---------
 
@@ -149,12 +165,12 @@ the provider with ``fake`` and then use the function name from the Faker library
 ``mask``
 ~~~~~~~~
 
-This provider will replace each character with a static sign.
-
 **Arguments:**
 
 * ``sign``: The sign to be used to replace the original characters (default ``X``).
 
+This provider will replace each character with a static sign.
+
 **Example usage**::
 
     tables:
@@ -200,6 +216,15 @@ This provider will hash the given field value with the MD5 algorithm.
               name: set
               value: "Foo"
 
+The value can also be a dictionary for JSONB columns::
+
+    tables:
+     - auth_user:
+        fields:
+         - first_name:
+            provider:
+              name: set
+              value: '{"foo": "bar", "baz": 1}'
 
 Arguments
 ---------

diff --git a/pganonymizer/utils.py b/pganonymizer/utils.py
@@ -3,6 +3,7 @@
 from __future__ import absolute_import
 
 import csv
+import json
 import logging
 import re
 import subprocess
@@ -32,29 +33,35 @@ def anonymize_tables(connection, definitions, verbose=False):
         table_definition = definition[table_name]
         columns = table_definition.get('fields', [])
         excludes = table_definition.get('excludes', [])
+        search = table_definition.get('search')
         column_dict = get_column_dict(columns)
         primary_key = table_definition.get('primary_key', DEFAULT_PRIMARY_KEY)
         total_count = get_table_count(connection, table_name)
-        data, table_columns = build_data(connection, table_name, columns, excludes, total_count, verbose)
+        data, table_columns = build_data(connection, table_name, columns, excludes, search, total_count, verbose)
         import_data(connection, column_dict, table_name, table_columns, primary_key, data)
 
 
-def build_data(connection, table, columns, excludes, total_count, verbose=False):
+def build_data(connection, table, columns, excludes, search, total_count, verbose=False):
     """
     Select all data from a table and return it together with a list of table columns.
 
     :param connection: A database connection instance.
     :param str table: Name of the table to retrieve the data.
     :param list columns: A list of table fields
     :param list[dict] excludes: A list of exclude definitions.
+    :param str search: A SQL WHERE (search_condition) to filter and keep only the searched rows.
     :param int total_count: The amount of rows for the current table
     :param bool verbose: Display logging information and a progress bar.
     :return: A tuple containing the data list and a complete list of all table columns.
     :rtype: (list, list)
     """
     if verbose:
         progress_bar = IncrementalBar('Anonymizing', max=total_count)
-    sql = "SELECT * FROM {table};".format(table=table)
+    sql_select = "SELECT * FROM {table}".format(table=table)
+    if search:
+        sql = "{select} WHERE {search_condition};".format(select=sql_select, search_condition=search)
+    else:
+        sql = "{select};".format(select=sql_select)
     cursor = connection.cursor(cursor_factory=psycopg2.extras.DictCursor, name='fetch_large_result')
     cursor.execute(sql)
     data = []
@@ -189,7 +196,19 @@ def data2csv(data):
     """
     buf = StringIO()
     writer = csv.writer(buf, delimiter=COPY_DB_DELIMITER, lineterminator='\n', quotechar='~')
-    [writer.writerow([(x is None and '\\N' or (x.strip() if type(x) == str else x)) for x in row]) for row in data]
+    for row in data:
+        row_data = []
+        for x in row:
+            if x is None:
+                val = '\\N'
+            elif type(x) == str:
+                val = x.strip()
+            elif type(x) == dict:
+                val = json.dumps(x)
+            else:
+                val = x
+            row_data.append(val)
+        writer.writerow(row_data)
     buf.seek(0)
     return buf
 

diff --git a/pganonymizer/version.py b/pganonymizer/version.py
@@ -1,3 +1,3 @@
 # -*- coding: utf-8 -*-
 
-__version__ = '0.3.3'
+__version__ = '0.4.0'
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "postgresql-anonymizer"
-version = "0.3.0"
+version = "0.4.0"
 description = "Commandline tool to anonymize PostgreSQL databases"
 authors = [
     "Henning Kage <[email protected]>"

diff --git a/setup.py b/setup.py
@@ -69,7 +69,7 @@ def run(self):
     maintainer='Rheinwerk Verlag GmbH Webteam',
     maintainer_email='[email protected]',
     url='https://github.com/rheinwerk-verlag/postgresql-anonymizer',
-    license='Proprietary',
+    license='MIT license',
     classifiers=[
         'Development Status :: 2 - Pre-Alpha',
         'Intended Audience :: Developers',
@@ -81,6 +81,7 @@ def run(self):
         'Programming Language :: Python :: 3.6',
         'Programming Language :: Python :: 3.7',
         'Programming Language :: Python :: 3.8',
+        'Programming Language :: Python :: 3.9',
         'Environment :: Console',
         'Topic :: Database'
     ],

diff --git a/tox.ini b/tox.ini
@@ -1,5 +1,5 @@
 [tox]
-envlist = flake8,py27,py35,py36,py37,py38
+envlist = flake8,py27,py35,py36,py37,py38,py39
 skip_missing_interpreters=True
 
 [testenv:flake8]