Skip to content

Commit

Permalink
Merge branch 'development'
Browse files Browse the repository at this point in the history
  • Loading branch information
hkage committed May 5, 2021
2 parents 9e7a921 + ab37d70 commit 60f244e
Show file tree
Hide file tree
Showing 10 changed files with 73 additions and 18 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/python_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
strategy:
max-parallel: 4
matrix:
python-version: [2.7, 3.5, 3.6, 3.7, 3.8]
python-version: [2.7, 3.5, 3.6, 3.7, 3.8, 3.9]

steps:
- uses: actions/checkout@v2
Expand Down
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@

## Development

## 0.4.0 (2021-05-05)

* [#18](https://github.com/rheinwerk-verlag/postgresql-anonymizer/pull/18): Specify (SQL WHERE) search_condition, to filter the table for rows to be anonymized ([bobslee](https://github.com/bobslee))
* [#17](https://github.com/rheinwerk-verlag/postgresql-anonymizer/pull/17): Fix anonymizing error if there is a JSONB column in a table ([koptelovav](https://github.com/koptelovav))

## 0.3.3 (2021-04-16)

* [#16](https://github.com/rheinwerk-verlag/postgresql-anonymizer/issues/16): Preserve column and table cases during the copy process
Expand Down
2 changes: 1 addition & 1 deletion LICENSE.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ License

The MIT License

Copyright (c) 2019-2020, Rheinwerk Verlag GmbH
Copyright (c) 2019-2021, Rheinwerk Verlag GmbH

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
15 changes: 10 additions & 5 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
PostgreSQL Anonymizer
=====================

This commandline tool makes PostgreSQL database anonymization easy. It uses a YAML definition file
to define which tables and fields should be anonymized and provides various methods of anonymization.
A commandline tool to anonymize PostgreSQL databases for GDPR purposes.

It uses a YAML definition file to define which tables and fields should be anonymized and provides various methods of anonymization.

.. class:: no-web no-pdf

|license| |pypi| |downloads| |build|
|python| |license| |pypi| |downloads| |build|

.. contents::

Expand All @@ -15,6 +16,7 @@ to define which tables and fields should be anonymized and provides various meth
Features
--------

* Intentionally compatible with Python 2.7 (for old, productive platforms)
* Anonymize PostgreSQL tables on data level entry with various methods (s. table below)
* Exclude data for anonymization depending on regular expressions
* Truncate entire tables for unwanted data
Expand Down Expand Up @@ -144,15 +146,18 @@ After that you can pass a schema file to the container, using Docker volumes, an
.. _schema documentation: https://python-postgresql-anonymizer.readthedocs.io/en/latest/schema.html
.. _YAML sample schema: https://github.com/rheinwerk-verlag/postgresql-anonymizer/blob/master/sample_schema.yml

.. |python| image:: https://img.shields.io/pypi/pyversions/pganonymize
:alt: PyPI - Python Version

.. |license| image:: https://img.shields.io/badge/license-MIT-green.svg
:target: https://github.com/rheinwerk-verlag/postgresql-anonymizer/blob/master/LICENSE.rst

.. |pypi| image:: https://badge.fury.io/py/pganonymize.svg
:target: https://badge.fury.io/py/pganonymize

.. |downloads| image:: https://pepy.tech/badge/pganonymize
.. |downloads| image:: https://static.pepy.tech/personalized-badge/pganonymize?period=total&units=international_system&left_color=blue&right_color=black&left_text=Downloads
:target: https://pepy.tech/project/pganonymize
:alt: Download count

.. |build| image:: https://github.com/rheinwerk-verlag/postgresql-anonymizer/workflows/Test/badge.svg
:target: https://github.com/rheinwerk-verlag/postgresql-anonymizer/actions
31 changes: 28 additions & 3 deletions docs/schema.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ regular expression pattern (the backslash is to escape the string for YAML).
~~~~~~~~~~~~

In addition to the field level providers you can also specify a list of tables that should be cleared with
the `truncated` key. This is useful if you don't need the table data for development purposes or the reduce
the `truncated` key. This is useful if you don't need the table data for development purposes or the reduce
the size of the database dump.

**Example**::
Expand All @@ -73,6 +73,22 @@ the size of the database dump.
If two tables have a foreign key relation and you don't need to keep one of the table's data, just add the
second table and they will be truncated at once, without causing a constraint error.

``search``
~~~~~~~~~~

You can also specify a (SQL WHERE) `search_condition`, to filter the table for rows to be anonymized.
This is useful if you need to anonymize one or more specific records, eg for "Right to be forgotten" (GDPR etc) purpose.

**Example**::

tables:
- auth_user:
search: id BETWEEN 18 AND 140 AND user_type = 'customer'
fields:
- first_name:
provider:
name: clear

Providers
---------

Expand Down Expand Up @@ -149,12 +165,12 @@ the provider with ``fake`` and then use the function name from the Faker library
``mask``
~~~~~~~~

This provider will replace each character with a static sign.

**Arguments:**

* ``sign``: The sign to be used to replace the original characters (default ``X``).

This provider will replace each character with a static sign.

**Example usage**::

tables:
Expand Down Expand Up @@ -200,6 +216,15 @@ This provider will hash the given field value with the MD5 algorithm.
name: set
value: "Foo"

The value can also be a dictionary for JSONB columns::

tables:
- auth_user:
fields:
- first_name:
provider:
name: set
value: '{"foo": "bar", "baz": 1}'

Arguments
---------
Expand Down
27 changes: 23 additions & 4 deletions pganonymizer/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from __future__ import absolute_import

import csv
import json
import logging
import re
import subprocess
Expand Down Expand Up @@ -32,29 +33,35 @@ def anonymize_tables(connection, definitions, verbose=False):
table_definition = definition[table_name]
columns = table_definition.get('fields', [])
excludes = table_definition.get('excludes', [])
search = table_definition.get('search')
column_dict = get_column_dict(columns)
primary_key = table_definition.get('primary_key', DEFAULT_PRIMARY_KEY)
total_count = get_table_count(connection, table_name)
data, table_columns = build_data(connection, table_name, columns, excludes, total_count, verbose)
data, table_columns = build_data(connection, table_name, columns, excludes, search, total_count, verbose)
import_data(connection, column_dict, table_name, table_columns, primary_key, data)


def build_data(connection, table, columns, excludes, total_count, verbose=False):
def build_data(connection, table, columns, excludes, search, total_count, verbose=False):
"""
Select all data from a table and return it together with a list of table columns.
:param connection: A database connection instance.
:param str table: Name of the table to retrieve the data.
:param list columns: A list of table fields
:param list[dict] excludes: A list of exclude definitions.
:param str search: A SQL WHERE (search_condition) to filter and keep only the searched rows.
:param int total_count: The amount of rows for the current table
:param bool verbose: Display logging information and a progress bar.
:return: A tuple containing the data list and a complete list of all table columns.
:rtype: (list, list)
"""
if verbose:
progress_bar = IncrementalBar('Anonymizing', max=total_count)
sql = "SELECT * FROM {table};".format(table=table)
sql_select = "SELECT * FROM {table}".format(table=table)
if search:
sql = "{select} WHERE {search_condition};".format(select=sql_select, search_condition=search)
else:
sql = "{select};".format(select=sql_select)
cursor = connection.cursor(cursor_factory=psycopg2.extras.DictCursor, name='fetch_large_result')
cursor.execute(sql)
data = []
Expand Down Expand Up @@ -189,7 +196,19 @@ def data2csv(data):
"""
buf = StringIO()
writer = csv.writer(buf, delimiter=COPY_DB_DELIMITER, lineterminator='\n', quotechar='~')
[writer.writerow([(x is None and '\\N' or (x.strip() if type(x) == str else x)) for x in row]) for row in data]
for row in data:
row_data = []
for x in row:
if x is None:
val = '\\N'
elif type(x) == str:
val = x.strip()
elif type(x) == dict:
val = json.dumps(x)
else:
val = x
row_data.append(val)
writer.writerow(row_data)
buf.seek(0)
return buf

Expand Down
2 changes: 1 addition & 1 deletion pganonymizer/version.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# -*- coding: utf-8 -*-

__version__ = '0.3.3'
__version__ = '0.4.0'
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "postgresql-anonymizer"
version = "0.3.0"
version = "0.4.0"
description = "Commandline tool to anonymize PostgreSQL databases"
authors = [
"Henning Kage <[email protected]>"
Expand Down
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ def run(self):
maintainer='Rheinwerk Verlag GmbH Webteam',
maintainer_email='[email protected]',
url='https://github.com/rheinwerk-verlag/postgresql-anonymizer',
license='Proprietary',
license='MIT license',
classifiers=[
'Development Status :: 2 - Pre-Alpha',
'Intended Audience :: Developers',
Expand All @@ -81,6 +81,7 @@ def run(self):
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: 3.9',
'Environment :: Console',
'Topic :: Database'
],
Expand Down
2 changes: 1 addition & 1 deletion tox.ini
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[tox]
envlist = flake8,py27,py35,py36,py37,py38
envlist = flake8,py27,py35,py36,py37,py38,py39
skip_missing_interpreters=True

[testenv:flake8]
Expand Down

0 comments on commit 60f244e

Please sign in to comment.