Skip to content

Commit

Permalink
Merge branch 'development'
Browse files Browse the repository at this point in the history
  • Loading branch information
hkage committed Dec 20, 2019
2 parents ce2b3ce + d25e107 commit 5cffe06
Show file tree
Hide file tree
Showing 9 changed files with 291 additions and 42 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,13 @@

## Development

## 0.2.0 (2019-12-20)

* Added provider classes
* Added new providers:
* choice - returns a random list element
* mask - replaces the original value with a static sign

## 0.1.1 (2019-12-18)

Changed setup.py
Expand Down
79 changes: 65 additions & 14 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,21 @@
PostgreSQL Anonymizer
=====================

.. image:: https://travis-ci.org/hkage/postgresql-anonymizer.svg?branch=master
:target: https://travis-ci.org/hkage/postgresql-anonymizer
.. image:: https://travis-ci.org/rheinwerk-verlag/postgresql-anonymizer.svg?branch=master
:target: https://travis-ci.org/rheinwerk-verlag/postgresql-anonymizer

.. image:: https://img.shields.io/badge/license-MIT-green.svg
:target: https://github.com/hkage/postgresql-anonymizer/blob/master/LICENSE.rst
:target: https://github.com/rheinwerk-verlag/postgresql-anonymizer/blob/master/LICENSE.rst

.. image:: https://badge.fury.io/py/pganonymize.svg
:target: https://badge.fury.io/py/pganonymize

A commandline tool to anonymize PostgreSQL databases.

Installation
------------

``pganonymize`` is Python 2 and Python 3 compatible.

The default installation method is to use ``pip``::

$ pip install pganonymize
Expand Down Expand Up @@ -60,11 +63,13 @@ be treated.
primary_key: id
fields:
- first_name:
provider: clear
provider:
name: clear
- customer_email:
fields:
- email:
provider: md5
provider:
name: md5
append: @localhost


Expand All @@ -74,6 +79,29 @@ Providers
Provider are the tools, that means functions, used to alter the data within the database.
The following provider are currently supported:

``choice``
~~~~~~~~~~

This provider will define a list of possible values for a database field and will randomly make a choice
from this list.

**Arguments:**

* ``values``: All list of values

**Example usage**::

tables:
- auth_user:
fields:
- first_name:
provider:
name: choice
values:
- "John"
- "Lisa"
- "Tom"

``clear``
~~~~~~~~~

Expand All @@ -90,7 +118,8 @@ The ``clear`` provider will set a database field to ``null``.
- auth_user:
fields:
- first_name:
provider: clear
provider:
name: clear


``fake``
Expand All @@ -114,7 +143,27 @@ the provider with ``faker`` and use the provider function from the Faker library
- auth_user:
fields:
- email:
provider: fake.email
provider:
name: fake.email

``mask``
~~~~~~~~

This provider will replace each character with a static sign.

**Arguments:**

* ``sign``: The sign to be used to replace the original characters (default ``X``).

**Example usage**::

tables:
- auth_user:
fields:
- last_name:
provider:
name: mask
sign: '?'


``md5``
Expand All @@ -130,7 +179,8 @@ This provider will hash the given field value with the MD5 algorithm.
- auth_user:
fields:
- password:
provider: md5
provider:
name: md5


``set``
Expand All @@ -146,8 +196,9 @@ This provider will hash the given field value with the MD5 algorithm.
- auth_user:
fields:
- first_name:
provider: set
value: "Foo"
provider:
name: set
value: "Foo"


Arguments
Expand All @@ -166,7 +217,8 @@ This argument will append a value at the end of the altered value:
- auth_user:
fields:
- email:
provider: md5
provider:
name: md5
append: "@example.com"

Quickstart
Expand Down Expand Up @@ -250,9 +302,8 @@ TODOs
-----
* Add tests
* Add exceptions for certain field values
* Make the providers more pluggable (e.g. as own classes with a unqiue character id)
* Add option to create a database dump
* Add ``choice`` provider to randomly choice from a list of values
* Add a commandline argument to list all available providers


.. _Faker: https://faker.readthedocs.io/en/master/providers.html
10 changes: 9 additions & 1 deletion pganonymizer/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,15 @@ class PgAnonymizeException(Exception):


class InvalidFieldProvider(PgAnonymizeException):
"""Raised if an unknown field provider was used."""
"""Raised if an unknown field provider was used in the schema."""


class InvalidProvider(PgAnonymizeException):
"""Raised if an unknown provider class was requested."""


class InvalidProviderArgument(PgAnonymizeException):
"""Raised if an argument is unknown or invalid for a provider."""


class BadDataFormat(PgAnonymizeException):
Expand Down
125 changes: 125 additions & 0 deletions pganonymizer/providers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
import random
from hashlib import md5

from faker import Faker
from six import with_metaclass

from pganonymizer.exceptions import InvalidProvider, InvalidProviderArgument


PROVIDERS = []

fake_data = Faker()


def get_provider(provider_config):
"""
Return a provider instance, according to the schema definition of a field.
:param dict provider_config: A provider configuration for a single field, e.g.:
{'name': 'set', 'value': 'Foo'}
:return: A provider instance
:rtype: Provider
"""
def get_provider_class(cid):
for cls in PROVIDERS:
if cls.matches(name):
return cls
name = provider_config['name']
cls = get_provider_class(name)
if cls is None:
raise InvalidProvider('Could not find provider with id %s' % name)
return cls(**provider_config)


class ProviderMeta(type):
"""Metaclass to register all provider classes."""

def __new__(cls, clsname, bases, attrs):
newclass = super(ProviderMeta, cls).__new__(cls, clsname, bases, attrs)
if clsname != 'Provider':
PROVIDERS.append(newclass)
return newclass


class Provider(object):
"""Base class for all providers."""

id = None

def __init__(self, **kwargs):
self.kwargs = kwargs

@classmethod
def matches(cls, name):
return cls.id.lower() == name.lower()

def alter_value(self, value):
raise NotImplementedError


class ChoiceProvider(with_metaclass(ProviderMeta, Provider)):
"""Provider that returns a random value from a list of choices."""

id = 'choice'

def alter_value(self, value):
return random.choice(self.kwargs.get('values'))


class ClearProvider(with_metaclass(ProviderMeta, Provider)):
"""Provider to set a field value to None."""

id = 'clear'

def alter_value(self, value):
return None


class FakeProvider(with_metaclass(ProviderMeta, Provider)):
"""Provider to generate fake data."""

id = 'fake'

@classmethod
def matches(cls, name):
return cls.id.lower() == name.split('.')[0].lower()

def alter_value(self, value):
func_name = self.kwargs['name'].split('.')[1]
try:
func = getattr(fake_data, func_name)
except AttributeError as exc:
raise InvalidProviderArgument(exc)
return func()


class MaskProvider(with_metaclass(ProviderMeta, Provider)):
"""Provider that masks the original value."""

id = 'mask'
default_sign = 'X'

def alter_value(self, value):
sign = self.kwargs.get('sign', self.default_sign) or self.default_sign
return sign * len(value)


class MD5Provider(with_metaclass(ProviderMeta, Provider)):
"""Provider to hash a value with the md5 algorithm."""

id = 'md5'

def alter_value(self, value):
return md5(value.encode('utf-8')).hexdigest()


class SetProvider(with_metaclass(ProviderMeta, Provider)):
"""Provider to set a static value."""

id = 'set'

def alter_value(self, value):
return self.kwargs.get('value')
23 changes: 5 additions & 18 deletions pganonymizer/utils.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,15 @@
import csv
import logging
from cStringIO import StringIO
from hashlib import md5

import psycopg2
import psycopg2.extras
from faker import Faker
from progress.bar import IncrementalBar
from psycopg2.errors import BadCopyFileFormat, InvalidTextRepresentation

from pganonymizer.constants import COPY_DB_DELIMITER, DATABASE_ARGS, DEFAULT_PRIMARY_KEY
from pganonymizer.exceptions import BadDataFormat, InvalidFieldProvider

fake_data = Faker()
from pganonymizer.exceptions import BadDataFormat
from pganonymizer.providers import get_provider


def anonymize_tables(connection, definitions, verbose=False):
Expand Down Expand Up @@ -208,23 +205,13 @@ def get_column_values(row, columns):
for definition in columns:
column_name = definition.keys()[0]
column_definition = definition[column_name]
provider = column_definition.get('provider')
provider_config = column_definition.get('provider')
orig_value = row.get(column_name)
if not orig_value:
# Skip the current column if there is no value to be altered
continue
if provider.startswith('fake'):
func_name = provider.split('.')[1]
func = getattr(fake_data, func_name)
value = func()
elif provider == 'md5':
value = md5(orig_value).hexdigest()
elif provider == 'clear':
value = None
elif provider == 'set':
value = column_definition.get('value')
else:
raise InvalidFieldProvider('Unknown provider for field {}: {}'.format(column_name, provider))
provider = get_provider(provider_config)
value = provider.alter_value(orig_value)
append = column_definition.get('append')
if append:
value = value + append
Expand Down
2 changes: 1 addition & 1 deletion pganonymizer/version.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# -*- coding: utf-8 -*-

__version__ = '0.1.1'
__version__ = '0.2.0'
12 changes: 7 additions & 5 deletions sample_schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,13 @@ tables:
primary_key: id
fields:
- first_name:
provider: set
value: "Foo"
provider:
name: fake.first_name
- last_name:
provider: set
value: "Bar"
provider:
name: set
value: "Bar"
- email:
provider: md5
provider:
name: md5
append: "@localhost"
Loading

0 comments on commit 5cffe06

Please sign in to comment.