fastteradata

Tools for faster and optimized interaction with Teradata and large datasets. This was initially made to make using Teradata's fast export utility easier and get around various spool space issues that may arise from large tables.

Installation

fastteradata can easily be downloaded straight from github via the following:

pip install git+https://github.com/mark-hoffmann/fastteradata.git

Usage

The main task this package accomplishes right now is export large tables faster and easier than trying to load them directly. This is accomplished by connecting to a Teradata account and reading metadata about the table of interest to autogenerate and execute the appropriate fastexport script. You then have the option to read the data into memory to clean it and then serialize it as a feather or pickle file with correct data types and columns.

Major Updates

Update 11/10/2017

Basic fastloading capabilities have been added for doing a fastload from a pandas dataframe

Update 11/24/2017

Major update to data type robustness on export

Update 12/1/2017

Auto horizontal scaling merged Feather serialization merged

Additional components will continue to be worked on such as

Better optimization of auto parrellelization
The ability to join different tables for an export
Plus popular suggestions from others in the issues

Setup

Before you can use fasteradata, you must create a file in your working home directory called ".fastteradata", such that the path is "~/.fastteradata". This file follows json formatting rules and an example can be found here. All fields surround with underscores need to be updated. Don't necessary need two different environments.

Once you have your credential file set up, you are ready to go.

Extracting a table

An example of exporting is as follows:

import fastteradata as ftd

ftd.extract_table("/absolute/path/to/output", "TABLE_NAME", "ENV_NAME", "DB_NAME", nrows=50, connector="pyodbc")

This particular call will

Create a data and serialized folder in the directory of your absolute path argument
Connect to Teradata via the connector "pyodbc"
Read metadata about the table
Generate a fast export script in your absolute path directory
Execute the script and populate a data file with the first 50 rows of the table into the data/ subdirectory
Read in the data file and attempt to clean it with np.nans where appropriate as well as appropriate typecasting
Save the resulting pandas dataframe into the serialized/ subdirectory

Loading a table

An example of loading is as follows:

import fastteradata as ftd

ftd.load_table("/absolute/path/for/files", df,  "TABLE_NAME", "ENV_NAME", "DB_NAME", clear_table=True)

This call will

Take the datatypes from your dataframe and generate a fastload script file
Save your dataframe to the appropriate csv file that fastload expects
Since the clear_table flag is True (True by default as well) it will DROP THE CURRENT TABLE if one exists with that name in that databse, and recreate a new table to fill in
Execute the script to populate the database table

Method Signatures

While this project is small, the method signatures can be found below. If this starts to become much larger, I will generate Sphinx docs.

extract_table(abs_path, table_name, env, db, nrows=-1, connector = "teradata", columns = [], clean_and_serialize="feather", partition_key="", partition_type="year", primary_keys=[], meta_table="", where_clause="", distinct=False)

Summary

Extracts table information from Teradata and saves / executes the appropriate files

Args

abs_path (str): Absolute path where you want your scripts to reside and data and serialized subdirectories made

table_name (str): Teradata table name you wish to query

env (str): Environment that you want to connect to as specified in your .fastteradata file. (People usually have a testing and production environment)

db (str): Database name to connect to

nrows (int): default = -1 The default of -1 means ALL rows. Otherwise, you can specificy a subset of rows such as 20

connector (str): default = 'teradata' The default uses the teradata python module to connect to the cluster. Valid options include 'teradata' and 'pyodbc'

columns (list(str)): default = [] Subset of columns to use, default is all of the columns found in the metadata, however subsets can be selected by passing in ['col1','col2','col3']

clean_and_serialize (str): default = "feather" Refers to if you want to read the resulting data file into memory to clean and then serialize in your serialized subdirectory. Available options incldue 'feather', 'pickle', and False False is if you do not want to serialize the resulting dataframe, but just get the text.

partition_key (str): default = '' There is no partitioning by default. When you define a partition key, it MUST BE A DATE COLUMN AS DEFINED IN TERADATA. This breaks up the exporting into paritions by the partition_type argument. This generates multiple fexp scripts and executes them in parrelel using the available cores. This helps to break up extremely large datasets or increase speed. When a parition key is defined, after all of the partition files are finished loading from Teradata, the resulting data is COMBINED into a SINGLE DATA FILE and finishes processing through the following cleaning, data type specification, and serializing.

partition_type (str): default = 'year' Default is to partition the partition_key by distict YEAR. Valid options include "year" or "month"

primary_keys (list(str)): default = [] This is required any time that horizontal partitioning is required. Horizontal partitioning is done automatically when there are more than 100 columns you are trying to extract from a table. The list of column names should be the columns that define a unique row in the dataset. If these do not define a unique row, you will recieve unexpected behavior of unwanted rows appearing in yoru dataset. This scaling feature is required because of the limitationso f Teradata's fastexport utility. It is to abstract back the headache of having to deal with more columns than the utility can handle.

meta_table (str): default = "" This is used if you want to override what table to take metadata from. For example, views do not have metadata associated with them in Teradata, therefore one can create a temp table without data from a view to get the metadata registered in dbc.tables and dbc.columns. We can then point to the temp table's metadata while pulling from the correct view. Follows format "dbname.tablename"

where_clause (str): default = "" Where clause to use in sql generated scripts. Do not need to add WHERE ex. "col1 > 15"

distinct (bool): default = False Whether to put a distinct keyword in the select statement

Returns

Column list received from the metadata if clean_and_serialize is set to False, else nothing. Column names are returned in this case so you can save them and use them to read the raw data file later with appropriate columns.

load_table(abs_path, df, table_name, env, db, connector = "teradata", clear_table=True)

Summary

Loads a pandas dataframe from memory into teradata via the optimized fastload functionality.

Args

abs_path (str): Absolute path where you want your scripts to reside and appropriate subdirectories made (Most of the time should be same absolute path as the extract_table abs_path)

df (pandas DataFrame): The pandas DataFrame that you want to save up to teradata

table_name (str): The desired table name

env (str): Environment that you want to connect to as specified in your .fastteradata file. (People usually have a testing and production environment)

db (str): Database name to connect to

connector (str): default = 'teradata' The default uses the teradata python module to connect to the cluster. Valid options include 'teradata' and 'pyodbc'

clear_table (bool): default = 'True' This specifies if you want the table you specify in your db to be dropped before loading in. Right now, this is the recomended way of using this function. Otherwise, you have to be sure to have the columns exactly right and error handling has not been robustly built out for that case yet.

Returns

Nothing

Requirements

Compatibility

fastteradata currently supports Python 3.6

Licence

MIT

Authors

fastteradata was written by Mark Hoffmann.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
fastteradata		fastteradata
tests		tests
.example_fastteradata		.example_fastteradata
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
README.rst		README.rst
codecov.yml		codecov.yml
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fastteradata

Installation

Usage

Method Signatures

Requirements

Compatibility

Licence

Authors

About

Releases

Packages

Contributors 2

Languages

License

mark-hoffmann/fastteradata

Folders and files

Latest commit

History

Repository files navigation

fastteradata

Installation

Usage

Method Signatures

Requirements

Compatibility

Licence

Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages