This solution implements an automated data quality check and correction system using AWS services and Large Language Models (LLM). The architecture combines AWS Glue for ETL processing and Amazon Bedrock for LLM-based data correction, and Amazon Redshift for data storage.
-
AWS Glue ETL Job
- Processes source data from S3
- Performs data quality checks
- Integrates with Amazon Bedrock for corrections
-
Amazon Bedrock
- Provides LLM capabilities
- Suggests corrections for identified issues
-
Amazon Redshift
- Stores the processed and corrected data
- AWS Account with appropriate permissions
- Amazon Bedrock access
- AWS Glue service role with necessary permissions
- Redshift cluster and connection details
-
Clone the repository:
git clone https://github.com/fhuthmacher/data-quality-automation.git cd data-quality-automation
-
Set up a virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Create dev.env file and populate with the appropriate values for the environment variables
REGION=us-east-1 SQL_DATABASE=REDSHIFT SQL_DIALECT=PostgreSQL DATABASE_SECRET_NAME=RedshiftCreds S3_BUCKET_NAME=XXX GLUE_IAM_ROLE_ARN=XXX
-
Set up Redshift connection in AWS Glue
-
Ensure you have sufficient permissions to access AWS Glue, Amazon Bedrock, and Amazon Redshift
-
Go through the notebooks and run them in the order they are listed.
-
Refer to the blog for further details.
- /01_dq-etl.ipynb: This notebook creates a table in Redshift, along with an AWS Glue ETL job that includes data quality checks and anomaly detection.
- /02_dq-etl with correction.ipynb: creates a new AWS Glue ETL job that includes LLM auto-correction.
- /03_create_custom_reusable_visual_transform.ipynb: This notebook creates a custom reusable visual transform that can be used in Glue Studio to fix data quality issues with the help of an LLM.
- /04_dq-llm-etl.ipynb: This notebook evaluates different prompt engineering techniques for LLM-based data quality checks.
- Felix Huthmacher - Initial work - github
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.