All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to semantic versioning.
- Added link and description of
easy_pipeline_run
repo toREADME.md
.
- Modified
list_files
function incdp/helpers/s3_utils.py
to use pagination when listing objects from S3 buckets, improving handling of large buckets. - Added test cases for new pagination functionality in
list_files
function intests/cdp/helpers/test_s3_utils.py
.
- Modified
insert_df_to_hive_table
function incdp/io/output.py
. Added support for creating non-existent Hive tables, repartitioning by column or partition count, and handling missing columns with explicit type casting.
- Update
CODEOWNERS
file, changed email to GitHub username.
- Updated
ons-mkdocs-theme
version from1.1.2
to1.1.3
to fix issues with the crest not showing in the footer of documentation site.
- Updated the
ons-mkdocs-theme
version number indoc
requirements insetup.cfg
.
- Unpinned
pandas
version insetup.cfg
to allow for more flexibility in dependency management. - Removed
numpy
fromsetup.cfg
as it will be installed automatically bypandas
.
- Added
write_csv
function insidecdp/helpers/s3_utils.py
.
- Changed
cut_lineage
function insidehelpers/pyspark.py
to make it compatible with newer PySpark versions.
- Added "How the Project is Organised" section to
README.md
. - Fix docstring for
test_load_json_with_encoding
intest_s3_utils.py
.
- Added
load_json
tos3_utils.py
.
- Added
InvalidS3FilePathError
toexceptions.py
. - Added
validate_s3_file_path
tos3_utils.py
.
- Fixed docstring for
load_csv
inhelpers/pyspark.py
. - Call
validate_s3_file_path
function insidesave_csv_to_s3
. - Call
validate_bucket_name
andvalidate_s3_file_path
function insidecdp/helpers/s3_utils/load_csv
.
- Improved
truncate_external_hive_table
to handle both partitioned and non-partitioned Hive tables, with enhanced error handling and support for table identifiers in<database>.<table>
or<table>
formats.
- Added
load_csv
tohelpers/pyspark.py
with kwargs parameter. - Added
truncate_external_hive_table
tohelpers/pyspark.py
. - Added
get_tables_in_database
tocdp/io/input.py
. - Added
load_csv
tocdp/helpers/s3_utils.py
. This loads a CSV from S3 bucket into a Pandas DataFrame.
- Removed
.config("spark.shuffle.service.enabled", "true")
fromcreate_spark_session()
not compatible with CDP. Added.config("spark.dynamicAllocation.shuffleTracking.enabled", "true")
&.config("spark.sql.adaptive.enabled", "true")
. - Change
mkdocs
theme frommkdocs-tech-docs-template
toons-mkdocs-theme
. - Added more parameters to
load_and_validate_table()
incdp/io/input.py
.
- Temporarily pin
numpy==1.24.4
due to https://github.com/numpy/numpy/issues/267100
- Added
zip_folder
function toio/output.py
.
- Modified
gcp_utils.py
, added more helper functions for GCS. - Modified docstring for
InvalidBucketNameError
inexceptions.py
.
- Added
.isort.cfg
to configureisort
with theblack
profile and recognizerdsa-utils
as a local repository. - Reformatted the entire codebase using
black
andisort
.
- Updated
.pre-commit-config.yaml
to includeblack
andisort
as pre-commit hooks for code formatting. - Updated
setup.cfg
to includeblack
andisort
in thedev
requirements. - Updated
README.md
to includeblack
formatting badge. - Updated
ruff.toml
to align withblack
's formatting rules.
- Added
save_csv_to_s3
function incdp/io/output.py
.
- Modified docstrings in
cdp/helpers/s3_utils.py
; remove type-hints from docstrings, type-hints already in function signatures. - Add Examples section in
delete_folder
function ins3_utils.py
. - Modified docstrings in
cdp/io/input.py
&cdp/io/output.py
; remove type-hints from docstrings, type-hints already in function signatures. - Updated
.gitignore
to excludemetastore_db/
directory. - Standardised parameter names for consistency across
S3 utility functions
s3_utils.py
- Added
s3_utils.py
module located incdp/helpers/
.
- Updated
reference.md
; includeds3_utils.py
. - Updated
README.md
; added Ruff and Python versions badges.
- Revised the "Further Reading on Reproducible Analytical Pipelines" section
in the
README.md
for clarity.
- Breaking Change: Renamed module
cdsw
tocdp
(Cloudera Data Platform). - Added a "Further Reading on Reproducible Analytical Pipelines" section to
README.md
to enhance resources on RAP best practices. - Added section on synchronising the
development
branch withmain
to thebranch_and_deploy_guide.md
file.
- Updated
contribution_guide.md
; fix code block rendering issue inmkdocs
by removing extra whitespaces.
- Updated
branch_and_deploy_guide.md
, added section titled: "Merging Development to Main: A Guide for Maintainers"
- Updated
README.md
to include new badges for Deployment Status and PyPI version.
- Added
mkdocs-mermaid2-plugin
to thedoc
extras_require insetup.cfg
, enhancing documentation with MermaidJS diagram support. - Added
gitleaks
and localrestrict-filenames
hooks to.pre-commit-config.yaml
. - Enhanced
README.md
headers with relevant emojis for improved readability and engagement.
- Modified
README.md
: Added Installation section and Git Workflow Diagram section with a MermaidJS diagram. - Improved the
branch_and_deploy_guide.md
andcontribution_guide.md
documentation on branching strategy. - Updated
python_requires
insetup.cfg
to support Python versions>=3.8
and<3.12
, including all3.11.x
versions. - Modified
pull_request_workflow.yaml
to add Python3.11
to the testing matrix. - Moved
pyspark
from primary dependencies todev
section inextras_require
to streamline installation for users with pre-installed environments, requiring manual installation where necessary. - Renamed
isdir
function incdsw/helpers/hdfs_utils
tois_dir
for improved compliance with PEP 8 naming conventions. - Removed line stopping existing SparkSession in
create_spark_session
to prevent Py4JError and enable seamless SparkContext management on GCP. - Refactor
save_csv_to_hdfs
to use functions in/cdsw/helpers/hdfs_utils.py
- Add function
delete_path
in/cdsw/helpers/hdfs_utils.py
, and refactor docstring fordelete_file
anddelete_dir
. - Modified
CHANGELOG.md
added note on missingpre-v0.1.8
releases due todeploy_pypi.yaml
issues
- Added
pyproject.toml
andsetup.cfg
.
- Removed
requirements.txt
now insetup.cfg
.
- Added
build
dependency in.github/workflows/deploy_pypi.yaml
- Modified Workflow Trigger in
.github/workflows/deploy_pypi.yaml
- Removed
.github/workflows/version_check.yaml
- Fix GitHub Branch Reference for deployment.
- Remove check of branch for deployment.
- Take workflows out of nested folder to have PyPI listing on merge to main branch.
- Workflows to have PyPI listing on merge to main branch.
- Typo in the documentation to install Python.
parametrize_cases
andCase
code for use in test scripts.- Add in PR template.
- README with additional information and guidelines for contributors.
- Pull Request Workflow includes
test
job which installs Poetry and Run Tests. - Add
.pre-commit-config.yaml
for pre-commit hooks. - Add CODEOWNERS file to repository.
- Add mkdocs;
deploy_mkdocs.yaml
anddocs
Folder. - Add the helpers_spark.py and test_helpers_spark.py modules from cprices-utils.
- Add logging.py and test_logging.py module from cprices-utils.
- Add the helpers_python.py and test_helpers_python.py modules from cprices-utils.
- Add averaging_methods.py and test_averaging_methods.py.
- Add
init_logger_advanced
inhelpers/logging.py
module. - Add in the general validation functions from cprices-utils.
- Add
invalidate_impala_metadata
function to thecdsw/impala.py
module. - Add "search" Plugin and mkdocs GOV UK Theme via
mkdocs-tech-docs-template
. - Add
pipeline_runlog.py
andhdfs_utils.py
modules fromepds_utils
. - Add common custom exceptions.
- Add config load class.
- Add generic IO input functions.
- Add
docs/contribution_guide.md
- Add functions from
epds_utils
intohelpers/pyspark.py
,io/input.py
,io/output.py
. - Add various I/O functions from the io.py module in cprices-utils.
- Add modules to
docs/reference.md
- Add mkdocs Plugins:
mkdocs-git-revision-date-localized-plugin
,mkdocs-jupyter
. - Add better navigation to
mkdocs.yml
. - Add
save_csv_to_hdfs
function tocdsw/io/output.py
. - Add
docs/branch_and_deploy_guide.md
. - Add
.github/workflows/deploy_pypi/version_check.yaml
and.github/workflows/deploy_pypi/deploy_pypi.yaml
.
- Renamed
_typing
module totyping
. - Renamed modules in helpers directory to remove
helper_
from names. - Relocated
logging.py
andvalidation.py
to root level. - Relocated
Getting Started for Developers
intodocs/contribution_guide.md
. - Migrated from
poetry
tosetup.py
for Python Code Packaging. - Upgrade
mkdocs-tech-docs-template
to0.1.2
. - Moved CDSW related from
io/input.py
&io/output.py
intocdsw/io/input.py
&cdsw/io/output.py
- Pin
pytest
version<8.0.0
due to TvoroG/pytest-lazy-fixture#65 - Updated the license information.
- Fix paths for
get_window_spec
inaveraging_methods.py
. - Fix
deploy_mkdocs.yaml
, changedmkdocs-material
tomkdocs-tech-docs-template
. - Fix module paths for unit test patches in
tests/cdsw/
. - Fix
pull_request_workflow.yaml
; ensured pytest failures are accurately reported in GitHub workflow by removing|| true
condition. - Fix
deploy_mkdocs.yaml
, fixed Python version to3.10
. - Fix
deploy_mkdocs.yaml
, missing quotes for Python version.
- Remove
_version.py
. - Remove all references to Poetry.
Note: Releases prior to v0.1.8 are not available on GitHub Releases and PyPI due to bugs in the GitHub Action
deploy_pypi.yaml
, which deploys to PyPI and GitHub Releases.
- rdsa-utils v0.5.0: GitHub Release | PyPI
- rdsa-utils v0.4.4: GitHub Release | PyPI
- rdsa-utils v0.4.3: GitHub Release | PyPI
- rdsa-utils v0.4.2: GitHub Release | PyPI
- rdsa-utils v0.4.1: GitHub Release | PyPI
- rdsa-utils v0.4.0: GitHub Release | PyPI
- rdsa-utils v0.3.7: GitHub Release | PyPI
- rdsa-utils v0.3.6: GitHub Release | PyPI
- rdsa-utils v0.3.5: GitHub Release | PyPI
- rdsa-utils v0.3.4: GitHub Release | PyPI
- rdsa-utils v0.3.3: GitHub Release | PyPI
- rdsa-utils v0.3.2: GitHub Release | PyPI
- rdsa-utils v0.3.1: GitHub Release | PyPI
- rdsa-utils v0.3.0: GitHub Release | PyPI
- rdsa-utils v0.2.3: GitHub Release | PyPI
- rdsa-utils v0.2.2: GitHub Release | PyPI
- rdsa-utils v0.2.1: GitHub Release | PyPI
- rdsa-utils v0.2.0: GitHub Release | PyPI
- rdsa-utils v0.1.10: GitHub Release | PyPI
- rdsa-utils v0.1.9: GitHub Release | PyPI
- rdsa-utils v0.1.8: GitHub Release | PyPI
- rdsa-utils v0.1.7 - Not available on GitHub Releases or PyPI
- rdsa-utils v0.1.6 - Not available on GitHub Releases or PyPI
- rdsa-utils v0.1.5 - Not available on GitHub Releases or PyPI
- rdsa-utils v0.1.4 - Not available on GitHub Releases or PyPI
- rdsa-utils v0.1.3 - Not available on GitHub Releases or PyPI
- rdsa-utils v0.1.2 - Not available on GitHub Releases or PyPI
- rdsa-utils v0.1.1 - Not available on GitHub Releases or PyPI
- rdsa-utils v0.1.0 - Not available on GitHub Releases or PyPI