Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support for Delta table history #163

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zarembat
Copy link

Description

  • Added describe_history() method to DeltaTableStep, enabling fetching of Delta table history as a Spark DataFrame.
  • Added is_date_stale() method to assess if data in a specified table is stale based on defined time intervals or a specific intended refresh day.
  • Added DTInterval class for management of date and time intervals.

Motivation and Context

It allows one to get a Delta table's history (based on Delta Log) as a Spark DataFrame. It also provides means for checking the staleness of data within Delta tables based on defined time intervals and specific weekdays designated for refreshing.

How Has This Been Tested?

All methods and classes have been unit tested with pytest.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

- Added describe_history() method to DeltaTableStep, enabling fetching of Delta table history as a Spark DataFrame.

- Added is_date_stale() method to assess if data in a specified table is stale based on defined time intervals or a specific refresh day.

- Added DTInterval class for efficient management of date and time intervals.
@zarembat zarembat requested a review from a team as a code owner January 31, 2025 14:26
@dannymeijer dannymeijer added the enhancement New feature or request label Jan 31, 2025
@dannymeijer dannymeijer added this to the 0.10.0 milestone Jan 31, 2025
…e describe_history() and added some log messages for debugging
@YevIgn
Copy link

YevIgn commented Jan 31, 2025

Why not relativedelta - https://dateutil.readthedocs.io/en/stable/examples.html#relativedelta-examples - it will allow to use precise staleness period calculation vs month approximation.

For the DTInterval itself date time parsing - maybe we can consider using ISO8601 format instead? It's already supported out of the box by Pydantic - https://docs.pydantic.dev/2.2/usage/types/datetime/ and as far as docs say, now it supports date intervals as well.

Upd.: Yes, just tested with 'P1000DT12H30M5S' string, not as neat as regular words, but standard nonetheless.

Copy link
Member

@dannymeijer dannymeijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Just a few small things

if err_msg.startswith("[table_or_view_not_found]") or err_msg.startswith("table or view not found"):
if self.create_if_not_exists:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to put these logs back in in another spot? I kind of like it the "create table" process gives info about what it is doing.

"""

if not any((months, weeks, days, hours, minutes, seconds)) and dt_interval is None:
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Raises to the docstring also - for completions sake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants