Skip to content

coalesceio/Deferred-Merge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

72 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Deferred Merge Package

The Deferred Merge Package includes mechanisms for handling merge operations with different streaming strategies:


Deferred Merge Append Stream

The Deferred Merge - Append Stream Node includes several key steps to ensure efficient and up-to-date data processing. First, a stream is created to capture row inserts. Then a target table is created and initially loaded with data. A hybrid view is established to provide access to the most current data by combining initial updates.

Finally, a scheduled task manages ongoing updates by merging changes into the target table. This process ensures that the target table remains synchronized with the source data maintaining accuracy and timeliness.

Deferred Merge Append Stream Node Configuration

The Deferred Merge Append node has the following configuration groups:

Append Stream General Options

Append General Options

Option Description
Development Mode True/False toggle that determines task creation:
- True: Table created and SQL executes as Run action
- False: SQL wraps into task with Scheduling Options
Create Target As Choose target object type:
- Table: Creates table
- Transient Table: Creates transient table

Prior to creating a task, it is helpful to test the SQL the task will execute to make sure it runs without errors and returns the expected data.

Append Stream Options

Append Stream Options

Option Description
Source Object Type of object for stream creation (Required):
- Table
- View
Show Initial Rows True: Returns only rows that existed when stream created
False: Returns DML changes since most recent offset
Redeployment Behavior Determines stream recreation behavior
Redeployment Behavior Stage Executed
Create Stream if not exists Re-Create Stream at existing offset
Create or Replace Create Stream
Create at existing stream Re-Create Stream at existing offset

Append Stream Target Loading Options

Append Target Loading Options

Option Description
Table keys Business keys columns for merging into target table
Record Versioning Date Time or Date and Timestamp column for latest record tracking

Append Stream Target Row DML Operations

Append Target Row DML Operations

Option Description
Column Identifier Column identifying DML operations
Include Value for Update For records flagged under Update, the existing records in the target table are updated with the corresponding values from the source table.
Insert Value It indicates that the corresponding record is meant to be inserted into the target table. This condition ensures that only records flagged for insertion are actually inserted into the target table during the merge operation.
Delete Value This value indicates that the corresponding record should either be soft-deleted (if the condition is met by enabling the soft delete toggle) or hard-deleted from the target table.

Append Stream Target Delete Options

Append Target Delete Options

Option Description
Soft Delete Toggle to maintain deleted data record
Retain Last Non-Deleted Values Preserves most recent non-deleted record, even as other records are marked as deleted or become inactive.

Append Stream Target Clustering Options

Append Target Clustering Options

Option Description
Cluster key True: Specify clustering column and allow expressions
False: No clustering implemented
Allow Expressions Cluster Key Aadd an expression to the specified cluster key

Append Stream Scheduling Options

If development mode is set to false then Scheduling Options can be used to configure how and when the task will run.

image

Option Description
Scheduling Mode Choose compute type:
- Warehouse Task: User managed warehouse executes tasks
- Serverless Task: Uses serverless compute
When Source Stream has Data Flag True/False toggle to check for stream data
True - Only run task if source stream has capture change data
False - Run task on schedule regardless of whether the source stream has data. If the source is not a stream should set this to false.
Select Warehouse Visible if Scheduling Mode is set to Warehouse Task. Enter the name of the warehouse you want the task to run on without quotes.
Select initial serverless size Visible when Scheduling Mode is set to Serverless Task.
Select the initial compute size on which to run the task. Snowflake will adjust size from there based on target schedule and task run times.
Task Schedule Choose schedule type:
- Minutes - Specify interval in minutes. Enter a whole number from 1 to 11520 which represents the number of minutes between task runs.
- Cron - Uses Cron expressions. Specifies a cron expression and time zone for periodically running the task. Supports a subset of standard cron utility syntax.
- Predecessor - Specify dependent tasks
Enter predecessor tasks separated by a comma Visible when Task Schedule is set to Predecessor.
One or more task names that precede the task being created in the current node. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces.
Root task name Visible when Task Schedule is set to Predecessor.
Name of the root task that controls scheduling for the DAG of tasks. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces.

Append Stream Limitations

🚧 Appyling Transformation This node can't apply transformations to the columns for this node type.

Append Stream Deployment

Append Stream Parameters

It includes an environment parameter that allows you to specify a different warehouse to run a task in different environments. The parameter name is targetTaskWarehouse, and the default value is DEV ENVIRONMENT.

When set to DEV ENVIRONMENT, the value entered in the Scheduling Options config "Select Warehouse on which to run the task" will be used when creating the task.

{"targetTaskWarehouse": "DEV ENVIRONMENT"}

When set to any value other than DEV ENVIRONMENT, the node will attempt to create the task using a Snowflake warehouse with the specified value. For example, with the setting below for the parameter in a QA environment, the task will execute using a warehouse named compute_wh.

{"targetTaskWarehouse": "compute_wh"}

Append Stream Initial Deployment

When deployed for the first time into an environment, executes the following stages:

Stage Description
Create Stream Executes CREATE OR REPLACE statement to create Stream in target environment
Create Target Table Creates destination table for processed data storage
Target Table Initial Load Populates the target table with the existing data from the source object. This step ensures that the target table starts with a complete set of data before any further changes are applied
Create Hybrid View Provides access to the most up-to-date data by combining the initial data load with any subsequent changes captured by the stream. The hybrid view ensures that users always have access to the latest version of the data without waiting for batch updates.
Create Task Creates merge operation task for stream changes
Resume Task Activates the created task for scheduled execution
Apply Table Clustering Alters table with cluster key if enabled
Resume Recluster Clustering Enables periodic reclustering to maintain optimal clustering
Append Stream Predecessor Task Deployment

When deploying with predecessor tasks, executes:

Stage Description
Suspend Root Task Suspends root task to add task into DAG
Create Task Creates task for loading target table

πŸ“˜ Task DAG Note

For tasks in a DAG, include Task Dag Resume Root node type to resume root node after dependent task creation. Tasks use CREATE OR REPLACE only, with no ALTER capabilities.

Append Stream Redeployment
Redeployment Behavior Stage Executed
Create Stream if not exists Re-Create Stream at existing offset
Create or Replace Create Stream
Create at existing stream Re-Create Stream at existing offset
Append Stream Table Redeployment

The following column/table changes trigger ALTER statements:

  • Table name changes
  • Column drops
  • Column data type alterations
  • New column additions

Executes these stages:

Stage Description
Alter Table Operations Executes appropriate ALTER statements for schema changes
Target Initial Load Executes load based on configuration:
- If initial load enabled + "Create or Replace": Uses INSERT OVERWRITE
- All other scenarios: Uses INSERT INTO
Append Stream View Redeployment

Stream or table changes trigger Hybrid View recreation.

Append Stream Task Redeployment

Stream or table changes trigger:

  1. Task recreation
  2. Task resumption

🚧 Redeployment Behavior

Redeployment with changes in Stream/Table/Task properties will result in execution of all steps mentioned in inital deployment.

Append Stream Undeployment

When node is deleted, executes:

  • Drop Stream
  • Drop Table/Transient Table
  • Drop View
  • Drop Current Task

Deferred Merge Delta Stream

The Deferred Merge - Delta Stream Node includes several key steps to ensure efficient and up-to-date data processing. First, a stream is created to capture changes from the source object to tracks all DML changes to the source object, including inserts, updates, and deletes. Then a target table is created and initially loaded with data. A hybrid view is established to provide access to the most current data by combining initial updates.

Finally, a scheduled task manages ongoing updates by merging changes into the target table. This process ensures that the target table remains synchronized with the source data maintaining accuracy and timeliness

Deferred Merge Delta Stream Node Configuration

The Deferred Merge Append node has the following configuration groups:

Delta Stream Stream General Options

Append General Options

Option Description
Development Mode True/False toggle that determines task creation:
- True: Table created and SQL executes as Run action
- False: SQL wraps into task with Scheduling Options
Create Target As Choose target object type:
- Table: Creates table
- Transient Table: Creates transient table

Prior to creating a task, it is helpful to test the SQL the task will execute to make sure it runs without errors and returns the expected data.

Delta Stream Options

Append Stream Options

Option Description
Source Object Type of object for stream creation (Required):
- Table
- View
Show Initial Rows True: Returns only rows that existed when stream created
False: Returns DML changes since most recent offset
Redeployment Behavior Determines stream recreation behavior
Redeployment Behavior Stage Executed
Create Stream if not exists Re-Create Stream at existing offset
Create or Replace Create Stream
Create at existing stream Re-Create Stream at existing offset

Delta Stream Target Loading Options

Append Target Loading Options

Option Description
Table keys Business keys columns for merging into target table
Record Versioning Date Time or Date and Timestamp column for latest record tracking

Delta Stream Target Row DML Operations

Append Target Row DML Operations

Option Description
Column Identifier Column identifying DML operations
Include Value for Update For records flagged under Update, the existing records in the target table are updated with the corresponding values from the source table.
Insert Value It indicates that the corresponding record is meant to be inserted into the target table. This condition ensures that only records flagged for insertion are actually inserted into the target table during the merge operation.
Delete Value This value indicates that the corresponding record should either be soft-deleted (if the condition is met by enabling the soft delete toggle) or hard-deleted from the target table.

Delta Stream Target Delete Options

Append Target Delete Options

Option Description
Soft Delete Toggle to maintain deleted data record
Retain Last Non-Deleted Values Preserves most recent non-deleted record, even as other records are marked as deleted or become inactive.

Delta Stream Target Clustering Options

Append Target Clustering Options

Option Description
Cluster key True: Specify clustering column and allow expressions
False: No clustering implemented
Allow Expressions Cluster Key Aadd an expression to the specified cluster key

Delta Stream Scheduling Options

If development mode is set to false then Scheduling Options can be used to configure how and when the task will run.

image

Option Description
Scheduling Mode Choose compute type:
- Warehouse Task: User managed warehouse executes tasks
- Serverless Task: Uses serverless compute
When Source Stream has Data Flag True/False toggle to check for stream data
True - Only run task if source stream has capture change data
False - Run task on schedule regardless of whether the source stream has data. If the source is not a stream should set this to false.
Select Warehouse Visible if Scheduling Mode is set to Warehouse Task. Enter the name of the warehouse you want the task to run on without quotes.
Select initial serverless size Visible when Scheduling Mode is set to Serverless Task.
Select the initial compute size on which to run the task. Snowflake will adjust size from there based on target schedule and task run times.
Task Schedule Choose schedule type:
- Minutes - Specify interval in minutes. Enter a whole number from 1 to 11520 which represents the number of minutes between task runs.
- Cron - Uses Cron expressions. Specifies a cron expression and time zone for periodically running the task. Supports a subset of standard cron utility syntax.
- Predecessor - Specify dependent tasks
Enter predecessor tasks separated by a comma Visible when Task Schedule is set to Predecessor.
One or more task names that precede the task being created in the current node. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces.
Root task name Visible when Task Schedule is set to Predecessor.
Name of the root task that controls scheduling for the DAG of tasks. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces.

Delta Stream Limitations

🚧 Appyling Transformation This node can't apply transformations to the columns for this node type.

Delta Stream Deployment

Delta Stream Parameters

It includes an environment parameter that allows you to specify a different warehouse to run a task in different environments. The parameter name is targetTaskWarehouse, and the default value is DEV ENVIRONMENT.

When set to DEV ENVIRONMENT, the value entered in the Scheduling Options config "Select Warehouse on which to run the task" will be used when creating the task.

{"targetTaskWarehouse": "DEV ENVIRONMENT"}

When set to any value other than DEV ENVIRONMENT, the node will attempt to create the task using a Snowflake warehouse with the specified value. For example, with the setting below for the parameter in a QA environment, the task will execute using a warehouse named compute_wh.

{"targetTaskWarehouse": "compute_wh"}

Delta Stream Initial Deployment

When deployed for the first time into an environment, executes the following stages:

Stage Description
Create Stream Executes CREATE OR REPLACE statement to create Stream in target environment
Create Target Table Creates destination table for processed data storage
Target Table Initial Load Populates the target table with the existing data from the source object. This step ensures that the target table starts with a complete set of data before any further changes are applied
Create Hybrid View Provides access to the most up-to-date data by combining the initial data load with any subsequent changes captured by the stream. The hybrid view ensures that users always have access to the latest version of the data without waiting for batch updates.
Create Task Creates merge operation task for stream changes
Resume Task Activates the created task for scheduled execution
Apply Table Clustering Alters table with cluster key if enabled
Resume Recluster Clustering Enables periodic reclustering to maintain optimal clustering
Delta Stream Predecessor Task Deployment

When deploying with predecessor tasks, executes:

Stage Description
Suspend Root Task Suspends root task to add task into DAG
Create Task Creates task for loading target table

πŸ“˜ Task DAG Note

For tasks in a DAG, include Task Dag Resume Root node type to resume root node after dependent task creation. Tasks use CREATE OR REPLACE only, with no ALTER capabilities.

Delta Stream Redeployment
Redeployment Behavior Stage Executed
Create Stream if not exists Re-Create Stream at existing offset
Create or Replace Create Stream
Create at existing stream Re-Create Stream at existing offset
Delta Stream Table Redeployment

The following column/table changes trigger ALTER statements:

  • Table name changes
  • Column drops
  • Column data type alterations
  • New column additions

Executes these stages:

Stage Description
Alter Table Operations Executes appropriate ALTER statements for schema changes
Target Initial Load Executes load based on configuration:
- If initial load enabled + "Create or Replace": Uses INSERT OVERWRITE
- All other scenarios: Uses INSERT INTO
Delta Stream View Redeployment

Stream or table changes trigger Hybrid View recreation.

Delta Stream Task Redeployment

Stream or table changes trigger:

  1. Task recreation
  2. Task resumption

🚧 Redeployment Behavior

Redeployment with changes in Stream/Table/Task properties will result in execution of all steps mentioned in inital deployment.

Delta Stream Undeployment

When node is deleted, executes:

  • Drop Stream
  • Drop Table/Transient Table
  • Drop View
  • Drop Current Task

Code

Deferred Merge - Append Stream

Deferred Merge - Delta Stream

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages