dbt-labs · nicor88 · Feb 6, 2024 · Mar 17, 2023 · Mar 25, 2023 · Mar 26, 2023
@@ -7,3 +7,4 @@ DBT_TEST_ATHENA_DATABASE=
 DBT_TEST_ATHENA_SCHEMA=
 DBT_TEST_ATHENA_WORK_GROUP=
 DBT_TEST_ATHENA_AWS_PROFILE_NAME=
+DBT_TEST_ATHENA_SPARK_WORK_GROUP=
@@ -63,7 +63,7 @@
     - Supports two incremental update strategies: `insert_overwrite` and `append`
     - Does **not** support the use of `unique_key`
 - Supports [snapshots][snapshots]
-- Does not support [Python models][python-models]
+- Supports [Python models][python-models]
 
 [seeds]: https://docs.getdbt.com/docs/building-a-dbt-project/seeds
 
@@ -132,6 +132,7 @@ A dbt profile can be configured to run against AWS Athena using the following co
 | aws_profile_name      | Profile to use from your AWS shared credentials file.                                    | Optional  | `my-profile`                               |
 | work_group            | Identifier of Athena workgroup                                                           | Optional  | `my-custom-workgroup`                      |
 | num_retries           | Number of times to retry a failing query                                                 | Optional  | `3`                                        |
+| spark_work_group      | Identifier of Athena Spark workgroup                                           | Optional  | `my-spark-workgroup`                       |
 | num_boto3_retries     | Number of times to retry boto3 requests (e.g. deleting S3 files for materialized tables) | Optional  | `5`                                        |
 | seed_s3_upload_args   | Dictionary containing boto3 ExtraArgs when uploading to S3                               | Optional  | `{"ACL": "bucket-owner-full-control"}`     |
 | lf_tags_database      | Default LF tags for new database if it's created by dbt                                  | Optional  | `tag_key: tag_value`                       |
@@ -151,8 +152,10 @@ athena:
       region_name: eu-west-1
       schema: dbt
       database: awsdatacatalog
+      threads: 4
       aws_profile_name: my-profile
       work_group: my-workgroup
+      spark_work_group: my-spark-workgroup
       seed_s3_upload_args:
         ACL: bucket-owner-full-control
 ```
@@ -546,6 +549,151 @@ You may find the following links useful to manage that:
 * [terraform aws_lakeformation_resource_lf_tags](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lakeformation_resource_lf_tags)
 <!-- markdownlint-restore -->
 
+## Python Models
+
+The adapter supports python models using [`spark`](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html).
+
+### Setup
+
+- A spark enabled work group created in athena
+- Spark execution role granted access to Athena, Glue and S3
+- The spark work group is added to the ~/.dbt/profiles.yml file and the profile is referenced in dbt_project.yml
+  that will be created. It is recommended to keep this same as threads.
+
+### Spark specific table configuration
+
+- `timeout` (`default=43200`)
+  - Time out in seconds for each python model execution. Defaults to 12 hours/43200 seconds.
+- `spark_encryption` (`default=false`)
+  - If this flag is set to true, encrypts data in transit between Spark nodes and also encrypts data at rest stored
+   locally by Spark.
+- `spark_cross_account_catalog` (`default=false`)
+  - In spark, you can query the external account catalog and for that the consumer account has to be configured to
+   access the producer catalog.
+  - If this flag is set to true, "/" can be used as the glue catalog separator.
+   Ex: 999999999999/mydatabase.cloudfront_logs (*where *999999999999* is the external catalog id*)
+- `spark_requester_pays` (`default=false`)
+  - When an Amazon S3 bucket is configured as requester pays, the account of the user running the query is charged for
+   data access and data transfer fees associated with the query.
+  - If this flag is set to true, requester pays S3 buckets are enabled in Athena for Spark.
+
+### Spark notes
+
+- A session is created for each unique engine configuration defined in the models that are part of the invocation.
+- A session's idle timeout is set to 10 minutes. Within the timeout period, if there is a new calculation
+ (spark python model) ready for execution and the engine configuration matches, the process will reuse the same session.
+- Number of python models running at a time depends on the `threads`.  Number of sessions created for the entire run
+ depends on number of unique engine configurations and availability of session to maintain threads concurrency.
+- For iceberg table, it is recommended to use table_properties configuration to set the format_version to 2. This is to
+ maintain compatability between iceberg tables created by Trino with those created by Spark.
+
+### Example models
+
+#### Simple pandas model
+
+```python
+import pandas as pd
+
+
+def model(dbt, session):
+    dbt.config(materialized="table")
+
+    model_df = pd.DataFrame({"A": [1, 2, 3, 4]})
+
+    return model_df
+```
+
+#### Simple spark
+
+```python
+def model(dbt, spark_session):
+    dbt.config(materialized="table")
+
+    data = [(1,), (2,), (3,), (4,)]
+
+    df = spark_session.createDataFrame(data, ["A"])
+
+    return df
+```
+
+#### Spark incremental
+
+```python
+def model(dbt, spark_session):
+    dbt.config(materialized="incremental")
+    df = dbt.ref("model")
+
+    if dbt.is_incremental:
+        max_from_this = (
+            f"select max(run_date) from {dbt.this.schema}.{dbt.this.identifier}"
+        )
+        df = df.filter(df.run_date >= spark_session.sql(max_from_this).collect()[0][0])
+
+    return df
+```
+
+#### Config spark model
+
+```python
+def model(dbt, spark_session):
+    dbt.config(
+        materialized="table",
+        engine_config={
+            "CoordinatorDpuSize": 1,
+            "MaxConcurrentDpus": 3,
+            "DefaultExecutorDpuSize": 1
+        },
+        spark_encryption=True,
+        spark_cross_account_catalog=True,
+        spark_requester_pays=True
+        polling_interval=15,
+        timeout=120,
+    )
+
+    data = [(1,), (2,), (3,), (4,)]
+
+    df = spark_session.createDataFrame(data, ["A"])
+
+    return df
+```
+
+#### Create pySpark udf using imported external python files
+
+```python
+def model(dbt, spark_session):
+    dbt.config(
+        materialized="incremental",
+        incremental_strategy="merge",
+        unique_key="num",
+    )
+    sc = spark_session.sparkContext
+    sc.addPyFile("s3://athena-dbt/test/file1.py")
+    sc.addPyFile("s3://athena-dbt/test/file2.py")
+
+    def func(iterator):
+        from file2 import transform
+
+        return [transform(i) for i in iterator]
+
+    from pyspark.sql.functions import udf
+    from pyspark.sql.functions import col
+
+    udf_with_import = udf(func)
+
+    data = [(1, "a"), (2, "b"), (3, "c")]
+    cols = ["num", "alpha"]
+    df = spark_session.createDataFrame(data, cols)
+
+    return df.withColumn("udf_test_col", udf_with_import(col("alpha")))
+```
+
+#### Known issues in python models
+
+- Incremental models do not fully utilize spark capabilities. They depend partially on existing sql based logic which
+ runs on trino.
+- Snapshots materializations are not supported.
+- Spark can only reference tables within the same catalog.
+
 ### Working example
 
 seed file - employent_indicators_november_2022_csv_tables.csv

@@ -1,12 +1,159 @@
 import importlib.metadata
 from functools import lru_cache
+from typing import Any, Dict
 
 from botocore import config
 
+from dbt.adapters.athena.constants import (
+    DEFAULT_CALCULATION_TIMEOUT,
+    DEFAULT_POLLING_INTERVAL,
+    DEFAULT_SPARK_COORDINATOR_DPU_SIZE,
+    DEFAULT_SPARK_EXECUTOR_DPU_SIZE,
+    DEFAULT_SPARK_MAX_CONCURRENT_DPUS,
+    DEFAULT_SPARK_PROPERTIES,
+    LOGGER,
+)
+
 
 @lru_cache()
 def get_boto3_config(num_retries: int) -> config.Config:
     return config.Config(
         user_agent_extra="dbt-athena-community/" + importlib.metadata.version("dbt-athena-community"),
         retries={"max_attempts": num_retries, "mode": "standard"},
     )
+
+
+class AthenaSparkSessionConfig:
+    """
+    A helper class to manage Athena Spark Session Configuration.
+    """
+
+    def __init__(self, config: Dict[str, Any], **session_kwargs: Any) -> None:
+        self.config = config
+        self.session_kwargs = session_kwargs
+
+    def set_timeout(self) -> int:
+        """
+        Get the timeout value.
+
+        This function retrieves the timeout value from the parsed model's configuration. If the timeout value
+        is not defined, it falls back to the default timeout value. If the retrieved timeout value is less than or
+        equal to 0, a ValueError is raised as timeout must be a positive integer.
+
+        Returns:
+            int: The timeout value in seconds.
+
+        Raises:
+            ValueError: If the timeout value is not a positive integer.
+
+        """
+        timeout = self.config.get("timeout", DEFAULT_CALCULATION_TIMEOUT)
+        if not isinstance(timeout, int):
+            raise TypeError("Timeout must be an integer")
+        if timeout <= 0:
+            raise ValueError("Timeout must be a positive integer")
+        LOGGER.debug(f"Setting timeout: {timeout}")
+        return timeout
+
+    def get_polling_interval(self) -> Any:
+        """
+        Get the polling interval for the configuration.
+
+        Returns:
+            Any: The polling interval value.
+
+        Raises:
+            KeyError: If the polling interval is not found in either `self.config`
+                or `self.session_kwargs`.
+        """
+        try:
+            return self.config["polling_interval"]
+        except KeyError:
+            try:
+                return self.session_kwargs["polling_interval"]
+            except KeyError:
+                return DEFAULT_POLLING_INTERVAL
+
+    def set_polling_interval(self) -> float:
+        """
+        Set the polling interval for the configuration.
+
+        Returns:
+            float: The polling interval value.
+
+        Raises:
+            ValueError: If the polling interval is not a positive integer.
+        """
+        polling_interval = self.get_polling_interval()
+        if not (isinstance(polling_interval, float) or isinstance(polling_interval, int)) or polling_interval <= 0:
+            raise ValueError(f"Polling_interval must be a positive number. Got: {polling_interval}")
+        LOGGER.debug(f"Setting polling_interval: {polling_interval}")
+        return float(polling_interval)
+
+    def set_engine_config(self) -> Dict[str, Any]:
+        """Set the engine configuration.
+
+        Returns:
+            Dict[str, Any]: The engine configuration.
+
+        Raises:
+            TypeError: If the engine configuration is not of type dict.
+            KeyError: If the keys of the engine configuration dictionary do not match the expected format.
+        """
+        table_type = self.config.get("table_type", "hive")
+        spark_encryption = self.config.get("spark_encryption", False)
+        spark_cross_account_catalog = self.config.get("spark_cross_account_catalog", False)
+        spark_requester_pays = self.config.get("spark_requester_pays", False)
+
+        default_spark_properties: Dict[str, str] = dict(
+            **DEFAULT_SPARK_PROPERTIES.get(table_type)
+            if table_type.lower() in ["iceberg", "hudi", "delta_lake"]
+            else {},
+            **DEFAULT_SPARK_PROPERTIES.get("spark_encryption") if spark_encryption else {},
+            **DEFAULT_SPARK_PROPERTIES.get("spark_cross_account_catalog") if spark_cross_account_catalog else {},
+            **DEFAULT_SPARK_PROPERTIES.get("spark_requester_pays") if spark_requester_pays else {},
+        )
+
+        default_engine_config = {
+            "CoordinatorDpuSize": DEFAULT_SPARK_COORDINATOR_DPU_SIZE,
+            "MaxConcurrentDpus": DEFAULT_SPARK_MAX_CONCURRENT_DPUS,
+            "DefaultExecutorDpuSize": DEFAULT_SPARK_EXECUTOR_DPU_SIZE,
+            "SparkProperties": default_spark_properties,
+        }
+        engine_config = self.config.get("engine_config", None)
+
+        if engine_config:
+            provided_spark_properties = engine_config.get("SparkProperties", None)
+            if provided_spark_properties:
+                default_spark_properties.update(provided_spark_properties)
+                default_engine_config["SparkProperties"] = default_spark_properties
+                engine_config.pop("SparkProperties")
+            default_engine_config.update(engine_config)
+        engine_config = default_engine_config
+
+        if not isinstance(engine_config, dict):
+            raise TypeError("Engine configuration has to be of type dict")
+
+        expected_keys = {
+            "CoordinatorDpuSize",
+            "MaxConcurrentDpus",
+            "DefaultExecutorDpuSize",
+            "SparkProperties",
+            "AdditionalConfigs",
+        }
+
+        if set(engine_config.keys()) - {
+            "CoordinatorDpuSize",
+            "MaxConcurrentDpus",
+            "DefaultExecutorDpuSize",
+            "SparkProperties",
+            "AdditionalConfigs",
+        }:
+            raise KeyError(
+                f"The engine configuration keys provided do not match the expected athena engine keys: {expected_keys}"
+            )
+
+        if engine_config["MaxConcurrentDpus"] == 1:
+            raise KeyError("The lowest value supported for MaxConcurrentDpus is 2")
+        LOGGER.debug(f"Setting engine configuration: {engine_config}")
+        return engine_config