Add support for Spark calculations #493

Avinash-1394 · 2023-11-23T17:29:36Z

Description

Add support to run spark calculations using any cursor

Related docs

Implementation in awswrangler - https://github.com/aws/aws-sdk-pandas/blob/main/awswrangler/athena/_spark.py
Official AWS docs - https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html
MR in dbt-athena-community - feat: support python submissions dbt-labs/dbt-athena#248

Comments

I am currently working on adding support to run python models using the dbt-athena-community adapter and it would be much easier to accomplish if the pyathena library supports this first. I don't think mock_athena supports these yet so testing it actually much more difficult than I thought.

laughingman7743 · 2023-11-24T10:57:35Z

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/athena/client/start_calculation_execution.html
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/athena/client/stop_calculation_execution.html
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/athena/client/get_calculation_execution.html
If you pass a flag that uses Spark in some way, I think it would be possible to implement the query to be executed by Spark. Refactoring of the cursor class may be necessary, though.

Avinash-1394 · 2023-11-24T13:41:53Z

If you pass a flag that uses Spark in some way, I think it would be possible to implement the query to be executed by Spark

The way dbt have handled that is by treating the query as a generic compiled_code and using the file extension to redirect it to either the query engine or spark engine. I just started looking into this library but what would you say is the entrypoint to receive the query? Is it the Cursor class

Refactoring of the cursor class may be necessary, though.

Definitely. It seems like we need to add calls to the new spark api endpoints to the BaseCursor and create something similar to AthenaQueryExecution like AthenaSparkExecution. Don't think that will be enough but those seem like the first steps.

laughingman7743 · 2024-01-05T16:50:11Z

I am trying to implement a cursor class that executes Spark calculations in the following branch.
#497

It looks like the PySpark code can be executed as follows.

import textwrap
from pyathena import connect

conn = connect(work_group="spark-primary", cursor_class=CalcCursor)
with conn.cursor() as cursor:
    cursor.execute(
        textwrap.dedent(
            """
            spark.sql("create database if not exists spark_demo_database")
            """
        )
    )

Since it would be difficult to add features to a regular cursor, I have implemented a different cursor class. If you have any ideas, please feel free to suggest them.

Avinash-1394 · 2024-01-06T01:17:06Z

@laughingman7743 Thank you so much that. I have reviewed the PR.

There are a couple of additional models you can test and check if they cause issues.

Pandas dataframe

import pandas as pd
return pd.DataFrame({"A": [1, 2, 3, 4]})

Spark dataframe

return spark.createDataFrame(data, ["A"])

Think you can also import pyspark and return a pyspark dataframe but I haven't tested that one out

laughingman7743 · 2024-01-06T06:07:42Z

The code for the Athena Example notebook is as follows:

Spark Dataframes:

file_name = "s3://athena-examples-us-east-1/notebooks/yellow_tripdata_2016-01.parquet"

taxi_df = (spark.read.format("parquet")
     .option("header", "true")
     .option("inferSchema", "true")
     .load(file_name))

print("Read parquet file" + " complete")

taxi1_df=taxi_df.groupBy("VendorID", "passenger_count").count()
taxi1_df.show()

var1 = taxi1_df.collect()
%table var1

taxi1_df.coalesce(1).write.mode('overwrite').csv("s3://aws-athena-query-results-****-us-west-2-hl3rhzkk/select_taxi")
print("Write to s3 " + "complete")

Spark SQL:

spark.sql("create database if not exists spark_demo_database")
spark.sql("show databases").show()

spark.sql("use spark_demo_database")
taxi1_df.write.mode("overwrite").format("parquet").option("path","s3://aws-athena-query-results-****-us-west-2-hl3rhzkk/select_taxi").saveAsTable("select_taxi_table")
print("Create new table" + " complete")

spark.sql("show tables").show()

spark.sql("select * from select_taxi_table").show()

spark.sql("DROP TABLE if exists select_taxi_table")
spark.sql("DROP DATABASE if exists spark_demo_database")
print("Clean resources" + " complete")

laughingman7743 · 2024-01-06T06:08:55Z

Dataframe would be using Spark Dataframe, not Pandas. I am not sure of the use case that would return values. You will probably be running code that writes data out to S3.

Avinash-1394 · 2024-01-06T17:16:23Z

That was to mainly check if the import cause any issues but think we can skip that feedback 👍🏽

Implement SparkCursor to support Spark calculations (fix #493)

laughingman7743 · 2024-01-09T05:50:01Z

I just have released v3.1.0. 🎉
https://pypi.org/project/PyAthena/3.1.0/
https://github.com/laughingman7743/PyAthena/releases/tag/v3.1.0

laughingman7743 added a commit that referenced this issue Jan 5, 2024

implement CalcCursor to support Spark calculations (fix #493)

5492a96

laughingman7743 linked a pull request Jan 5, 2024 that will close this issue

Implement SparkCursor to support Spark calculations (fix #493) #497

Merged

laughingman7743 added a commit that referenced this issue Jan 5, 2024

implement CalcCursor to support Spark calculations (fix #493)

d949e48

laughingman7743 added a commit that referenced this issue Jan 6, 2024

implement CalcCursor to support Spark calculations (fix #493)

8fe1cde

laughingman7743 added a commit that referenced this issue Jan 6, 2024

implement CalcCursor to support Spark calculations (fix #493)

80525bc

laughingman7743 added a commit that referenced this issue Jan 6, 2024

Implement SparkCursor to support Spark calculations (fix #493)

8cf1f9c

laughingman7743 added a commit that referenced this issue Jan 6, 2024

Merge branch 'master' into #493

60f287d

laughingman7743 closed this as completed in #497 Jan 9, 2024

laughingman7743 added a commit that referenced this issue Jan 9, 2024

Merge pull request #497 from laughingman7743/#493

4bf29c2

Implement SparkCursor to support Spark calculations (fix #493)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Spark calculations #493

Add support for Spark calculations #493

Avinash-1394 commented Nov 23, 2023

laughingman7743 commented Nov 24, 2023

Avinash-1394 commented Nov 24, 2023 •

edited

Loading

laughingman7743 commented Jan 5, 2024 •

edited

Loading

Avinash-1394 commented Jan 6, 2024 •

edited

Loading

laughingman7743 commented Jan 6, 2024

laughingman7743 commented Jan 6, 2024

Avinash-1394 commented Jan 6, 2024

laughingman7743 commented Jan 9, 2024

Add support for Spark calculations #493

Add support for Spark calculations #493

Comments

Avinash-1394 commented Nov 23, 2023

Description

Related docs

Comments

laughingman7743 commented Nov 24, 2023

Avinash-1394 commented Nov 24, 2023 • edited Loading

laughingman7743 commented Jan 5, 2024 • edited Loading

Avinash-1394 commented Jan 6, 2024 • edited Loading

Pandas dataframe

Spark dataframe

laughingman7743 commented Jan 6, 2024

laughingman7743 commented Jan 6, 2024

Avinash-1394 commented Jan 6, 2024

laughingman7743 commented Jan 9, 2024

Avinash-1394 commented Nov 24, 2023 •

edited

Loading

laughingman7743 commented Jan 5, 2024 •

edited

Loading

Avinash-1394 commented Jan 6, 2024 •

edited

Loading