Skip to content

Commit

Permalink
Merge branch 'current' into serp
Browse files Browse the repository at this point in the history
  • Loading branch information
mirnawong1 authored Feb 3, 2025
2 parents f443914 + a6f3d6c commit 032cd13
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 6 deletions.
37 changes: 32 additions & 5 deletions website/docs/docs/build/python-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -816,13 +816,40 @@ storage.objects.create
storage.objects.delete
```

**Installing packages:** If you are using a Dataproc Cluster (as opposed to Dataproc Serverless), you can add third-party packages while creating the cluster.
**Installing packages:**

Google recommends installing Python packages on Dataproc clusters via initialization actions:
- [How initialization actions are used](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used)
- [Actions for installing via `pip` or `conda`](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/python)
Installation of third-party packages on Dataproc varies depending on whether it's a [cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) or [serverless](https://cloud.google.com/dataproc-serverless/docs).

You can also install packages at cluster creation time by [defining cluster properties](https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20): `dataproc:pip.packages` or `dataproc:conda.packages`.
- **Dataproc Cluster** — Google recommends installing Python packages while creating the cluster via initialization actions:
- [How initialization actions are used](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used)
- [Actions for installing via `pip` or `conda`](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/python)

You can also install packages at cluster creation time by [defining cluster properties](https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20): `dataproc:pip.packages` or `dataproc:conda.packages`.

- **Dataproc Serverless** — Google recommends using a [custom docker image](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers) to install thrid-party packages. The image needs to be hosted in [Google Artifact Registry](https://cloud.google.com/artifact-registry/docs). It can then be used by providing the image path in dbt profiles:

<File name='profiles.yml'>
```yml
my-profile:
target: dev
outputs:
dev:
type: bigquery
method: oauth
project: abc-123
dataset: my_dataset
# for dbt Python models to be run on Dataproc Serverless
gcs_bucket: dbt-python
dataproc_region: us-central1
submission_method: serverless
dataproc_batch:
runtime_config:
container_image: {HOSTNAME}/{PROJECT_ID}/{IMAGE}:{TAG}
```


</File>

<Lightbox src="/img/docs/building-a-dbt-project/building-models/python-models/dataproc-pip-packages.png" title="Adding packages to install via pip at cluster startup"/>

Expand Down
2 changes: 1 addition & 1 deletion website/snippets/_cloud-environments-info.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ For more info, check out this [FAQ page on this topic](/faqs/Environments/custom
### Extended attributes

:::note
Extended attributes are are currently _not_ supported for SSH tunneling
Extended attributes are currently _not_ supported for SSH tunneling
:::

Extended attributes allows users to set a flexible [profiles.yml](/docs/core/connect-data-platform/profiles.yml) snippet in their dbt Cloud Environment settings. It provides users with more control over environments (both deployment and development) and extends how dbt Cloud connects to the data platform within a given environment.
Expand Down

0 comments on commit 032cd13

Please sign in to comment.