Skip to content

Terraform module for terraform-gcp-tamr-config

License

Notifications You must be signed in to change notification settings

Datatamer/terraform-gcp-tamr-config

Repository files navigation

Terraform Module Template

This terraform module automates populating some Tamr config variables that are generated as outputs from other GCP scale-out modules.

Examples

Minimal

Smallest complete fully working example. This example might require extra resources to run the example.

Resources Created

This modules creates:

  • A template_file data source which renders the contents of a populated Tamr config.

Requirements

Name Version
terraform >= 1.0.0
google >= 4.6.0

Providers

No provider.

Inputs

Name Description Type Default Required
tamr_bigtable_cluster_id Bigtable cluster ID string n/a yes
tamr_bigtable_instance_id Bigtable instance ID string n/a yes
tamr_bigtable_max_nodes Max number of nodes to scale up to string n/a yes
tamr_bigtable_min_nodes Min number of nodes to scale down to string n/a yes
tamr_cloud_sql_location location for cloud sql instance. NOTE: this is either a region or a zone. string n/a yes
tamr_cloud_sql_name name of cloud sql instance string n/a yes
tamr_dataproc_bucket GCS bucket to use for the tamr dataproc cluster string n/a yes
tamr_dataproc_region Region the dataproc uses. string n/a yes
tamr_filesystem_bucket GCS bucket to use for the tamr default file system string n/a yes
tamr_instance_internal_ip internal ip of tamr vm string n/a yes
tamr_instance_project The project to launch the tamr VM instance in. string n/a yes
tamr_instance_service_account email of service account to attach to the tamr instance string n/a yes
tamr_instance_subnet subnetwork to attach instance too string n/a yes
tamr_instance_zone zone to deploy tamr vm string n/a yes
tamr_sql_password password for the cloud sql user string n/a yes
dataproc_network_tags list of network tags to attach to the dataproc nodes list(string) [] no
tamr_bigtable_project_id The google project that the bigtable instance lives in. If not set will use the tamr_instance_project as the default value. string "" no
tamr_cloud_sql_project project containing cloudsql instance. If not set will use the tamr_instance_project as the default value. string "" no
tamr_dataproc_cluster_config If you do not want to use the default dataproc configuration template, pass in a complete dataproc configuration file to variable.
If you are passing in a dataproc configure it should not be left padded, we will handle that inside of our template. It is expected to
a yaml document of a dataproc cluster config
Refrence spec is https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig
string "" no
tamr_dataproc_cluster_enable_stackdriver_logging Enabled stackdriver logging on dataproc clusters. This only used if using the built in tamr_dataproc_cluster_config configuration bool true no
tamr_dataproc_cluster_master_disk_size Size of disk to use on dataproc master disk This only used if using the built in tamr_dataproc_cluster_config configuration number 1000 no
tamr_dataproc_cluster_master_instance_type Instance type to use as dataproc master This only used if using the built in tamr_dataproc_cluster_config configuration string "n1-highmem-4" no
tamr_dataproc_cluster_service_account Service account to attach to dataproc workers. If not set will use the tamr_instance_service_account as the default value. This only used if using the built in tamr_dataproc_cluster_config configuration string "" no
tamr_dataproc_cluster_subnetwork_uri Subnetwork URI for dataproc to use. If not set will use the tamr_instance_subnet as the default value. This only used if using the built in tamr_dataproc_cluster_config configuration string "" no
tamr_dataproc_cluster_version Version of dataproc to use. This only used if using the built in tamr_dataproc_cluster_config configuration string "2.0" no
tamr_dataproc_cluster_worker_machine_type machine type of default worker pool. This only used if using the built in tamr_dataproc_cluster_config configuration string "n1-standard-16" no
tamr_dataproc_cluster_worker_num_instances Number of default workers to use. This only used if using the built in tamr_dataproc_cluster_config configuration number 4 no
tamr_dataproc_cluster_worker_num_local_ssds Number of localssds to attach to each worker node. This only used if using the built in tamr_dataproc_cluster_config configuration number 2 no
tamr_dataproc_cluster_worker_preemptible_machine_type machine type of preemptible worker pool. This only used if using the built in tamr_dataproc_cluster_config configuration string "n1-standard-16" no
tamr_dataproc_cluster_worker_preemptible_num_instances Number of preemptible workers to use. This only used if using the built in tamr_dataproc_cluster_config configuration number 0 no
tamr_dataproc_cluster_worker_preemptible_num_local_ssds Number of localssds to attach to each preemptible worker node. This only used if using the built in tamr_dataproc_cluster_config configuration number 2 no
tamr_dataproc_cluster_zone Zone to launch dataproc cluster into. If not set will use the tamr_instance_zone as the default value. This only used if using the built in tamr_dataproc_cluster_config configuration string "" no
tamr_dataproc_image_version Dataproc image versionmage string "2.0" no
tamr_dataproc_project_id Project for the dataproc cluster. If not set will use the tamr_instance_project as the default value. string "" no
tamr_es_apihost The hostname and port of the REST API endpoint of the Elasticsearch cluster to use. If unset will use < ip of vm>:9200 string "" no
tamr_es_enabled Whether Tamr will index user data in Elasticsearch or not. Elasticsearch is used to power Tamr's interactive data UI, so when this is set to false Tamr will run 'headless,' that is, without its core UI capabilities. It can be useful to disable Elasticsearch in production settings where the models are trained on a separate instance and the goal is to maximize pipeline throughput. bool true no
tamr_es_number_of_shards The number of shards to set when creating the Tamr index in Elasticsearch. Default value is the number of cores on the local host machine, so this should be overridden when using a remote Elasticsearch cluster. Note: this value is only applied when the index is created. number 1 no
tamr_es_password Password to use to authenticate to Elasticsearch, using basic authentication. Not required unless the Elasticsearch cluster you're using has security and authentication enabled. The value passed in may be encrypted. string null no
tamr_es_socket_timeout Defines the socket timeout for Elasticsearch clients, in milliseconds. This is the timeout for waiting for data or, put differently, a maximum period of inactivity between two consecutive data packets. A timeout value of zero is interpreted as an infinite timeout. A negative value is interpreted as undefined (system default). The default value is 900000, i.e., fifteen minutes. number 900000 no
tamr_es_ssl_enabled Whether to connect to Elasticsearch over https or not. Default is false (http). bool false no
tamr_es_user Username to use to authenticate to Elasticsearch. Not required unless the Elasticsearch cluster you're using has security and authentication enabled. string "" no
tamr_hbase_namespace HBase namespace to user, for bigtable this will be the table prefix. string "ns0" no
tamr_json_logging Toggle json formatting for tamr logs. bool false no
tamr_license_key Set a tamr license key string "" no
tamr_spark_driver_memory Amount of memory spark should allocate to spark driver string "12G" no
tamr_spark_executor_cores Amount of cores spark should allocate to each spark executor number 5 no
tamr_spark_executor_instances number of spark executor instances number 12 no
tamr_spark_executor_memory Amount of memory spark should allocate to each spark executor string "13G" no
tamr_spark_properties_override json blob of spark properties to override, if not set will use a default set of properties that should work for most use cases string "" no
tamr_sql_user username for the cloud sql user string "tamr" no

Outputs

Name Description
tamr_config_file full tamr config file
tmpl_dataproc_config dataproc config

References

This repo is based on:

Development

Generating Docs

Run make terraform/docs to generate the section of docs around terraform inputs, outputs and requirements.

Checkstyles

Run make lint, this will run terraform fmt, in addition to a few other checks to detect whitespace issues. NOTE: this requires having docker working on the machine running the test

Releasing new versions

  • Update version contained in VERSION
  • Document changes in CHANGELOG.md
  • Create a tag in github for the commit associated with the version

License

Apache 2 Licensed. See LICENSE for full details.