Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: support GCP cloud storage #17

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 26 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,43 +3,49 @@ Export OpenCost data in parquet format

This script was created to export data from opencost in PARQUET format.

It supports exporting the data to S3 and local directory.
It supports exporting the data to S3, Azure Blob Storage, GCP Cloud Storage, and local directory.

# Dependencies
This script depends on boto3, pandas, numpy and python-dateutil.
This script depends on boto3, pandas, numpy, python-dateutil, azure-identity, azure-storage-blob, and google-cloud-storage.

The file requirements.txt has all the dependencies specified.

# Configuration:
The script supports the following environment variables:
* OPENCOST_PARQUET_SVC_HOSTNAME: Hostname of the opencost service. By default it assume the opencost service is on localhost.
* OPENCOST_PARQUET_SVC_PORT: Port of the opencost service, by default it assume it is 9003
* OPENCOST_PARQUET_WINDOW_START: Start window for the export, by default it is None, which results in exporting the data for yesterday. Date needs to be set in RFC3339 format. i.e `2024-05-27T00:00:00Z`.
* OPENCOST_PARQUET_WINDOW_END: End of export window, by default it is None, which results in exporting the data for yesterday. Date needs to be set in RFC3339 format. i.e `2024-05-27T00:00:00Z`.
* OPENCOST_PARQUET_S3_BUCKET: S3 bucket that will be used to store the export. By default this is None, and S3 export is not done. If set to a bucket use s3://bucket-name and make sure there is an AWS Role with access to the s3 bucket attached to the container that is running the export. This also respect the environment variables AWS_PROFILE, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY. see: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
* OPENCOST_PARQUET_FILE_KEY_PREFIX: This is the prefix used for the export, by default it is '/tmp'. The export is going to be saved inside this prefix, in the following structure: year=window_start.year/month=window_start.month/day=window_start.day , ex: tmp/year=2024/month=1/date=15
* OPENCOST_PARQUET_AGGREGATE: This is the dimentions used to aggregate the data. by default we use "namespace,pod,container" which is the same dimensions used for the CSV native export.
* OPENCOST_PARQUET_STEP: This is the Step for the export, by default we use 1h steps, which result in 24 steps in a day and make easier to match the exported data to AWS CUR, since cur also export on hourly base.
* OPENCOST_PARQUET_RESOLUTION: Duration to use as resolution in Prometheus queries. Smaller values (i.e. higher resolutions) will provide better accuracy, but worse performance (i.e. slower query time, higher memory use). Larger values (i.e. lower resolutions) will perform better, but at the expense of lower accuracy for short-running workloads.
* OPENCOST_PARQUET_ACCUMULATE: If `"true"`, sum the entire range of time intervals into a single set. Default value is `"false"`.
* OPENCOST_PARQUET_SVC_HOSTNAME: Hostname of the opencost service. By default, it assumes the opencost service is on localhost.
* OPENCOST_PARQUET_SVC_PORT: Port of the opencost service, by default it assumes it is 9003.
* OPENCOST_PARQUET_WINDOW_START: Start window for the export. By default it is None, which results in exporting the data for yesterday. Date needs to be set in RFC3339 format, e.g., `2024-05-27T00:00:00Z`.
* OPENCOST_PARQUET_WINDOW_END: End of the export window. By default it is None, which results in exporting the data for yesterday. Date needs to be set in RFC3339 format, e.g., `2024-05-27T23:59:59Z`.
* OPENCOST_PARQUET_S3_BUCKET: S3 bucket that will be used to store the export. By default this is None, and S3 export is not done. If set to a bucket, use `s3://bucket-name` and make sure there is an AWS Role with access to the S3 bucket attached to the container running the export. This also respects the environment variables AWS_PROFILE, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY. See: [Boto3 Documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html).
* OPENCOST_PARQUET_FILE_KEY_PREFIX: This is the prefix used for the export. By default it is `/tmp`. The export will be saved inside this prefix in the following structure: `year=window_start.year/month=window_start.month/day=window_start.day`, e.g., `tmp/year=2024/month=1/day=15`.
* OPENCOST_PARQUET_AGGREGATE: Dimensions used to aggregate the data. By default, "namespace,pod,container" which is the same dimensions used for the CSV native export.
* OPENCOST_PARQUET_STEP: Step size for the export. By default, we use 1h steps, which results in 24 steps in a day and makes it easier to match the exported data to AWS CUR since CUR also exports on an hourly basis.
* OPENCOST_PARQUET_RESOLUTION: Duration to use as resolution in Prometheus queries. Smaller values (i.e., higher resolutions) will provide better accuracy, but worse performance (i.e., slower query time, higher memory use). Larger values (i.e., lower resolutions) will perform better but at the expense of lower accuracy for short-running workloads.
* OPENCOST_PARQUET_ACCUMULATE: If `"true"`, sum the entire range of time intervals into a single set. Default value is `"false"`.
* OPENCOST_PARQUET_INCLUDE_IDLE: Whether to return the calculated __idle__ field for the query. Default is `"false"`.
* OPENCOST_PARQUET_IDLE_BY_NODE: If `"true"`, idle allocations are created on a per node basis. Which will result in different values when shared and more idle allocations when split. Default is `"false"`.
* OPENCOST_PARQUET_STORAGE_BACKEND: The storage backend to use. Supports `aws`, `azure`. See below for Azure specific variables.
* OPENCOST_PARQUET_JSON_SEPARATOR: The OpenCost API returns nested objects. The used [JSON normalization method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) allows for a custom separator. Use this to specify the separator of your choice.
* OPENCOST_PARQUET_IDLE_BY_NODE: If `"true"`, idle allocations are created on a per-node basis, which will result in different values when shared and more idle allocations when split. Default is `"false"`.
* OPENCOST_PARQUET_STORAGE_BACKEND: The storage backend to use. Supports `aws`, `azure`, `gcp`. See below for Azure and GCP-specific variables.
* OPENCOST_PARQUET_JSON_SEPARATOR: The OpenCost API returns nested objects. The used [JSON normalization method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) allows for a custom separator. Use this to specify the separator of your choice.

## Azure Specific Environment Variables
* OPENCOST_PARQUET_AZURE_STORAGE_ACCOUNT_NAME: Name of the Azure Storage Account you want to export the data to.
* OPENCOST_PARQUET_AZURE_CONTAINER_NAME: The container within the storage account you want to save the data to. The service principal requires write permissions on the container
* OPENCOST_PARQUET_AZURE_TENANT: You Azure Tenant ID
* OPENCOST_PARQUET_AZURE_APPLICATION_ID: ClientID of the Service Principal
* OPENCOST_PARQUET_AZURE_APPLICATION_SECRET: Secret of the Service Principal
* OPENCOST_PARQUET_AZURE_CONTAINER_NAME: The container within the storage account you want to save the data to. The service principal requires write permissions on the container.
* OPENCOST_PARQUET_AZURE_TENANT: Your Azure Tenant ID.
* OPENCOST_PARQUET_AZURE_APPLICATION_ID: Client ID of the Service Principal.
* OPENCOST_PARQUET_AZURE_APPLICATION_SECRET: Secret of the Service Principal.

## GCP Specific Environment Variables
* OPENCOST_PARQUET_GCP_BUCKET_NAME: Name of the GCP bucket you want to export the data to.
* OPENCOST_PARQUET_GCP_CREDENTIALS_JSON: JSON-formatted string of your GCP credentials (optional, uses `GOOGLE_APPLICATION_CREDENTIALS` if not set).

# Prerequisites
## AWS IAM

## Azure RBAC
The current implementation allows for authentication via [Service Principals](https://learn.microsoft.com/en-us/entra/identity-platform/app-objects-and-service-principals?tabs=browser) on the Azure Storage Account. Therefore, to use the Azure storage backend you need an existing service principal with according role assignments. Azure RBAC has built-in roles for Storage Account Blob Storage operations. The [Storage-Blob-Data-Contributor](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/storage#storage-blob-data-contributor) allows to write data to a Azure Storage Account container. A less permissivie custom role can be built and is encouraged!
The current implementation allows for authentication via [Service Principals](https://learn.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals) on the Azure Storage Account. Therefore, to use the Azure storage backend, you need an existing service principal with the appropriate role assignments. Azure RBAC has built-in roles for Storage Account Blob Storage operations. The [Storage Blob Data Contributor](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/storage#storage-blob-data-contributor) allows writing data to an Azure Storage Account container. A less permissive custom role can be built and is encouraged!

## GCP IAM
The current implementation allows for authentication using service account keys or Workload Identity. Ensure that the service account has the `Storage Object Creator` role or equivalent permissions to write data to the GCP bucket.

# Usage:

Expand Down
12 changes: 1 addition & 11 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -1,14 +1,4 @@
numpy==1.26.3
pandas==2.2.3
boto3==1.35.16
requests==2.32.0
python-dateutil==2.8.2
pytz==2023.3.post1
six==1.16.0
tzdata==2023.4
pyarrow==14.0.1
azure-storage-blob==12.19.1
azure-identity==1.16.1
-r requirements.txt
# The dependencies bellow are only used for development.
freezegun==1.4.0
pylint==3.0.3
Expand Down
7 changes: 7 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,10 @@ tzdata==2023.4
pyarrow==14.0.1
azure-storage-blob==12.19.1
azure-identity==1.16.1
google-api-core==2.19.2
google-auth==2.34.0
google-cloud-core==2.4.1
google-cloud-storage==2.18.2
google-crc32c==1.6.0
google-resumable-media==2.7.2
googleapis-common-protos==1.65.0
6 changes: 6 additions & 0 deletions src/opencost_parquet_exporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,12 @@ def get_config(
'azure_application_id': os.environ.get('OPENCOST_PARQUET_AZURE_APPLICATION_ID'),
'azure_application_secret': os.environ.get('OPENCOST_PARQUET_AZURE_APPLICATION_SECRET'),
})
if config['storage_backend'] == 'gcp':
config.update({
# pylint: disable=C0301
'gcp_bucket_name': os.environ.get('OPENCOST_PARQUET_GCP_BUCKET_NAME'),
'gcp_credentials': json.loads(os.environ.get('OPENCOST_PARQUET_GCP_CREDENTIALS_JSON', '{}')),
})

# If window is not specified assume we want yesterday data.
if window_start is None or window_end is None:
Expand Down
89 changes: 89 additions & 0 deletions src/storage/gcp_storage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
"""
This module provides an implementation of the BaseStorage class for Google Cloud Storage.
"""

from io import BytesIO
import logging
from google.cloud import storage
from google.oauth2 import service_account
from google.api_core import exceptions as gcp_exceptions
import pandas as pd
from .base_storage import BaseStorage

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)


# pylint: disable=R0903
class GCPStorage(BaseStorage):
"""
A class to handle data storage in Google Cloud Storage.
"""

def _get_client(self, config) -> storage.Client:
"""
Returns a Google Cloud Storage client using credentials provided in the config.

Parameters:
config (dict): Configuration dictionary that may contain 'gcp_credentials'
for service account keys and other authentication-related keys.

Returns:
storage.Client: An authenticated Google Cloud Storage client.
"""
if 'gcp_credentials' in config:
credentials_info = config['gcp_credentials']
credentials = service_account.Credentials.from_service_account_info(
credentials_info)
client = storage.Client(credentials=credentials)
else:
# Use default credentials
client = storage.Client()

return client

def save_data(self, data: pd.core.frame.DataFrame, config) -> str | None:
"""
Saves a DataFrame to Google Cloud Storage.

Parameters:
data (pd.core.frame.DataFrame): The DataFrame to be saved.
config (dict): Configuration dictionary containing necessary information for storage.
Expected keys include 'gcp_bucket_name',
'file_key_prefix', and 'window_start'.

Returns:
str | None: The URL of the saved object if successful, None otherwise.
"""
client = self._get_client(config)

file_name = 'k8s_opencost.parquet'
window = pd.to_datetime(config['window_start'])
blob_prefix = f"{config['file_key_prefix']}/{window.year}/{window.month}/{window.day}"
bucket_name = config['gcp_bucket_name']
blob_name = f"{blob_prefix}/{file_name}"

bucket = client.bucket(bucket_name)
blob = bucket.blob(blob_name)
parquet_file = BytesIO()
data.to_parquet(parquet_file, engine='pyarrow', index=False)
parquet_file.seek(0)

try:
blob.upload_from_file(
parquet_file, content_type='application/octet-stream')
return blob.public_url
except gcp_exceptions.BadRequest as e:
logger.error("Bad Request Error: %s", e)
except gcp_exceptions.Forbidden as e:
logger.error("Forbidden Error: %s", e)
except gcp_exceptions.NotFound as e:
logger.error("Not Found Error: %s", e)
except gcp_exceptions.TooManyRequests as e:
logger.error("Too Many Requests Error: %s", e)
except gcp_exceptions.InternalServerError as e:
logger.error("Internal Server Error: %s", e)
except gcp_exceptions.GoogleAPIError as e:
logger.error("Google API Error: %s", e)

return None
9 changes: 6 additions & 3 deletions src/storage_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,18 @@

from storage.aws_s3_storage import S3Storage
from storage.azure_storage import AzureStorage
from storage.gcp_storage import GCPStorage # New import


def get_storage(storage_backend):
"""
Factory function to create and return a storage object based on the given backend.

This function abstracts the creation of storage objectss. It supports 'azure' for
Azure Storage and 's3' for AWS S3 Storage.
This function abstracts the creation of storage objects. It supports 'azure' for
Azure Storage, 's3' for AWS S3 Storage, and 'gcp' for Google Cloud Storage.

Parameters:
storage_backend (str): The name of the storage backend. SUpported:'azure','s3'.
storage_backend (str): The name of the storage backend. Supported: 'azure', 's3', 'gcp'.

Returns:
An instance of the specified storage backend class.
Expand All @@ -27,5 +28,7 @@ def get_storage(storage_backend):
return AzureStorage()
if storage_backend in ['s3', 'aws']:
return S3Storage()
if storage_backend == 'gcp':
return GCPStorage()

raise ValueError("Unsupported storage backend")
30 changes: 30 additions & 0 deletions src/test_opencost_parquet_exporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,36 @@ def test_get_azure_config_with_env_vars(self):
self.assertEqual(config['params'][1][1], 'true')
self.assertEqual(config['params'][2][1], 'true')

def test_get_gcp_config_with_env_vars(self):
"""Test get_config returns correct configurations based on environment variables."""
with patch.dict(os.environ, {
'OPENCOST_PARQUET_SVC_HOSTNAME': 'testhost',
'OPENCOST_PARQUET_SVC_PORT': '8080',
'OPENCOST_PARQUET_WINDOW_START': '2020-01-01T00:00:00Z',
'OPENCOST_PARQUET_WINDOW_END': '2020-01-01T23:59:59Z',
'OPENCOST_PARQUET_S3_BUCKET': 's3://test-bucket',
'OPENCOST_PARQUET_FILE_KEY_PREFIX': 'test-prefix/',
'OPENCOST_PARQUET_AGGREGATE': 'namespace',
'OPENCOST_PARQUET_STEP': '1m',
'OPENCOST_PARQUET_STORAGE_BACKEND': 'gcp',
'OPENCOST_PARQUET_GCP_BUCKET_NAME': 'testbucket',
'OPENCOST_PARQUET_GCP_CREDENTIALS_JSON': '{"type": "service_account"}',
'OPENCOST_PARQUET_IDLE_BY_NODE': 'true',
'OPENCOST_PARQUET_INCLUDE_IDLE': 'true'}, clear=True):
config = get_config()

self.assertEqual(
config['url'], 'http://testhost:8080/allocation/compute')
self.assertEqual(config['params'][0][1],
'2020-01-01T00:00:00Z,2020-01-01T23:59:59Z')
self.assertEqual(config['storage_backend'], 'gcp')
self.assertEqual(
config['gcp_bucket_name'], 'testbucket')
self.assertEqual(config['gcp_credentials'], {
'type': 'service_account'})
self.assertEqual(config['params'][1][1], 'true')
self.assertEqual(config['params'][2][1], 'true')

@freeze_time("2024-01-31")
def test_get_config_defaults_last_day_of_month(self):
"""Test get_config returns correct defaults when no env vars are set."""
Expand Down
Loading