Project Report: Open Cap Stack Lake House Setup #91

urbantech · 2024-10-15T05:04:23Z

urbantech
Oct 15, 2024
Maintainer

Overview

This project report outlines the successful execution and setup of the Open Cap Stack "Lake House," integrating key technologies such as Apache Spark, Delta Lake, PostgreSQL, MinIO object storage, and Apache Airflow for orchestrating workflows. The lake house was designed to handle large-scale data storage, querying, and processing efficiently. Each step in the setup process was thoroughly tested and verified to ensure stability and functionality. Below is a detailed breakdown of each task accomplished, with step-by-step instructions, tools used, and versions tested.

1. Apache Spark Installation and Delta Lake Setup

Objective: Install Apache Spark and Delta Lake for scalable data processing and storage.

Step 1: Install Apache Spark.

wget https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
tar xvf spark-3.3.0-bin-hadoop3.tgz
sudo mv spark-3.3.0-bin-hadoop3 /usr/local/spark-3.3.0

Step 2: Set up environment variables for Apache Spark.

echo "export SPARK_HOME=/usr/local/spark-3.3.0" >> ~/.zshrc
echo "export PATH=$SPARK_HOME/bin:$PATH" >> ~/.zshrc
source ~/.zshrc

Step 3: Verify Spark installation.
```
spark-shell --version
```

Step 4: Install Delta Lake for transactional storage in Spark.
Start the Spark shell with Delta Lake integration:

/usr/local/spark-3.3.0/bin/spark-shell --packages io.delta:delta-core_2.12:1.2.1 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Step 5: Test Delta Lake setup by writing data.

val data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")

Step 6: Verify Delta Table contents.

val deltaTable = DeltaTable.forPath("/tmp/delta-table")
deltaTable.toDF.show()

2. PostgreSQL Metadata Database Setup

Objective: Set up PostgreSQL to manage the metadata for the lake house.

Step 1: Log in as the postgres user.
```
psql -U postgres
```

Step 2: Create a new database user and grant privileges.

CREATE USER lakehouse_user WITH PASSWORD 'password';
ALTER USER lakehouse_user CREATEDB;

Step 3: Create the metadata database and tables.

CREATE DATABASE lakehouse_metadata WITH OWNER lakehouse_user;
\c lakehouse_metadata

Step 4: Create the necessary tables to manage datasets, schema, and ingestion logs.

CREATE TABLE datasets (
    dataset_id SERIAL PRIMARY KEY,
    dataset_name VARCHAR(255) NOT NULL,
    description TEXT,
    storage_location VARCHAR(255),
    creation_time TIMESTAMPTZ DEFAULT NOW(),
    last_modified_time TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE dataset_schema (
    schema_id SERIAL PRIMARY KEY,
    dataset_id INT REFERENCES datasets(dataset_id),
    column_name VARCHAR(255) NOT NULL,
    data_type VARCHAR(50) NOT NULL,
    is_nullable BOOLEAN,
    creation_time TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE ingestion_logs (
    log_id SERIAL PRIMARY KEY,
    dataset_id INT REFERENCES datasets(dataset_id),
    ingestion_status VARCHAR(50),
    records_ingested INT,
    ingestion_time TIMESTAMPTZ DEFAULT NOW(),
    error_message TEXT
);

Step 5: Verify user privileges.

GRANT ALL PRIVILEGES ON DATABASE lakehouse_metadata TO lakehouse_user;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO lakehouse_user;

Step 6: Insert sample data to test the setup.

INSERT INTO datasets (dataset_name, description, storage_location) 
VALUES ('Sample Dataset', 'This is a sample dataset for testing', '/data/sample-dataset');

3. MinIO Object Storage Integration

Objective: Set up MinIO for scalable object storage to integrate with the lake house.

Step 1: Download and install MinIO.

wget https://dl.min.io/server/minio/release/darwin-amd64/minio
chmod +x minio

Step 2: Start MinIO.

./minio server /data --console-address ":9001"

Step 3: Access the MinIO web console.
- Open a browser and navigate to http://localhost:9001.
- Use the default credentials: minioadmin:minioadmin.
Step 4: Create a bucket for storing datasets.
- Log in to MinIO.
- Create a new bucket called lakehouse-bucket to store dataset files.

4. Apache Airflow Orchestration

Objective: Set up Apache Airflow for orchestrating data ingestion, processing, and monitoring workflows.

Step 1: Create a Python virtual environment and install Apache Airflow.

python3 -m venv airflow-venv
source airflow-venv/bin/activate
pip install apache-airflow==2.7.2 --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.7.2/constraints-3.8.txt"

Step 2: Initialize the Airflow database.
```
airflow db init
```

Step 3: Create an Airflow admin user.

airflow users create --username admin --firstname Admin --lastname User \
  --role Admin --email [email protected] --password admin_password

Step 4: Start the Airflow web server and scheduler.
```
airflow webserver --port 8080
airflow scheduler
```

Final Notes

Through the integration of Apache Spark, Delta Lake, PostgreSQL for metadata management, MinIO object storage, and Apache Airflow for workflow orchestration, we successfully set up the foundation for the Open Cap Stack "Lake House." Each component was installed, configured, and tested to ensure full functionality and stability. This system provides a scalable and flexible infrastructure for managing, storing, and querying large datasets efficiently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open Cap Stack

Project Report: Open Cap Stack Lake House Setup #91

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Open Cap Stack

Project Report: Open Cap Stack Lake House Setup #91

urbantech Oct 15, 2024 Maintainer

Overview

1. Apache Spark Installation and Delta Lake Setup

2. PostgreSQL Metadata Database Setup

3. MinIO Object Storage Integration

4. Apache Airflow Orchestration

Final Notes

Replies: 0 comments

urbantech
Oct 15, 2024
Maintainer