Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Zhenzhong1 committed Sep 10, 2024
1 parent 1953be7 commit deff1d4
Showing 1 changed file with 39 additions and 51 deletions.
90 changes: 39 additions & 51 deletions evals/auto_tuning/README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,37 @@
# Auto-Tuning for ChatQnA: Optimizing Resource Allocation in Kubernetes

This document describes the Auto-Tuning Framework, a tool designed to streamline deployment strategies for resource-intensive services, particularly in ChatQnA environments. It leverages Kubernetes for container orchestration and integrates experimental data with prior knowledge to fine-tune deployments for optimal performance.
This document describes the Auto-Tuning framework, a tool designed to streamline deployment strategies for resource-intensive services, particularly in ChatQnA environments. It leverages Kubernetes for container orchestration and integrates experimental data with out prior knowledge to fine-tune deployments for optimal performance.

## Key Features
* Hardware Efficiency: Focuses on adjusting replica counts and effectively utilizing CPU and HPU (Habana Processing Unit) resources.
* Theoretical and Experimental Optimization: Combines theoretical best practices with real-world data to ensure services receive appropriate resource allocations.
* Targeted Services: Primarily optimizes deployments for Embedding Service, Reranking Service, and LLM Service (most resource-demanding).
* Hardware Efficiency: Focuses on adjusting replica counts and maximizing the utilization of CPU and HPU (Habana Processing Unit) resources.

* Theoretical and Experimental Optimization: Integrates theoretical best practices with our prior knowledge to ensure optimal resource allocation for services.

# Usage

To generate the strategy.json configuration file for deployment, use the following command:


```python
```bash
# Kubernetes Deployment
python3 tuning.py --tuning_config replica_tuning_config.json --hardware_info hardware_info_gaudi.json --service_info chatqna_neuralchat_rerank_latest.yaml

# Note: Add --config_only to output deployment configs only.
```

## Configuration Files
1. hardware_info_gaudi.json: Provides the hardware details (CPU, HPU, etc.).
1. hardware_info_gaudi.json: Specifies the hardware details (CPU, HPU, etc.).

2. chatqna_neuralchat_rerank_latest.yaml: Contains service deployment information.
3. tuning_config.json: Customize tuning parameters for replica counts and granularity.

### Hardrware_info.json
This file includes details of all hardware devices that will be used for deployment.
3. tuning_config.json: Customizes tuning parameters for replica counts and granularity.

Please only list devices in the config that you want to use.
### Hardrware_info.json
This file lists only the hardware devices to be used in deployment.

```json
{
"device_0": {
"ip": ["10.239.1.1", "10.239.10.2"],
"type": "cpu",
"sockets": 2,
"cores_per_socket": 64
},
"device_1": {
"ip": ["10.239.1.3"],
"type": "cpu",
"sockets": 2,
"cores_per_socket": 56
},
"device_2": {
"ip": ["10.239.1.5", "10.239.10.6"],
"type": "hpu",
"sockets": 2,
Expand All @@ -50,16 +40,20 @@ Please only list devices in the config that you want to use.
}
}
```
Please refer to `hardware_info_gaudi.json` for more details.

### chatqna_neuralchat_rerank_latest.yaml
This file includes all services which will be deployed.
This file includes all services that will be deployed.
```yaml
opea_micro_services:
data_prep:
... ...
embedding:
... ...

reranking:
... ...

llm:
opea/llm-tgi:
tag: latest
Expand All @@ -70,47 +64,36 @@ opea_micro_services:
type: hpu
requirements:
model_id: "Intel/neural-chat-7b-v3-3"
ghcr.io/huggingface/text-generation-inference:
tag: 1.4
type: cpu
requirements:
model_id: "Intel/neural-chat-7b-v3-3"

opea_mega_service:
opea/chatqna:
tag: latest
type: cpu
```
Refer to chatqna_neuralchat_rerank_latest.yaml for more details on service configurations.
Please refer to `chatqna_neuralchat_rerank_latest.yaml` for more details.

### Tuning Config Parameters

`embedding_replicas_granularity = 1`
* This defines the step size or increment for scaling the number of replicas for the embedding server.
`embedding_replicas_granularity = 1`: This defines the step size for scaling the number of replicas for the embedding server.
* Value (1): Each scaling operation increases or decreases the number of replicas by 1 at a time.

`embedding_replicas_min = 1`
* This sets the minimum number of replicas allowed for the embedding server.
`embedding_replicas_min = 1`: This sets the minimum number of replicas allowed for the embedding server.
* Value (1): The service will always have at least 1 replica running, ensuring that it is available for deployment.

`embedding_replicas_max = 4`
* This defines the maximum number of replicas allowed for the embedding-service.
`embedding_replicas_max = 4`: This defines the maximum number of replicas allowed for the embedding server.
* Value (4): The service can be scaled up to a maximum of 4 replicas, limiting resource consumption and avoiding over-provisioning.

`microservice_replicas_granularity = 1`
* This specifies the scaling step size for other microservices like reranking, retrieval, and llm-service.
`microservice_replicas_granularity = 1`: This specifies the scaling step size for other microservices (such as retrieval, dataprep, etc.).
* Value (1): Similar to the embedding_replicas_granularity, the number of replicas for these microservices will scale by 1 replica at a time.

`microservice_replicas_min = 1`
* This parameter sets the minimum number of replicas for these microservices.
* Value (1): Ensures that each microservice always has at least 1 replica running for high availability.
`microservice_replicas_min = 1`: This parameter sets the minimum number of replicas for these microservices.
* Value (1): Ensures that each microservice always has at least 1 replica running.

`microservice_replicas_max = 4`
* This defines the upper limit for scaling replicas for these microservices.
* Value (4): The maximum number of replicas allowed for the microservices is capped at 4, ensuring efficient use of resources.
`microservice_replicas_max = 4`: This defines the upper limit for scaling replicas for these microservices.
* Value (4): The maximum number of replicas allowed for the microservices is 4.


If you want to adjust the default tuning parameters, create a tuning_config.json file. For example:
If you want to adjust the default tuning parameters, just create a replica_tuning_config.json file. For example:

```json
{
Expand All @@ -123,15 +106,17 @@ If you want to adjust the default tuning parameters, create a tuning_config.json
"microservice_replicas_max": 4
}
```
The system will apply these parameters during the tuning process.
Please refer to `replica_tuning_config.json` for more details.

## Output:
The output of the auto-tuning process includes two key components: strategy_files and K8S manifests.
## Output

The strategy_files contain optimized configurations for deploying services, such as the number of replicas and resource allocations, while the K8S manifests provide the necessary Kubernetes deployment specifications, including pod definitions and resource limits, ready for deployment in a Kubernetes environment.
The output of the auto-tuning process includes two key components:
1. strategy_files: Contains optimized configurations for deploying services, such as replica counts and hardware resource allocations.

For examples:
```
2. K8S manifests: Provides the Kubernetes deployment specifications, including pod definitions and resource limits, ready for deployment.

Example of a strategy file:
```json
{
"embedding-dependency": {
"type": "cpu",
Expand Down Expand Up @@ -161,4 +146,7 @@ For examples:
}
```

The K8S manifests are generated in the current directory, alongside the strategy_files, which contain the deployment configurations.
Both the K8S manifests and strategy files are generated in the current directory, providing everything needed for deployment.

Deployment methods: simply run `kubectl apply -f` on the newly generated *_run.yaml files and the chatqna_config_map.

0 comments on commit deff1d4

Please sign in to comment.