-
Notifications
You must be signed in to change notification settings - Fork 608
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
60a504f
commit a9e166b
Showing
60 changed files
with
1,984 additions
and
508 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# Compute | ||
|
||
Compute resource requests in Cortex follow the syntax and meaning of [compute resources in Kubernetes](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container). | ||
|
||
For example: | ||
|
||
```yaml | ||
- name: my-api | ||
... | ||
compute: | ||
cpu: 1 | ||
gpu: 1 | ||
mem: 1G | ||
``` | ||
CPU, GPU, Inf, and memory requests in Cortex correspond to compute resource requests in Kubernetes. In the example above, the API will only be scheduled once 1 CPU, 1 GPU, and 1G of memory are available on any instance, and it will be guaranteed to have access to those resources throughout its execution. In some cases, resource requests can be \(or may default to\) `Null`. | ||
|
||
## CPU | ||
|
||
One unit of CPU corresponds to one virtual CPU on AWS. Fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix \(`0.2` and `200m` are equivalent\). | ||
|
||
## GPU | ||
|
||
One unit of GPU corresponds to one virtual GPU. Fractional requests are not allowed. | ||
|
||
See [GPU documentation](gpus.md) for more information. | ||
|
||
## Memory | ||
|
||
One unit of memory is one byte. Memory can be expressed as an integer or by using one of these suffixes: `K`, `M`, `G`, `T` \(or their power-of two counterparts: `Ki`, `Mi`, `Gi`, `Ti`\). For example, the following values represent roughly the same memory: `128974848`, `129e6`, `129M`, `123Mi`. | ||
|
||
## Inf | ||
|
||
One unit of Inf corresponds to one Inferentia ASIC with 4 NeuronCores _\(not the same thing as `cpu`\)_ and 8GB of cache memory _\(not the same thing as `mem`\)_. Fractional requests are not allowed. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Using GPUs | ||
|
||
To use GPUs: | ||
|
||
1. Make sure your AWS account is subscribed to the [EKS-optimized AMI with GPU Support](https://aws.amazon.com/marketplace/pp/B07GRHFXGM). | ||
2. You may need to [file an AWS support ticket](https://console.aws.amazon.com/support/cases#/create?issueType=service-limit-increase&limitType=ec2-instances) to increase the limit for your desired instance type. | ||
3. Set instance type to an AWS GPU instance \(e.g. `g4dn.xlarge`\) when installing Cortex. | ||
4. Set the `gpu` field in the `compute` configuration for your API. One unit of GPU corresponds to one virtual GPU. Fractional requests are not allowed. | ||
|
||
## Tips | ||
|
||
### If using `processes_per_replica` > 1, TensorFlow-based models, and Python Predictor | ||
|
||
When using `processes_per_replica` > 1 with TensorFlow-based models \(including Keras\) in the Python Predictor, loading the model in separate processes at the same time will throw a `CUDA_ERROR_OUT_OF_MEMORY: out of memory` error. This is because the first process that loads the model will allocate all of the GPU's memory and leave none to other processes. To prevent this from happening, the per-process GPU memory usage can be limited. There are two methods: | ||
|
||
1\) Configure the model to allocate only as much memory as it requires, via [tf.config.experimental.set\_memory\_growth\(\)](https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth): | ||
|
||
```python | ||
for gpu in tf.config.list_physical_devices("GPU"): | ||
tf.config.experimental.set_memory_growth(gpu, True) | ||
``` | ||
|
||
2\) Impose a hard limit on how much memory the model can use, via [tf.config.set\_logical\_device\_configuration\(\)](https://www.tensorflow.org/api_docs/python/tf/config/set_logical_device_configuration): | ||
|
||
```python | ||
mem_limit_mb = 1024 | ||
for gpu in tf.config.list_physical_devices("GPU"): | ||
tf.config.set_logical_device_configuration( | ||
gpu, [tf.config.LogicalDeviceConfiguration(memory_limit=mem_limit_mb)] | ||
) | ||
``` | ||
|
||
See the [TensorFlow GPU guide](https://www.tensorflow.org/guide/gpu) and this [blog post](https://medium.com/@starriet87/tensorflow-2-0-wanna-limit-gpu-memory-10ad474e2528) for additional information. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
# Using Inferentia | ||
|
||
To use [Inferentia ASICs](https://aws.amazon.com/machine-learning/inferentia/): | ||
|
||
1. You may need to [file an AWS support ticket](https://console.aws.amazon.com/support/cases#/create?issueType=service-limit-increase&limitType=ec2-instances) to increase the limit for your desired instance type. | ||
2. Set the instance type to an AWS Inferentia instance \(e.g. `inf1.xlarge`\) when creating your Cortex cluster. | ||
3. Set the `inf` field in the `compute` configuration for your API. One unit of `inf` corresponds to one Inferentia ASIC with 4 NeuronCores _\(not the same thing as `cpu`\)_ and 8GB of cache memory _\(not the same thing as `mem`\)_. Fractional requests are not allowed. | ||
|
||
## Neuron | ||
|
||
Inferentia ASICs come in different sizes depending on the instance type: | ||
|
||
* `inf1.xlarge`/`inf1.2xlarge` - each has 1 Inferentia ASIC | ||
* `inf1.6xlarge` - has 4 Inferentia ASICs | ||
* `inf1.24xlarge` - has 16 Inferentia ASICs | ||
|
||
Each Inferentia ASIC comes with 4 NeuronCores and 8GB of cache memory. To better understand how Inferentia ASICs work, read these [technical notes](https://github.com/aws/aws-neuron-sdk/blob/master/docs/technotes/README.md) and this [FAQ](https://github.com/aws/aws-neuron-sdk/blob/master/FAQ.md). | ||
|
||
### NeuronCore Groups | ||
|
||
A [NeuronCore Group](https://github.com/aws/aws-neuron-sdk/blob/master/docs/tensorflow-neuron/tutorial-NeuronCore-Group.md) \(NCG\) is a set of NeuronCores that is used to load and run a compiled model. NCGs exist to aggregate NeuronCores to improve hardware performance. Models can be shared within an NCG, but this would require the device driver to dynamically context switch between each model, which degrades performance. Therefore we've decided to only allow one model per NCG \(unless you are using a [multi-model endpoint](../guides/multi-model.md), in which case there will be multiple models on a single NCG, and there will be context switching\). | ||
|
||
Each Cortex API process will have its own copy of the model and will run on its own NCG \(the number of API processes is configured by the [`processes_per_replica`](../deployments/realtime-api/autoscaling.md#replica-parallelism) for Realtime APIs field in the API configuration\). Each NCG will have an equal share of NeuronCores. Therefore, the size of each NCG will be `4 * inf / processes_per_replica` \(`inf` refers to your API's `compute` request, and it's multiplied by 4 because there are 4 NeuronCores per Inferentia chip\). | ||
|
||
For example, if your API requests 2 `inf` chips, there will be 8 NeuronCores available. If you set `processes_per_replica` to 1, there will be one copy of your model running on a single NCG of size 8 NeuronCores. If `processes_per_replica` is 2, there will be two copies of your model, each running on a separate NCG of size 4 NeuronCores. If `processes_per_replica` is 4, there will be 4 NCGs of size 2 NeuronCores, and if If `processes_per_replica` is 8, there will be 8 NCGs of size 1 NeuronCores. In this scenario, these are the only valid values for `processes_per_replica`. In other words the total number of requested NeuronCores \(which equals 4 \* the number of requested Inferentia chips\) must be divisible by `processes_per_replica`. | ||
|
||
The 8GB cache memory is shared between all 4 NeuronCores of an Inferentia chip. Therefore an NCG with 8 NeuronCores \(i.e. 2 Inf chips\) will have access to 16GB of cache memory. An NGC with 2 NeuronCores will have access to 8GB of cache memory, which will be shared with the other NGC of size 2 running on the same Inferentia chip. | ||
|
||
### Compiling models | ||
|
||
Before a model can be deployed on Inferentia chips, it must be compiled for Inferentia. The Neuron compiler can be used to convert a regular TensorFlow SavedModel or PyTorch model into the hardware-specific instruction set for Inferentia. Inferentia currently supports compiled models from TensorFlow and PyTorch. | ||
|
||
By default, the Neuron compiler will compile a model to use 1 NeuronCore, but can be manually set to a different size \(1, 2, 4, etc\). | ||
|
||
For optimal performance, your model should be compiled to run on the number of NeuronCores available to it. The number of NeuronCores will be `4 * inf / processes_per_replica` \(`inf` refers to your API's `compute` request, and it's multiplied by 4 because there are 4 NeuronCores per Inferentia chip\). See [NeuronCore Groups](inferentia.md#neuron-core-groups) above for an example, and see [Improving performance](inferentia.md#improving-performance) below for a discussion of choosing the appropriate number of NeuronCores. | ||
|
||
Here is an example of compiling a TensorFlow SavedModel for Inferentia: | ||
|
||
```python | ||
import tensorflow.neuron as tfn | ||
|
||
tfn.saved_model.compile( | ||
model_dir, | ||
compiled_model_dir, | ||
batch_size, | ||
compiler_args=["--num-neuroncores", "1"], | ||
) | ||
``` | ||
|
||
Here is an example of compiling a PyTorch model for Inferentia: | ||
|
||
```python | ||
import torch_neuron, torch | ||
|
||
model.eval() | ||
example_input = torch.zeros([batch_size] + input_shape, dtype=torch.float32) | ||
model_neuron = torch.neuron.trace( | ||
model, | ||
example_inputs=[example_input], | ||
compiler_args=["--num-neuroncores", "1"] | ||
) | ||
model_neuron.save(compiled_model) | ||
``` | ||
|
||
The versions of `tensorflow-neuron` and `torch-neuron` that are used by Cortex are found in the [Realtime API pre-installed packages list](../deployments/realtime-api/predictors.md#inferentia-equipped-apis) and [Batch API pre-installed packages list](../deployments/batch-api/predictors.md#inferentia-equipped-apis). When installing these packages with `pip` to compile models of your own, use the extra index URL `--extra-index-url=https://pip.repos.neuron.amazonaws.com`. | ||
|
||
A list of model compilation examples for Inferentia can be found on the [`aws/aws-neuron-sdk`](https://github.com/aws/aws-neuron-sdk) repo for [TensorFlow](https://github.com/aws/aws-neuron-sdk/blob/master/docs/tensorflow-neuron/) and for [PyTorch](https://github.com/aws/aws-neuron-sdk/blob/master/docs/pytorch-neuron/README.md). Here are 2 examples implemented with Cortex: | ||
|
||
1. [ResNet50 in TensorFlow](https://github.com/cortexlabs/cortex/tree/0.22/examples/tensorflow/image-classifier-resnet50) | ||
2. [ResNet50 in PyTorch](https://github.com/cortexlabs/cortex/tree/0.22/examples/pytorch/image-classifier-resnet50) | ||
|
||
### Improving performance | ||
|
||
A few things can be done to improve performance using compiled models on Cortex: | ||
|
||
1. There's a minimum number of NeuronCores for which a model can be compiled. That number depends on the model's architecture. Generally, compiling a model for more cores than its required minimum helps to distribute the model's operators across multiple cores, which in turn [can lead to lower latency](https://github.com/aws/aws-neuron-sdk/blob/master/docs/technotes/neuroncore-pipeline.md). However, compiling a model for more NeuronCores means that you'll have to set `processes_per_replica` to be lower so that the NeuronCore Group has access to the number of NeuronCores for which you compiled your model. This is acceptable if latency is your top priority, but if throughput is more important to you, this tradeoff is usually not worth it. To maximize throughput, compile your model for as few NeuronCores as possible and increase `processes_per_replica` to the maximum possible \(see above for a sample calculation\). | ||
2. Try to achieve a near [100% placement](https://github.com/aws/aws-neuron-sdk/blob/b28262e3072574c514a0d72ad3fe5ca48686d449/src/examples/tensorflow/keras_resnet50/pb2sm_compile.py#L59) of your model's graph onto the NeuronCores. During the compilation phase, any operators that can't execute on NeuronCores will be compiled to execute on the machine's CPU and memory instead. Even if just a few percent of the operations reside on the host's CPU/memory, the maximum throughput of the instance can be significantly limited. | ||
3. Use the [`--static-weights` compiler option](https://github.com/aws/aws-neuron-sdk/blob/master/docs/technotes/performance-tuning.md#compiling-for-pipeline-optimization) when possible. This option tells the compiler to make it such that the entire model gets cached onto the NeuronCores. This avoids a lot of back-and-forth between the machine's CPU/memory and the Inferentia ASICs. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
# Networking | ||
|
||
![api architecture diagram](https://user-images.githubusercontent.com/808475/84695323-8507dd00-aeff-11ea-8b32-5a55cef76c79.png) | ||
|
||
APIs are deployed with a public API Gateway by default \(the API Gateway forwards requests to the API load balancer\). Each API can be independently configured to not create the API Gateway endpoint by setting `api_gateway: none` in the `networking` field of the [Realtime API configuration](../deployments/realtime-api/api-configuration.md) and [Batch API configuration](../deployments/batch-api/api-configuration.md). If the API Gateway endpoint is not created, your API can still be accessed via the API load balancer; `cortex get API_NAME` will show the load balancer endpoint if API Gateway is disabled. API Gateway is enabled by default, and is generally recommended unless it doesn't support your use case due to limitations such as the 29 second request timeout, or if you are keeping your APIs private to your VPC. See below for common configurations. To disable API Gateway cluster-wide \(thereby enforcing that all APIs cannot create API Gateway endpoints\), set `api_gateway: none` in your [cluster configuration](../cluster-management/config.md) file \(before creating your cluster\). | ||
|
||
By default, the API load balancer is public. You can configure your API load balancer to be private by setting `api_load_balancer_scheme: internal` in your [cluster configuration](../cluster-management/config.md) file \(before creating your cluster\). This will force external traffic to go through your API Gateway endpoint, or if you disabled API Gateway for your API, it will make your API only accessible through VPC Peering. Note that if API Gateway is used, endpoints will be public regardless of `api_load_balancer_scheme`. See below for common configurations. | ||
|
||
The API Gateway that Cortex creates in AWS is the "HTTP" type. If you need to use AWS's "REST" API Gateway, see [here](../guides/rest-api-gateway.md). | ||
|
||
## Common API networking configurations | ||
|
||
### Public https endpoint \(with API Gateway\) | ||
|
||
This is the most common configuration for public APIs. [Custom domains](../guides/custom-domain.md) can be used with this setup, but are not required. | ||
|
||
```yaml | ||
# cluster.yaml | ||
|
||
api_load_balancer_scheme: internal | ||
``` | ||
```yaml | ||
# cortex.yaml | ||
|
||
- name: my-api | ||
... | ||
networking: | ||
api_gateway: public # this is the default, so can be omitted | ||
``` | ||
### Private https endpoint | ||
You can configure your API to be private. If you do this, you must use [VPC Peering](../guides/vpc-peering.md) to connect to your APIs. | ||
The SSL certificate on the API load balancer is autogenerated during installation using `localhost` as the Common Name \(CN\). Therefore, clients will need to skip certificate verification when making HTTPS requests \(e.g. `curl -k`\). Alternatively, you can set up a [custom domain](../guides/custom-domain.md), which will use ACM to provision SSL certs for your domain. | ||
|
||
```yaml | ||
# cluster.yaml | ||
api_load_balancer_scheme: internal # this is the default, so can be omitted | ||
# use this to configure a custom domain | ||
# if you don't use a custom domain, clients will need to skip certificate verification when making HTTPS requests (e.g. `curl -k`) | ||
ssl_certificate_arn: arn:aws:acm:us-west-2:***:certificate/*** | ||
``` | ||
```yaml | ||
# cortex.yaml | ||
|
||
- name: my-api | ||
... | ||
networking: | ||
api_gateway: none | ||
``` | ||
### Private http endpoint | ||
You can configure your API to be private. If you do this, you must use [VPC Peering](../guides/vpc-peering.md) to connect to your APIs. | ||
```yaml | ||
# cluster.yaml | ||
|
||
api_load_balancer_scheme: internal # this is the default, so can be omitted | ||
``` | ||
```yaml | ||
# cortex.yaml | ||
|
||
- name: my-api | ||
... | ||
networking: | ||
api_gateway: none | ||
``` | ||
### Public https endpoint \(without API Gateway\) | ||
API gateway is generally recommended for public https APIs, but there may be a situation where you don't wish to use it \(e.g. requests take longer than 29 seconds to complete, which is the max for API Gateway\). In this case, clients can connect directly to the API load balancer. | ||
The SSL certificate on the API load balancer is autogenerated during installation using `localhost` as the Common Name \(CN\). Therefore, clients will need to skip certificate verification when making HTTPS requests \(e.g. `curl -k`\). Alternatively, you can set up a [custom domain](../guides/custom-domain.md), which will use ACM to provision SSL certs for your domain. | ||
|
||
```yaml | ||
# cluster.yaml | ||
api_load_balancer_scheme: internet-facing # this is the default, so can be omitted | ||
# use this to configure a custom domain | ||
# if you don't use a custom domain, clients will need to skip certificate verification when making HTTPS requests (e.g. `curl -k`) | ||
ssl_certificate_arn: arn:aws:acm:us-west-2:***:certificate/*** | ||
``` | ||
```yaml | ||
# cortex.yaml | ||
|
||
- name: my-api | ||
... | ||
networking: | ||
api_gateway: none | ||
``` | ||
### Public http endpoint | ||
If you don't wish to use https for your public API, you can simply disable API gateway \(your API will be accessed directly via the API load balancer\): | ||
```yaml | ||
# cluster.yaml | ||
|
||
api_load_balancer_scheme: internet-facing # this is the default, so can be omitted | ||
``` | ||
```yaml | ||
# cortex.yaml | ||
|
||
- name: my-api | ||
... | ||
networking: | ||
api_gateway: none | ||
``` | ||
Oops, something went wrong.