Skip to content

Commit

Permalink
sync 13-06-24
Browse files Browse the repository at this point in the history
  • Loading branch information
aisi-inspect committed Jun 13, 2024
1 parent f829d7e commit 8141ef2
Show file tree
Hide file tree
Showing 63 changed files with 2,174 additions and 886 deletions.
8 changes: 5 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,19 @@

## v0.3.15 (Unreleased)

- [Tool Environments](https://ukgovernmentbeis.github.io/inspect_ai/tools.html#tool-environments) for executing tool code in sandboxed containers.
- [Tool Environments](https://ukgovernmentbeis.github.io/inspect_ai/agents.html#sec-tool-environments) for executing tool code in a sandbox.
- [Caching](https://ukgovernmentbeis.github.io/inspect_ai/caching.html) to reduce the number of model API calls made.
- The `multiple_choice()` solver now has support for questions with multiple correct answers.
- More fine grained handling of Claude `BadRequestError` (400) errors (which were formerly all treated as content moderation errors).
- Filter out empty TextBlockParam when playing messages back to Claude.
- Automatically combine Claude user messages that include tool content.
- Revert to "auto" rather than "none" after forced tool call.
- Automatically combine Anthropic user messages that include tool content.
- Support all Llama series models on Bedrock.
- Provide `TaskState.tools` getter/setter (where the setter automatically syncs the system messages to the specified set of tools).
- The `use_tools()` function now uses the `TaskState.tools` setter, so replaces the current set of tools entirely rather than appending to it.
- Set `state.completed = False` when `max_messages` is reached.
- Allow tools to be declared with no parameters.
- Allow for null `bytes` field in `Logprobs` and `TopLogprobs`
- Support all Llama series models on Bedrock.
- Added `truthfulqa` benchmark.
- Added `intercode-ctf` example.

Expand Down
1 change: 1 addition & 0 deletions docs/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ book:

- part: "Advanced"
chapters:
- caching.qmd
- eval-logs.qmd
- eval-suites.qmd
- eval-tuning.qmd
Expand Down
65 changes: 57 additions & 8 deletions docs/agents.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -432,7 +432,7 @@ class ToolEnvironment:
There are two tool environments built in to Inspect:

| Environment Type | Description |
|--------------------------|----------------------------------------------|
|---------------------------|---------------------------------------------|
| `local` | Run `tool_environment()` methods in the same file system as the running evaluation (should *only be used* if you are already running your evaluation in another sandbox). |
| `docker` | Run `tool_environment()` methods within a Docker container (see the [Docker Configuration](#sec-docker-configuration) section below for additional details). |

Expand Down Expand Up @@ -475,22 +475,53 @@ While `--tool-environment` can be a default un-configured environment (e.g. “
Here is how Docker tool environments are created based on the presence of `Dockerfile` and/or `compose.yml` in the task directory:

| Config Files | Behavior |
|---------------------|---------------------------------------------------|
|-----------------------|------------------------------------------------|
| None | Creates a tool environment based on the official [python:3.12-bookworm](https://hub.docker.com/_/python) image. |
| `Dockerfile` | Creates a tool environment by building the image. |
| `compose.yaml` | Creates tool environment(s) based on `compose.yaml`. |

Here is what a simple `compose.yaml` would look like for a single tool environment that uses the `ctf-agent-environment` Docker image:
If you have a `Dockerfile`, then `compose.yaml` is not strictly required, however you still may want to provide one (e.g. to set compute resource limits). For example:

``` {.yaml filename="compose.yaml"}
services:
default:
image: ctf-agent-environment
build: .
command: tail -f /dev/null
cpus: 1.0
mem_limit: 0.5gb
```

Note that we've also chosen to limit the CPU and memory usage of the container (see the [Docker Compose](https://docs.docker.com/compose/compose-file/) documentation for information on these and other container options).
The `command` is provided to prevent the container from exiting.

Here is what a simple `compose.yaml` would look like for a local pre-built image named `ctf-agent-environment` (resource limits excluded for brevity):

``` {.yaml filename="compose.yaml"}
services:
default:
image: ctf-agent-environment
x-local: true
command: tail -f /dev/null
```

The `ctf-agent-environment` is not an image that exists on a remote registry, so we add the `x-local: true` to indicate that it should not be pulled. If local images are tagged, they also will not be pulled by default (so `x-local: true` is not required). For example:

``` {.yaml filename="compose.yaml"}
services:
default:
image: ctf-agent-environment:1.0.0
command: tail -f /dev/null
```

If we are using an image from a remote registry we similarly don't need to include `x-local`:

``` {.yaml filename="compose.yaml"}
services:
default:
image: python:3.12-bookworm
command: tail -f /dev/null
```

See the [Docker Compose](https://docs.docker.com/compose/compose-file/) documentation for information on all available container options.

#### Multiple Environments

Expand All @@ -500,10 +531,12 @@ In some cases you may want to create multiple tool environments (e.g. if one env
services:
default:
image: ctf-agent-environment
x-local: true
cpus: 1.0
mem_limit: 0.5gb
ghidra:
image: ctf-ghidra-environment
victim:
image: ctf-victim-environment
x-local: true
cpus: 1.0
mem_limit: 1gb
```
Expand All @@ -512,13 +545,15 @@ The first environment listed is the “default” environment, and can be access

``` python
tool_environment() # default tool environment
tool_environment("ghidra") # named tool environment
tool_environment("victim") # named tool environment
```

::: {.callout-note apperance="simple"}
If you define multiple tool environments you are *required* to name one of them "default" so that Inspect knows which environment to copy samples files to and resolve for calls to `tool_environment()` without an argument.
:::

#### Files

Sample `files` will be copied into the default tool environment unless their name contains a prefix mapping them into another environment (e.g. `"victim:flag.txt": "flag.txt"`).

#### Infrastructure
Expand All @@ -529,11 +564,13 @@ Note that in many cases you’ll want to provision additional infrastructure (e.
services:
default:
image: ctf-agent-environment
x-local: true
volumes:
- ctf-challenge-volume:/shared-data

writer:
image: ctf-challenge-writer
x-local: true
volumes:
- ctf-challenge-volume:/shared-data
volumes:
Expand All @@ -552,6 +589,18 @@ As described above, each `Sample` is provisioned its own container. The number o

Use `max_samples` to dial up or down the number of containers running at any given time. Note that a running container does not necessarily use CPU resources unless it has active background processes.

Use a `compose.yaml` file to limit the resources consumed by each running container. For example:

``` {.yaml filename="compose.yaml"}
services:
default:
image: ctf-agent-environment
x-local: true
command: tail -f /dev/null
cpus: 1.0
mem_limit: 0.5gb
```

#### Concurrent Execution

The `ToolEnvironment.exec()` method runs a command within a tool environment, typically consuming CPU resources. To protect against overwhelming the system's CPUs, the implementation of `exec()` uses Inspect's `subprocess()` function, which automatically limits concurrent child processes to the number of CPUs on your system (`os.cpu_count()`).
Expand Down
150 changes: 150 additions & 0 deletions docs/caching.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# Caching {#sec-caching}

## Overview

Caching enables you to cache model output to reduce the number of API calls made, saving both time and expense. Caching is also often useful during development---for example, when you are iterating on a scorer you may want the model outputs served from a cache to both save time as well as for increased determinism.

## Caching Basics

Use the `cache` parameter on calls to `generate()` to activate the use of the cache. The keys for caching (what determines if a request can be fulfilled from the cache) are as follows:

- Model name and base URL (e.g. `openai/gpt-4-turbo`)
- Model prompt (i.e. message history)
- Epoch number (for ensuring distinct generations per epoch)
- Generate configuration (e.g. `temperature`, `top_p`, etc.)
- Active `tools` and `tool_choice`

If all of these inputs are identical, then the model response will be served from the cache. By default, model responses are cached for 1 week (see [Cache Policy](#cache-policy) below for details on customising this).

For example, here we are iterating on our self critique template, so we cache the main call to `generate()`:

``` python
@task
def theory_of_mind():
return Task(
dataset=example_dataset("theory_of_mind"),
plan=[
chain_of_thought(),
generate(cache = True),
self_critique(CRITIQUE_TEMPLATE)
]
scorer=model_graded_fact(),
)
```

You can similarly do this with the `generate` function passed into a `Solver`:

``` python
@solver
def custom_solver(cache):

async def solve(state, generate):

# (custom solver logic prior to generate)

return generate(state, cache)

return solve
```

You don't strictly need to provide a `cache` argument for a custom solver that uses caching, but it's generally good practice to enable users of the function to control caching behaviour.

You can also use caching with lower-level `generate()` calls (e.g. a model instance you have obtained with `get_model()`. There are some special considerations around epochs for this case, see the [Model API](#model-api) section below for further discussion.

### Model Versions

The model name (e.g. `openai/gpt-4-turbo`) is used as part of the cache key. Note though that many model names are aliases to specific model versions. For example, `gpt-4`, `gpt-4-turbo`, may resolve to different versions over time as updates are released.

If you want to invalidate caches for updated model versions, it's much better to use an explicitly versioned model name. For example:

``` bash
$ inspect eval ctf.py --model openai/gpt-4-turbo-2024-04-09
```

If you do this, then when a new version of turbo is deployed a call to the model will occur rather than resolving to a cached version.

## Cache Policy {#cache-policy}

By default, if you specify `cache = True` then the cache will expire in 1 week. You can customise this by passing a `CachePolicy` rather than a boolean. For example:

``` python
cache = CachePolicy(expiry="3h")
cache = CachePolicy(expiry="4D")
cache = CachePolicy(expiry="2W")
cache = CachePolicy(expiry="3M")
```

You can use `s`, `m`, `h`, `D`, `W` , `M`, and `Y` as abbreviations for `expiry` values.

If you want the cache to *never* expire, specify `None`. For example:

``` python
cache = CachePolicy(expiry = None)
```

You can also define scopes for cache expiration (e.g. cache for a specific task or usage pattern). Use the `scopes` parameter to add named scopes to the cache key:

``` python
scache = CachePolicy(
expiry = "1M",
scopes = {"role": "attacker", "team": "red"})
)
```

## Model API {#model-api}

You can use the `cache` parameter in lower-level Model API calls (e.g. models obtained via `get_model()`). For example:

``` python
model = get_model("anthropic/claude-3-opus-20240229")
output = model.generate(input, cache = True)
```

If you are using Model APIs directly and your evaluation has multiple epochs, you will likely also want to add an `epoch` to the `CachePolicy` to ensure that outputs vary across epochs. For example, in a custom scorer:

``` python
@scorer(metrics=[accuracy(), bootstrap_std()])
def custom_scorer(model: str | Model | None = None):

# resolve model
grader_model = get_model(model)

async def score(state: TaskState, target: Target)

# (create input for grading)

# use caching but also forward epoch
output = grader_model.generate(
input,
cache = CachePolicy(
expiry = "4W",
epoch = state.epoch
)
)

# (use output to determine score value)

return value

return score
```

## Management

Use the `inspect cache` command the view the current contents of the cache, prune expired entries, or clear entries entirely. For example:

``` bash
# list the current contents of the cache
$ inspect cache list

# clear the cache (globally or by model)
$ inspect cache clear
$ inspect cache clear --model openai/gpt-4-turbo-2024-04-09

# prune expired entries from the cache
$ inspect cache list --prunable
$ inspect cache prune
$ inspect cache prune --model openai/gpt-4-turbo-2024-04-09
```

See `inspect cache --help` for further details on management commands.
Loading

0 comments on commit 8141ef2

Please sign in to comment.