sync 13-06-24

UKGovernmentBEIS · Jun 13, 2024 · 8141ef2 · 8141ef2
1 parent f829d7e
commit 8141ef2
Show file tree

Hide file tree

Showing 63 changed files with 2,174 additions and 886 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,17 +2,19 @@
 
 ## v0.3.15 (Unreleased)
 
--   [Tool Environments](https://ukgovernmentbeis.github.io/inspect_ai/tools.html#tool-environments) for executing tool code in sandboxed containers.
+-   [Tool Environments](https://ukgovernmentbeis.github.io/inspect_ai/agents.html#sec-tool-environments) for executing tool code in a sandbox.
+-   [Caching](https://ukgovernmentbeis.github.io/inspect_ai/caching.html) to reduce the number of model API calls made.
 -   The `multiple_choice()` solver now has support for questions with multiple correct answers.
 -   More fine grained handling of Claude `BadRequestError` (400) errors (which were formerly all treated as content moderation errors).
 -   Filter out empty TextBlockParam when playing messages back to Claude.
+-   Automatically combine Claude user messages that include tool content.
 -   Revert to "auto" rather than "none" after forced tool call.
--   Automatically combine Anthropic user messages that include tool content.
--   Support all Llama series models on Bedrock.
 -   Provide `TaskState.tools` getter/setter (where the setter automatically syncs the system messages to the specified set of tools).
 -   The `use_tools()` function now uses the `TaskState.tools` setter, so replaces the current set of tools entirely rather than appending to it.
 -   Set `state.completed = False` when `max_messages` is reached.
 -   Allow tools to be declared with no parameters.
+-   Allow for null `bytes` field in `Logprobs` and `TopLogprobs`
+-   Support all Llama series models on Bedrock.
 -   Added `truthfulqa` benchmark.
 -   Added `intercode-ctf` example.
 

diff --git a/docs/_quarto.yml b/docs/_quarto.yml
@@ -67,6 +67,7 @@ book:
 
       - part: "Advanced"
         chapters:
+           - caching.qmd
            - eval-logs.qmd
            - eval-suites.qmd
            - eval-tuning.qmd

diff --git a/docs/agents.qmd b/docs/agents.qmd
@@ -432,7 +432,7 @@ class ToolEnvironment:
 There are two tool environments built in to Inspect:
 
 | Environment Type | Description                                                                                                                                                               |
-|--------------------------|----------------------------------------------|
+|---------------------------|---------------------------------------------|
 | `local`          | Run `tool_environment()` methods in the same file system as the running evaluation (should *only be used* if you are already running your evaluation in another sandbox). |
 | `docker`         | Run `tool_environment()` methods within a Docker container (see the [Docker Configuration](#sec-docker-configuration) section below for additional details).              |
 
@@ -475,22 +475,53 @@ While `--tool-environment` can be a default un-configured environment (e.g. “
 Here is how Docker tool environments are created based on the presence of `Dockerfile` and/or `compose.yml` in the task directory:
 
 | Config Files   | Behavior                                                                                                        |
-|---------------------|---------------------------------------------------|
+|-----------------------|------------------------------------------------|
 | None           | Creates a tool environment based on the official [python:3.12-bookworm](https://hub.docker.com/_/python) image. |
 | `Dockerfile`   | Creates a tool environment by building the image.                                                               |
 | `compose.yaml` | Creates tool environment(s) based on `compose.yaml`.                                                            |
 
-Here is what a simple `compose.yaml` would look like for a single tool environment that uses the `ctf-agent-environment` Docker image:
+If you have a `Dockerfile`, then `compose.yaml` is not strictly required, however you still may want to provide one (e.g. to set compute resource limits). For example:
 
 ``` {.yaml filename="compose.yaml"}
 services:
   default: 
-    image: ctf-agent-environment
+    build: .
+    command: tail -f /dev/null
     cpus: 1.0
     mem_limit: 0.5gb
 ```
 
-Note that we've also chosen to limit the CPU and memory usage of the container (see the [Docker Compose](https://docs.docker.com/compose/compose-file/) documentation for information on these and other container options).
+The `command` is provided to prevent the container from exiting.
+
+Here is what a simple `compose.yaml` would look like for a local pre-built image named `ctf-agent-environment` (resource limits excluded for brevity):
+
+``` {.yaml filename="compose.yaml"}
+services:
+  default: 
+    image: ctf-agent-environment
+    x-local: true
+    command: tail -f /dev/null
+```
+
+The `ctf-agent-environment` is not an image that exists on a remote registry, so we add the `x-local: true` to indicate that it should not be pulled. If local images are tagged, they also will not be pulled by default (so `x-local: true` is not required). For example:
+
+``` {.yaml filename="compose.yaml"}
+services:
+  default: 
+    image: ctf-agent-environment:1.0.0
+    command: tail -f /dev/null
+```
+
+If we are using an image from a remote registry we similarly don't need to include `x-local`:
+
+``` {.yaml filename="compose.yaml"}
+services:
+  default:
+    image: python:3.12-bookworm
+    command: tail -f /dev/null
+```
+
+See the [Docker Compose](https://docs.docker.com/compose/compose-file/) documentation for information on all available container options.
 
 #### Multiple Environments
 
@@ -500,10 +531,12 @@ In some cases you may want to create multiple tool environments (e.g. if one env
 services:
   default:
     image: ctf-agent-environment
+    x-local: true
     cpus: 1.0
     mem_limit: 0.5gb
-  ghidra:
-    image: ctf-ghidra-environment
+  victim:
+    image: ctf-victim-environment
+    x-local: true
     cpus: 1.0
     mem_limit: 1gb
 ```
@@ -512,13 +545,15 @@ The first environment listed is the “default” environment, and can be access
 
 ``` python
 tool_environment()          # default tool environment
-tool_environment("ghidra")  # named tool environment
+tool_environment("victim")  # named tool environment
 ```
 
 ::: {.callout-note apperance="simple"}
 If you define multiple tool environments you are *required* to name one of them "default" so that Inspect knows which environment to copy samples files to and resolve for calls to `tool_environment()` without an argument.
 :::
 
+#### Files
+
 Sample `files` will be copied into the default tool environment unless their name contains a prefix mapping them into another environment (e.g. `"victim:flag.txt": "flag.txt"`).
 
 #### Infrastructure
@@ -529,11 +564,13 @@ Note that in many cases you’ll want to provision additional infrastructure (e.
 services:
   default: 
     image: ctf-agent-environment
+    x-local: true
     volumes:
       - ctf-challenge-volume:/shared-data
 
   writer:
     image: ctf-challenge-writer
+    x-local: true
     volumes:
       - ctf-challenge-volume:/shared-data
 volumes:
@@ -552,6 +589,18 @@ As described above, each `Sample` is provisioned its own container. The number o
 
 Use `max_samples` to dial up or down the number of containers running at any given time. Note that a running container does not necessarily use CPU resources unless it has active background processes.
 
+Use a `compose.yaml` file to limit the resources consumed by each running container. For example:
+
+``` {.yaml filename="compose.yaml"}
+services:
+  default: 
+    image: ctf-agent-environment
+    x-local: true
+    command: tail -f /dev/null
+    cpus: 1.0
+    mem_limit: 0.5gb
+```
+
 #### Concurrent Execution
 
 The `ToolEnvironment.exec()` method runs a command within a tool environment, typically consuming CPU resources. To protect against overwhelming the system's CPUs, the implementation of `exec()` uses Inspect's `subprocess()` function, which automatically limits concurrent child processes to the number of CPUs on your system (`os.cpu_count()`).

diff --git a/docs/caching.qmd b/docs/caching.qmd
@@ -0,0 +1,150 @@
+# Caching {#sec-caching}
+
+## Overview
+
+Caching enables you to cache model output to reduce the number of API calls made, saving both time and expense. Caching is also often useful during development---for example, when you are iterating on a scorer you may want the model outputs served from a cache to both save time as well as for increased determinism.
+
+## Caching Basics
+
+Use the `cache` parameter on calls to `generate()` to activate the use of the cache. The keys for caching (what determines if a request can be fulfilled from the cache) are as follows:
+
+-   Model name and base URL (e.g. `openai/gpt-4-turbo`)
+-   Model prompt (i.e. message history)
+-   Epoch number (for ensuring distinct generations per epoch)
+-   Generate configuration (e.g. `temperature`, `top_p`, etc.)
+-   Active `tools` and `tool_choice`
+
+If all of these inputs are identical, then the model response will be served from the cache. By default, model responses are cached for 1 week (see [Cache Policy](#cache-policy) below for details on customising this).
+
+For example, here we are iterating on our self critique template, so we cache the main call to `generate()`:
+
+``` python
+@task
+def theory_of_mind():
+    return Task(
+        dataset=example_dataset("theory_of_mind"),
+        plan=[
+            chain_of_thought(),
+            generate(cache = True),
+            self_critique(CRITIQUE_TEMPLATE)
+        ]
+        scorer=model_graded_fact(),
+    )
+```
+
+You can similarly do this with the `generate` function passed into a `Solver`:
+
+``` python
+@solver
+def custom_solver(cache): 
+
+  async def solve(state, generate):
+
+    # (custom solver logic prior to generate)
+
+    return generate(state, cache)
+
+  return solve
+```
+
+You don't strictly need to provide a `cache` argument for a custom solver that uses caching, but it's generally good practice to enable users of the function to control caching behaviour.
+
+You can also use caching with lower-level `generate()` calls (e.g. a model instance you have obtained with `get_model()`. There are some special considerations around epochs for this case, see the [Model API](#model-api) section below for further discussion.
+
+### Model Versions
+
+The model name (e.g. `openai/gpt-4-turbo`) is used as part of the cache key. Note though that many model names are aliases to specific model versions. For example, `gpt-4`, `gpt-4-turbo`, may resolve to different versions over time as updates are released.
+
+If you want to invalidate caches for updated model versions, it's much better to use an explicitly versioned model name. For example:
+
+``` bash
+$ inspect eval ctf.py --model openai/gpt-4-turbo-2024-04-09
+```
+
+If you do this, then when a new version of turbo is deployed a call to the model will occur rather than resolving to a cached version.
+
+## Cache Policy {#cache-policy}
+
+By default, if you specify `cache = True` then the cache will expire in 1 week. You can customise this by passing a `CachePolicy` rather than a boolean. For example:
+
+``` python
+cache = CachePolicy(expiry="3h")
+cache = CachePolicy(expiry="4D")
+cache = CachePolicy(expiry="2W")
+cache = CachePolicy(expiry="3M")
+```
+
+You can use `s`, `m`, `h`, `D`, `W` , `M`, and `Y` as abbreviations for `expiry` values.
+
+If you want the cache to *never* expire, specify `None`. For example:
+
+``` python
+cache = CachePolicy(expiry = None)
+```
+
+You can also define scopes for cache expiration (e.g. cache for a specific task or usage pattern). Use the `scopes` parameter to add named scopes to the cache key:
+
+``` python
+scache = CachePolicy(
+    expiry = "1M",
+    scopes = {"role": "attacker", "team": "red"})
+)
+```
+
+## Model API {#model-api}
+
+You can use the `cache` parameter in lower-level Model API calls (e.g. models obtained via `get_model()`). For example:
+
+``` python
+model = get_model("anthropic/claude-3-opus-20240229")
+output = model.generate(input, cache = True)
+```
+
+If you are using Model APIs directly and your evaluation has multiple epochs, you will likely also want to add an `epoch` to the `CachePolicy` to ensure that outputs vary across epochs. For example, in a custom scorer:
+
+``` python
+@scorer(metrics=[accuracy(), bootstrap_std()])
+def custom_scorer(model: str | Model | None = None):
+
+    # resolve model
+    grader_model = get_model(model)
+
+    async def score(state: TaskState, target: Target)
+
+        # (create input for grading)
+
+        # use caching but also forward epoch
+        output = grader_model.generate(
+            input,
+            cache = CachePolicy(
+                expiry = "4W",
+                epoch = state.epoch
+            )
+        )
+
+        # (use output to determine score value)
+
+        return value
+
+     return score
+```
+
+## Management
+
+Use the `inspect cache` command the view the current contents of the cache, prune expired entries, or clear entries entirely. For example:
+
+``` bash
+# list the current contents of the cache
+$ inspect cache list
+
+# clear the cache (globally or by model)
+$ inspect cache clear 
+$ inspect cache clear --model openai/gpt-4-turbo-2024-04-09
+
+# prune expired entries from the cache
+$ inspect cache list --prunable
+$ inspect cache prune
+$ inspect cache prune --model openai/gpt-4-turbo-2024-04-09
+```
+
+See `inspect cache --help` for further details on management commands.