Merge remote-tracking branch 'origin/main' into feature/display-inter…

…action
UKGovernmentBEIS · Jan 9, 2025 · 2eae7e3 · 2eae7e3
2 parents 7b1c457 + e2add88
commit 2eae7e3
Show file tree

Hide file tree

Showing 42 changed files with 840 additions and 328 deletions.
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -90,4 +90,4 @@ jobs:
       - name: Delete knowingly duplicated files
         run: rm src/inspect_ai/_view/www/favicon.svg
 
-      - uses: hynek/build-and-inspect-python-package@v1
+      - uses: hynek/build-and-inspect-python-package@v2
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -4,15 +4,15 @@
 default_language_version:
   python: python3.11
 repos:
-- repo: https://github.com/astral-sh/ruff-pre-commit
-  rev: v0.8.5
-  hooks:
-    # Run the linter.
-    - id: ruff
-      args: [ --fix ]
-    # Run the formatter.
-    - id: ruff-format
--   repo: https://github.com/pre-commit/pre-commit-hooks
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.9.0
+    hooks:
+      # Run the linter.
+      - id: ruff
+        args: [--fix]
+      # Run the formatter.
+      - id: ruff-format
+  - repo: https://github.com/pre-commit/pre-commit-hooks
     rev: v4.5.0
     hooks:
       - id: check-added-large-files

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,11 +2,26 @@
 
 ## Unreleased
 
+- Print model conversations to terminal with `--display=conversation` (was formerly `--trace`, which is now deprecated).
+
+## v0.3.57 (09 January 2025)
+
+- [Tracing API](https://inspect.ai-safety-institute.org.uk/tracing.html#tracing-api) for custom trace logging.
 - Inspect View: never truncate tool result images and display at default width of 800px.
 - Inspect View: display tool error messages in transcript when tool errors occur.
 - Inspect View: display any completed samples even if the task fails because of an error
+- Inspect View: don't display the 'input' column heading if there isn't an input
+- Open AI: Handle additional bad request status codes (mapping them to appropriate `StopReason`)
+- Open AI: Use new `max_completion_tokens` option for o1 full.
+- Web Browser: raise error when both `error` and `web_at` fields are present in response.
+- Sandboxes: Apply dataset filters (limit and sample id) prior to sandbox initialisation.
+- Docker: Prevent issue with container/project names that have a trailing underscore. 
+- Store: initialise `Store` from existing dictionary.
+- Log: provide `metadata_as` and `store_as` typed accessors for sample metadata and store.
 - Tool parameters with a default of `None` are now supported.
-- Print model conversations to terminal with `--display=conversation` (was formerly `--trace`, which is now deprecated).
+- More fine graned HTML escaping for sample transcripts displalyed in terminal.
+- Bugfix: prevent errors when a state or storage value uses a tilda or slash in the key name.
+- Bugfix: Include input in sample summary when the sample input contains a simple string.
 
 ## v0.3.56 (01 January 2025)
 

diff --git a/docs/scorers.qmd b/docs/scorers.qmd
@@ -100,7 +100,7 @@ def model_graded_qa(
 
 The default model graded QA scorer is tuned to grade answers to open ended questions. The default `template` and `instructions` ask the model to produce a grade in the format `GRADE: C` or `GRADE: I`, and this grade is extracted using the default `grade_pattern` regular expression. The grading is by default done with the model currently being evaluated. There are a few ways you can customise the default behaviour:
 
-1.  Provide alternate `instructions`—the default instructions ass the model to use chain of thought reasoning and provide grades in the format `GRADE: C` or `GRADE: I`. Note that if you provide instructions that ask the model to format grades in a different way, you will also want to customise the `grade_pattern`.
+1.  Provide alternate `instructions`—the default instructions ask the model to use chain of thought reasoning and provide grades in the format `GRADE: C` or `GRADE: I`. Note that if you provide instructions that ask the model to format grades in a different way, you will also want to customise the `grade_pattern`.
 2.  Specify `include_history = True` to include the full chat history in the presented question (by default only the original sample input is presented). You may optionally instead pass a function that enables customising the presentation of the chat history.
 3.  Specify `partial_credit = True` to prompt the model to assign partial credit to answers that are not entirely right but come close (metrics by default convert this to a value of 0.5). Note that this parameter is only valid when using the default `instructions`.
 4.  Specify an alternate `model` to perform the grading (e.g. a more powerful model or a model fine tuned for grading).

diff --git a/docs/tracing.qmd b/docs/tracing.qmd
@@ -31,7 +31,7 @@ Trace logs are written using [JSON Lines](https://jsonlines.org/) format and are
 inspect trace dump trace-86396.log.gz
 ```
 
-## Anomalies
+## Anomalies {#anomalies}
 
 If an evaluation is running and is not terminating, you can execute the following command to list instances of actions (e.g. model API generates, docker compose commands, tool calls, etc.) that are still running:
 
@@ -55,4 +55,40 @@ By default, the `inspect trace anomalies` command prints only currently running
 inspect trace anomalies --all
 ```
 
-Note that errors and timeouts are not by themselves evidence of problems, since both occur in the normal course of running evaluations (e.g. model generate calls can return errors that are retried and Docker or S3 can also return retryable errors or timeout when they are under heavy load).
+Note that errors and timeouts are not by themselves evidence of problems, since both occur in the normal course of running evaluations (e.g. model generate calls can return errors that are retried and Docker or S3 can also return retryable errors or timeout when they are under heavy load).
+
+## Tracing API {#tracing-api}
+
+In addition to the standard set of actions which are trace logged, you can do your own custom trace logging using the `trace_action()` and `trace_message()` APIs. Trace logging is a great way to make sure that logging context is *always captured* (since the last 10 trace logs are always available) without cluttering up the console or eval transcripts.
+
+### trace_action()
+
+Use the `trace_action()` context manager to collect data on the resolution (e.g. succeeded, cancelled, failed, timed out, etc.) and duration of actions. For example, let's say you are interacting with a remote content database:
+
+``` python
+from inspect_ai.util import trace_action
+
+from logging import getLogger
+logger = getLogger(__name__)
+
+server = "https://contentdb.example.com"
+query = "<content-db-query>"
+
+with trace_action(logger, "ContentDB", f"{server}: {query}"):
+    # perform content database query
+```
+
+Your custom trace actions will be reported alongside the standard traced actions in `inspect trace anomalies`, `inspect trace dump`, etc.
+
+### trace_message()
+
+Use the `trace_message()` function to trace events that don't fall into enter/exit pattern supported by `trace_action()`. For example, let's say you want to track every invocation of a custom tool:
+
+``` python
+from inspect_ai.util import trace_message
+
+from logging import getLogger
+logger = getLogger(__name__)
+
+trace_message(logger, "MyTool", "message related to tool")
+```
diff --git a/docs/typing.qmd b/docs/typing.qmd
@@ -16,3 +16,14 @@ The sample store and sample metadata interfaces are weakly typed to accommodate
 
 {{< include _metadata_typing.md >}}
 
+## Log Samples
+
+The `store_as()` and `metadata_as()` typed accessors are also available when reading samples from the eval log. Continuing from the examples above, you access typed interfaces as follows from an `EvalLog`:
+
+```python
+# typed store
+activity = log.samples[0].store_as(Activity)
+
+# typed metadata
+metadata = log.samples[0].metadata_as(PopularityMetadata)
+```
diff --git a/pyproject.toml b/pyproject.toml
@@ -129,7 +129,7 @@ dev = [
     "pytest-cov",
     "pytest-dotenv",
     "pytest-xdist",
-    "ruff==0.8.5", # match version specified in .pre-commit-config.yaml
+    "ruff==0.9.0", # match version specified in .pre-commit-config.yaml
     "textual-dev>=0.86.2",
     "types-PyYAML",
     "types-beautifulsoup4",

diff --git a/src/inspect_ai/_display/core/panel.py b/src/inspect_ai/_display/core/panel.py
@@ -112,7 +112,7 @@ def tasks_title(completed: int, total: int) -> str:
 def task_title(profile: TaskProfile, show_model: bool) -> str:
     eval_epochs = profile.eval_config.epochs or 1
     epochs = f" x {profile.eval_config.epochs}" if eval_epochs > 1 else ""
-    samples = f"{profile.samples//eval_epochs:,}{epochs} sample{'s' if profile.samples != 1 else ''}"
+    samples = f"{profile.samples // eval_epochs:,}{epochs} sample{'s' if profile.samples != 1 else ''}"
     title = f"{registry_unqualified_name(profile.name)} ({samples})"
     if show_model:
         title = f"{title}: {profile.model}"

diff --git a/src/inspect_ai/_eval/run.py b/src/inspect_ai/_eval/run.py
@@ -42,7 +42,7 @@
 from .task.run import TaskRunOptions, task_run
 from .task.rundir import task_run_dir_switching
 from .task.sandbox import TaskSandboxEnvironment, resolve_sandbox_for_task
-from .task.util import task_run_dir
+from .task.util import slice_dataset, task_run_dir
 
 log = logging.getLogger(__name__)
 
@@ -70,12 +70,23 @@ async def eval_run(
     # get cwd before switching to task dir
     eval_wd = os.getcwd()
 
+    # ensure sample ids
+    for resolved_task in tasks:
+        # add sample ids to dataset if they aren't there (start at 1 not 0)
+        task = resolved_task.task
+        for id, sample in enumerate(task.dataset):
+            if sample.id is None:
+                sample.id = id + 1
+
+        # Ensure sample ids are unique
+        ensure_unique_ids(task.dataset)
+
     # run startup pass for the sandbox environments
     shutdown_sandbox_environments: Callable[[], Awaitable[None]] | None = None
     if has_sandbox:
         cleanup = eval_config.sandbox_cleanup is not False
         shutdown_sandbox_environments = await startup_sandbox_environments(
-            resolve_sandbox_environment(eval_sandbox), tasks, cleanup
+            resolve_sandbox_environment(eval_sandbox), tasks, eval_config, cleanup
         )
 
     # resolve solver and solver spec
@@ -146,14 +157,6 @@ async def eval_run(
                 else:
                     task.fail_on_error = task_eval_config.fail_on_error
 
-                # add sample ids to dataset if they aren't there (start at 1 not 0)
-                for id, sample in enumerate(task.dataset):
-                    if sample.id is None:
-                        sample.id = id + 1
-
-                # Ensure sample ids are unique
-                ensure_unique_ids(task.dataset)
-
                 # create and track the logger
                 logger = TaskLogger(
                     task_name=task.name,
@@ -340,13 +343,15 @@ async def worker() -> None:
 async def startup_sandbox_environments(
     eval_sandbox: SandboxEnvironmentSpec | None,
     tasks: list[ResolvedTask],
+    config: EvalConfig,
     cleanup: bool,
 ) -> Callable[[], Awaitable[None]]:
     # find unique sandboxenvs
     sandboxenvs: Set[TaskSandboxEnvironment] = set()
     for task in tasks:
         # resolve each sample and add to sandboxenvs
-        for sample in task.task.dataset:
+        dataset = slice_dataset(task.task.dataset, config.limit, config.sample_id)
+        for sample in dataset:
             sandbox = resolve_sandbox_for_task(eval_sandbox, task.task, sample)
             if sandbox is not None and sandbox not in sandboxenvs:
                 sandboxenvs.add(sandbox)

diff --git a/src/inspect_ai/_util/datetime.py b/src/inspect_ai/_util/datetime.py
@@ -4,7 +4,7 @@
 
 def iso_now(
     timespec: Literal[
-        "auto", "hours", "minutes", "seconds", "milliseconds" "microseconds"
+        "auto", "hours", "minutes", "seconds", "milliseconds", "microseconds"
     ] = "seconds",
 ) -> str:
     return datetime.now().astimezone().isoformat(timespec=timespec)
diff --git a/src/inspect_ai/_util/deprecation.py b/src/inspect_ai/_util/deprecation.py
@@ -174,7 +174,7 @@ def default_deprecation_msg(
 
         _qual = getattr(obj, "__qualname__", "") or ""
         if _qual.endswith(".__init__") or _qual.endswith(".__new__"):
-            _obj = f' class ({_qual.rsplit(".", 1)[0]})'
+            _obj = f" class ({_qual.rsplit('.', 1)[0]})"
         elif _qual and _obj:
             _obj += f" ({_qual})"
 

diff --git a/src/inspect_ai/_util/json.py b/src/inspect_ai/_util/json.py
@@ -103,10 +103,20 @@ def json_changes(
                 paths = json_change.path.split("/")[1:]
                 replaced = before
                 for path in paths:
-                    index: Any = int(path) if path.isnumeric() else path
+                    decoded_path = decode_json_pointer_segment(path)
+                    index: Any = (
+                        int(decoded_path) if decoded_path.isnumeric() else decoded_path
+                    )
                     replaced = replaced[index]
                 json_change.replaced = replaced
             changes.append(json_change)
         return changes
     else:
         return None
+
+
+def decode_json_pointer_segment(segment: str) -> str:
+    """Decode a single JSON Pointer segment."""
+    # JSON points encode ~ and / because they are special characters
+    # this decodes these values (https://www.rfc-editor.org/rfc/rfc6901)
+    return segment.replace("~1", "/").replace("~0", "~")
diff --git a/src/inspect_ai/_util/logger.py b/src/inspect_ai/_util/logger.py
@@ -1,5 +1,6 @@
 import atexit
 import os
+import re
 from logging import (
     DEBUG,
     INFO,
@@ -182,7 +183,7 @@ def notify_logger_record(record: LogRecord, write: bool) -> None:
     if write:
         transcript()._event(LoggerEvent(message=LoggingMessage.from_log_record(record)))
     global _rate_limit_count
-    if (record.levelno <= INFO and "429" in record.getMessage()) or (
+    if (record.levelno <= INFO and re.search(r"\b429\b", record.getMessage())) or (
         record.levelno == DEBUG
         # See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#validating-retry-attempts
         # for boto retry logic / log messages (this is tracking standard or adapative retries)

diff --git a/src/inspect_ai/_util/trace.py b/src/inspect_ai/_util/trace.py
@@ -33,6 +33,22 @@ def inspect_trace_file() -> Path:
 def trace_action(
     logger: Logger, action: str, message: str, *args: Any, **kwargs: Any
 ) -> Generator[None, None, None]:
+    """Trace a long running or poentially unreliable action.
+
+    Trace actions for which you want to collect data on the resolution
+    (e.g. succeeded, cancelled, failed, timed out, etc.) and duration of.
+
+    Traces are written to the `TRACE` log level (which is just below
+    `HTTP` and `INFO`). List and read trace logs with `inspect trace list`
+    and related commands (see `inspect trace --help` for details).
+
+    Args:
+       logger (Logger): Logger to use for tracing (e.g. from `getLogger(__name__)`)
+       action (str): Name of action to trace (e.g. 'Model', 'Subprocess', etc.)
+       message (str): Message describing action (can be a format string w/ args or kwargs)
+       *args (Any): Positional arguments for `message` format string.
+       **kwargs (Any): Named args for `message` format string.
+    """
     trace_id = uuid()
     start_monotonic = time.monotonic()
     start_wall = time.time()
@@ -117,6 +133,19 @@ def trace_message(event: str) -> str:
 def trace_message(
     logger: Logger, category: str, message: str, *args: Any, **kwargs: Any
 ) -> None:
+    """Log a message using the TRACE log level.
+
+    The `TRACE` log level is just below `HTTP` and `INFO`). List and
+    read trace logs with `inspect trace list` and related commands
+    (see `inspect trace --help` for details).
+
+    Args:
+       logger (Logger): Logger to use for tracing (e.g. from `getLogger(__name__)`)
+       category (str): Category of trace message.
+       message (str): Trace message (can be a format string w/ args or kwargs)
+       *args (Any): Positional arguments for `message` format string.
+       **kwargs (Any): Named args for `message` format string.
+    """
     logger.log(TRACE, f"[{category}] {message}", *args, **kwargs)
 
 

diff --git a/src/inspect_ai/_util/transcript.py b/src/inspect_ai/_util/transcript.py
@@ -1,4 +1,5 @@
 import html
+import re
 from typing import Any
 
 from rich.align import AlignMethod
@@ -19,13 +20,43 @@ def transcript_code_theme() -> str:
 def transcript_markdown(content: str, *, escape: bool = False) -> Markdown:
     code_theme = transcript_code_theme()
     return Markdown(
-        html.escape(content) if escape else content,
+        html_escape_markdown(content) if escape else content,
         code_theme=code_theme,
         inline_code_lexer="python",
         inline_code_theme=code_theme,
     )
 
 
+def html_escape_markdown(content: str) -> str:
+    """Escape markdown lines that aren't in a code block."""
+    codeblock_pattern = re.compile("`{3,}")
+    current_codeblock = ""
+    escaped: list[str] = []
+    lines = content.splitlines()
+    for line in lines:
+        # look for matching end of codeblock
+        if current_codeblock:
+            if current_codeblock in line:
+                current_codeblock = ""
+                escaped.append(line)
+                continue
+
+        # look for beginning of codeblock
+        match = codeblock_pattern.search(line)
+        if match:
+            current_codeblock = match[0]
+            escaped.append(line)
+            continue
+
+        # escape if we are not in a codeblock
+        if current_codeblock:
+            escaped.append(line)
+        else:
+            escaped.append(html.escape(line, quote=False))
+
+    return "\n".join(escaped)
+
+
 def set_transcript_markdown_options(markdown: Markdown) -> None:
     code_theme = transcript_code_theme()
     markdown.code_theme = code_theme
@@ -89,12 +120,10 @@ def transcript_function(function: str, arguments: dict[str, Any]) -> RenderableT
     return transcript_markdown("```python\n" + call + "\n```\n")
 
 
-DOUBLE_LINE = Box(
-    " ══ \n" "    \n" "    \n" "    \n" "    \n" "    \n" "    \n" "    \n"
-)
+DOUBLE_LINE = Box(" ══ \n    \n    \n    \n    \n    \n    \n    \n")
 
-LINE = Box(" ── \n" "    \n" "    \n" "    \n" "    \n" "    \n" "    \n" "    \n")
+LINE = Box(" ── \n    \n    \n    \n    \n    \n    \n    \n")
 
-DOTTED = Box(" ·· \n" "    \n" "    \n" "    \n" "    \n" "    \n" "    \n" "    \n")
+DOTTED = Box(" ·· \n    \n    \n    \n    \n    \n    \n    \n")
 
-NOBORDER = Box("    \n" "    \n" "    \n" "    \n" "    \n" "    \n" "    \n" "    \n")
+NOBORDER = Box("    \n    \n    \n    \n    \n    \n    \n    \n")