Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into feature/display-inter…
Browse files Browse the repository at this point in the history
…action
  • Loading branch information
jjallaire committed Jan 9, 2025
2 parents 7b1c457 + e2add88 commit 2eae7e3
Show file tree
Hide file tree
Showing 42 changed files with 840 additions and 328 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -90,4 +90,4 @@ jobs:
- name: Delete knowingly duplicated files
run: rm src/inspect_ai/_view/www/favicon.svg

- uses: hynek/build-and-inspect-python-package@v1
- uses: hynek/build-and-inspect-python-package@v2
18 changes: 9 additions & 9 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@
default_language_version:
python: python3.11
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.8.5
hooks:
# Run the linter.
- id: ruff
args: [ --fix ]
# Run the formatter.
- id: ruff-format
- repo: https://github.com/pre-commit/pre-commit-hooks
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.9.0
hooks:
# Run the linter.
- id: ruff
args: [--fix]
# Run the formatter.
- id: ruff-format
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: check-added-large-files
Expand Down
17 changes: 16 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,26 @@

## Unreleased

- Print model conversations to terminal with `--display=conversation` (was formerly `--trace`, which is now deprecated).

## v0.3.57 (09 January 2025)

- [Tracing API](https://inspect.ai-safety-institute.org.uk/tracing.html#tracing-api) for custom trace logging.
- Inspect View: never truncate tool result images and display at default width of 800px.
- Inspect View: display tool error messages in transcript when tool errors occur.
- Inspect View: display any completed samples even if the task fails because of an error
- Inspect View: don't display the 'input' column heading if there isn't an input
- Open AI: Handle additional bad request status codes (mapping them to appropriate `StopReason`)
- Open AI: Use new `max_completion_tokens` option for o1 full.
- Web Browser: raise error when both `error` and `web_at` fields are present in response.
- Sandboxes: Apply dataset filters (limit and sample id) prior to sandbox initialisation.
- Docker: Prevent issue with container/project names that have a trailing underscore.
- Store: initialise `Store` from existing dictionary.
- Log: provide `metadata_as` and `store_as` typed accessors for sample metadata and store.
- Tool parameters with a default of `None` are now supported.
- Print model conversations to terminal with `--display=conversation` (was formerly `--trace`, which is now deprecated).
- More fine graned HTML escaping for sample transcripts displalyed in terminal.
- Bugfix: prevent errors when a state or storage value uses a tilda or slash in the key name.
- Bugfix: Include input in sample summary when the sample input contains a simple string.

## v0.3.56 (01 January 2025)

Expand Down
2 changes: 1 addition & 1 deletion docs/scorers.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ def model_graded_qa(

The default model graded QA scorer is tuned to grade answers to open ended questions. The default `template` and `instructions` ask the model to produce a grade in the format `GRADE: C` or `GRADE: I`, and this grade is extracted using the default `grade_pattern` regular expression. The grading is by default done with the model currently being evaluated. There are a few ways you can customise the default behaviour:

1. Provide alternate `instructions`—the default instructions ass the model to use chain of thought reasoning and provide grades in the format `GRADE: C` or `GRADE: I`. Note that if you provide instructions that ask the model to format grades in a different way, you will also want to customise the `grade_pattern`.
1. Provide alternate `instructions`—the default instructions ask the model to use chain of thought reasoning and provide grades in the format `GRADE: C` or `GRADE: I`. Note that if you provide instructions that ask the model to format grades in a different way, you will also want to customise the `grade_pattern`.
2. Specify `include_history = True` to include the full chat history in the presented question (by default only the original sample input is presented). You may optionally instead pass a function that enables customising the presentation of the chat history.
3. Specify `partial_credit = True` to prompt the model to assign partial credit to answers that are not entirely right but come close (metrics by default convert this to a value of 0.5). Note that this parameter is only valid when using the default `instructions`.
4. Specify an alternate `model` to perform the grading (e.g. a more powerful model or a model fine tuned for grading).
Expand Down
40 changes: 38 additions & 2 deletions docs/tracing.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Trace logs are written using [JSON Lines](https://jsonlines.org/) format and are
inspect trace dump trace-86396.log.gz
```

## Anomalies
## Anomalies {#anomalies}

If an evaluation is running and is not terminating, you can execute the following command to list instances of actions (e.g. model API generates, docker compose commands, tool calls, etc.) that are still running:

Expand All @@ -55,4 +55,40 @@ By default, the `inspect trace anomalies` command prints only currently running
inspect trace anomalies --all
```

Note that errors and timeouts are not by themselves evidence of problems, since both occur in the normal course of running evaluations (e.g. model generate calls can return errors that are retried and Docker or S3 can also return retryable errors or timeout when they are under heavy load).
Note that errors and timeouts are not by themselves evidence of problems, since both occur in the normal course of running evaluations (e.g. model generate calls can return errors that are retried and Docker or S3 can also return retryable errors or timeout when they are under heavy load).

## Tracing API {#tracing-api}

In addition to the standard set of actions which are trace logged, you can do your own custom trace logging using the `trace_action()` and `trace_message()` APIs. Trace logging is a great way to make sure that logging context is *always captured* (since the last 10 trace logs are always available) without cluttering up the console or eval transcripts.

### trace_action()

Use the `trace_action()` context manager to collect data on the resolution (e.g. succeeded, cancelled, failed, timed out, etc.) and duration of actions. For example, let's say you are interacting with a remote content database:

``` python
from inspect_ai.util import trace_action

from logging import getLogger
logger = getLogger(__name__)

server = "https://contentdb.example.com"
query = "<content-db-query>"

with trace_action(logger, "ContentDB", f"{server}: {query}"):
# perform content database query
```

Your custom trace actions will be reported alongside the standard traced actions in `inspect trace anomalies`, `inspect trace dump`, etc.

### trace_message()

Use the `trace_message()` function to trace events that don't fall into enter/exit pattern supported by `trace_action()`. For example, let's say you want to track every invocation of a custom tool:

``` python
from inspect_ai.util import trace_message

from logging import getLogger
logger = getLogger(__name__)

trace_message(logger, "MyTool", "message related to tool")
```
11 changes: 11 additions & 0 deletions docs/typing.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,14 @@ The sample store and sample metadata interfaces are weakly typed to accommodate

{{< include _metadata_typing.md >}}

## Log Samples

The `store_as()` and `metadata_as()` typed accessors are also available when reading samples from the eval log. Continuing from the examples above, you access typed interfaces as follows from an `EvalLog`:

```python
# typed store
activity = log.samples[0].store_as(Activity)

# typed metadata
metadata = log.samples[0].metadata_as(PopularityMetadata)
```
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ dev = [
"pytest-cov",
"pytest-dotenv",
"pytest-xdist",
"ruff==0.8.5", # match version specified in .pre-commit-config.yaml
"ruff==0.9.0", # match version specified in .pre-commit-config.yaml
"textual-dev>=0.86.2",
"types-PyYAML",
"types-beautifulsoup4",
Expand Down
2 changes: 1 addition & 1 deletion src/inspect_ai/_display/core/panel.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ def tasks_title(completed: int, total: int) -> str:
def task_title(profile: TaskProfile, show_model: bool) -> str:
eval_epochs = profile.eval_config.epochs or 1
epochs = f" x {profile.eval_config.epochs}" if eval_epochs > 1 else ""
samples = f"{profile.samples//eval_epochs:,}{epochs} sample{'s' if profile.samples != 1 else ''}"
samples = f"{profile.samples // eval_epochs:,}{epochs} sample{'s' if profile.samples != 1 else ''}"
title = f"{registry_unqualified_name(profile.name)} ({samples})"
if show_model:
title = f"{title}: {profile.model}"
Expand Down
27 changes: 16 additions & 11 deletions src/inspect_ai/_eval/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
from .task.run import TaskRunOptions, task_run
from .task.rundir import task_run_dir_switching
from .task.sandbox import TaskSandboxEnvironment, resolve_sandbox_for_task
from .task.util import task_run_dir
from .task.util import slice_dataset, task_run_dir

log = logging.getLogger(__name__)

Expand Down Expand Up @@ -70,12 +70,23 @@ async def eval_run(
# get cwd before switching to task dir
eval_wd = os.getcwd()

# ensure sample ids
for resolved_task in tasks:
# add sample ids to dataset if they aren't there (start at 1 not 0)
task = resolved_task.task
for id, sample in enumerate(task.dataset):
if sample.id is None:
sample.id = id + 1

# Ensure sample ids are unique
ensure_unique_ids(task.dataset)

# run startup pass for the sandbox environments
shutdown_sandbox_environments: Callable[[], Awaitable[None]] | None = None
if has_sandbox:
cleanup = eval_config.sandbox_cleanup is not False
shutdown_sandbox_environments = await startup_sandbox_environments(
resolve_sandbox_environment(eval_sandbox), tasks, cleanup
resolve_sandbox_environment(eval_sandbox), tasks, eval_config, cleanup
)

# resolve solver and solver spec
Expand Down Expand Up @@ -146,14 +157,6 @@ async def eval_run(
else:
task.fail_on_error = task_eval_config.fail_on_error

# add sample ids to dataset if they aren't there (start at 1 not 0)
for id, sample in enumerate(task.dataset):
if sample.id is None:
sample.id = id + 1

# Ensure sample ids are unique
ensure_unique_ids(task.dataset)

# create and track the logger
logger = TaskLogger(
task_name=task.name,
Expand Down Expand Up @@ -340,13 +343,15 @@ async def worker() -> None:
async def startup_sandbox_environments(
eval_sandbox: SandboxEnvironmentSpec | None,
tasks: list[ResolvedTask],
config: EvalConfig,
cleanup: bool,
) -> Callable[[], Awaitable[None]]:
# find unique sandboxenvs
sandboxenvs: Set[TaskSandboxEnvironment] = set()
for task in tasks:
# resolve each sample and add to sandboxenvs
for sample in task.task.dataset:
dataset = slice_dataset(task.task.dataset, config.limit, config.sample_id)
for sample in dataset:
sandbox = resolve_sandbox_for_task(eval_sandbox, task.task, sample)
if sandbox is not None and sandbox not in sandboxenvs:
sandboxenvs.add(sandbox)
Expand Down
2 changes: 1 addition & 1 deletion src/inspect_ai/_util/datetime.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

def iso_now(
timespec: Literal[
"auto", "hours", "minutes", "seconds", "milliseconds" "microseconds"
"auto", "hours", "minutes", "seconds", "milliseconds", "microseconds"
] = "seconds",
) -> str:
return datetime.now().astimezone().isoformat(timespec=timespec)
2 changes: 1 addition & 1 deletion src/inspect_ai/_util/deprecation.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ def default_deprecation_msg(

_qual = getattr(obj, "__qualname__", "") or ""
if _qual.endswith(".__init__") or _qual.endswith(".__new__"):
_obj = f' class ({_qual.rsplit(".", 1)[0]})'
_obj = f" class ({_qual.rsplit('.', 1)[0]})"
elif _qual and _obj:
_obj += f" ({_qual})"

Expand Down
12 changes: 11 additions & 1 deletion src/inspect_ai/_util/json.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,10 +103,20 @@ def json_changes(
paths = json_change.path.split("/")[1:]
replaced = before
for path in paths:
index: Any = int(path) if path.isnumeric() else path
decoded_path = decode_json_pointer_segment(path)
index: Any = (
int(decoded_path) if decoded_path.isnumeric() else decoded_path
)
replaced = replaced[index]
json_change.replaced = replaced
changes.append(json_change)
return changes
else:
return None


def decode_json_pointer_segment(segment: str) -> str:
"""Decode a single JSON Pointer segment."""
# JSON points encode ~ and / because they are special characters
# this decodes these values (https://www.rfc-editor.org/rfc/rfc6901)
return segment.replace("~1", "/").replace("~0", "~")
3 changes: 2 additions & 1 deletion src/inspect_ai/_util/logger.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import atexit
import os
import re
from logging import (
DEBUG,
INFO,
Expand Down Expand Up @@ -182,7 +183,7 @@ def notify_logger_record(record: LogRecord, write: bool) -> None:
if write:
transcript()._event(LoggerEvent(message=LoggingMessage.from_log_record(record)))
global _rate_limit_count
if (record.levelno <= INFO and "429" in record.getMessage()) or (
if (record.levelno <= INFO and re.search(r"\b429\b", record.getMessage())) or (
record.levelno == DEBUG
# See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#validating-retry-attempts
# for boto retry logic / log messages (this is tracking standard or adapative retries)
Expand Down
29 changes: 29 additions & 0 deletions src/inspect_ai/_util/trace.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,22 @@ def inspect_trace_file() -> Path:
def trace_action(
logger: Logger, action: str, message: str, *args: Any, **kwargs: Any
) -> Generator[None, None, None]:
"""Trace a long running or poentially unreliable action.
Trace actions for which you want to collect data on the resolution
(e.g. succeeded, cancelled, failed, timed out, etc.) and duration of.
Traces are written to the `TRACE` log level (which is just below
`HTTP` and `INFO`). List and read trace logs with `inspect trace list`
and related commands (see `inspect trace --help` for details).
Args:
logger (Logger): Logger to use for tracing (e.g. from `getLogger(__name__)`)
action (str): Name of action to trace (e.g. 'Model', 'Subprocess', etc.)
message (str): Message describing action (can be a format string w/ args or kwargs)
*args (Any): Positional arguments for `message` format string.
**kwargs (Any): Named args for `message` format string.
"""
trace_id = uuid()
start_monotonic = time.monotonic()
start_wall = time.time()
Expand Down Expand Up @@ -117,6 +133,19 @@ def trace_message(event: str) -> str:
def trace_message(
logger: Logger, category: str, message: str, *args: Any, **kwargs: Any
) -> None:
"""Log a message using the TRACE log level.
The `TRACE` log level is just below `HTTP` and `INFO`). List and
read trace logs with `inspect trace list` and related commands
(see `inspect trace --help` for details).
Args:
logger (Logger): Logger to use for tracing (e.g. from `getLogger(__name__)`)
category (str): Category of trace message.
message (str): Trace message (can be a format string w/ args or kwargs)
*args (Any): Positional arguments for `message` format string.
**kwargs (Any): Named args for `message` format string.
"""
logger.log(TRACE, f"[{category}] {message}", *args, **kwargs)


Expand Down
43 changes: 36 additions & 7 deletions src/inspect_ai/_util/transcript.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import html
import re
from typing import Any

from rich.align import AlignMethod
Expand All @@ -19,13 +20,43 @@ def transcript_code_theme() -> str:
def transcript_markdown(content: str, *, escape: bool = False) -> Markdown:
code_theme = transcript_code_theme()
return Markdown(
html.escape(content) if escape else content,
html_escape_markdown(content) if escape else content,
code_theme=code_theme,
inline_code_lexer="python",
inline_code_theme=code_theme,
)


def html_escape_markdown(content: str) -> str:
"""Escape markdown lines that aren't in a code block."""
codeblock_pattern = re.compile("`{3,}")
current_codeblock = ""
escaped: list[str] = []
lines = content.splitlines()
for line in lines:
# look for matching end of codeblock
if current_codeblock:
if current_codeblock in line:
current_codeblock = ""
escaped.append(line)
continue

# look for beginning of codeblock
match = codeblock_pattern.search(line)
if match:
current_codeblock = match[0]
escaped.append(line)
continue

# escape if we are not in a codeblock
if current_codeblock:
escaped.append(line)
else:
escaped.append(html.escape(line, quote=False))

return "\n".join(escaped)


def set_transcript_markdown_options(markdown: Markdown) -> None:
code_theme = transcript_code_theme()
markdown.code_theme = code_theme
Expand Down Expand Up @@ -89,12 +120,10 @@ def transcript_function(function: str, arguments: dict[str, Any]) -> RenderableT
return transcript_markdown("```python\n" + call + "\n```\n")


DOUBLE_LINE = Box(
" ══ \n" " \n" " \n" " \n" " \n" " \n" " \n" " \n"
)
DOUBLE_LINE = Box(" ══ \n \n \n \n \n \n \n \n")

LINE = Box(" ── \n" " \n" " \n" " \n" " \n" " \n" " \n" " \n")
LINE = Box(" ── \n \n \n \n \n \n \n \n")

DOTTED = Box(" ·· \n" " \n" " \n" " \n" " \n" " \n" " \n" " \n")
DOTTED = Box(" ·· \n \n \n \n \n \n \n \n")

NOBORDER = Box(" \n" " \n" " \n" " \n" " \n" " \n" " \n" " \n")
NOBORDER = Box(" \n \n \n \n \n \n \n \n")
Loading

0 comments on commit 2eae7e3

Please sign in to comment.