Inference usage of model tuned on `task_response` tool #124

leonardmq · 2025-01-20T02:14:00Z

For a Task with Structured Output, creating a fine-tuning job via the Fine Tune UI for an OpenAI model will compile the dataset in a tool_call format, where the structured output of the task is modeled as arguments to a task_response tool.

If the model is trained to output arguments to a task_response tool, usage at inference time should typically align with the same format to most benefit from the training. If the user were to call the model without a task_response tool for example, or with response_format instead, the model may behave inconsistently due to the mismatch with the training samples.

task_response as a tool does not seem to be documented at the moment.

[Sample format sent to OpenAI for fine-tuning]

{
  "messages":[
    { "role":"system", "content":"..." },
    { "role":"user", "content":"..." },
    {
      "role":"assistant",
      "content":null,
      "tool_calls":[
        {
          "id":"call_1",
          "type":"function",
          "function":{
            "name":"task_response",
            "arguments":"{\"x\": 1, \"y\": 2}"
          }
        }
      ]
    }
  ]
}

To align with the sample format, the code calling the model at inference time might need to look like this:

const completion = await this.openai.chat.completions.create({
  model: 'my-tuned-model',
  messages: [...],

  // response_schema: { ... }, // no response_schema

  tool_choice: 'required', // the model must always call the tool
  tools: [
    {
      strict: true,
      type: 'function',
      function: {
        name: 'task_response', // define a function tool called task_response like in the samples
        parameters: {
          type: 'object',
          properties: { x: {  type: 'number' }, y: { type: 'number' } }, // our Task's structured output will be here
          ...
        },
      },
    },
  ],
});

// get the response from the function call args of the `task_response` toolcall
const output = JSON.parse(completion.choices[0].message.tool_calls[0].function.arguments) as { x: number; y: number }

Could you please clarify how code using the tuned model is expected to use the model at inference-time? I'd be happy to help with documenting this once intended usage is confirmed.

About the format in general, the OpenAI docs suggest that response_format might be more suitable when the structured output use case does not require making actual calls; it also shows an example of fine-tuning structured output by including the serialized JSON in content. However, their documentation does not elaborate on whether there are any meaningful differences besides a slightly different interface and slightly more convenient parsing with the SDK.

The text was updated successfully, but these errors were encountered:

scosman · 2025-01-21T01:47:57Z

I’m away from my computer for a week so this is from memory, but from what I recall

used tools instead of JSON for tighter schema checks/guarantees
used the task_response to get a json structure instead or ordered list of params

I don’t know why I didn’t use response_format - that might be better. But I’d have to dig in. If there aren’t any downsides that sounds better.

The only docs are the code for now.

scosman · 2025-01-21T01:49:50Z

Checked my notes and had a note to potentially move to response_format from tool calling 😀. It fixed some issues with Ollama and Qwen in fireworks.

leonardmq · 2025-01-22T23:17:35Z

I seem to notice negative side-effects at inference time on a GPT-4o-mini model tuned on the task_response format - both when using it with a task_response tool and parsing the arguments, and also when trying it out with response_format. For example, a tokenizer model tuned on that format often drops words and punctuation (happened sometimes, but that model does it a lot more) and sometimes it swaps out words (that is novel behavior).

In contrast, using response_format at inference time, and using the same model before fine-tuning seems to perform better, and after fine-tuning (on the same ~500 samples) using the content: "{ json in here }" seems to perform noticeably better.

Admittedly handwavy evaluation, but there seems to be a noticeable difference

scosman added the enhancement New feature or request label Jan 25, 2025

scosman self-assigned this Jan 25, 2025

scosman closed this as completed in a144552 Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference usage of model tuned on `task_response` tool #124

Inference usage of model tuned on `task_response` tool #124

leonardmq commented Jan 20, 2025

scosman commented Jan 21, 2025

scosman commented Jan 21, 2025

leonardmq commented Jan 22, 2025 •

edited

Loading

Inference usage of model tuned on task_response tool #124

Inference usage of model tuned on task_response tool #124

Comments

leonardmq commented Jan 20, 2025

scosman commented Jan 21, 2025

scosman commented Jan 21, 2025

leonardmq commented Jan 22, 2025 • edited Loading

Inference usage of model tuned on `task_response` tool #124

Inference usage of model tuned on `task_response` tool #124

leonardmq commented Jan 22, 2025 •

edited

Loading