Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference usage of model tuned on task_response tool #124

Closed
leonardmq opened this issue Jan 20, 2025 · 3 comments
Closed

Inference usage of model tuned on task_response tool #124

leonardmq opened this issue Jan 20, 2025 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@leonardmq
Copy link
Contributor

For a Task with Structured Output, creating a fine-tuning job via the Fine Tune UI for an OpenAI model will compile the dataset in a tool_call format, where the structured output of the task is modeled as arguments to a task_response tool.

If the model is trained to output arguments to a task_response tool, usage at inference time should typically align with the same format to most benefit from the training. If the user were to call the model without a task_response tool for example, or with response_format instead, the model may behave inconsistently due to the mismatch with the training samples.

task_response as a tool does not seem to be documented at the moment.

[Sample format sent to OpenAI for fine-tuning]
{
  "messages":[
    { "role":"system", "content":"..." },
    { "role":"user", "content":"..." },
    {
      "role":"assistant",
      "content":null,
      "tool_calls":[
        {
          "id":"call_1",
          "type":"function",
          "function":{
            "name":"task_response",
            "arguments":"{\"x\": 1, \"y\": 2}"
          }
        }
      ]
    }
  ]
}

To align with the sample format, the code calling the model at inference time might need to look like this:

const completion = await this.openai.chat.completions.create({
  model: 'my-tuned-model',
  messages: [...],

  // response_schema: { ... }, // no response_schema

  tool_choice: 'required', // the model must always call the tool
  tools: [
    {
      strict: true,
      type: 'function',
      function: {
        name: 'task_response', // define a function tool called task_response like in the samples
        parameters: {
          type: 'object',
          properties: { x: {  type: 'number' }, y: { type: 'number' } }, // our Task's structured output will be here
          ...
        },
      },
    },
  ],
});

// get the response from the function call args of the `task_response` toolcall
const output = JSON.parse(completion.choices[0].message.tool_calls[0].function.arguments) as { x: number; y: number }

Could you please clarify how code using the tuned model is expected to use the model at inference-time? I'd be happy to help with documenting this once intended usage is confirmed.

About the format in general, the OpenAI docs suggest that response_format might be more suitable when the structured output use case does not require making actual calls; it also shows an example of fine-tuning structured output by including the serialized JSON in content. However, their documentation does not elaborate on whether there are any meaningful differences besides a slightly different interface and slightly more convenient parsing with the SDK.

@scosman
Copy link
Collaborator

scosman commented Jan 21, 2025

I’m away from my computer for a week so this is from memory, but from what I recall

  1. used tools instead of JSON for tighter schema checks/guarantees
  2. used the task_response to get a json structure instead or ordered list of params

I don’t know why I didn’t use response_format - that might be better. But I’d have to dig in. If there aren’t any downsides that sounds better.

The only docs are the code for now.

@scosman
Copy link
Collaborator

scosman commented Jan 21, 2025

Checked my notes and had a note to potentially move to response_format from tool calling 😀. It fixed some issues with Ollama and Qwen in fireworks.

@leonardmq
Copy link
Contributor Author

leonardmq commented Jan 22, 2025

I seem to notice negative side-effects at inference time on a GPT-4o-mini model tuned on the task_response format - both when using it with a task_response tool and parsing the arguments, and also when trying it out with response_format. For example, a tokenizer model tuned on that format often drops words and punctuation (happened sometimes, but that model does it a lot more) and sometimes it swaps out words (that is novel behavior).

In contrast, using response_format at inference time, and using the same model before fine-tuning seems to perform better, and after fine-tuning (on the same ~500 samples) using the content: "{ json in here }" seems to perform noticeably better.

Admittedly handwavy evaluation, but there seems to be a noticeable difference

@scosman scosman added the enhancement New feature or request label Jan 25, 2025
@scosman scosman self-assigned this Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants