Evaluating Tool Calling in Qwen3-4B: Base, Prompting and LoRA

Feb 10, 2026


Overview

In this blog, we are going to look at the performance of a fine-tuned Qwen3-4B Base variant on a well defined, structured task. The motivation behind this exercise is to evaluate a small language model’s learning capability of behaviour of structure, formatting and parsing which are paramount for agentic tool calling. We also look at some hyperparameters and the hypothesis behind those decisions.

Task

For this task, we use Salesforce’s xlam-function-calling-60k dataset from HuggingFace. The dataset maps natural language queries to precise function calls needed to fetch the user’s requested data. Specifically, an example from the dataset is in the following JSON format:

  • query - The question or query by a user
  • tools - The list of available tools that can be used by an agent to fetch relevant data. This includes the tool’s name, description and expected parameters.
  • answers - An array of tool calls with parameters obtained from the user’s query. This is the target label for this task and we expect our model to generate this information in a structured JSON object.
DatasetDict({
    train: Dataset({
        features: ['id', 'query', 'answers', 'tools'],
        num_rows: 60000
    })
})

Quoting an example from the home page of https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k

{
  "query": "Find the sum of all the multiples of 3 and 5 between 1 and 1000.
    Also find the product of the first five prime numbers.",
  "tools": [
    {
      "name": "math_toolkit.sum_of_multiples",
      "description": "Find the sum of all multiples of specified numbers within a
        specified range.",
      "parameters": {
        "lower_limit": {
          "type": "int",
          "description": "The start of the range (inclusive).",
          "required": true
        },
        "upper_limit": {
          "type": "int",
          "description": "The end of the range (inclusive).",
          "required": true
        },
        "multiples": {
          "type": "list",
          "description": "The numbers to find multiples of.",
          "required": true
        }
      }
    },
    {
      "name": "math_toolkit.product_of_primes",
      "description": "Find the product of the first n prime numbers.",
      "parameters": {
        "count": {
          "type": "int",
          "description": "The number of prime numbers to multiply together.",
          "required": true
        }
      }
    }
  ],
  "answers": [
    {
      "name": "math_toolkit.sum_of_multiples",
      "arguments": {
        "lower_limit": 1,
        "upper_limit": 1000,
        "multiples": [3, 5]
      }
    },
    {
      "name": "math_toolkit.product_of_primes",
      "arguments": {
        "count": 5
      }
    }
  ]
}

Looking at the dataset example itself, we can see that the expected response from our model must follow a strict JSON structure, which is why we will set a few hyperparameters such that there is minimal “creativity” during generation.

There are 60k examples in the dataset. Across the entire dataset, there are about 3600 unique tools, with a few tools that appear fairly frequently in many examples. It is interesting to note that the average number of tool calls for every query is about ~2, while the maximum number of calls for a query is ~50.

For the experiments in this article, 5 percent of the data was used for evaluation, and the rest for training. However, we will see that only a small fraction of the training data was needed for the model to learn the desired formatting and parsing.

With the sequence length ranging between 100 to 3000, and average sequence length being ~400.

The dataset itself was generated by DeepSeek-V2-Chat and Mixtral-8x22B-Inst. You can find more details about this dataset in the XLAM paper.

Qwen3-xB? Base or Instruct?

For this exercise, we will go ahead with one of the Qwen3 models. It’s open source, and a highly performant model across various tasks. You can read more about Qwen3’s capabilities in this paper. As I have access to an A100 GPU, I chose the 4B variant.

But the question is, do we want the Base or the Instruct variant? Well, the task at hand can quite likely be solved by Qwen3-4B-Instruct with some curated prompt tuning. However, in this exercise, the goal is to parse the correct arguments from natural language and generate an accurate JSON conforming object. There is less motivation for instruction-following. In addition, we want to evaluate the pretrained-base version’s behaviour after fine-tuning with LoRA.

https://huggingface.co/Qwen/Qwen3-4B-Base

Before LoRA

Let’s see the Base model’s behaviour on a query. For this, we will run the example available on https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k

For inference, we will set the following parameters

  • max_new_tokens=200
  • do_sample=False
  • temperature=None
  • top_p=None

But why? Well, for function calling tasks, we want a model that generates precise and deterministic outputs. The emphasis is less “creativity” and more on accuracy of structure. Hence, we will opt for a deterministic, greedy approach.

So let’s see the output on running inference on this raw example.

This JSON object contains a query and two tools that can be used to solve the problem.
The first tool, "math_toolkit.sum_of_multiples", can be used to find the sum of all multiples
of 3 and 5 between 1 and 1000. The second tool, "math_toolkit.product_of_primes",
can be used to find the product of the first five prime numbers.
The answers section contains the arguments that need to be passed to each tool
to solve the problem.

Well, this is not a JSON object. However, the base model itself is quite powerful semantically. It would be interesting to see the output if we format the prompt and introduce a small system instruction.

Let’s introduce a format for each prompt. Each prompt will be as follows

<system_prompt>
Only use the tools provided and get the args from the query.
And only output a valid json which shows the function call and the args.
</system_prompt>
<query>
…
</query>
<answer>

As you can see, we did not apply a closing tag for <answer>. The reason is that for the final output, we only need a JSON object.

Let’s evaluate the output of the base model on a formatted input prompt. The result?

{
  "function": "math_toolkit.sum_of_multiples",
  "args": {
    "lower_limit": 1,
    "upper_limit": 1000,
    "multiples": [3, 5]
  }
},
{
  "function": "math_toolkit.product_of_primes",
  "args": {
    "count": 5
  }
}

As you can see, this time, the base model correctly parsed the function calls and arguments. It was also able to structure the response with minor errors. While we can get through this by running some adhoc heuristics on the raw response, it is worthwhile to observe the behaviour on running LoRA on some training examples and then comparing it with the raw Base output.

LoRA Configuration

We will apply LoRA to the following projections

"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj"

We apply LoRA to the attention mechanism (q,k,v projections and attention output layer) to learn new attention patterns. We also apply LoRA to the Feed Forward layers for further adaptation downstream.

To be more memory efficient, we will use QLoRA with the following configurations. We choose lora_r=32 with bf16 precision. It is common practice to keep alpha as 2 x rank, so we will keep alpha=64.

The full configuration set is as follows (SFTTrainer)

# LoRA
lora_r=32,
lora_alpha=64,
lora_dropout=0.05,

# Training
num_train_epochs=1,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=4,
gradient_checkpointing=True,

completion_only_loss=True
learning_rate=1e-4,
lr_scheduler_type=”cosine”
max_seq_length=1024,
warmup_ratio=0.03,
eval_steps=100

While the number of epochs was set to 1, we stopped at around 48 % of the training run, after 2900 of 7k steps. We then started evaluation.

At step=100,

  • Training loss = 0.0352
  • Validation loss = 0.0472

At step=2900

  • Training loss=0.01956
  • Validation loss=0.02986

Here are the loss curves

image

image

Interestingly, the training loss fluctuated quite a bit, while the validation loss improved consistently.

Next, we evaluate the LoRA adapter checkpointed at step=2900. The size of this LoRA adapter is ~300 MB, which showcases the memory efficiency of using LoRA for fine-tuning.

Evaluation

Function tool-calling is a task which can be accurately judged based on output structure, tool names and arguments. Hence, there are 3 metrics that we will use to compare the model’s output versus the actual output for a given query.

JSON Validity

This checks whether the generated output has a valid JSON structure. This check is strictly enforced, as a broken JSON object can lead to downstream errors during tool chaining.

Exact JSON Match

This checks whether the generated output exactly matches the expected output. This includes checking the tools called and the arguments passed to those tools.

Hallucination Score

This checks whether the generated output introduces tool names or argument values that do not exist in the query or toolset.

After executing the inference loop on 100 eval examples using Qwen3-4B-Base and the LoRA-tuned version, we get the following scores

--- Accuracy Scores ---
Base Model - JSON Validity Accuracy: 63.37%
Base Model - Exact JSON Match Accuracy: 0.00%
Base Model - Tools Exist Accuracy: 63.37%

LoRA Model - JSON Validity Accuracy: 100.00%
LoRA Model - Exact JSON Match Accuracy: 80.20%
LoRA Model - Tools Exist Accuracy: 100.00%

As we can see, the base model was able to generate a valid JSON for 63 % of the cases. And based on the numbers, it appears that it was able to parse the correct tools. However, none of the base model outputs led to an exact JSON match with the expected output.

On the other hand, the LoRA tuned version accurately learnt the JSON structure and in over 80% of the cases, there was an exact match with the expected output. There was no hallucination by the LoRA model which means for each example, it always picked the tools and args specified in the query.

Let us take a look at the generated output for the prompt we used earlier, using the LoRA tuned model.

[
  {
    "name": "math_toolkit.sum_of_multiples",
    "arguments": {
      "lower_limit": 1,
      "upper_limit": 1000,
      "multiples": [3, 5]
    }
  },
  {
    "name": "math_toolkit.product_of_primes",
    "arguments": {
      "count": 5
    }
  }
]

It is an exact match.

Conclusion

Qwen3-4B-Base already shows strong semantic understanding, with modest gains from a lightweight system prompt. However, achieving consistent format adherence and reliable argument parsing requires more than prompt engineering alone. LoRA fine-tuning provides that extra control, noticeably improving structural reliability while preserving the base model’s behavior.

While an instruct-tuned model would likely be preferable in production, this experiment was scoped to isolate the effect of LoRA. We could extend this analysis to smaller Qwen3 variants and different LoRA ranks, but even within this limited setup, the impact of LoRA on structured generation is quite evident.

References