Rubra v0.1: Introducing Tool Calling to Top Open-Weight LLMs

Jul 10, 2024 by Sanjay Nadhavajhala

We recommend reading this blog in desktop mode for the best experience

The trend is shifting from simple chatbots that provide text-based assistance to AI agents capable of performing tasks. A key component of these AI agents is the ability to call and execute external APIs, tools, and functions, commonly known as “tool calling” or “function calling”. AI agents use LLMs as the reasoning engine to decide when and how to call tools to accomplish tasks or retrieve more information, leveraging their ability to connect to real-time data and systems. In our experience, tool calling with open-weight LLMs, such as Meta’s Llama-3 or Microsoft’s Phi-3, has been unreliable and significantly underperformed in agent frameworks that depend on precise task execution from LLMs.

While major cloud LLM providers like OpenAI, Anthropic, Cohere, Google, and Mistral have embraced function calling, it has remained largely underdeveloped in open-weight models. As a result, LLM app developers have had to rely on flaky prompt engineering. Therefore, we decided to further train these models.

Rubra v0.1 introduces tool calling to top open-weight Large Language Models, including:

Meta Llama3 8B
Meta Llama3 70B
Google Gemma 1.1 2B
Mistral 7B Instruct v0.3
Mistral 7B Instruct v0.2
Microsoft Phi Mini 128k Instruct
Alibaba Qwen2 7B Instruct

Additionally, we extend the popular inferencing libraries vLLM and llama.cpp to support the output of Rubra models to be a drop in replacement for OpenAI-format tool calling.

How We Trained Rubra Models

Developing LLMs generally involves two main steps:

1. Pre-training: In this step, the model is trained using a broad and diverse range of data to build a solid foundation of knowledge.

2. Post-training:

• Supervised Fine-Tuning: The model’s knowledge is refined with specific instructions and examples to improve its understanding and performance in various tasks.

• Alignment: Techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are used to align and optimize the model’s responses based on human preference data, making them more accurate, useful, and safe.

In Rubra v0.1, we further post-trained instruction-tuned models. Instead of training LoRA adapters or conducting a full fine-tuning, we opted for a novel approach designed to avoid the common pitfalls of full fine-tuning and LoRA training—specifically, catastrophic forgetting and overfitting. This was achieved through block expansion, where additional layers (neurons) were interleaved into the existing network structure, with only these new layers subjected to further training. This method preserves the integrity of the original parameters while updating only the new layers, thus preventing overfitting and mitigating catastrophic forgetting of skills the model already had. Subsequently, we employed DPO to align the model more closely with our targeted outcomes, enhancing its adaptability without compromising the established capabilities.

To confirm the effectiveness of our post-training approach, we rigorously validated the models against a variety of benchmarks. These benchmarks, which assess both the enhanced capabilities and overall performance of our Rubra models as well as other function calling models, are available here. The results show that Rubra models are capable of tool calling and other tasks required by AI Agents, whereas models fine tuned specifically for function calling like gorilla-openfunctions and NexusRaven struggle in both complex function calling and relevant benchmarks, likely because of overfitting and catastrophic forgetting.

The benchmark results indicate that models proficient in reasoning also tend to excel in complex, agentic function calling tasks.

Benchmarks

Model	Function Calling	MMLU (5-shot)	GPQA (0-shot)	GSM-8K (8-shot, CoT)	MATH (4-shot, CoT)	MT-bench
Rubra Llama-3 70B Instruct	97.85%	75.90	33.93	82.26	34.24	8.36
Rubra Llama-3 8B Instruct	89.28%	64.39	31.70	68.99	23.76	8.03
Rubra Qwen2-7B-Instruct	85.71%	68.88	30.36	75.82	28.72	8.08
Rubra Mistral 7B Instruct v0.3	73.57%	59.12	29.91	43.29	11.14	7.69
Rubra Phi-3 Mini 128k Instruct	70.00%	67.87	29.69	79.45	30.80	8.21
Rubra Mistral 7B Instruct v0.2	69.28%	58.90	29.91	34.12	8.36	7.36
Rubra Gemma-1.1 2B Instruct	45.00%	38.85	24.55	6.14	2.38	5.75

The benchmark results indicate that Rubra models were able to successfully mitigate catastrophic forgetting that is often found in fine-tuned models. Let’s take a closer look by comparing Rubra Llama-3 8B and Llama-3 8B without Rubra enhancements on MT Bench, a benchmark that uses GPT-4 to measure the ability of large language models to engage in coherent, informative, and engaging conversations:

Model	Function Calling Ability (OpenAI format)	Win	Loss	Tie	Win Rate	Loss Rate	Adjusted Win Rate
Llama-3 8B Instruct	0%	41	42	77	0.25625	0.2625	0.496875
Rubra Enhanced Llama-3 8B Instruct	89.28%	42	41	77	0.2625	0.25625	0.503125

The two models are nearly identical in performance, with a slight bias to the Rubra enhanced model. Additionally, the Rubra enhanced model demonstrates a significant improvement in function calling ability, which aligns with our goal to enhance this specific capability without sacrificing overall performance.

Every model we trained has a head-to-head comparison like this and can be found in the docs under models.

Conclusion

Rubra v0.1 addresses a crucial gap in the current landscape of open-weight LLMs by introducing robust tool-calling capabilities. This enhancement ensures that developers no longer need to rely on unreliable prompt engineering, and can instead leverage a more structured and consistent method for task execution.

By integrating our models with popular inferencing libraries such as vLLM and llama.cpp, we provide seamless support for OpenAI-format tool calling. This compatibility not only streamlines the development process but also expands the potential applications of Rubra models.

Explore and leverage the capabilities of Rubra models in your projects. You can try them out directly in Huggingface Spaces – it’s free and requires no login. We look forward to seeing the innovative ways you use these tools with Rubra models!

Sanjay Nadhavajhala is an AI engineer at Acorn Labs.

See all articles