Function Calling Test Suite for Fun and Profit

Jul 11, 2024 by Nick Hale

We recommend viewing in desktop mode for the best experience

Introducing function-calling-test-suite

Function calling is the fundamental feature that powers our flagship project, GPTScript. This makes an LLM’s ability to call functions the primary consideration when determining its suitability as a drop-in replacement for OpenAI’s gpt4-o (the current default model used by GPTScript). To quantify this metric, we decided to sink some time into building out function-calling-test-suite (FCTS), a shiny new test framework!

We’ll drop another blog post that delves into the specifics of the design shortly, but for now, here’s a breakdown of the framework’s key features:

Simple YAML specs to describe test cases
Optionally use gpt4-o to judge test results
Configurable test run count (i.e. run each test N times to detect non-deterministic model responses)

Now that introductions are out of the way, here’s what we’ve found with FCTS so far:

Rankings

We tested six major models with function calling support across a total of four major platforms. In order to account for the non-deterministic nature of generative models, we ran every test case 10 times per model, then ranked the models by overall pass rate.

Rank	Pass Rate	Model	Platform
1	98.24%	gpt-4o-2024-05-13	OpenAI
2	94.71%	gpt-4-turbo-2024-04-09	OpenAI
3	87.65%	claude-3-5-sonnet-20240620	Anthropic
4	72.94%	claude-3-opus-20240229	Anthropic
5	51.18%	mistral-large-2402	La Plateforme (Mistral AI)
6	48.82%	gemini-1.5-pro	Vertex AI (Google)

A Quick Litmus Test

As mentioned earlier, GPTScript uses gpt-4o — which referenced gpt-4o-2024-05-13 at the time these rankings were compiled — by default, so we were already confident in its ability to satisfy our use cases. But to get a rough idea of how well these results stack up to reality, we also ran GPTScript on a selection of example scripts and recorded the pass rate for each model.

Example	gpt-4o-2024-05-13	gpt-4-turbo-2024-04-09	claude-3-5-sonnet-20240620	claude-3-opus-20240229	mistral-large-2402	gemini-1.5-pro
bob-as-shell.gpt	pass	pass	pass	pass	pass	pass
bob.gpt	pass	pass	pass	pass	pass	pass
echo.gpt	pass	pass	pass	pass	pass	pass
fac.gpt	pass	pass	pass	pass	pass	fail
helloworld.gpt	pass	pass	pass	pass	pass	pass
describe-code.gpt	pass	pass	fail	fail	fail	fail
add-go-mod-dep.gpt	pass	pass	fail	fail	fail	fail
hacker-news-headlines.gpt	pass	pass	pass	fail	fail	fail
search.gpt	pass	pass	pass	pass	fail	pass
json-notebook	pass	pass	pass	fail	fail	fail
sqlite-download.gpt	pass	pass	pass	pass	fail	fail
syntax-from-code.gpt	pass	pass	pass	pass	fail	pass
git-commit.gpt	pass	pass	pass	fail	pass	fail
sentiments.gpt	pass	pass	pass	fail	pass	fail

Rank	Example Pass Rate	FCTF Pass Rate	Model
1	100%	98.24%	gpt-4o-2024-05-13
2	100%	94.71%	gpt-4-turbo-2024-04-09
3	85.71%	87.65%	claude-3-5-sonnet-20240620
4	57.14%	72.94%	claude-3-opus-20240229
5	50.00%	51.18%	mistral-large-2402
6	42.86%	48.82%	gemini-1.5-pro

With the exception of claude-3-opus-20240229, which differs by ~16%, the practical rankings are within 6% of the FCTS rankings. Although this isn’t exactly an apples-to-apples comparison, we feel the congruence is enough to warrant some confidence that FCTS is a reasonable approximation of a model’s potential performance with GPTScript.

Huzzah!

Now that we’ve convinced ourselves that our results pass muster, let’s take a closer look at the test cases.

Test Case Overview

The initial test suite spans six categories and contains a relatively small number of test cases, but we feel they cover a wide mix of typical use cases without being too overwhelming.

Category	Description
basic	Tests that a model can make the most basic function calls
sequenced	Tests that a model can make function calls in a specific order
chained	Tests that a model can pass the result of a function call to another function
grouped	Tests that a model can identify and make groups of function calls
semantic	Tests that a model can infer and make the correct function calls given natural language prompts and descriptions
gptscript	Tests that a model can perform more complex tasks found in GPTScript’s example scripts.

Test ID	Description	Categories
01_basic.yaml-0	Asserts that the model can make a function call with a given argument and conveys the result to the user	basic
01_basic.yaml-1	Asserts that the model can make a function call with an ordered set of arguments and conveys the result to the user	basic
03_sequenced.yaml-0	Asserts that the model can make a sequence of function calls in the correct order and conveys the results to the user	sequenced
03_sequenced.yaml-1	Asserts that the model can make a mix of ordered an unordered function calls and conveys the result to the user	sequenced
05_chained.yaml-0	Asserts that the model can use the result of a function call as the argument for a specified function and conveys the result to the user	chained
05_chained.yaml-1	Asserts that the model can use the results of a group of function calls as arguments for a single function call and conveys the result to the user	chained, grouped
05_chained.yaml-2	Asserts that the model can use the results of a group of function calls as arguments for successive groups of function calls and conveys the result to the user	chained, grouped
07_semantic.yaml-0	Asserts that the model can derive and make a function call with one argument from a prompt and conveys the result to the user	semantic, basic
07_semantic.yaml-1	Asserts that the model can derive and make a function call with two arguments from a prompt and conveys the result to the user	semantic, basic
07_semantic.yaml-2	Asserts that the model can derive and make an ordered sequence of function calls from a prompt and conveys the results to the user	sequenced, semantic
07_semantic.yaml-3	Asserts that the model can derive and make two function calls from the prompt, using the result of the first call as the argument for the second, and convey the result to the user	semantic, chained
07_semantic.yaml-4	Asserts that the model can derive and make a series of functions calls from a prompt, where the results of an initial group of calls are used as arguments for a final function call, and conveys the result to the user	semantic, chained
07_semantic.yaml-5	Asserts that the model can interpret and execute a complex series of chained steps related to creating a database and creating entries in it.	semantic, chained
07_semantic.yaml-6	Asserts that the model can parse a comma delimited list from one function, pass each entry to a second function, and send the gathered results of those calls to a third function.	chained, semantic, grouped
07_semantic.yaml-7	Asserts that the model can parse a large csv style response and make a series of chained calls for each row in the csv	semantic, chained
07_semantic.yaml-8	Asserts that the model can parse and transform user input based on the instructions in its system prompt.	sequenced, gptscript, semantic, chained
07_semantic.yaml-9	Asserts that the model can build chain of grouped function calls.	sequenced, semantic, chained, grouped, gptscript

Note: Test ID refers to the spec file name and yaml stream index that a given spec originated from. There are “gaps” in the indices above are because we’ve elided the nascent negative test category from our analysis. We did this because we’re not fully confident the category is meaningful yet. The full spec files for the entire suite, including negatives, are available for review in the FCTS repo.

Comparing Performance

Plotting the number of passed runs for each test case as a heat map makes the major differences between models stand out.

Here we can see that the gulf in performance between OpenAI and the other providers is mostly caused by failing chained and semantic test cases. Interestingly, with the exception of claude-3.5-sonnet, non-OpenAPI providers fail the same two chained and semantic test cases across the board (05_chained.yaml-0, 05_chained.yaml-1, 07_semantic.yaml-6, and 07_semantic.yaml-8). These failures represent a whopping ~66% and 20% of the total test runs in their respective categories!

But to compare the deficits of each model in any greater fidelity, we’ll need to understand why they failed on a test-by-test basis.

gpt-4o-2024-05-13

Test ID	Fail Rate	Failure Pathology
07_semantic.yaml-4	30%	- Fails to properly chain groups of function calls - Hallucinates function arguments

gpt-4-turbo-2024-04-09

Test ID	Fail Rate	Failure Pathology
07_semantic.yaml-6	80%	- Returns an incorrect argument after a large number of function calls
07_semantic.yaml-9	10%	- Makes an unnecessary duplicate function call

claude-3-5-sonnet-20240620

Test ID	Fail Rate	Failure Pathology
05_chained.yaml-1	100%	- Chains correctly - Final answer enumerates the chain of function calls invoked instead of the final evaluated result
05_chained.yaml-2	10%	- Chains correctly - Final answer enumerates the chain of function calls invoked instead of the final evaluated result
07_semantic.yaml-6	100%	- Halts after the first call - Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan

claude-3-opus-20240229

Test ID	Fail Rate	Failure Pathology
05_chained.yaml-0	60%	- Makes chained calls in parallel - Passes a “place holder” instead of a “real” argument
05_chained.yaml-1	100%	- Chains correctly - Final answer enumerates the chain of function calls invoked instead of the final evaluated result
05_chained.yaml-2	100%	- Makes chained calls in parallel - Passes a “place holder” instead of a “real” argument
07_semantic.yaml-6	100%	- Halts after the first call - Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan
07_semantic.yaml-8	100%	- Halts without making any calls - Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan

mistral-large-2402

Test ID	Fail Rate	Failure Pathology
05_chained.yaml-1	100%	- Halts without making any calls - Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan
05_chained.yaml-2	100%	- Makes chained calls in parallel - Passes a “place holder” instead of a “real” argument
07_semantic.yaml-2	30%	- Halts after the first call - Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan
07_semantic.yaml-4	100%	- Halts after the first call - Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan
07_semantic.yaml-5	100%	- Halts without making any calls - Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan
07_semantic.yaml-6	100%	- Makes chained calls in parallel - Passes a “place holder” instead of a “real” argument
07_semantic.yaml-7	100%	- Makes chained calls in parallel - Hallucinates arguments instead of using the results of the initial call
07_semantic.yaml-8	100%	- Halts after the first call - Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan
07_semantic.yaml-9	100%	- Halts after the first call - Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan

gemini-1.5-pro

Test ID	Fail Rate	Failure Pathology
01_basic.yaml-1	10%	- Makes the correct tool call - Returns the raw JSON of what looks like the internal “google representation” of the call result
05_chained.yaml-0	100%	- Fails to derive chain call order - Passes “unknown” literal as argument
05_chained.yaml-1	100%	- Fails to derive chain call order - Passes given arguments to the wrong function
05_chained.yaml-2	100%	- Makes no function calls - Returns a 500 error
07_semantic.yaml-2	100%	- Chains correctly - Final answer doesn’t contain the chain’s result
07_semantic.yaml-3	60%	- Chains correctly - Final answer is missing required information
07_semantic.yaml-4	100%	- Chains correctly - Final answer is missing required information
07_semantic.yaml-6	100%	- Returns an incorrect argument after a large number of function calls
07_semantic.yaml-8	100%	- Fails to derive the chain order - Hallucinates initial argument
07_semantic.yaml-9	100%	- Begins chain correctly - Adds extra escape characters to new-lines

Thumbing through the failure pathologies above reveals a few common threads between models:

Premature Halting

claude-3-5-sonnet-20240620, claude-3-opus-20240229, mistral-large-2402, and gemini-1.5-pro all frequently halt before completing their tasks. Instead of executing the plan, they just describe what should be done. For example, claude-3-opus-20240229 stops after the first step in test 07_semantic.yaml-6, while mistral-large-2402 exhibits similar behavior in several tests, like 07_semantic.yaml-4 and 07_semantic.yaml-5. gemini-1.5-pro also halts prematurely, particularly in 05_chained.yaml-1.

Poor Chaining

claude-3-opus-20240229 and mistral-large-2402 tend to make parallel calls when they should be sequential, leading to incorrect results. This problem is evident in tests like 05_chained.yaml-2 and 07_semantic.yaml-6. gemini-1.5-pro also encounters this issue, especially in 05_chained.yaml-0 and 05_chained.yaml-1, failing to derive the correct call order.

Argument Hallucination

Hallucinating function arguments is another prevalent issue. gpt-4o-2024-05-13, claude-3-opus-20240229, and gemini-1.5-pro all exhibit this behavior. In 07_semantic.yaml-4, gpt-4o-2024-05-13 generates arguments that were not part of the original input. Similarly, claude-3-opus-20240229 and gemini-1.5-pro show this issue in tests 07_semantic.yaml-7 and 07_semantic.yaml-8, respectively, making up inputs on the fly.

Potential Confounding

At the moment, one factor that could throw off our results is the use of GPTScript provider shims for model providers that don’t support OpenAI’s Chat Completion API; e.g. claude-3-opus-20240229 and gemini-1.5-pro. While we’re fairly confident in our shims, there’s always the potential for unknown bugs to skew our test results. However, since we’ve tested our provider shims pretty thoroughly we expect confounding from this source to be minimal.

Conclusion

The exercise of building a framework calling test framework has been a fruitful one. It’s given us a much deeper grasp on the strengths and weaknesses of the current ecosystem’s top models. It’s also unveiled several real world takeaways that we’ve already put to use in our other work-streams (e.g. using an LLM to test GPTScript). To us, the results indicate a real gap in performance between OpenAI and the other providers which serves as evidence to support our initial decision to build GPTScript around OpenAI’s models. They’ve also exposed the best providers and made it clear that they are getting even better (e.g. gpt-4o vs gpt-4-turbo and claude-3.5-sonnet vs claude-3-opus).

If you’ve found this post interesting, you may want to check out the FCTS repo and give it a spin for yourself. Feel free to join our Discord server to chat with us about it too!