Unlocking Higher Accuracy using Iterative Prompting with GPT-4o mini

GPT-4o mini represents a significant leap forward in deploying GenAI within our apps. Not only does it offer substantial cost savings (60% cheaper than GPT-3.5 Turbo!), but it also provides a solid base model at a great price. This allows us to use advanced prompting techniques to enhance performance and my tests over the weekend show that GPT-4o mini can outperform GPT-4 Turbo.

This, I believe, is where the real power of GPT-4o mini lies—leveraging this model to explore multi-step interactions that refine and enhance the outputs.

Take, for example, the Monte Carlo Tree Self-Reflect (MCTSR) technique developed by Di Zhang, Xiaoshui Huang, Dongzhan Zhou, and Yuqiang Li at the Shanghai Artificial Intelligence Lab. They were able to take a smaller model, llama-3-8b, and get it to perform at levels comparable to GPT-4 and do so by building on initial responses and refining them through iterative prompting by applying principles from Monte Carlo Tree Search (same high-level idea used in AlphaGo).

Over the past weekend, my experimentation with GPT-4o-mini and the MCTSR technique revealed some interesting findings: GPT-4o-mini, with iterative prompting, surpasses GPT-4 Turbo.

BenchmarkGPT-4o mini + MCTSRGPT-4 Turbo
Gsm8k96.7%93% (few shot, k = 5, CoT)
GAIC (MathOdyssey)56.3%49.1%
MATH80.3%73.4%

Here are the results I got with GPT-4o mini on some of the other benchmarks used in the MCTSR paper:

BenchmarkGPT 4o mini + MCTSR
GsmHard63.4%
AIME37.8%
OlympiadBench30.4%

Although I spent about $140 for my tests, it seems more economical compared to GPT-4 Turbo.

Here are the counts of the questions that were attempted:

BenchmarkQuestion count
gsm8k1319
gsmhard1319
GAIC389
MATH5000
AIME933
OlympiadBench1275

Code:
All the code and JSONs from my benchmarks runs with gpt-4o-mini are saved at https://github.com/SidU/MathBlackBox.

Summary:
The ability of GPT-4o mini to tackle complex problems iteratively highlights its potential. Start simple, but consider more iterative prompting techniques for better reasoning in complex problems and domains. However, if using iterative prompting, ensure your UX informs users asynchronously when reasoning is complete, rather than making them wait on your webpage/view for results, as the model emits tokens to ‘think’ and ‘reflect’ which takes some time.

Learn more:
Monte Carlo Tree – Self Refine paper
YouTube video from Trelis Research with an explanation of the above paper
Great overview of Monte Caro Tree Search


Posted

in

by

Tags:

Comments

Leave a comment