GPT-4o mini represents a significant leap forward in deploying GenAI within our apps. Not only does it offer substantial cost savings (60% cheaper than GPT-3.5 Turbo!), but it also provides a solid base model at a great price. This allows us to use advanced prompting techniques to enhance performance and my tests over the weekend show that GPT-4o mini can outperform GPT-4 Turbo.
This, I believe, is where the real power of GPT-4o mini lies—leveraging this model to explore multi-step interactions that refine and enhance the outputs.
Take, for example, the Monte Carlo Tree Self-Reflect (MCTSR) technique developed by Di Zhang, Xiaoshui Huang, Dongzhan Zhou, and Yuqiang Li at the Shanghai Artificial Intelligence Lab. They were able to take a smaller model, llama-3-8b, and get it to perform at levels comparable to GPT-4 and do so by building on initial responses and refining them through iterative prompting by applying principles from Monte Carlo Tree Search (same high-level idea used in AlphaGo).
Over the past weekend, my experimentation with GPT-4o-mini and the MCTSR technique revealed some interesting findings: GPT-4o-mini, with iterative prompting, surpasses GPT-4 Turbo.
| Benchmark | GPT-4o mini + MCTSR | GPT-4 Turbo |
| Gsm8k | 96.7% | 93% (few shot, k = 5, CoT) |
| GAIC (MathOdyssey) | 56.3% | 49.1% |
| MATH | 80.3% | 73.4% |
Here are the results I got with GPT-4o mini on some of the other benchmarks used in the MCTSR paper:
| Benchmark | GPT 4o mini + MCTSR |
| GsmHard | 63.4% |
| AIME | 37.8% |
| OlympiadBench | 30.4% |
Although I spent about $140 for my tests, it seems more economical compared to GPT-4 Turbo.

Here are the counts of the questions that were attempted:
| Benchmark | Question count |
| gsm8k | 1319 |
| gsmhard | 1319 |
| GAIC | 389 |
| MATH | 5000 |
| AIME | 933 |
| OlympiadBench | 1275 |
Code:
All the code and JSONs from my benchmarks runs with gpt-4o-mini are saved at https://github.com/SidU/MathBlackBox.
Summary:
The ability of GPT-4o mini to tackle complex problems iteratively highlights its potential. Start simple, but consider more iterative prompting techniques for better reasoning in complex problems and domains. However, if using iterative prompting, ensure your UX informs users asynchronously when reasoning is complete, rather than making them wait on your webpage/view for results, as the model emits tokens to ‘think’ and ‘reflect’ which takes some time.
Learn more:
Monte Carlo Tree – Self Refine paper
YouTube video from Trelis Research with an explanation of the above paper
Great overview of Monte Caro Tree Search

Leave a comment