Model Evaluation
Comprehensive evaluation of AI models tested for production use in Ghana educational apps. Focus areas include tool calling reliability, formatting quality, cost-effectiveness, and instruction following.
Benchmark Comparisons
Benchmarks are embedded inline where available, with a pop-out link for each.
Gorilla ToolBench leaderboard for complex tool-use tasks from UC Berkeley.
Interactive model comparison with throughput, cost, and latency views.
Marketplace of models with live pricing and usage routes.
Model Comparison
| Model | Status | Tool Calling | Input Cost | Output Cost |
|---|---|---|---|---|
| baseline | N/A | $3/Million Tokens | $15/Million Tokens | |
| Kimi K2 | selected | High - best among cheap models | N/A | N/A |
| Gemini 2.5 Flash: Preview | disqualified | Poor - ~4% failure rate | N/A | N/A |
| Qwen | under evaluation | Good | N/A | N/A |
| GPT-5 Mini | rejected | Good but unusable due to latency | N/A | N/A |
Individual Model Pages
Claude Sonnet 4.5
Baseline/reference model used for comparison. Strong natural formatting abilities due to fine-tuning and training data.
Kimi K2
Selected as primary model for production. Best cheap tool calling ability. Requires explicit formatting instructions to match Claude's natural formatting.
Gemini 2.5 Flash: Preview
Completely disqualified due to unacceptable tool calling failure rate (~4% - 1 in 25 interactions).
Qwen
Still under evaluation. Generally performs well but requires refinement to control unwanted behaviors.
GPT-5 Mini
Tested but rejected due to latency issues. Time to First Token averaging 25 seconds, with network errors in 2/5 responses.