• Playlab

Model Evaluation

Comprehensive evaluation of AI models tested for production use in Ghana educational apps. Focus areas include tool calling reliability, formatting quality, cost-effectiveness, and instruction following.

Benchmark Comparisons

Benchmarks are embedded inline where available, with a pop-out link for each.

Gorilla ToolBench leaderboard for complex tool-use tasks from UC Berkeley.

Interactive model comparison with throughput, cost, and latency views.

Marketplace of models with live pricing and usage routes.

Model Comparison

ModelStatusTool CallingInput CostOutput Cost
Anthropic LogoClaude Sonnet 4.5baselineN/A$3/Million Tokens$15/Million Tokens
Kimi K2selectedHigh - best among cheap modelsN/AN/A
Gemini 2.5 Flash: PreviewdisqualifiedPoor - ~4% failure rateN/AN/A
Qwenunder evaluationGoodN/AN/A
GPT-5 MinirejectedGood but unusable due to latencyN/AN/A

Individual Model Pages