Model Migration Update

We have migrated to Claude Haiku 4.5. This move provides significant speed and price improvements over Sonnet 4.5, while delivering a performance boost over Haiku 3.5 and Kimi K2.

Model Evaluation

Comprehensive evaluation of AI models tested for production use in Ghana educational apps. Focus areas include tool calling reliability, formatting quality, cost-effectiveness, and instruction following.

Benchmark Comparisons

Benchmarks are embedded inline where available, with a pop-out link for each.

Berkeley Tool Bench

Gorilla ToolBench leaderboard for complex tool-use tasks from UC Berkeley.

Artificial Analysis

Interactive model comparison with throughput, cost, and latency views.

OpenRouter

Marketplace of models with live pricing and usage routes.

Model Comparison

Model	Status	Tool Calling	Input Cost	Output Cost
Claude Sonnet 4.5	baseline	N/A	$3/Million Tokens	$15/Million Tokens
Kimi K2	selected	High - best among cheap models	N/A	N/A
Gemini 2.5 Flash: Preview	disqualified	Poor - ~4% failure rate	N/A	N/A
Qwen	under evaluation	Good	N/A	N/A
GPT-5 Mini	rejected	Good but unusable due to latency	N/A	N/A

Individual Model Pages

Claude Sonnet 4.5

Baseline/reference model used for comparison. Strong natural formatting abilities due to fine-tuning and training data.

Natural markdown formattingStrong instruction following

Kimi K2

Selected as primary model for production. Best cheap tool calling ability. Requires explicit formatting instructions to match Claude's natural formatting.

Excellent tool calling abilityLow cost

Gemini 2.5 Flash: Preview

Completely disqualified due to unacceptable tool calling failure rate (~4% - 1 in 25 interactions).

Low costFast response times

Qwen

Still under evaluation. Generally performs well but requires refinement to control unwanted behaviors.

Generally performs wellSuccessfully follows most instructions

GPT-5 Mini

Tested but rejected due to latency issues. Time to First Token averaging 25 seconds, with network errors in 2/5 responses.

Good reasoning capabilitiesStrong performance on complex tasks