Model Migration Update

We have migrated to Claude Haiku 4.5. This move provides significant speed and price improvements over Sonnet 4.5, while delivering a performance boost over Haiku 3.5 and Kimi K2.

Technical Details

Technical implementation details covering tool calling, formatting standards, error handling, and performance metrics. This section provides deep technical insights for engineers working on similar systems.

Tool Calling Implementation

Tool Call Format

Tool calls must follow the correct format. When formatting rules apply to tool functions, models attempt to bold or format tool calls, which breaks functionality.

Tool calls should display "tools" without showing the operation unless clicked. Raw JSON should never be displayed to users.

Tool Call Reliability

Kimi K2: High reliability - best among cheap models. Occasional retries needed but self-corrects.

Gemini 2.5 Flash: ~4% failure rate - attempted to use XML tags (incorrect method), displayed raw JSON. Disqualified.

Claude Sonnet 4.5: Excellent reliability, but high cost and rate limits.

Tool Call Workflow

Critical requirement: Tool calls MUST be executed prior to steps where reference content is needed. This prevents models from responding as if they read references when they haven't.

User requests information
Model calls search/reference tool
Tool executes and returns results
Model uses retrieved content in response

Formatting Standards

Text Generation Rules

Formatting rules apply to text generation only, NOT tool calls:

Use clear markdown headings
Bullet points and numbered lists where appropriate
Horizontal lines for section separation
Adequate spacing for readability

Question Formatting

All questions must appear on a new line
All questions must be in bold
Maximum 2 questions at a time
Rationale: Bold formatting guides eyes to important prompts

Mathematical Notation

For Mathematics, Intervention Math, and Economics apps:

Use Unicode symbols, NOT LaTeX or Markdown math
Reason: Preserves formatting across different word processors
When users copy-paste into Word or Google Docs, LaTeX/Markdown often breaks
Unicode maintains consistent appearance across platforms

Verbosity Control

Repeated instruction throughout prompts: "Do not be overly verbose; state only what is necessary". This reduces token usage and improves user experience.

Error Handling and Recovery

Failed Tool Call Recovery

Kimi K2 demonstrates good error recovery. When a tool call doesn't work properly the first time, it self-corrects and calls it again. Example observed:

Model attempts tool call → Tool call fails → Model recognizes failure → Model retries with corrected format → Success

Common Error Patterns

XML tag usage: Some models attempt to use XML tags for tool calls (incorrect method)
Raw JSON display: Models sometimes display raw JSON as first message to users (terrible UX)
Formatting in tool calls: Models attempt to bold or format tool calls, breaking functionality
Reference hallucination: Models respond as if they read references when tool calls haven't been executed

Error Prevention Strategies

Explicit tool call format instructions
Clear separation between formatting rules and tool instructions
Workflow instructions requiring tool execution before reference
Lower variability settings for consistency
Model selection prioritizing tool calling reliability

Performance Metrics

Latency Metrics

GPT-5 Mini: Time to First Token averaging 25 seconds (unacceptable)

Kimi K2: Acceptable latency for real-time use

Claude Sonnet 4.5: Good latency but rate limits at scale

Token Usage

Progressive disclosure: 30-40% token reduction
Format specification upfront: 15-20% reduction in rework
Smart defaults: 20-25% reduction in turns
Response length controls: 30-40% token reduction
Combined potential impact: 50-60% reduction in operational costs

Reliability Metrics

Tool calling success rate:

Kimi K2: High (best among cheap models)
Claude Sonnet 4.5: Excellent
Gemini 2.5 Flash: ~96% (4% failure rate - unacceptable)

Network errors: GPT-5 Mini had network errors in 2/5 responses, likely due to input token limit issues.

Code Examples

Prompt Structure Example

// Prompt Architecture Structure
{
  "context": {
    "year": "Year 2",
    "subject": "Intervention Math",
    "learningPlanType": "Weekly learning activities"
  },
  "workflow": [
    {
      "step": 1,
      "action": "Gather information",
      "instructions": "Ask at most 2 questions, use bold formatting"
    }
  ],
  "guidelines": {
    "dok": "Integrate Depth of Knowledge levels",
    "formatting": "Use Unicode for math, markdown for text",
    "toolCalls": "Execute before referencing results"
  }
}