Model Migration Update

We have migrated to Claude Haiku 4.5. This move provides significant speed and price improvements over Sonnet 4.5, while delivering a performance boost over Haiku 3.5 and Kimi K2.

Key Learnings

Critical insights, recommendations, and best practices discovered through the optimization process. These learnings are actionable takeaways for improving prompt engineering, RAG systems, and model selection.

Critical Insights

Model Selection Criteria

Tool calling reliability is non-negotiable: Even a 4% failure rate (Gemini) is unacceptable for production use.
Formatting consistency requires explicit instructions: Models like Kimi K2 need more guidance than Claude Sonnet 4.5, but can achieve similar results with proper prompt engineering.
Cost vs. capability trade-offs: Kimi K2 provides excellent tool calling at low cost, making it ideal for production despite requiring more explicit formatting instructions.
Latency matters: GPT-5 Mini's 25-second Time to First Token makes it unsuitable for real-time educational applications.

Prompt Engineering Insights

Categorization improves comprehension: Organizing related guidelines together helps models understand relationships and reduces oversight.
Repetition for emphasis works: Critical instructions repeated throughout the prompt increase compliance probability.
Separation of concerns: Formatting rules must be clearly separated from tool call instructions to prevent functional breakage.
Lower variability improves consistency: Especially important for Kimi K2 and other models that benefit from reduced temperature settings.

RAG System Insights

Tool execution order matters: References must be retrieved before they're referenced to prevent hallucination.
Models can confuse internal knowledge with retrieved content:Explicit workflow instructions prevent this confusion.
Search References Tool improves consistency: Migration from traditional RAG to Search References Tool provides better control over retrieval.

Recommendations for Improvement

For Prompt Engineering

Use categorization to organize related guidelines
Repeat critical instructions multiple times
Separate formatting rules from tool call instructions
Specify question formatting requirements (bold, max 2 at a time)
Use Unicode for math notation instead of LaTeX/Markdown
Include explicit verbosity control instructions

For RAG Systems

Always execute tool calls before referencing their results
Add confidence labels and source tagging
Implement retrieval verification steps
Lower variability for better instruction following
Provide explicit workflow instructions

For Cost Optimization

Implement progressive disclosure to reduce token usage
Add format specification early in conversation flow
Create smart defaults for common patterns
Use response length controls
Migrate to cost-effective models with similar capabilities

For User Experience

Do not provide answer keys automatically - ask user to reupload/paste
Add automatic MCQ linter before responses are sent
Standardize assessment blueprints per subject/level
Implement context persistence with confirmation shortcuts
Default to single optimal activity design with alternatives option

Technical Challenges and Solutions

Challenge: Tool Calling Failures

Problem: Gemini 2.5 Flash attempted to use XML tags for tool calls, breaking functionality and displaying raw JSON to users.

Solution: Disqualified Gemini and selected Kimi K2 for reliable tool calling. Implemented explicit tool call format instructions in prompts.

Challenge: Reference Hallucination

Problem: Kimi responds as if it read references even when tool calls haven't been executed.

Solution: Explicit workflow instructions requiring tool calls to be executed before referencing their results. Lower variability settings improve compliance.

Challenge: Formatting Inconsistencies

Problem: Kimi K2 doesn't format as naturally as Claude Sonnet 4.5.

Solution: Added explicit formatting instructions including markdown rules, question formatting requirements, and Unicode for math notation.

Challenge: High Costs

Problem: Claude Sonnet 4.5 costs $3/M input and $15/M output, with rate limits at scale.

Solution: Migrated to Kimi K2 for cost-effective tool calling, implemented token reduction strategies (30-40% reduction), and added progressive disclosure.

Best Practices Discovered

Prompt Engineering

Use clear categorization
Repeat critical instructions
Separate formatting from tool calls
Specify exact formatting requirements

Model Selection

Prioritize tool calling reliability
Consider cost vs. capability trade-offs
Test latency for real-time use cases
Evaluate formatting quality

RAG Implementation

Execute tool calls before referencing
Lower variability for consistency
Add explicit workflow instructions
Verify retrieval before use

Cost Optimization

Implement progressive disclosure
Add format specification early
Create smart defaults
Control response length

Actionable Takeaways for Anthropic Team

Claude's natural formatting is a competitive advantage: The system prompt and training data give Claude strong formatting abilities that other models struggle to match without explicit instructions.
Tool calling reliability is critical: Even small failure rates can create terrible user experiences. Models must handle tool calls correctly consistently.
Prompt engineering can bridge capability gaps: With proper prompt engineering, cheaper models can achieve similar results to more expensive ones, though with more explicit instructions.
RAG systems need explicit workflow control: Models can confuse internal knowledge with retrieved content. Explicit instructions about tool execution order are essential.
Cost optimization requires multiple strategies: Model migration, token reduction, and workflow optimization all contribute to significant cost savings.