• Playlab

Technical Details

Technical implementation details covering tool calling, formatting standards, error handling, and performance metrics. This section provides deep technical insights for engineers working on similar systems.

Tool Calling Implementation

Tool Call Format

Tool calls must follow the correct format. When formatting rules apply to tool functions, models attempt to bold or format tool calls, which breaks functionality.

Tool calls should display "tools" without showing the operation unless clicked. Raw JSON should never be displayed to users.

Tool Call Reliability

Kimi K2: High reliability - best among cheap models. Occasional retries needed but self-corrects.

Gemini 2.5 Flash: ~4% failure rate - attempted to use XML tags (incorrect method), displayed raw JSON. Disqualified.

Claude Sonnet 4.5: Excellent reliability, but high cost and rate limits.

Tool Call Workflow

Critical requirement: Tool calls MUST be executed prior to steps where reference content is needed. This prevents models from responding as if they read references when they haven't.

  1. User requests information
  2. Model calls search/reference tool
  3. Tool executes and returns results
  4. Model uses retrieved content in response

Formatting Standards

Text Generation Rules

Formatting rules apply to text generation only, NOT tool calls:

  • Use clear markdown headings
  • Bullet points and numbered lists where appropriate
  • Horizontal lines for section separation
  • Adequate spacing for readability

Question Formatting

  • All questions must appear on a new line
  • All questions must be in bold
  • Maximum 2 questions at a time
  • Rationale: Bold formatting guides eyes to important prompts

Mathematical Notation

For Mathematics, Intervention Math, and Economics apps:

  • Use Unicode symbols, NOT LaTeX or Markdown math
  • Reason: Preserves formatting across different word processors
  • When users copy-paste into Word or Google Docs, LaTeX/Markdown often breaks
  • Unicode maintains consistent appearance across platforms

Verbosity Control

Repeated instruction throughout prompts: "Do not be overly verbose; state only what is necessary". This reduces token usage and improves user experience.

Error Handling and Recovery

Failed Tool Call Recovery

Kimi K2 demonstrates good error recovery. When a tool call doesn't work properly the first time, it self-corrects and calls it again. Example observed:

Model attempts tool call → Tool call fails → Model recognizes failure → Model retries with corrected format → Success

Common Error Patterns

  • XML tag usage: Some models attempt to use XML tags for tool calls (incorrect method)
  • Raw JSON display: Models sometimes display raw JSON as first message to users (terrible UX)
  • Formatting in tool calls: Models attempt to bold or format tool calls, breaking functionality
  • Reference hallucination: Models respond as if they read references when tool calls haven't been executed

Error Prevention Strategies

  • Explicit tool call format instructions
  • Clear separation between formatting rules and tool instructions
  • Workflow instructions requiring tool execution before reference
  • Lower variability settings for consistency
  • Model selection prioritizing tool calling reliability

Performance Metrics

Latency Metrics

GPT-5 Mini: Time to First Token averaging 25 seconds (unacceptable)

Kimi K2: Acceptable latency for real-time use

Claude Sonnet 4.5: Good latency but rate limits at scale

Token Usage

  • Progressive disclosure: 30-40% token reduction
  • Format specification upfront: 15-20% reduction in rework
  • Smart defaults: 20-25% reduction in turns
  • Response length controls: 30-40% token reduction
  • Combined potential impact: 50-60% reduction in operational costs

Reliability Metrics

Tool calling success rate:

  • Kimi K2: High (best among cheap models)
  • Claude Sonnet 4.5: Excellent
  • Gemini 2.5 Flash: ~96% (4% failure rate - unacceptable)

Network errors: GPT-5 Mini had network errors in 2/5 responses, likely due to input token limit issues.

Code Examples

Prompt Structure Example

// Prompt Architecture Structure
{
  "context": {
    "year": "Year 2",
    "subject": "Intervention Math",
    "learningPlanType": "Weekly learning activities"
  },
  "workflow": [
    {
      "step": 1,
      "action": "Gather information",
      "instructions": "Ask at most 2 questions, use bold formatting"
    }
  ],
  "guidelines": {
    "dok": "Integrate Depth of Knowledge levels",
    "formatting": "Use Unicode for math, markdown for text",
    "toolCalls": "Execute before referencing results"
  }
}