Troubleshoot AI Agents

Use this page to diagnose and fix common issues with AI agents, including deployment failures, runtime behavior problems, tool execution errors, and integration issues.

The Agentic Data Plane is supported on BYOC clusters running with AWS and Redpanda version 25.3 and later.

Deployment issues

Fix issues that prevent agents from connecting to required resources.

MCP server connection failures

Symptoms: Agent starts but the tools don’t respond or return connection errors.

Causes:

  • MCP server stopped or crashed after agent creation

  • Network connectivity issues between agent and MCP server

  • MCP server authentication or permission issues

Solution:

  1. Verify MCP server status in Agentic AI > Remote MCP.

  2. Check MCP server logs for errors.

  3. Restart the MCP server if needed.

  4. Verify agent has permission to access the MCP server.

Prevention:

  • Monitor MCP server health

  • Use appropriate retry logic in tools

Runtime behavior issues

Resolve problems with agent decision-making, tool selection, and response generation.

Agent not calling tools

Symptoms: Agent responds without calling any tools, or fabricates information instead of using tools.

Causes:

  • System prompt doesn’t clearly specify when to use tools

  • Tool descriptions are vague or missing

  • LLM model lacks sufficient reasoning capability

  • Max iterations is too low

Solution:

  1. Strengthen tool usage guidance in your system prompt:

    ALWAYS use get_order_status when customer mentions an order ID.
    NEVER respond about order status without calling the tool first.
  2. Review tool descriptions in your MCP server configuration.

  3. Use a more capable model from the supported list for your gateway.

  4. Increase max iterations if the agent is stopping before reaching tools.

Prevention:

  • Write explicit tool selection criteria in system prompts

  • Test agents with the systematic testing approach

  • Use models appropriate for your task complexity

Calling wrong tools

Symptoms: Agent selects incorrect tools for the task, or calls tools with invalid parameters.

Causes:

  • Tool descriptions are ambiguous or overlap

  • Too many similar tools confuse the LLM

  • System prompt doesn’t provide clear tool selection guidance

Solution:

  1. Make tool descriptions more specific and distinct.

  2. Add "when to use" guidance to your system prompt:

    Use get_order_status when:
    - Customer provides an order ID (ORD-XXXXX)
    - You need to check current order state
    
    Use get_shipping_info when:
    - Order status is "shipped"
    - Customer asks about delivery or tracking
  3. Reduce the number of tools you expose to the agent.

  4. Use subagents to partition tools by domain.

Prevention:

  • Follow tool design patterns in MCP Tool Patterns

  • Limit each agent to 10-15 tools maximum

  • Test boundary cases where multiple tools might apply

Stuck in loops or exceeding max iterations

Symptoms: Agent reaches max iterations without completing the task, or repeatedly calls the same tool with the same parameters.

Causes:

  • Tool returns errors that the agent doesn’t know how to handle

  • Agent doesn’t recognize when the task is complete

  • Tool returns incomplete data that prompts another call

  • System prompt encourages exhaustive exploration

Solution:

  1. Add completion criteria to your system prompt:

    When you have retrieved all requested information:
    1. Present the results to the user
    2. Stop calling additional tools
    3. Do not explore related data unless asked
  2. Add error handling guidance:

    If a tool fails after 2 attempts:
    - Explain what went wrong
    - Do not retry the same tool again
    - Move on or ask for user guidance
  3. Review tool output to ensure it signals completion clearly.

  4. Increase max iterations if the task legitimately requires many steps.

Prevention:

  • Design tools to return complete information in one call

  • Set max iterations appropriate for task complexity (see Why iterations matter)

  • Test with ambiguous requests that might cause loops

Making up information

Symptoms: Agent provides plausible-sounding answers without calling tools, or invents data when tools fail.

Causes:

  • System prompt doesn’t explicitly forbid fabrication

  • Agent treats tool failures as suggestions rather than requirements

  • Model is hallucinating due to lack of constraints

Solution:

  1. Add explicit constraints to your system prompt:

    Critical rules:
    - NEVER make up order numbers, tracking numbers, or customer data
    - If a tool fails, explain the failure - do not guess
    - If you don't have information, say so explicitly
  2. Test error scenarios by temporarily disabling tools.

  3. Use a more capable model that follows instructions better.

Prevention:

  • Include "never fabricate" rules in all system prompts

  • Test with requests that require unavailable data

  • Monitor Transcripts and session topic for fabricated responses

Analyzing conversation patterns

Symptoms: Agent behavior is inconsistent or produces unexpected results.

Solution:

Review conversation history in Transcripts to identify problematic patterns:

  • Agents calling the same tool repeatedly: Indicates loop detection is needed

  • Large gaps between messages: Suggests tool timeout or slow execution

  • Agent responses without tool calls: Indicates a tool selection issue

  • Fabricated information: Suggests a missing "never make up data" constraint

  • Truncated early messages: Indicates the context window was exceeded

Analysis workflow:

  1. Use Inspector to reproduce the issue.

  2. Review full conversation including tool invocations.

  3. Identify where agent behavior diverged from expected.

  4. Check system prompt for missing guidance.

  5. Verify tool responses are formatted correctly.

Performance issues

Diagnose and fix issues related to agent speed and resource consumption.

Slow response times

Symptoms: Agent takes 10+ seconds to respond to simple queries.

Causes:

  • LLM model is slow (large context processing)

  • Too many tool calls in sequence

  • Tools themselves are slow (database queries, API calls)

  • Large context window from long conversation history

Solution:

  1. Use a faster, lower-latency model tier for simple queries and reserve larger models for complex reasoning.

  2. Review conversation history in the Inspector tab to identify unnecessary tool calls.

  3. Optimize tool implementations:

    1. Add caching where appropriate

    2. Reduce query complexity

    3. Return only needed data (use pagination, filters)

  4. Clear the conversation history if the context is very large.

Prevention:

  • Right-size model selection based on task complexity

  • Design tools to execute quickly (< 2 seconds ideal)

  • Set appropriate max iterations to prevent excessive exploration

  • Monitor token usage and conversation length

High token costs

Symptoms: Token usage is higher than expected, costs are increasing rapidly.

Causes:

  • Max iterations configured too high

  • Agent making unnecessary tool calls

  • Large tool results filling context window

  • Long conversation history not being managed

  • Using expensive models for simple tasks

Solution:

  1. Review token usage in Transcripts.

  2. Lower max iterations for this agent.

  3. Optimize tool responses to return less data:

    Bad:  Return all 10,000 customer records
    Good: Return paginated results, 20 records at a time
  4. Add cost control guidance to system prompt:

    Efficiency guidelines:
    - Request only the data you need
    - Stop when you have enough information
    - Do not call tools speculatively
  5. Switch to a more cost-effective model for simple queries.

  6. Clear conversation history periodically in the Inspector tab.

Prevention:

  • Set appropriate max iterations (10-20 for simple, 30-40 for complex)

  • Design tools to return minimal necessary data

  • Monitor token usage trends

  • See cost calculation guidance in Cost calculation

Tool execution issues

Fix problems with timeouts, invalid parameters, and error responses.

Tool timeouts

Symptoms: Tools fail with timeout errors, agent receives incomplete results.

Causes:

  • External API is slow or unresponsive

  • Database query is too complex

  • Network latency between tool and external system

  • Tool processing large datasets in memory

Solution:

  1. Add timeout handling to tool implementation:

    http:
      url: https://api.example.com/data
      timeout: "5s"  # Set explicit timeout
  2. Optimize external queries:

    1. Add database indexes

    2. Reduce query scope

    3. Cache frequent queries

  3. Increase tool timeout if operation legitimately takes longer.

  4. Add retry logic for transient failures.

Prevention:

  • Set explicit timeouts in all tool configurations

  • Test tools under load

  • Monitor external API performance

  • Design tools to fail fast on unavailable services

Invalid parameters

Symptoms: Tools return validation errors about missing or incorrectly formatted parameters.

Causes:

  • Tool schema doesn’t match implementation

  • Agent passes wrong data types

  • Required parameters not marked as required in schema

  • Agent misunderstands parameter purpose

Solution:

  1. Verify tool schema matches implementation:

    input_schema:
      properties:
        order_id:
          type: string  # Must match what tool expects
          description: "Order ID in format ORD-12345"
  2. Add parameter validation to tools.

  3. Improve parameter descriptions in tool schema.

  4. Add examples to tool descriptions:

    description: |
      Get order status by order ID.
      Example: get_order_status(order_id="ORD-12345")

Prevention:

  • Write detailed parameter descriptions

  • Include format requirements and examples

  • Test tools with invalid inputs to verify error messages

  • Use JSON Schema validation in tool implementations

Tool returns errors

Symptoms: Tools execute but return error responses or unexpected data formats.

Causes:

  • External API returned error

  • Tool implementation has bugs

  • Data format changed in external system

  • Tool lacks error handling

Solution:

  1. Check tool logs in MCP server.

  2. Test tool directly (outside agent context).

  3. Verify external system is operational.

  4. Add error handling to tool implementation:

    processors:
      - try:
          - http:
              url: ${API_URL}
        catch:
          - mapping: |
              root.error = "API unavailable: " + error()
  5. Update agent system prompt to handle this error type.

Prevention:

  • Implement comprehensive error handling in tools

  • Monitor external system health

  • Add retries for transient failures

  • Log all tool errors for analysis

Integration issues

Fix problems with external applications calling agents and pipeline-to-agent integration.

Agent card does not contain a URL

Symptoms: Pipeline fails with error: agent card does not contain a URL or failed to init processor <no label> path root.pipeline.processors.0

Causes:

  • The agent_card_url points to the base agent endpoint instead of the agent card JSON file

Solution:

The agent_card_url must point to the agent card JSON file, not the base agent endpoint.

Incorrect configuration:

processors:
  - a2a_message:
      agent_card_url: "https://your-agent-id.ai-agents.your-cluster-id.cloud.redpanda.com"
      prompt: "Analyze this transaction: ${!content()}"

Correct configuration:

processors:
  - a2a_message:
      agent_card_url: "https://your-agent-id.ai-agents.your-cluster-id.cloud.redpanda.com/.well-known/agent-card.json"
      prompt: "Analyze this transaction: ${!content()}"

The agent card is always available at /.well-known/agent-card.json according to the A2A protocol standard.

Prevention:

  • Always append /.well-known/agent-card.json to the agent endpoint URL

  • Test the agent card URL in a browser before using it in pipeline configuration

  • See Agent card location for details

Pipeline integration failures

Symptoms: Pipelines using a2a_message processor fail or timeout.

Causes:

  • Agent is not running or restarting

  • Agent timeout is too low for pipeline workload

  • Authentication issues between pipeline and agent

  • High event volume overwhelming agent

Solution:

  1. Check agent status and resource allocation.

  2. Increase agent resource tier for high-volume pipelines.

  3. Add error handling in pipeline:

    processors:
      - try:
          - a2a_message:
              agent_card_url: "https://your-agent-url/.well-known/agent-card.json"
        catch:
          - log:
              message: "Agent invocation failed: ${! error() }"

Prevention:

  • Test pipeline-agent integration with low volume first

  • Size agent resources appropriately for event rate

  • See integration patterns in Pipeline Integration Patterns

Monitor and debug agents

For comprehensive guidance on monitoring agent activity, analyzing conversation history, tracking token usage, and debugging issues, see Monitor Agent Activity.