Collapse

Troubleshoot AI Agents

Redpanda Agentic Data Plane is supported only on BYOC clusters running with AWS and Redpanda version 25.3+. It is currently in limited availability.

Use this page to diagnose and fix common issues with AI agents, including deployment failures, runtime behavior problems, tool execution errors, and integration issues.

Deployment issues

Fix issues that prevent agents from connecting to required resources.

MCP server connection failures

Symptoms: Agent starts but the tools don’t respond or return connection errors.

Causes:

MCP server stopped or crashed after agent creation
Network connectivity issues between agent and MCP server
MCP server authentication or permission issues

Solution:

Verify MCP server status in Agentic AI > Remote MCP.
Check MCP server logs for errors.
Restart the MCP server if needed.
Verify agent has permission to access the MCP server.

Prevention:

Monitor MCP server health
Use appropriate retry logic in tools

Runtime behavior issues

Resolve problems with agent decision-making, tool selection, and response generation.

Agent not calling tools

Symptoms: Agent responds without calling any tools, or fabricates information instead of using tools.

Causes:

System prompt doesn’t clearly specify when to use tools
Tool descriptions are vague or missing
LLM model lacks sufficient reasoning capability
Max iterations is too low

Solution:

Strengthen tool usage guidance in your system prompt:

ALWAYS use get_order_status when customer mentions an order ID.
NEVER respond about order status without calling the tool first.

Review tool descriptions in your MCP server configuration.
Use a more capable model from the supported list for your gateway.
Increase max iterations if the agent is stopping before reaching tools.

Prevention:

Write explicit tool selection criteria in system prompts
Test agents with the systematic testing approach
Use models appropriate for your task complexity

Calling wrong tools

Symptoms: Agent selects incorrect tools for the task, or calls tools with invalid parameters.

Causes:

Tool descriptions are ambiguous or overlap
Too many similar tools confuse the LLM
System prompt doesn’t provide clear tool selection guidance

Solution:

Make tool descriptions more specific and distinct.

Add "when to use" guidance to your system prompt:

Use get_order_status when:
- Customer provides an order ID (ORD-XXXXX)
- You need to check current order state

Use get_shipping_info when:
- Order status is "shipped"
- Customer asks about delivery or tracking

Reduce the number of tools you expose to the agent.
Use subagents to partition tools by domain.

Prevention:

Follow tool design patterns in MCP Tool Patterns
Limit each agent to 10-15 tools maximum
Test boundary cases where multiple tools might apply

Stuck in loops or exceeding max iterations

Symptoms: Agent reaches max iterations without completing the task, or repeatedly calls the same tool with the same parameters.

Causes:

Tool returns errors that the agent doesn’t know how to handle
Agent doesn’t recognize when the task is complete
Tool returns incomplete data that prompts another call
System prompt encourages exhaustive exploration

Solution:

Add completion criteria to your system prompt:

When you have retrieved all requested information:
1. Present the results to the user
2. Stop calling additional tools
3. Do not explore related data unless asked

Add error handling guidance:

If a tool fails after 2 attempts:
- Explain what went wrong
- Do not retry the same tool again
- Move on or ask for user guidance

Review tool output to ensure it signals completion clearly.
Increase max iterations if the task legitimately requires many steps.

Prevention:

Design tools to return complete information in one call
Set max iterations appropriate for task complexity (see Why iterations matter)
Test with ambiguous requests that might cause loops

Making up information

Symptoms: Agent provides plausible-sounding answers without calling tools, or invents data when tools fail.

Causes:

System prompt doesn’t explicitly forbid fabrication
Agent treats tool failures as suggestions rather than requirements
Model is hallucinating due to lack of constraints

Solution:

Add explicit constraints to your system prompt:

Critical rules:
- NEVER make up order numbers, tracking numbers, or customer data
- If a tool fails, explain the failure - do not guess
- If you don't have information, say so explicitly

Test error scenarios by temporarily disabling tools.
Use a more capable model that follows instructions better.

Prevention:

Include "never fabricate" rules in all system prompts
Test with requests that require unavailable data
Monitor Transcripts and session topic for fabricated responses

Analyzing conversation patterns

Symptoms: Agent behavior is inconsistent or produces unexpected results.

Solution:

Review conversation history in Transcripts to identify problematic patterns:

Agents calling the same tool repeatedly: Indicates loop detection is needed
Large gaps between messages: Suggests tool timeout or slow execution
Agent responses without tool calls: Indicates a tool selection issue
Fabricated information: Suggests a missing "never make up data" constraint
Truncated early messages: Indicates the context window was exceeded

Analysis workflow:

Use Inspector to reproduce the issue.
Review full conversation including tool invocations.
Identify where agent behavior diverged from expected.
Check system prompt for missing guidance.
Verify tool responses are formatted correctly.

Performance issues

Diagnose and fix issues related to agent speed and resource consumption.

Slow response times

Symptoms: Agent takes 10+ seconds to respond to simple queries.

Causes:

LLM model is slow (large context processing)
Too many tool calls in sequence
Tools themselves are slow (database queries, API calls)
Large context window from long conversation history

Solution:

Use a faster, lower-latency model tier for simple queries and reserve larger models for complex reasoning.
Review conversation history in the Inspector tab to identify unnecessary tool calls.
Optimize tool implementations:
1. Add caching where appropriate
2. Reduce query complexity
3. Return only needed data (use pagination, filters)
Clear the conversation history if the context is very large.

Prevention:

Right-size model selection based on task complexity
Design tools to execute quickly (< 2 seconds ideal)
Set appropriate max iterations to prevent excessive exploration
Monitor token usage and conversation length

High token costs

Symptoms: Token usage is higher than expected, costs are increasing rapidly.

Causes:

Max iterations configured too high
Agent making unnecessary tool calls
Large tool results filling context window
Long conversation history not being managed
Using expensive models for simple tasks

Solution:

Review token usage in Transcripts.
Lower max iterations for this agent.

Optimize tool responses to return less data:

Bad:  Return all 10,000 customer records
Good: Return paginated results, 20 records at a time

Add cost control guidance to system prompt:

Efficiency guidelines:
- Request only the data you need
- Stop when you have enough information
- Do not call tools speculatively

Switch to a more cost-effective model for simple queries.
Clear conversation history periodically in the Inspector tab.

Prevention:

Set appropriate max iterations (10-20 for simple, 30-40 for complex)
Design tools to return minimal necessary data
Monitor token usage trends
See cost calculation guidance in Cost calculation

Tool execution issues

Fix problems with timeouts, invalid parameters, and error responses.

Tool timeouts

Symptoms: Tools fail with timeout errors, agent receives incomplete results.

Causes:

External API is slow or unresponsive
Database query is too complex
Network latency between tool and external system
Tool processing large datasets in memory

Solution:

Add timeout handling to tool implementation:

http:
  url: https://api.example.com/data
  timeout: "5s"  # Set explicit timeout

Optimize external queries:
1. Add database indexes
2. Reduce query scope
3. Cache frequent queries
Increase tool timeout if operation legitimately takes longer.
Add retry logic for transient failures.

Prevention:

Set explicit timeouts in all tool configurations
Test tools under load
Monitor external API performance
Design tools to fail fast on unavailable services

Invalid parameters

Symptoms: Tools return validation errors about missing or incorrectly formatted parameters.

Causes:

Tool schema doesn’t match implementation
Agent passes wrong data types
Required parameters not marked as required in schema
Agent misunderstands parameter purpose

Solution:

Verify tool schema matches implementation:

input_schema:
  properties:
    order_id:
      type: string  # Must match what tool expects
      description: "Order ID in format ORD-12345"

Add parameter validation to tools.
Improve parameter descriptions in tool schema.

Add examples to tool descriptions:

description: |
  Get order status by order ID.
  Example: get_order_status(order_id="ORD-12345")

Prevention:

Write detailed parameter descriptions
Include format requirements and examples
Test tools with invalid inputs to verify error messages
Use JSON Schema validation in tool implementations

Tool returns errors

Symptoms: Tools execute but return error responses or unexpected data formats.

Causes:

External API returned error
Tool implementation has bugs
Data format changed in external system
Tool lacks error handling

Solution:

Check tool logs in MCP server.
Test tool directly (outside agent context).
Verify external system is operational.

Add error handling to tool implementation:

processors:
  - try:
      - http:
          url: ${API_URL}
    catch:
      - mapping: |
          root.error = "API unavailable: " + error()

Update agent system prompt to handle this error type.

Prevention:

Implement comprehensive error handling in tools
Monitor external system health
Add retries for transient failures
Log all tool errors for analysis

Integration issues

Fix problems with external applications calling agents and pipeline-to-agent integration.

Agent card does not contain a URL

Symptoms: Pipeline fails with error: agent card does not contain a URL or failed to init processor <no label> path root.pipeline.processors.0

Causes:

The agent_card_url points to the base agent endpoint instead of the agent card JSON file

Solution:

The agent_card_url must point to the agent card JSON file, not the base agent endpoint.

Incorrect configuration:

processors:
  - a2a_message:
      agent_card_url: "https://your-agent-id.ai-agents.your-cluster-id.cloud.redpanda.com"
      prompt: "Analyze this transaction: ${!content()}"

Correct configuration:

processors:
  - a2a_message:
      agent_card_url: "https://your-agent-id.ai-agents.your-cluster-id.cloud.redpanda.com/.well-known/agent-card.json"
      prompt: "Analyze this transaction: ${!content()}"

The agent card is always available at /.well-known/agent-card.json according to the A2A protocol standard.

Prevention:

Always append /.well-known/agent-card.json to the agent endpoint URL
Test the agent card URL in a browser before using it in pipeline configuration
See Agent card location for details

Pipeline integration failures

Symptoms: Pipelines using a2a_message processor fail or timeout.

Causes:

Agent is not running or restarting
Agent timeout is too low for pipeline workload
Authentication issues between pipeline and agent
High event volume overwhelming agent

Solution:

Check agent status and resource allocation.
Increase agent resource tier for high-volume pipelines.

Add error handling in pipeline:

processors:
  - try:
      - a2a_message:
          agent_card_url: "https://your-agent-url/.well-known/agent-card.json"
    catch:
      - log:
          message: "Agent invocation failed: ${! error() }"

Prevention:

Test pipeline-agent integration with low volume first
Size agent resources appropriately for event rate
See integration patterns in Pipeline Integration Patterns

Monitor and debug agents

For comprehensive guidance on monitoring agent activity, analyzing conversation history, tracking token usage, and debugging issues, see Monitor Agent Activity.

Next steps

Was this helpful?

group Ask in the community

mail Share your feedback

group_add Make a contribution

What do you think of this page?

Let us know more:

Let us contact you about your feedback:

Troubleshoot AI Agents

Deployment issues

MCP server connection failures

Runtime behavior issues

Agent not calling tools

Calling wrong tools

Stuck in loops or exceeding max iterations

Making up information

Analyzing conversation patterns

Performance issues

Slow response times

High token costs

Tool execution issues

Tool timeouts

Invalid parameters

Tool returns errors

Integration issues

Agent card does not contain a URL

Pipeline integration failures

Monitor and debug agents

Next steps

Simple online edits

Contribution guide