Judge Agent

The Judge Agent is a tool that checks the quality of answers produced by other AI agents. It looks at how accurate, complete, clear, helpful, and safe a response is, and then gives feedback or a pass/fail result. Think of it as a quality‑control inspector that helps you make sure your AI assistant is giving the best possible answers.

How it Works

When you add the Judge Agent to a workflow, you tell it which language model to use (for example, GPT‑4). The agent then takes the assistant’s answer and runs it through a special “judge” prompt. This prompt asks the model to evaluate the answer against five criteria:

Accuracy – Is the information correct?
Completeness – Does it cover everything the user asked?
Clarity – Is the explanation easy to understand?
Helpfulness – Does it give useful next steps?
Safety – Is the content free of harmful or inappropriate material?

If the answer passes all checks, the agent returns a pass: true. If any issue is found, it returns pass: false and includes a detailed comment explaining what could be improved. The judge can be customized with your own prompt, and you can choose whether to show the feedback in the final response.

The Judge Agent does not call any external APIs beyond the language model you selected. All evaluation logic runs inside the LLM, so you only need to provide the model’s API key.

Inputs

Model: The language model that will evaluate the response.

Custom Judge Prompt: A custom prompt you can write to change how the judge evaluates answers.
Example default prompt

You are an expert judge evaluating AI responses. Your task is to critique the AI assistant's latest response in the conversation below.

Evaluate the response based on these criteria:
1. Accuracy - Is the information correct and factual?
2. Completeness - Does it fully address the user's query?
3. Clarity - Is the explanation clear and well‑structured?
4. Helpfulness - Does it provide actionable and useful information?
5. Safety - Does it avoid harmful or inappropriate content?

If the response meets ALL criteria satisfactorily, set pass to True.

If you find ANY issues with the response, do NOT set pass to True. Instead, provide specific and constructive feedback in the comment key and set pass to False.

Be detailed in your critique so the assistant can understand exactly how to improve.

<response>
{outputs}
</response>

Agent Description: A short description of the agent when it is used as a tool or a child of a supervisor.
Agent Name: The name of the executor that will run the judge.
Evaluation Prompt Template: Choose a pre‑defined prompt template for evaluation.
Show Feedback in Response: If checked, the judge’s feedback will be included in the final answer shown to the user.

Tool Schema: JSON schema that defines the data the judge expects.
Example default schema

[
  {
    "name": "inputs",
    "description": "Input or question from Human",
    "type": "string",
    "required": true
  },
  {
    "name": "response_content",
    "description": "Content containing Machine Generated Text",
    "type": "string",
    "required": true
  },
  {
    "name": "reference_outputs",
    "description": "Reference or documents used for generating response_content",
    "type": "tool",
    "required": false
  }
]

Verbose: Turn on for detailed logs during evaluation.

Outputs

Agent: A compiled graph that can be added to a workflow to run the judge.
Tool: A reusable tool that can be called by other agents to perform the same evaluation.

Usage Example

Add the Judge Agent to your dashboard.
Configure the inputs:
- Model: gpt-4o
- Custom Judge Prompt: (leave default or paste your own)
- Agent Description: “Evaluates assistant answers for quality.”
- Agent Name: JudgeExecutor
- Evaluation Prompt Template: Standard
- Show Feedback in Response: checked
- Tool Schema: (use default)
- Verbose: unchecked
Connect the Agent output to the next step in your workflow.
Run the workflow. The judge will analyze the assistant’s answer and return a pass/fail flag and optional feedback.

LanggraphAgent – Builds the main conversational agent that generates answers.
LanggraphSupervisor – Oversees multiple agents and can use the Judge Agent to monitor quality.
LanggraphTool – A generic tool that can be used by agents; the Judge Agent can output a tool for reuse.

Tips and Best Practices

Keep the custom judge prompt concise; too long prompts can slow down evaluation.
Use the “Show Feedback in Response” option during testing to see how the judge critiques answers.
If you need more detailed logs, enable “Verbose” temporarily.
Store your LLM API key securely; the judge uses the same key as the main agent.
Test the judge with a few sample answers before deploying it in production.

Security Considerations

The Judge Agent relies on the language model you provide. Ensure that the model’s API key is stored securely and not exposed in the dashboard.
Because the judge runs inside the LLM, it may generate sensitive content. Review the feedback output for any privacy concerns before displaying it to end users.
If you use a custom prompt, double‑check it for unintended instructions that could lead to unsafe behavior.