Agent evaluations are not available for agents built and deployed using the Agent Development Kit (ADK).
Agent Evaluation Metrics
Last verified 29 Jun 2026
Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.
Agent evaluation metrics measure agent response quality, safety, instruction following, context use, tool use, and user goal progress. Use these metrics to create evaluation test cases that reflect the behavior you want to measure.
General Agent Quality
General agent quality metrics measure the overall quality of an agent’s responses, including correctness, instruction following, safety, and whether the agent advances or completes the user’s goal. These metrics help ensure that agents provide accurate, relevant, and safe responses to user queries.
| Metric | Description | Returns | Recommendations |
|---|---|---|---|
| Correctness (General Hallucinations) | Measures how factually accurate the agent’s response is without using context. High score = likely accurate; low score = possible hallucinations or errors. | Number (0-100); high = likely accurate | Flag low scores, adjust prompt, prevent non-factual answers |
| Instruction Following | Measures how well the agent follows instructions. High = followed closely; low = ignored parts. | Number (0-100); Boolean (Yes/No) | Flag ignored instructions, reword or vary instructions, add safeguards |
| Ground Truth Faithfulness | Compares response to known correct output. High = semantically equivalent; low = different meaning. | % Yes judgments | Use with other metrics for full picture |
| PII Leaks | Detects if input/output contains personally identifiable info (PII). | Boolean (Yes/No) | No recommendations |
| Toxicity | Flags hateful, offensive, or harmful content. | Boolean (Yes/No) | Apply guardrails, retrain agents, change models |
| Sexism | Flags sexist content; identifies harmful gender-based language. | Boolean (Yes/No) | Apply guardrails, retrain agents |
| Prompt Injection | Flags input designed to manipulate agent behavior. | Boolean (Yes/No) | Apply guardrails |
| User Goal Progress (Action Advancement) | Measures if the agent advanced the user’s task or question (partial/full answer, clarification, confirm action). | Number (0 to 100); 100 = advanced or accomplished at least one goal | No Recommendations |
| User Goal Completion (Action Completion) | Measures if the agent fully accomplished the user’s goal; must be accurate, comprehensive, aligned with tool outputs. | Number (0 to 100) | No Recommendations |
RAG Context Quality
RAG context quality metrics evaluate how well agents use retrieved context to provide accurate, relevant, and grounded responses. These metrics help ensure that agents effectively use external knowledge from Retrieval-Augmented Generation (RAG) pipelines.
| Metric | Description | Returns | Recommendations |
|---|---|---|---|
| Context Adherence (Context Hallucinations) | Measures whether the agent stays within the retrieved context when generating a response. High = relies only on provided facts; low = introduces unsupported information. | Number (0 to 1); score close to 1 means fully adherent; close to 0 means hallucinations likely. | No Recommendations |
| Response-Context Completeness (Completeness) | Measures how thoroughly the agent covers key details from the provided context. | Number (0 to 1) | Rewrite the prompt to explicitly ask for full inclusion of relevant info Adjust prompt to encourage thorough coverage of key details |
| Retrieved Context Relevance (Context Relevance) | Measures how relevant the retrieved context is to the input prompt; checks if the context supports the query. | Number (0 or 100); high = significant similarity or relevance. | No Recommendations |
Tool Use Quality
Tool use quality metrics evaluate whether agents select the right tools and use them successfully. These metrics help you test agents that rely on tools, functions, APIs, or external systems to complete user requests.
| Metric | Description | Returns | Recommendations |
|---|---|---|---|
| Tool Selection | Measures whether the agent selected the correct tool for the user’s request. A high score means the selected tool was appropriate for the task. A low score means the agent selected the wrong tool, failed to select a needed tool, or used a tool unnecessarily. | Number (0-100) | Use to identify prompts, tool descriptions, or routing logic that need clearer tool selection guidance. |
| Tool Success | Measures whether the selected tool was used successfully and produced a useful result for the user’s request. A high score means the tool call helped complete the task. A low score means the tool call failed, returned unusable results, or was not handled correctly by the agent. | Number (0-100) | Use to identify failed tool workflows, incomplete tool handling, or cases where the agent does not recover from tool errors. |