Agent Evaluation Metrics

Last verified 29 Jun 2026

Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.

Agent evaluation metrics measure agent response quality, safety, instruction following, context use, tool use, and user goal progress. Use these metrics to create evaluation test cases that reflect the behavior you want to measure.

Note

Agent evaluations are not available for agents built and deployed using the Agent Development Kit (ADK).

General Agent Quality

General agent quality metrics measure the overall quality of an agent’s responses, including correctness, instruction following, safety, and whether the agent advances or completes the user’s goal. These metrics help ensure that agents provide accurate, relevant, and safe responses to user queries.

Metric Description Returns Recommendations
Correctness (General Hallucinations) Measures how factually accurate the agent’s response is without using context. High score = likely accurate; low score = possible hallucinations or errors. Number (0-100); high = likely accurate Flag low scores, adjust prompt, prevent non-factual answers
Instruction Following Measures how well the agent follows instructions. High = followed closely; low = ignored parts. Number (0-100); Boolean (Yes/No) Flag ignored instructions, reword or vary instructions, add safeguards
Ground Truth Faithfulness Compares response to known correct output. High = semantically equivalent; low = different meaning. % Yes judgments Use with other metrics for full picture
PII Leaks Detects if input/output contains personally identifiable info (PII). Boolean (Yes/No) No recommendations
Toxicity Flags hateful, offensive, or harmful content. Boolean (Yes/No) Apply guardrails, retrain agents, change models
Sexism Flags sexist content; identifies harmful gender-based language. Boolean (Yes/No) Apply guardrails, retrain agents
Prompt Injection Flags input designed to manipulate agent behavior. Boolean (Yes/No) Apply guardrails
User Goal Progress (Action Advancement) Measures if the agent advanced the user’s task or question (partial/full answer, clarification, confirm action). Number (0 to 100); 100 = advanced or accomplished at least one goal No Recommendations
User Goal Completion (Action Completion) Measures if the agent fully accomplished the user’s goal; must be accurate, comprehensive, aligned with tool outputs. Number (0 to 100) No Recommendations

RAG Context Quality

RAG context quality metrics evaluate how well agents use retrieved context to provide accurate, relevant, and grounded responses. These metrics help ensure that agents effectively use external knowledge from Retrieval-Augmented Generation (RAG) pipelines.

Metric Description Returns Recommendations
Context Adherence (Context Hallucinations) Measures whether the agent stays within the retrieved context when generating a response. High = relies only on provided facts; low = introduces unsupported information. Number (0 to 1); score close to 1 means fully adherent; close to 0 means hallucinations likely. No Recommendations
Response-Context Completeness (Completeness) Measures how thoroughly the agent covers key details from the provided context. Number (0 to 1) Rewrite the prompt to explicitly ask for full inclusion of relevant info
Adjust prompt to encourage thorough coverage of key details
Retrieved Context Relevance (Context Relevance) Measures how relevant the retrieved context is to the input prompt; checks if the context supports the query. Number (0 or 100); high = significant similarity or relevance. No Recommendations

Tool Use Quality

Tool use quality metrics evaluate whether agents select the right tools and use them successfully. These metrics help you test agents that rely on tools, functions, APIs, or external systems to complete user requests.

Metric Description Returns Recommendations
Tool Selection Measures whether the agent selected the correct tool for the user’s request. A high score means the selected tool was appropriate for the task. A low score means the agent selected the wrong tool, failed to select a needed tool, or used a tool unnecessarily. Number (0-100) Use to identify prompts, tool descriptions, or routing logic that need clearer tool selection guidance.
Tool Success Measures whether the selected tool was used successfully and produced a useful result for the user’s request. A high score means the tool call helped complete the task. A low score means the tool call failed, returned unusable results, or was not handled correctly by the agent. Number (0-100) Use to identify failed tool workflows, incomplete tool handling, or cases where the agent does not recover from tool errors.

We can't find any results for your search.

Try using different keywords or simplifying your search terms.