How to Evaluate Agent Performance on DigitalOcean Inference

Last verified 29 Jun 2026

Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.

Agent evaluations help you measure how well your agent performs across criteria like response quality, safety, instruction following, and context relevance. You can run evaluations on individual agents or across all agents in a workspace using customizable test cases and datasets that define what you want to measure.

Note

Agent evaluations are not available for agents built and deployed using the Agent Development Kit (ADK).

Create an Evaluation Test Case

Test cases are templates you configure with the metrics and data you want to use to test your agents. For example, you can create one test case to measure factual accuracy and safety, and another test case to measure context quality for Retrieval-Augmented Generation (RAG) agents.

Test cases are linked to the workspaces you create them in and you cannot move them to other workspaces.

To create an evaluation test case, go to the DigitalOcean Control Panel. In the left menu, click INFERENCE, and then click Agent Platform.

In the Workspaces section, click the workspace that contains the agent you want to evaluate, and then click the Evaluations tab.

Note

You must have at least one agent in the workspace before you can create an evaluation test case.

In the Test Cases section, in the top-right corner, click Create test case to open the Set up an evaluation test case page.

Test cases are available to all agents in the workspace.

Set Up Test Case Goals

In the Test case goals section, select one or more metric categories that match what you want to evaluate:

Category Measures Includes Requirements
Correctness Whether the agent gives factually and contextually correct answers. Retrieved context relevance, correctness, and response-context completeness. Some metrics require a knowledge base or other retrieved context.
User Outcomes Whether the agent moves the conversation, action, or user goal toward completion. User goal progress and instruction following. None.
Safety & Security Whether the agent avoids unsafe, sensitive, or inappropriate responses. Personally identifiable information leaks, toxicity, sexism, and prompt injection. None.
Context Quality The quality of the context the agent retrieves and how well the agent output follows the retrieved context. Context adherence. Some metrics require a knowledge base or other retrieved context.

For detailed descriptions of each metric and how they’re scored, see Agent Evaluation Metrics.

Note

You can’t create custom metrics for agent evaluation test cases.

To customize the metrics in a category, click Customize test case. In the Customize test case metrics window, select or deselect individual metrics, and then click Save metrics.

Evaluation test cases require at least one selected metric. If you select a metric that your agent does not have the required resources to score, such as a Context Quality metric for an agent without a knowledge base or tool output, that metric is not scored and you are not charged token usage for that metric.

If you select Ground truth faithfulness, your dataset must include an additional column for expected responses.

Select a Star Metric and Threshold

In the Recommended Star Metric section, review the selected star metric. The star metric is the primary metric used to determine whether the test case passes. The recommended star metric is based on the metric categories you selected.

To change the star metric, click Change. In the Select star metrics window, choose one of the available metrics, and then click Update Star Metric.

In the Define the test case scoring rules section, review how the test case is scored. Evaluations use a neutral LLM to score your agent’s responses against the selected metrics.

In the Star metric pass threshold field, enter the minimum passing score for the star metric. For metrics where higher scores indicate better output quality, scores below this threshold fail the test case.

It may take reviewing the results of a few runs and adjusting the scoring threshold to meet your needs for what you consider a passing or failing score.

Upload a Dataset

In the What Data Should We Use For This Test Case? section, add the dataset that contains the inputs you want to evaluate. To evaluate your agent, DigitalOcean sends the dataset inputs to the agent similarly to how the agent receives inputs from end users or applications in production.

Your dataset should have 50 to 100 inputs to get a representative sample of your agent’s behavior. Agent evaluation datasets must be CSV files. Evaluations are limited to 500 prompts.

Datasets without ground truth values must include a column named query, like this:

query
"What makes DigitalOcean different from other cloud providers?"
"Explain the benefits of using DigitalOcean for startups."
...

Datasets that use ground truth metrics must also include a column for expected responses. For additional guidance on how to write datasets for evaluations, see How to Create an Evaluation Dataset.

Click Add dataset to open the Select dataset window. Then, drag and drop your dataset file, or click Upload to select a file. After the file uploads, click Add.

You can only add one dataset for a test case.

Specify Test Case Name and Description

In the Test case name field, enter a unique, identifiable name that describes the test case’s goal or context. For example, Support accuracy test or Safety regression test.

Optionally, in the Test case description textbox, enter a short description to make the test case easier to identify later. For example:

Evaluates whether the support agent answers billing questions accurately and avoids unsupported refund promises.

Finalize Details

In the Final Details section, review the test case summary, including the selected metric categories and number of selected metrics.

To review possible cost factors, expand Estimating run costs. Evaluation costs depend on token usage. Token usage can vary based on the agent’s model, configuration, attached resources, settings, and query and response length. Larger datasets usually incur more token usage.

If an agent does not have the required resources for a selected metric, such as a knowledge base or tools, you are not charged tokens for that metric.

Then, click Create test case.

After you create the test case, it appears in the workspace’s Evaluations tab.

Edit an Evaluation Test Case

To edit a test case, click next to the test case, and then click Edit test case configuration to open the Edit test case configuration page.

Update any of the test case configuration fields you set when you created the test case. You can update the selected metrics, star metric, pass threshold, dataset, test case name, and description.

After you make your changes, click Save.

Run an Evaluation

After you create a test case, you can use it to run an evaluation on an agent in your workspace.

To run an evaluation, go to the DigitalOcean Control Panel. In the left menu, click INFERENCE, and then click Agent Platform.

In the Workspaces section, click the workspace that contains the agent you want to evaluate, and then click the Evaluations tab.

In the list of test cases, click next to the test case you want to run, and then click Run Evaluation to open the Run an evaluation page.

In the Agent to evaluate section, review the selected agent and version.

Note

To ensure high-quality run results, avoid making changes to agent configurations while a run is in progress.

In the Select test case for evaluation run section, select the test case you want to use for the evaluation.

In the Run name field, enter a name for the run. DigitalOcean uses this name as a prefix and appends a unique ID after the run starts. Run names can contain letters, numbers, and dashes.

In the Final Details section, review the test case summary, selected agent, model, and pricing information.

To review possible cost factors, expand Estimating run costs. Evaluation costs depend on token usage. Token usage can vary based on the agent’s model, configuration, attached resources, settings, and query and response length. Larger datasets usually incur more token usage.

To evaluate your data, DigitalOcean uses a third-party LLM-as-judge for scoring. By continuing, you acknowledge that your agent input and output may be sent to and processed by OpenAI to generate scores and score rationale. Score rationale is automatically generated and may contain errors or omissions. Review and verify results before relying on them.

Then, click Run evaluation to start the run. The run starts and directs you to the run’s overview page where you can view its progress. Runs may take several minutes to complete depending on the complexity of your agent’s configuration and prompts.

Review Run Results

After a run finishes, go to the workspace’s Evaluations tab, and then click the test case you ran.

In the Test case runs section, review the runs for the test case. The table includes the evaluation run name, agent, status, and star metric score.

To review a run’s results, click the evaluation run name to open the run overview page. You can also click + next to a run to expand additional run details.

The run overview page shows a completion summary with the total tokens used, number of prompts evaluated, and run cost.

In the Test case configuration section, review the agent evaluated, test case, selected metrics, dataset, and star metric.

In the PERFORMANCE METRICS section, review the total run time, number of prompts evaluated, and run cost. To view more cost details, click Cost breakdown to open the Evaluation run cost breakdown window. This window shows the evaluated agent and the token count and estimated cost for the agent configuration.

The Scores tab shows the average score for each selected metric across all evaluated queries. The table includes each metric’s category, metric name, and average score.

The Queries tab shows the dataset queries used in the run and lets you review how the agent performed on each query. Each query card shows the selected metric, score, input, token usage, and a link to Query details.

Click Query details to open the query details page. The query details page shows the full input, agent response, metric scores, score rationale, retrieved knowledge base data if available, and token usage. You can expand individual metric scores to review the judge model’s rationale for each score.

After reviewing the run results, you can adjust your agent in several ways to improve its responses, including:

After you adjust the agent’s configuration, run the evaluation again to see whether the agent’s responses improved.

We can't find any results for your search.

Try using different keywords or simplifying your search terms.