How to Evaluate Models

Last verified 29 Jun 2026

Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.

Using DigitalOcean Evaluations, you can validate model and inference router configurations on your own datasets before production using an LLM-as-a-judge framework. For pricing information, see the pricing page.

DigitalOcean Evaluations are advisory tools. We recommend manually reviewing evaluated model outputs before deploying models to production.

To run an evaluation, go to the DigitalOcean Control Panel, in the left menu, click INFERENCE, and then click Evaluations. In the top-right corner, click New Evaluation to open the Create an Evaluation page.

Configure the evaluation by selecting a dataset, candidate model, judge model, evaluation metrics, and star metric. You can also save the configuration as a preset to reuse later.

You can create an evaluation by either loading a preset or configuring the evaluation manually.

Choose a Preset

Presets let you save and reuse evaluation configurations, including the dataset, candidate model, system prompt, judge model, metrics, and star metric. When you load a preset, you can change the configuration without overwriting the preset you used. For more guidance, see Use Presets.

If you have saved presets, in the Presets section, click the Load from preset dropdown menu, and then select a preset.

In the Load preset? window, review the preset configuration, and then click Load preset.

After the preset loads, in the top-right corner, click Copy to form & edit to copy the configuration to the form. You can then update the configuration as needed before running the evaluation. If you make changes, you can save the modified configuration as a new preset.

To clear a loaded preset, in the top-right corner, click Clear. Then, either choose a different preset or manually configure the evaluation.

If needed, you can name the evaluation.

If you do not have any saved presets, click Configure without a preset and configure the evaluation manually. You can save the configuration as a preset before running the evaluation.

Configure an Evaluation Manually

Configure an evaluation manually when you want to start from a blank configuration instead of loading a saved preset.

Click Configure without a preset to manually configure the evaluation.

Configure Your Candidate

Click the Configure your candidate section to choose the candidate you want to evaluate. The candidate is either a model, inference router, or dedicated inference deployment that generates responses for the evaluation dataset.

Then, click the Select a candidate type dropdown menu, and then select one of the following candidate types:

  • Serverless Inference
  • Third Party
  • Dedicated Inference
  • Model Router

Then, click the Select a candidate dropdown menu, and then select the specific candidate you want to evaluate.

For the selected candidate, click Show advanced settings to optionally configure supported model parameters:

  • Max Tokens: The maximum number of tokens the model can generate in its response.
  • Temperature: The amount of randomness in the model’s responses. Lower values make responses more predictable, while higher values make responses more varied.
  • Stop token: A text sequence that tells the model when to stop generating tokens. This setting is only available for specific models.

In the System prompt textbox, enter the system prompt that the candidate uses to interpret the dataset prompts. For best practices, see system prompt best practices.

Choose a Dataset

Click the Choose Dataset section to select the dataset for the evaluation.

Then, click the Select a dataset dropdown menu, and then select the dataset you want to use. Datasets must be in CSV or JSONL format, have fewer than 1,000 rows, and be less than 1 GB in size.

If you need to add a dataset, click Manage datasets.

After you select a dataset, click Add dataset. In the Dataset Preview section, review the selected dataset. To view more rows, click Show full preview.

Upload a Dataset for a Model

To upload an evaluation dataset for a model, in the Datasets tab, click Upload dataset to open the Select dataset window.

In the Choose File section, either drag-and-drop your dataset file, or click Upload.

Then, click Add.

Configure Your Judge

Click the Configure judge & metrics section to choose the judge model and evaluation metrics.

In the Configure your LLM judge sub-section, click the Select a judge model dropdown menu, and then choose the model that scores the candidate’s responses. For information on which models can be used as judge models, see Available Models.

We send your inputs, outputs, and ground truth values, if available in the dataset, to the model provider for scoring and score rationale. Score rationale is automatically generated and may contain errors or omissions. Review and verify evaluation results before relying on them.

Select Evaluation Metrics

In the Evaluation metrics sub-section, either click Select all metrics to use all available metrics, or select the specific metrics you want the judge model to use.

Note

For metrics that measure desired behavior, such as Correctness, Completeness, and Ground Truth Faithfulness, higher scores are better. A row passes when the score is greater than or equal to the pass threshold.

For metrics that measure risk, such as Bias, Toxicity, and PII Leaks, lower scores are better. A row passes when the score is less than or equal to the pass threshold.

You can select from the following metrics:

Metric Description Recommendation
Correctness Measures how factually accurate or consistent the model response is. Correctness does not use the dataset ground truth field. Scores range from 0 to 1, where higher scores indicate likely accuracy. Flag low scores and adjust the system prompt to make the model explain its reasoning, cite sources, or avoid unsupported claims. You can also try a different candidate model.
Completeness Measures how thoroughly the response covers key details. The key details are based on the reference when provided, or inferred from the input when no reference is provided. Scores range from 0 to 1. Flag low scores and adjust the system prompt to require comprehensive answers, specific required details, or a structured response format.
Ground Truth Faithfulness Measures whether the response is both correct and complete compared to the dataset ground truth. This metric requires ground truth values in the dataset. If the dataset does not include ground truth values, this metric is skipped. Scores range from 0 to 1. Use this metric when you have expected answers or reference outputs. Flag low scores and compare the candidate response against the ground truth to identify missing or incorrect details.
Harmfulness Evaluates whether the candidate produces harmful, toxic, or unsafe content. Harmfulness includes the following sub-metrics, which you can select or deselect individually:

Bias: Detects whether the candidate response contains unfair or discriminatory language toward individuals or groups.

Toxicity: Detects whether the candidate response contains harmful, offensive, or abusive content.

PII Leakage: Detects whether the candidate response contains or exposes personally identifiable information.

If you deselect all Harmfulness sub-metrics, Harmfulness is deselected.
Flag high scores and review the selected sub-metric results to identify the type of harmful content detected. Update the system prompt to require safer, more neutral, or more privacy-preserving responses.

Select Custom Metrics

Custom metrics let you define your own metrics for evaluations. Use custom metrics when the default metrics don’t cover the behavior, policy, format, or use case you want to evaluate.

To select a custom metric, in the Custom metrics sub-section, click the Add a custom metric(s) dropdown menu, and then select the metrics you want to use.

To deselect a metric, clear the checkbox next to the metric. You can also click next to the metric, and then click Remove metric.

If you want to update or delete a metric, see Manage Custom Metrics.

If you don’t have any custom metrics, or if your existing custom metrics don’t cover what you need to evaluate, create a new custom metric.

Create a Custom Metric

To create a custom metric, in the Custom metrics sub-section, click Add metric to open the Create metric window. For guidance on writing effective custom metrics, see Create Custom Metrics.

In the Metric name field, enter a short, descriptive name for the custom metric. For example, Support Policy Compliance.

Optionally, in the Description field, describe what the metric measures. For example, Measures whether the response follows the required support policy and escalation guidance.

In the Scoring prompt textbox, provide detailed instructions for scoring the metric. Explain what the judge model should evaluate, what a good response includes, and what should lower the score. For example:

Evaluate whether the response follows the support policy provided in the prompt. A strong response follows the policy accurately, includes required escalation steps, and does not promise unsupported outcomes. Lower the score if the response skips required policy steps, contradicts the policy, or gives unsupported guarantees.

If the metric requires an expected answer or reference value from the dataset, select Requires ground truth. If the evaluation dataset does not include ground truth values, the metric is skipped.

Then, click Create.

Manage Custom Metrics

To edit a custom metric, click next to the metric, and then click Edit metric to open the Edit metric window.

Update any of the metric configuration fields you set when you created the metric, and then click Save.

To delete a custom metric, click next to the metric, and then click Delete to open the Delete metric window.

Deleting a custom metric removes it from future evaluation runs. This action is irreversible.

Then, cnter the metric name, and then click Delete metric to confirm deletion.

Select a Star Metric

In the Star Metric sub-section, either keep the default star metric, or click Change to open the Select star metric window.

The star metric is the primary metric you want to use to evaluate the candidate. The default star metric is Correctness.

In the Select star metric window, select one of the available metrics. You can choose from the built-in metrics or any custom metric you selected for the evaluation.

Then, click Update star metric to confirm your selection.

In the Star metric pass threshold field, set the passing score threshold for the star metric. This threshold determines whether each dataset row passes or fails for the star metric.

For metrics where higher scores are better, a row passes when its score is greater than or equal to the threshold. For metrics where lower scores are better, a row passes when its score is less than or equal to the threshold.

Review the results of several runs and adjust the threshold based on what you consider a passing or failing score. For example, if you want to ensure your model does not generate unsafe content, you can select Toxicity as the star metric. For Toxicity, lower scores are better, so a row passes when its Toxicity score is less than or equal to the threshold.

Name the Evaluation

Click the Optional settings section to optionally name the evaluation run.

In the Name your evaluation run (optional) field, enter a name for the evaluation.

You can also save the evaluation configuration as a preset.

Save an Evaluation as a Preset

Presets let you save evaluation configurations for later use.

To save an evaluation as a preset, click Optional settings, and then select Save this configuration as a preset.

In the Preset name field, enter a name for the preset.

In the Choose what to save sub-section, either keep Select all sections selected to save the full configuration, or deselect it and select the configuration sections you want to include in the preset. You can save the Dataset, Candidate model and configuration, Judge model, and Evaluation metrics sections independently.

Run the Evaluation

After configuring your evaluation, click Run Evaluation.

Note

To cancel a queued evaluation, go to the Evaluations page, click next to the evaluation, and then click Cancel Evaluation.

After the run completes, the evaluation results page shows a summary of the run. The summary includes:

  • Overall Score: The evaluation run’s overall pass score.
  • Dataset: The dataset used for the evaluation.
  • Star Metric: The primary metric used to evaluate the candidate.
  • Pass Threshold: The score threshold required to pass the star metric.
  • Candidate Tokens: The candidate model’s input, output, and total token usage.
  • Judge Tokens: The judge model’s input, output, and total token usage.
  • Candidate Latency: The candidate model’s average, percentile, minimum, and maximum latency.

The results chart shows the pass and fail percentage for each selected metric.

The Evaluation results table shows each dataset entry with the candidate output, selected metrics, scores, and judge rationale. The table includes the following fields:

  • Entry: The dataset row number.
  • Input: The prompt or input from the dataset. To view the full input, click Show full input.
  • Output: The candidate model’s response. To view the full output, click Show full output.
  • Metric: The selected evaluation metric.
  • Score: The score for the metric.
  • Reason: The judge model’s rationale for the score.

In the top-right corner, click Download results to download a JSON file with additional details about the evaluation results.

Our data privacy policy describes zero data retention for this flow, so your data is never stored outside of DigitalOcean, and your data is not used to train models.

Compare Evaluations

Compare evaluations to review differences between two evaluation runs, including overall score, configuration, metric performance, token usage, latency, and per-entry results.

In the top-right corner, click Compare Evaluation to open the Compare Evaluations page.

In the Evaluation A section, the current evaluation is selected by default. To compare a different evaluation, click its dropdown menu, and then select another evaluation.

In the Evaluation B section, click Select an evaluation dropdown menu, and then select the evaluation you want to compare against Evaluation A.

After you select both evaluations, the comparison page shows the following sections:

  • Overall Score: Compares each evaluation’s overall score, star metric, and pass threshold.
  • Configuration: Compares the candidate, judge model, dataset, temperature, max tokens, and runtime for each evaluation.
  • Metric Comparison: Shows each evaluation’s pass percentage for each selected metric.
  • Performance: Compares token usage and latency, including candidate input tokens, candidate output tokens, judge input tokens, judge output tokens, average latency, and percentile latency.
  • Per-Entry Results: Compares row-level inputs and candidate outputs side by side when both evaluations use the same dataset. If the evaluations use different datasets, per-entry results can’t be compared side by side.

Edit an Evaluation Name

To edit an evaluation’s name, go to the DigitalOcean Control Panel, in the left menu, click INFERENCE, and then click Evaluations.

Then, click next to the evaluation, and then click Edit name to open the Edit evaluation name window.

Then, update the evaluation name, and then click Save.

Delete an Evaluation

To delete an evaluation, go to the DigitalOcean Control Panel, in the left menu, click INFERENCE, and then click Evaluations.

Then, click next to the evaluation, and then click Delete Evaluation to open Delete evaluation window.

Then, to confirm deletion, enter the name of the evaluation, and then click Delete evaluation.

Run an Evaluation Using the API

To run an evaluation using the API, first upload an evaluation dataset by sending a POST request to the /v2/gen-ai/model_evaluation/datasets/file_upload_presigned_urls endpoint:

How to Upload an Evaluation Dataset Using the DigitalOcean API

Create a personal access token and save it for use with the API.

cURL

Send a POST request to https://api.digitalocean.com/v2/gen-ai/model_evaluation/datasets/file_upload_presigned_urls.

Using cURL:

curl -X POST \
  -H "Content-Type: application/json"  \
  -H "Authorization: Bearer $DIGITALOCEAN_TOKEN" \
  "https://api.digitalocean.com/v2/gen-ai/model_evaluation/datasets/file_upload_presigned_urls" \
  -d '{
    "files": [{
      "file_name": "dataset.jsonl",
      "file_size": 2048
    }]
  }'

Then, create an evaluation run for the model. To create an evaluation run, you need the candidate and judge model IDs. Use the List Available Models endpoint to get supported text models from serverless inference and dedicated inference, as well as available inference routers.

Note

Not all models support evaluations.

After you have the supported candidate and judge model IDs, send the following request to start an evaluation run:

How to Create a Model Evaluation Run Using the DigitalOcean API

Create a personal access token and save it for use with the API.

cURL

Send a POST request to https://api.digitalocean.com/v2/gen-ai/model_evaluation_runs.

Using cURL:

curl -X POST \
  -H "Content-Type: application/json"  \
  -H "Authorization: Bearer $DIGITALOCEAN_TOKEN" \
  "https://api.digitalocean.com/v2/gen-ai/model_evaluation_runs" \
  -d '{
    "name": "my-evaluation-run",
    "candidate_model_uuid": "123e4567-e89b-12d3-a456-426614174000",
    "judge_model_uuid": "223e4567-e89b-12d3-a456-426614174001",
    "dataset_uuid": "323e4567-e89b-12d3-a456-426614174002",
    "metric_uuids": [
      "423e4567-e89b-12d3-a456-426614174003"
    ]
  }'

After the run completes, retrieve the evaluation metrics:

How to List Model Evaluation Metrics Using the DigitalOcean API

Create a personal access token and save it for use with the API.

cURL

Send a GET request to https://api.digitalocean.com/v2/gen-ai/model_evaluation_metrics.

Using cURL:

curl -X GET \
  -H "Content-Type: application/json"  \
  -H "Authorization: Bearer $DIGITALOCEAN_TOKEN" \
  "https://api.digitalocean.com/v2/gen-ai/model_evaluation_metrics"

You can download a JSON file with additional details about the evaluation results:

How to Download Evaluation Results Using the DigitalOcean API

Create a personal access token and save it for use with the API.

cURL

Send a GET request to https://api.digitalocean.com/v2/gen-ai/model_evaluation_runs/{eval_run_uuid}/results/download_url.

Using cURL:

curl -X GET \
  -H "Content-Type: application/json"  \
  -H "Authorization: Bearer $DIGITALOCEAN_TOKEN" \
  "https://api.digitalocean.com/v2/gen-ai/model_evaluation_runs/123e4567-e89b-12d3-a456-426614174000/results/download_url"

We can't find any results for your search.

Try using different keywords or simplifying your search terms.