Prompt caches are held for an unbounded period of time. Treat prompt caching as opportunistic; do not depend on it for predictable cost savings.
How to Use Prompt Caching in Chat Completions and Responses API
Last verified 1 Jul 2026
Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.
Use prompt caching with the chat completion and responses APIs to cache context and use it in future requests. If part of your request is already cached, you are charged a lower price for those cached tokens, and the standard price for the remaining input tokens. This significantly reduces the cost for inference.
Anthropic Models
Use prompt caching for Anthropic models in the chat completions API. Specify the cache_control parameter with type: ephemeral and ttl in your JSON request body. The ttl value can be 5m (default) or 1h. The following request body examples show how to use the cache_control parameter.
...
{
"role": "user",
"content": {
"type": "text",
"text": "This is cached for 1h.",
"cache_control": {
"type": "ephemeral",
"ttl": "1h"
}
}
}...
{
"role": "developer",
"content": [
{
"type": "text",
"text": "Cache this segment for 5 minutes.",
"cache_control": {
"type": "ephemeral",
"ttl": "5m"
}
},
{
"type": "text",
"text": "Do not cache this segment"
}
]
}...
{
"role": "tool",
"tool_call_id": "tool_call_id",
"content": [
{
"type": "text",
"text": "Tool output cached for 5m.",
"cache_control": {
"type": "ephemeral",
"ttl": "5m"
}
}
]
}The JSON response looks similar to the following and shows the number of input tokens cached during this request:
"usage": {
"cache_created_input_tokens": 1043,
"cache_creation": {
"ephemeral_1h_input_tokens": 0,
"ephemeral_5m_input_tokens": 1043
},
"cache_read_input_tokens": 0,
"completion_tokens": 100,
"prompt_tokens": 14,
"total_tokens": 114
}If you send the request again, cached input tokens are used and the response looks like this:
"usage": {
"cache_created_input_tokens": 0,
"cache_creation": {
"ephemeral_1h_input_tokens": 0,
"ephemeral_5m_input_tokens": 0
},
"cache_read_input_tokens": 1043,
"completion_tokens": 100,
"prompt_tokens": 14,
"total_tokens": 114
}OpenAI Models
Use prompt caching for OpenAI models for prompts containing 1024 tokens or more in both chat completions and responses API. Caching applies when the input tokens of a response match tokens from a previous response, though this is best-effort and not guaranteed.
To use prompt caching, specify the prompt_cache_retention parameter as either in_memory or 24h. The following request body example shows how to use the prompt_cache_retention parameter:
...
{
"model": "gpt-4o-mini",
"prompt_cache_retention": "24h",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant that summarizes text."
},
{
"role": "user",
"content": "Summarize the following text:\n\nArtificial intelligence is transforming industries by automating tasks, improving efficiency, and enabling new innovations..."
}
],
"temperature": 0.2
}The JSON response looks similar to the following and shows the number of input tokens cached during this request:
{
"id": "chatcmpl-xyz789",
"object": "chat.completion",
"created": 1772134300,
"model": "gpt-4o-mini",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "Artificial intelligence is reshaping industries by automating processes, increasing efficiency, and enabling innovation."
}
}
],
"usage": {
"prompt_tokens": 1200,
"completion_tokens": 35,
"total_tokens": 1235,
"cache_read_input_tokens": 0,
"cache_created_input_tokens": 1200,
"cache_creation": {
"ephemeral_5m_input_tokens": 0,
"ephemeral_1h_input_tokens": 1200
}
}
}If you send the request again within the retention window, cached input tokens are used and the response looks like this:
"usage": {
"prompt_tokens": 1200,
"completion_tokens": 34,
"total_tokens": 1234,
"cache_read_input_tokens": 1200,
"cache_created_input_tokens": 0,
"cache_creation": {
"ephemeral_5m_input_tokens": 0,
"ephemeral_1h_input_tokens": 0
}
}Open-Source Models public
Prompt caching for open-source models is in public preview. Open-source models, such as DeepSeek V3.2, support prompt caching automatically. You do not need to set cache_control or prompt_cache_retention.
Caches are isolated per customer account. The platform derives cache entries from a per-account key, so the cache for one account is never reused by another, and another account cannot read or infer your cached content.
Caching is applied on a best-effort basis when the input tokens of a request match the prefix of a previous request. The number of cached tokens served from a previous request is reported in two places in the usage object: cache_read_input_tokens and prompt_tokens_details.cached_tokens:
"prompt_tokens_details": {
"cached_tokens": 128
}When cached_tokens and cache_read_input_tokens are greater than 0, those input tokens were served from cache and billed at the discounted cached rate. A value of 0 means no cached prefix was matched and all input tokens were billed at the standard rate.
Caching matches on an exact token prefix, so a single differing token at the start of a request prevents a cache match. To maximize cache hits, structure your prompts so that static content, such as the system prompt, tool definitions, and reference documents, comes first, and dynamic content, such as the user’s latest message, comes last. Avoid placing frequently changing values like timestamps or request IDs near the beginning of the prompt.
The following example request sends a prompt to a DeepSeek V3.2 model:
curl https://inference.do-ai.run/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $MODEL_ACCESS_TOKEN" \
-d '{
"model": "deepseek-3.2",
"messages": [
{
"role": "user",
"content": "Summarize this: For synchronous proxy: Expose a POST /v1/chat endpoint. When a request comes in, route the payload to the Primary LLM using the DigitalOcean Serverless Inference API and return the response to the user immediately. For asynchronous execution: In the background, concurrently route the exact same request to the Candidate LLM endpoint. The latency, errors, or failure of the candidate model must never delay or impact the primary response returned to the user. For deterministic evaluation: Once the Candidate model responds, evaluate and compare both outputs. Implement a heuristic rule checking: 1) Did both models return valid, parseable JSON payloads? 2) Extract the {\"action\": \"...\"} key from both models and assert whether they match exactly. For observability API: Expose a GET /metrics endpoint returning a real-time summary of total requests processed, shadow execution errors or timeouts, and the exact match rate percentage (%) between the Primary and Candidate outputs. Breakdown of prompt tokens per the OpenAI-compatible usage schema."
}
],
"max_tokens": 5000
}'The response looks similar to the following and includes the model output and a usage block. On a cache hit, cache_read_input_tokens and prompt_tokens_details.cached_tokens report the number of input tokens served from cache:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "Here's a concise summary of the process described:\n\n- **Primary synchronous flow**: User POST requests to `/v1/chat` go to a **Primary LLM** via DigitalOcean Serverless Inference API. The response is returned immediately.\n\n- **Asynchronous shadow execution**: The same request is sent concurrently to a **Candidate LLM** in the background. Failures or latency here **never affect** the primary user response.\n\n- **Deterministic evaluation**: Once the Candidate responds, outputs are compared using rules:\n 1. Both responses must be valid JSON.\n 2. The \"action\" field is extracted and must match exactly between models.\n\n- **Observability**: A `GET /metrics` endpoint provides real-time metrics on:\n - Total requests\n - Shadow execution errors/timeouts\n - Match rate percentage between Primary and Candidate\n - Token usage breakdown (OpenAI-compatible format)\n\n**Purpose**: Safely test a new model (Candidate) against the current model (Primary) without impacting users, while tracking performance and consistency.",
"reasoning_content": null,
"refusal": null,
"role": "assistant"
}
}
],
"created": 1782342911,
"id": "chatcmpl-b17862c6-333c-4a1d-aec0-cebd14bc93ad",
"model": "deepseek-3.2",
"object": "chat.completion",
"usage": {
"cache_created_input_tokens": 0,
"cache_creation": {
"ephemeral_1h_input_tokens": 0,
"ephemeral_5m_input_tokens": 0
},
"cache_read_input_tokens": 128,
"completion_tokens": 217,
"prompt_tokens": 214,
"prompt_tokens_details": {
"cached_tokens": 128
},
"speed": null,
"total_tokens": 431
}
}In this example, prompt_tokens is 214 and cached_tokens is 128, indicating that 128 of the 214 input tokens were served from cache at the discounted rate, and the remaining 86 were billed at the standard input rate.