Skip to main content

Overview

LemonData provides an intelligent caching system that can significantly reduce your API costs and response latency. Our caching goes beyond simple request matching - it understands the semantic meaning of your prompts.

Cost Savings

Cache hits are billed at a fraction of the normal cost.

Faster Responses

Cached responses are returned instantly, no model inference needed.

Context-Aware

Semantic matching finds similar requests even with different wording.

Privacy Controls

Full control over what gets cached and shared.

How It Works

LemonData uses a two-layer caching system:

Layer 1: Response Cache (Exact Match)

For deterministic requests (temperature=0), we cache the exact response:
  • Match: Identical model, messages, and parameters
  • Speed: Instant (microseconds)
  • Best for: Repeated identical queries

Layer 2: Semantic Cache (Similarity Match)

For all requests, we also check semantic similarity using a two-stage matching algorithm:
  • Stage 1 (Query-only): ≥95% similarity on user query
  • Stage 2 (Full context): ≥85% similarity including conversation context
  • Best for: FAQ-style queries, common questions
User A: "What is the capital of France?"
User B: "Tell me the capital city of France"
→ Same cached response (high semantic similarity)

Cache Control

Request-Level Control

Control caching behavior per-request using the cache_control parameter in the request body:
# Skip cache lookup, always call the model
curl https://api.lemondata.cc/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}],
    "cache_control": {"type": "no_cache"}
  }'
TypeEffect
no_cacheSkip cache lookup, always get fresh response
no_storeDon’t store this response in cache
response_onlyOnly use exact match cache (skip semantic)
semantic_onlyOnly use semantic cache (skip exact match)

Response Headers

Every response includes cache status:
X-Cache-Status: HIT    # Response served from cache
X-Cache-Status: MISS   # Fresh response from model

Checking Cache Status

from openai import OpenAI

client = OpenAI(
    api_key="sk-your-key",
    base_url="https://api.lemondata.cc/v1"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}]
)

# Check cache status from response headers
# (Available in raw HTTP response)
print(f"Cache: {response._raw_response.headers.get('X-Cache-Status')}")

Cache Billing

Cache hits are significantly cheaper than fresh requests:
TypeCost
Cache HIT90% off
Cache MISSFull price
The exact discount is shown in your dashboard usage logs.

Privacy Controls

Organization / User Level

Configure caching behavior in your dashboard settings:
ModeDescription
SharedCache enabled, responses may be shared across users (default for personal accounts)
IsolatedCache enabled, but responses are private to your organization (default for organizations)
DisabledNo caching at all
Additional settings available:
  • Similarity Threshold: Adjust semantic matching sensitivity (default: 92%)
  • Custom TTL: Override cache expiration time
  • Excluded Models: Disable caching for specific models

Request Level

Override per-request using the cache_control parameter:
# Disable caching for this request
curl https://api.lemondata.cc/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "..."}],
    "cache_control": {"type": "no_store"}
  }'

Cache Feedback

If you receive an incorrect cached response, you can report it:
curl -X POST https://api.lemondata.cc/v1/cache/feedback \
  -H "Authorization: Bearer sk-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "cache_entry_id": "abc123",
    "feedback_type": "wrong_answer",
    "description": "Response was outdated"
  }'
Feedback types:
  • wrong_answer - Factually incorrect
  • outdated - Information is stale
  • irrelevant - Doesn’t match the question
  • other - Other issues
When a cache entry receives enough negative feedback, it’s automatically invalidated.

Best Practices

Deterministic settings maximize cache hit rates.
Consistent formatting improves semantic matching.
Current events, real-time data should skip cache.
Check your dashboard for cache statistics and savings.

When NOT to Cache

Disable caching for:
  • Real-time information: Stock prices, weather, news
  • Personalized content: User-specific recommendations
  • Creative tasks: When variety is desired
  • Sensitive data: Confidential information
# For time-sensitive queries
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the current stock price of AAPL?"}],
    extra_body={"cache_control": {"type": "no_cache"}}
)