OpenAI Embedding Deployment Guide
Production operating guide for the OpenAI embedding provider (and OpenAI-compatible endpoints) covering authentication, usage-tier rate limiting, batching, retries, and observability.
Authentication & Secrets
| Parameter | Description |
|---|---|
openai_api_key / api_key | OpenAI API key. Use ${secrets:...} to resolve from a configured secret store. |
openai_org_id / org_id | OpenAI organization ID (optional). |
openai_project_id / project_id | OpenAI project ID (optional). |
openai_usage_tier / usage_tier | OpenAI account usage tier. |
endpoint | Endpoint override. Defaults to https://api.openai.com/v1. Set for OpenAI-compatible providers (Azure OpenAI, etc.). |
API keys must be sourced from a secret store in production. Aliases exist for credential parameters: api_key ↔ openai_api_key, org_id ↔ openai_org_id, etc.
OpenAI-Compatible Providers
Set endpoint to route embeddings through any OpenAI-compatible provider (Azure OpenAI, Together, vLLM, Groq, local Ollama with the OpenAI-compat endpoint). Verify the provider implements /v1/embeddings.
Resilience Controls
Usage Tier Rate Limiting
Tier selection governs the internal rate controller:
| Tier | Max concurrency | Requests / minute |
|---|---|---|
free | 1 | 100 |
tier1 | 35 | 3,000 |
tier2 | 60 | 5,000 |
tier3 | 60 | 5,000 |
tier4 | 125 | 10,000 |
tier5 | 125 | 10,000 |
Batching
The embeddings client automatically chunks input into batches bounded by:
- 256 inputs per batch (OpenAI's per-request input cap).
- ~512 KiB of string bytes per request batch (safeguard against oversized requests).
Large embedding jobs are transparently split across multiple API calls.
Retry Behavior
Embeddings retry with fibonacci backoff, up to 10 retries. Retriable conditions:
- HTTP 429 (rate limit, throttling)
- HTTP 500, 503 (transient server errors)
- Transient
reqwesterrors (connect failures, timeouts)
Throttling (429 with rate-limit body) is detected explicitly and surfaces as a structured rate-limit error after retries are exhausted.
Capacity & Sizing
- Vector dimensions: Bounded by the selected embedding model (e.g.,
text-embedding-3-small: 1536,text-embedding-3-large: 3072). Choose based on downstream storage and retrieval cost. - Concurrency budget: Plan for tier-based concurrency × typical per-request latency (~100-300 ms) to estimate achievable throughput. Embedding requests are IO-bound and scale well with concurrency up to the budget.
- Token limits: Each input is bounded by the model's context window (8192 tokens for
text-embedding-3-*). Inputs longer than the window fail with a400— truncate or chunk at the caller.
Metrics
Embedding requests use a dedicated metric namespace separate from chat/LLM metrics:
| Metric | Type | Labels | Description |
|---|---|---|---|
embeddings_requests | Counter | model, encoding_format, optional user, optional dimensions | Total embedding requests issued. |
embeddings_failures | Counter | same as above | Total embedding request failures. |
embeddings_internal_request_duration_ms | Histogram | same as above | Request latency (client-side). |
embeddings_load_errors | Counter | - | Runtime load-time errors. |
embeddings_active_count | Gauge | - | Currently-loaded embedding models. |
embeddings_load_state | Gauge | - | Load state (0/1). |
See Component Metrics for enabling and exporting metrics.
Task History
Embedding request operations emit text_embed spans in task history, with fields:
input(truncated)- Labels (
model,encoding_format, optionaluser, optionaldimensions) outputs_produced(number of vectors returned)- Errors (when applicable)
Known Limitations
- No automatic truncation: Inputs longer than the model's context window fail with a 400 error; truncate or chunk at the caller.
- No token-level rate limiting: The rate controller counts requests; token-level TPM limits imposed by OpenAI may still be hit and surface as 429.
- Provider compatibility varies: OpenAI-compatible providers may not implement every parameter (dimensions, user, encoding_format).
Troubleshooting
| Symptom | Likely cause | Resolution |
|---|---|---|
401 Unauthorized | Wrong / revoked API key. | Rotate the key; update the secret store. |
Sustained 429 rate_limit_exceeded | Tier budget too low or burst exceeds concurrency. | Raise openai_usage_tier, reduce max_concurrency, or upgrade the OpenAI tier. |
400 with "maximum context length" | Input exceeds model context window. | Truncate or chunk inputs at the caller. |
| Embeddings much slower than expected | Single-threaded caller, no batching. | Batch inputs; the client chunks into 256-input / 512 KiB batches but the caller must parallelize embedding jobs. |
| Latency spikes every few hundred requests | Transient 429 with fibonacci backoff recovering. | Expected at tier ceiling; raise tier or reduce load. |
