Skip to main content
Version: Next

OpenAI Embedding Deployment Guide

Production operating guide for the OpenAI embedding provider (and OpenAI-compatible endpoints) covering authentication, usage-tier rate limiting, batching, retries, and observability.

Authentication & Secrets

ParameterDescription
openai_api_key / api_keyOpenAI API key. Use ${secrets:...} to resolve from a configured secret store.
openai_org_id / org_idOpenAI organization ID (optional).
openai_project_id / project_idOpenAI project ID (optional).
openai_usage_tier / usage_tierOpenAI account usage tier.
endpointEndpoint override. Defaults to https://api.openai.com/v1. Set for OpenAI-compatible providers (Azure OpenAI, etc.).

API keys must be sourced from a secret store in production. Aliases exist for credential parameters: api_keyopenai_api_key, org_idopenai_org_id, etc.

OpenAI-Compatible Providers

Set endpoint to route embeddings through any OpenAI-compatible provider (Azure OpenAI, Together, vLLM, Groq, local Ollama with the OpenAI-compat endpoint). Verify the provider implements /v1/embeddings.

Resilience Controls

Usage Tier Rate Limiting

Tier selection governs the internal rate controller:

TierMax concurrencyRequests / minute
free1100
tier1353,000
tier2605,000
tier3605,000
tier412510,000
tier512510,000

Batching

The embeddings client automatically chunks input into batches bounded by:

  • 256 inputs per batch (OpenAI's per-request input cap).
  • ~512 KiB of string bytes per request batch (safeguard against oversized requests).

Large embedding jobs are transparently split across multiple API calls.

Retry Behavior

Embeddings retry with fibonacci backoff, up to 10 retries. Retriable conditions:

  • HTTP 429 (rate limit, throttling)
  • HTTP 500, 503 (transient server errors)
  • Transient reqwest errors (connect failures, timeouts)

Throttling (429 with rate-limit body) is detected explicitly and surfaces as a structured rate-limit error after retries are exhausted.

Capacity & Sizing

  • Vector dimensions: Bounded by the selected embedding model (e.g., text-embedding-3-small: 1536, text-embedding-3-large: 3072). Choose based on downstream storage and retrieval cost.
  • Concurrency budget: Plan for tier-based concurrency × typical per-request latency (~100-300 ms) to estimate achievable throughput. Embedding requests are IO-bound and scale well with concurrency up to the budget.
  • Token limits: Each input is bounded by the model's context window (8192 tokens for text-embedding-3-*). Inputs longer than the window fail with a 400 — truncate or chunk at the caller.

Metrics

Embedding requests use a dedicated metric namespace separate from chat/LLM metrics:

MetricTypeLabelsDescription
embeddings_requestsCountermodel, encoding_format, optional user, optional dimensionsTotal embedding requests issued.
embeddings_failuresCountersame as aboveTotal embedding request failures.
embeddings_internal_request_duration_msHistogramsame as aboveRequest latency (client-side).
embeddings_load_errorsCounter-Runtime load-time errors.
embeddings_active_countGauge-Currently-loaded embedding models.
embeddings_load_stateGauge-Load state (0/1).

See Component Metrics for enabling and exporting metrics.

Task History

Embedding request operations emit text_embed spans in task history, with fields:

  • input (truncated)
  • Labels (model, encoding_format, optional user, optional dimensions)
  • outputs_produced (number of vectors returned)
  • Errors (when applicable)

Known Limitations

  • No automatic truncation: Inputs longer than the model's context window fail with a 400 error; truncate or chunk at the caller.
  • No token-level rate limiting: The rate controller counts requests; token-level TPM limits imposed by OpenAI may still be hit and surface as 429.
  • Provider compatibility varies: OpenAI-compatible providers may not implement every parameter (dimensions, user, encoding_format).

Troubleshooting

SymptomLikely causeResolution
401 UnauthorizedWrong / revoked API key.Rotate the key; update the secret store.
Sustained 429 rate_limit_exceededTier budget too low or burst exceeds concurrency.Raise openai_usage_tier, reduce max_concurrency, or upgrade the OpenAI tier.
400 with "maximum context length"Input exceeds model context window.Truncate or chunk inputs at the caller.
Embeddings much slower than expectedSingle-threaded caller, no batching.Batch inputs; the client chunks into 256-input / 512 KiB batches but the caller must parallelize embedding jobs.
Latency spikes every few hundred requestsTransient 429 with fibonacci backoff recovering.Expected at tier ceiling; raise tier or reduce load.