Skip to main content
Every request to IronLabs follows the same five-state lifecycle. Understanding these states is the difference between debugging “why did it pick that model?” in minutes versus hours.

State-by-state

Received

Trigger: Your SDK or HTTP client sends a POST /completions (or /model-select) call. The API gateway authenticates the bearer token, validates the payload schema, and assigns a responseMessageId (UUID). Observable: No content yet — only the request ID. If auth fails or schema is invalid, the lifecycle terminates here with a 4xx and never reaches Scored.

Scored

Trigger: The router embeds the user message and runs the classifier (either a pre-trained router for general use or your Custom Router if routerId is set). Each candidate model in models gets a score that blends three signals:
SignalSource
Capability fitclassifier output — “is this model likely to answer correctly?”
Costnormalized per-million-token price
Latencyrolling p50 time-to-first-token from the last hour
Your tradeoff setting ("latency", "cost", "quality") controls the weighting. Observable: Internal — surfaced in the Studio activity log as the decision trace.

Routed

Trigger: Top-k ranking is reduced to a single primary plus an ordered fallback list. The system reserves capacity at the chosen provider. Observable: Streaming requests start seeing the chosen provider/model echoed in the first SSE chunk:
{ "provider": "anthropic", "model": "claude-3-5-sonnet-20240620", "type": "metadata" }

Executed

Trigger: IronLabs forwards the request to the provider with normalized parameters. Streaming responses are pass-through SSE; non-streaming responses are buffered. Observable: Either the full assistant message (non-streaming) or a stream of text-delta events (streaming). The usage block — {prompt_tokens, completion_tokens, total_tokens} — is attached at the close of execution.

Fallback (optional)

Trigger: The provider returned a 5xx, timed out, exceeded the request’s maxTokens, hit a content-policy block, or violated your maxRetries budget. IronLabs walks to the next entry in the fallback list and re-enters Executed. Observable: The final response includes an errorTrace array listing each attempt that failed. Inspect this when you see a slower-than-expected response — most often, a primary timed out and a fallback served.
"errorTrace": [
  { "provider": "openai",    "model": "gpt-4o", "error": "timeout (5s)" },
  { "provider": "anthropic", "model": "claude-3-5-sonnet-20240620", "error": null }
]
If the entire chain (primaries + fallbacks) is exhausted, the lifecycle exits Fallback → Logged with HTTP 502 and an errorTrace describing every attempt.

Logged

Trigger: Whether the request succeeded or failed, IronLabs writes a single row to ModelSelectUsage with: userId, customRouterId (if any), provider, model, cost, latency_ms, errorTrace, responseMessageId. Observable: Visible in the Studio → Activity tab within ~10 seconds. Cost rolls up into your Balance via DebitTransaction.

Edge cases

  • previous_session for context continuity. When set, the router prefers the same model that handled the prior turn — Scored is short-circuited to maintain conversation coherence. The trace records routingReason: "session_continuity".
  • stream: false waits for full execution before returning. Network round-trip is faster but you lose progressive output.
  • No fallback configured. Fallback → Executed is skipped — a single provider failure surfaces as 502 immediately. Set fallback_models for production traffic.
  • Custom Router cold start. First request after 5 minutes idle loads the XGBoost classifier from Modal Volume; expect +200–500ms on the cold call. Subsequent calls are warm.

Fallback Configuration

How to set primaries vs fallbacks.

Custom Router

Replace the pre-trained scorer with your own.

Completions API

Every parameter that affects routing.

Sandbox Isolation

How tenant data stays separated across requests.