Skip to main content
IronlabsAI supports client-side load balancing, enabling you to distribute requests across multiple language models based on weights or a failover chain. This feature enhances reliability and performance by intelligently routing requests to available models, handling retries, and managing failures.

When to use client-side load balancing

Use client-side load balancing when you want to:
  • Distribute requests across multiple models to optimize for reliability, latency, or cost.
  • Implement failover mechanisms to switch to alternative models if one fails.
  • Handle retries with exponential backoff for robust request processing.
  • Customize model-specific behavior with additional messages or timeouts.

Configuring weighted load balancing

Weighted load balancing allows you to assign weights to models, determining the probability of selecting each model for a request. You can also specify retries, timeouts, and model-specific messages.

Example: Weighted load balancing

from ironlabsai import IronlabsAI

client = IronlabsAI(
    reliability={
        "weights": {
            "openai/gpt-4o": 0.7,
            "anthropic/claude-3-5-haiku-20241022": 0.3
        },
        "max_retries": {
            "openai/gpt-4o": 2,
            "anthropic/claude-3-5-haiku-20241022": 1
        },
        "timeout": {
            "openai/gpt-4o": 5.0,
            "anthropic/claude-3-5-haiku-20241022": 5.0
        },
        "backoff": {
            "openai/gpt-4o": 2.0,
            "anthropic/claude-3-5-haiku-20241022": 1.0
        },
        "model_messages": {
            "openai/gpt-4o": [{"role": "user", "content": "Please provide a concise response."}],
            "anthropic/claude-3-5-haiku-20241022": [{"role": "user", "content": "Keep it short."}]
        }
    }
)

response = client.completions.create(
    messages=[{"role": "user", "content": "Write a haiku about the ocean."}]
)
print(response.choices[0].message.content)
print("Model used:", response.model)
In this example:
  • Requests are distributed with a 70% chance to openai/gpt-4o and 30% to anthropic/claude-3-5-haiku-20241022.
  • Each model has specific retry limits, timeouts, and backoff settings for exponential retry delays.
  • Model-specific messages are appended to ensure concise responses.

Configuring ordered failover

Ordered failover allows you to specify a chain of models to try in sequence if a model fails. Each model in the chain can have its own retries, timeouts, and additional messages.

Example: Ordered failover

from ironlabsai import IronlabsAI

client = IronlabsAI(
    reliability={
        "failover_chain": [
            {
                "model": "openai/gpt-4o",
                "max_retries": 2,
                "timeout": 5.0,
                "backoff": 1.0,
                "messages": [{"role": "user", "content": "Respond concisely."}]
            },
            {
                "model": "anthropic/claude-3-5-haiku-20241022",
                "max_retries": 1,
                "timeout": 5.0,
                "backoff": 2.0,
                "messages": [{"role": "user", "content": "Keep it brief."}]
            }
        ]
    }
)

response = client.completions.create(
    messages=[{"role": "user", "content": "Describe a sunset."}]
)
print(response.choices[0].message.content)
print("Model used:", response.model)
In this example:
  • The client first tries openai/gpt-4o with up to 2 retries.
  • If it fails, it falls back to anthropic/claude-3-5-haiku-20241022 with 1 retry.
  • Each model has its own timeout, backoff, and additional messages for tailored behavior.