Client-Side Load Balancing

IronlabsAI supports client-side load balancing, enabling you to distribute requests across multiple language models based on weights or a failover chain. This feature enhances reliability and performance by intelligently routing requests to available models, handling retries, and managing failures.

When to use client-side load balancing

Use client-side load balancing when you want to:

Distribute requests across multiple models to optimize for reliability, latency, or cost.
Implement failover mechanisms to switch to alternative models if one fails.
Handle retries with exponential backoff for robust request processing.
Customize model-specific behavior with additional messages or timeouts.

Configuring weighted load balancing

Weighted load balancing allows you to assign weights to models, determining the probability of selecting each model for a request. You can also specify retries, timeouts, and model-specific messages.

Example: Weighted load balancing

from ironlabsai import IronlabsAI

client = IronlabsAI(
    reliability={
        "weights": {
            "openai/gpt-4o": 0.7,
            "anthropic/claude-3-5-haiku-20241022": 0.3
        },
        "max_retries": {
            "openai/gpt-4o": 2,
            "anthropic/claude-3-5-haiku-20241022": 1
        },
        "timeout": {
            "openai/gpt-4o": 5.0,
            "anthropic/claude-3-5-haiku-20241022": 5.0
        },
        "backoff": {
            "openai/gpt-4o": 2.0,
            "anthropic/claude-3-5-haiku-20241022": 1.0
        },
        "model_messages": {
            "openai/gpt-4o": [{"role": "user", "content": "Please provide a concise response."}],
            "anthropic/claude-3-5-haiku-20241022": [{"role": "user", "content": "Keep it short."}]
        }
    }
)

response = client.completions.create(
    messages=[{"role": "user", "content": "Write a haiku about the ocean."}]
)
print(response.choices[0].message.content)
print("Model used:", response.model)

In this example:

Requests are distributed with a 70% chance to openai/gpt-4o and 30% to anthropic/claude-3-5-haiku-20241022.
Each model has specific retry limits, timeouts, and backoff settings for exponential retry delays.
Model-specific messages are appended to ensure concise responses.

Configuring ordered failover

Ordered failover allows you to specify a chain of models to try in sequence if a model fails. Each model in the chain can have its own retries, timeouts, and additional messages.

Example: Ordered failover

from ironlabsai import IronlabsAI

client = IronlabsAI(
    reliability={
        "failover_chain": [
            {
                "model": "openai/gpt-4o",
                "max_retries": 2,
                "timeout": 5.0,
                "backoff": 1.0,
                "messages": [{"role": "user", "content": "Respond concisely."}]
            },
            {
                "model": "anthropic/claude-3-5-haiku-20241022",
                "max_retries": 1,
                "timeout": 5.0,
                "backoff": 2.0,
                "messages": [{"role": "user", "content": "Keep it brief."}]
            }
        ]
    }
)

response = client.completions.create(
    messages=[{"role": "user", "content": "Describe a sunset."}]
)
print(response.choices[0].message.content)
print("Model used:", response.model)

In this example:

The client first tries openai/gpt-4o with up to 2 retries.
If it fails, it falls back to anthropic/claude-3-5-haiku-20241022 with 1 retry.
Each model has its own timeout, backoff, and additional messages for tailored behavior.

​When to use client-side load balancing

​Configuring weighted load balancing

​Example: Weighted load balancing

​Configuring ordered failover

​Example: Ordered failover

When to use client-side load balancing

Configuring weighted load balancing

Example: Weighted load balancing

Configuring ordered failover

Example: Ordered failover