AgentOpt - IronLabs Docs

IronLabs AgentOpt uses Claude as a proposer to iteratively rewrite your agent’s system prompt, benchmark each variant in an isolated sandbox, and return the best-scoring version — all without manual prompt engineering.

Python SDK

pip install ironlabs

Node.js SDK

npm install ironlabs

When to use AgentOpt

Automated prompt engineering — replace manual trial-and-error with a data-driven optimization loop
Agent quality improvement — boost task accuracy without changing your agent’s code structure
Benchmark-driven development — optimize against your own evaluation function and dataset
Model-specific tuning — find the best system prompt for a specific target model

Prerequisites

Before you start, make sure you have:

An IronLabs API key from the Settings page
A ZIP bundle containing agent.py, eval.py, and dataset.json hosted at a publicly accessible URL
Minimum 10 rows in your dataset

Installation

Install the SDK for your language:

pip install ironlabs

Initialize the client

Set your API key as an environment variable:

export IRONLABS_API_KEY="your_api_key_here"

Then initialize the optimizer in your code:

from ironlabs import AgentOptimizer

optimizer = AgentOptimizer()

The client automatically picks up IRONLABS_API_KEY from your environment — no need to pass it explicitly.

Running an Optimization

Prepare your ZIP bundle

AgentOpt requires three files packed into a single ZIP:

File	Purpose
`agent.py`	Your agent — defines `run_batch(inputs, api_key)` and uses EDITABLE/FIXED markers
`eval.py`	Scoring function — defines `score(expected, predicted) -> float` in [0, 1]
`dataset.json`	Array of `{"input": str, "answer": str}` objects (minimum 10 rows)

agent.py

The EDITABLE section is what AgentOpt rewrites each iteration. The FIXED section defines the interface contract and is never modified.

# ===== EDITABLE SECTION START =====
SYSTEM_PROMPT = "You are a helpful assistant. Answer concisely."
# ===== EDITABLE SECTION END =====

# ===== FIXED BOUNDARY - DO NOT MODIFY BELOW =====
import openai

DEPENDENCIES = ["openai"]

def run_batch(inputs: list[str], api_key: str) -> tuple[list[str], dict]:
    """Run agent on a batch of inputs. Returns (predictions, usage_dict)."""
    client = openai.OpenAI(
        api_key=api_key,
        base_url="https://openrouter.ai/api/v1",
    )
    predictions = []
    total_tokens = 0
    for inp in inputs:
        resp = client.chat.completions.create(
            model="target_model",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": inp},
            ],
        )
        predictions.append(resp.choices[0].message.content)
        total_tokens += resp.usage.total_tokens
    usage = {"tokens": total_tokens, "cost": total_tokens * 0.00000015}
    return predictions, usage

eval.py

Must define a score function that returns a float between 0.0 and 1.0:

def score(expected: str, predicted: str) -> float:
    """Return a score in [0, 1]. 1.0 = perfect match."""
    return 1.0 if expected.strip().lower() == predicted.strip().lower() else 0.0

dataset.json

A JSON array of input/answer pairs (minimum 10 rows):

[
  {"input": "What is 2 + 2?", "answer": "4"},
  {"input": "Capital of France?", "answer": "Paris"},
  {"input": "What color is the sky?", "answer": "blue"}
]

Pack the three files into a ZIP and host it at a publicly accessible URL:

zip agent_bundle.zip agent.py eval.py dataset.json

Submit the optimization job

Pass the ZIP URL, target model, and number of iterations to start the job.

result = optimizer.fit(
    input_url="https://example.com/agent_bundle.zip",
    target_model="target_model",
    n_iterations=15,
)

job_id = result["job_id"]
print(f"Job submitted. Job ID: {job_id}")

Parameters:

Parameter	Required	Default	Description
`input_url`	Yes	—	Public URL to your ZIP bundle
`target_model`	Yes	—	OpenRouter model string to optimize for (e.g. `target_model`)
`n_iterations`	No	15	Number of optimization iterations (1–50)
`overall_timeout_seconds`	No	3600	Total job timeout in seconds (300–7200)
`llm_call_timeout_seconds`	No	300	Timeout per LLM call (30–600)
`sandbox_timeout_seconds`	No	600	Timeout per sandbox benchmark run (60–1800)

Response:

{
  "job_id": "uuid",
  "status": "queued",
  "version": "x.x.x"
}

Monitor progress

Poll get_status() every 30 seconds. The response includes live per-iteration progress once the job starts running.

import time

while True:
    status_data = optimizer.get_status()
    status = status_data.get("status", "unknown")
    current = status_data.get("current_iteration")
    total = status_data.get("n_iterations")
    best = status_data.get("best_score")
    baseline = status_data.get("baseline_score")

    progress = f"iter {current}/{total}" if current is not None else "pending"
    scores = ""
    if baseline is not None:
        scores += f"  baseline={baseline:.4f}"
    if best is not None:
        scores += f"  best={best:.4f}"

    print(f"[{status}] {progress}{scores}")

    if status in ("completed", "failed", "interrupted"):
        break

    time.sleep(30)

Status values:

Status	Description
`queued`	Job is waiting to start
`running`	Optimization is active
`completed`	Optimization finished successfully
`interrupted`	Job timed out or was cancelled
`failed`	Internal error — check `error_message` field

AgentOpt-specific status fields:

Field	Description
`current_iteration`	Latest iteration completed (0 = baseline)
`best_score`	Best score seen so far (0.0–1.0)
`baseline_score`	Score before any optimization
`n_iterations`	Total iterations requested

Get results

Retrieve the optimized prompt and performance metrics once the job completes.

results_data = optimizer.get_results()

for entry in results_data.get("results", []):
    print(f"Model            : {entry.get('model')}")
    print(f"Train score      : {entry.get('train_score')}")
    print(f"Test score       : {entry.get('test_score')}")
    print(f"Iterations run   : {entry.get('iterations_run')}")
    print(f"Iterations kept  : {entry.get('iterations_kept')}")
    print(f"Agent code URL   : {entry.get('agent_code_url')}")
    print(f"\nOriginal prompt:\n{entry.get('original_prompt')}")
    print(f"\nOptimized prompt:\n{entry.get('optimized_prompt')}")

Response:

{
  "job_id": "uuid",
  "status": "completed",
  "results": [
    {
      "model": "target_model",
      "optimizer": "AGENTOPT",
      "original_prompt": "You are a helpful assistant. Answer concisely.",
      "optimized_prompt": "You are a precise assistant. Answer with a single word or number when possible. Do not add explanations unless the question requires them.",
      "train_score": 0.91,
      "test_score": 0.87,
      "iterations_run": 15,
      "iterations_kept": 6,
      "agent_code_url": "https://storage.example.com/best_agent.py"
    }
  ]
}

Result fields:

Field	Description
`optimized_prompt`	Best system prompt found across all iterations
`original_prompt`	System prompt from your original `agent.py`
`train_score`	Score on the 70% training split
`test_score`	Score on the 30% held-out test split
`iterations_run`	Total iterations executed
`iterations_kept`	Iterations where score improved
`agent_code_url`	Public URL to download the best `agent.py`

Complete example

View full end-to-end example

import time
from ironlabs import AgentOptimizer

def main():
    optimizer = AgentOptimizer()

    # 1. Submit job
    print("Submitting AgentOpt job...")
    result = optimizer.fit(
        input_url="https://example.com/agent_bundle.zip",
        target_model="target_model",
        n_iterations=15,
    )
    job_id = result["job_id"]
    print(f"Job submitted. Job ID: {job_id}\n")

    # 2. Poll until complete
    print("Polling status every 30s...")
    while True:
        status_data = optimizer.get_status()
        status = status_data.get("status", "unknown")
        current = status_data.get("current_iteration")
        total = status_data.get("n_iterations")
        best = status_data.get("best_score")
        baseline = status_data.get("baseline_score")

        progress = f"iter {current}/{total}" if current is not None else "pending"
        scores = ""
        if baseline is not None:
            scores += f"  baseline={baseline:.4f}"
        if best is not None:
            scores += f"  best={best:.4f}"

        print(f"  [{status}] {progress}{scores}")

        if status in ("completed", "failed", "interrupted"):
            break

        time.sleep(30)

    if status != "completed":
        print(f"\nJob ended with status: {status}")
        error = status_data.get("error_message")
        if error:
            print(f"Error: {error}")
        return

    # 3. Get results
    print("\nFetching results...")
    results_data = optimizer.get_results()
    for entry in results_data.get("results", []):
        print("=" * 60)
        print(f"Model            : {entry.get('model')}")
        print(f"Train score      : {entry.get('train_score')}")
        print(f"Test score       : {entry.get('test_score')}")
        print(f"Iterations run   : {entry.get('iterations_run')}")
        print(f"Iterations kept  : {entry.get('iterations_kept')}")
        print(f"Agent code URL   : {entry.get('agent_code_url')}")
        print("\nOriginal prompt:")
        print(entry.get("original_prompt", ""))
        print("\nOptimized prompt:")
        print(entry.get("optimized_prompt", ""))

if __name__ == "__main__":
    main()

How AgentOpt works

AgentOpt runs a closed optimization loop:

Baseline — runs your original agent.py on the dataset to establish a starting score
Propose — Claude reads the current system prompt and proposes an improved version
Benchmark — the proposed variant runs in an isolated sandbox against the dataset
Accept or reject — improvements are kept; regressions are discarded
Repeat — steps 2–4 repeat for n_iterations iterations

The best-scoring agent.py (with the winning system prompt embedded) is returned at the end.

Error handling

Error	Cause	Fix
`422 agentopt requires input_url`	Missing `input_url` in request	Upload ZIP and set the URL
`422 agentopt requires target_models`	Empty target models list	Add at least one model string
`422 n_iterations must be 1–50`	Iterations out of range	Use a value between 1 and 50
`422 overall_timeout_seconds must be 300–7200`	Timeout out of range	Use a value in range
`status: interrupted`	Job timed out	Check `error_message`; retry with higher `overall_timeout_seconds`
`status: failed`	Internal error	Check `error_message`; verify `agent.py` runs locally without error

Python SDK

Node.js SDK

​When to use AgentOpt

​Prerequisites

​Installation

​Initialize the client

​Running an Optimization

​agent.py

​eval.py

​dataset.json

​Complete example

​How AgentOpt works

​Error handling

When to use AgentOpt

Prerequisites

Installation

Initialize the client

Running an Optimization

agent.py

eval.py

dataset.json

Complete example

How AgentOpt works

Error handling