Skip to main content
IronLabs AgentOpt uses Claude as a proposer to iteratively rewrite your agent’s system prompt, benchmark each variant in an isolated sandbox, and return the best-scoring version — all without manual prompt engineering.

Python SDK

pip install ironlabs

Node.js SDK

npm install ironlabs

When to use AgentOpt

  • Automated prompt engineering — replace manual trial-and-error with a data-driven optimization loop
  • Agent quality improvement — boost task accuracy without changing your agent’s code structure
  • Benchmark-driven development — optimize against your own evaluation function and dataset
  • Model-specific tuning — find the best system prompt for a specific target model

Prerequisites

Before you start, make sure you have:
  • An IronLabs API key from the Settings page
  • A ZIP bundle containing agent.py, eval.py, and dataset.json hosted at a publicly accessible URL
  • Minimum 10 rows in your dataset

Installation

Install the SDK for your language:
pip install ironlabs

Initialize the client

Set your API key as an environment variable:
export IRONLABS_API_KEY="your_api_key_here"
Then initialize the optimizer in your code:
from ironlabs import AgentOptimizer

optimizer = AgentOptimizer()
The client automatically picks up IRONLABS_API_KEY from your environment — no need to pass it explicitly.

Running an Optimization

1

Prepare your ZIP bundle

AgentOpt requires three files packed into a single ZIP:
FilePurpose
agent.pyYour agent — defines run_batch(inputs, api_key) and uses EDITABLE/FIXED markers
eval.pyScoring function — defines score(expected, predicted) -> float in [0, 1]
dataset.jsonArray of {"input": str, "answer": str} objects (minimum 10 rows)

agent.py

The EDITABLE section is what AgentOpt rewrites each iteration. The FIXED section defines the interface contract and is never modified.
# ===== EDITABLE SECTION START =====
SYSTEM_PROMPT = "You are a helpful assistant. Answer concisely."
# ===== EDITABLE SECTION END =====

# ===== FIXED BOUNDARY - DO NOT MODIFY BELOW =====
import openai

DEPENDENCIES = ["openai"]

def run_batch(inputs: list[str], api_key: str) -> tuple[list[str], dict]:
    """Run agent on a batch of inputs. Returns (predictions, usage_dict)."""
    client = openai.OpenAI(
        api_key=api_key,
        base_url="https://openrouter.ai/api/v1",
    )
    predictions = []
    total_tokens = 0
    for inp in inputs:
        resp = client.chat.completions.create(
            model="target_model",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": inp},
            ],
        )
        predictions.append(resp.choices[0].message.content)
        total_tokens += resp.usage.total_tokens
    usage = {"tokens": total_tokens, "cost": total_tokens * 0.00000015}
    return predictions, usage

eval.py

Must define a score function that returns a float between 0.0 and 1.0:
def score(expected: str, predicted: str) -> float:
    """Return a score in [0, 1]. 1.0 = perfect match."""
    return 1.0 if expected.strip().lower() == predicted.strip().lower() else 0.0

dataset.json

A JSON array of input/answer pairs (minimum 10 rows):
[
  {"input": "What is 2 + 2?", "answer": "4"},
  {"input": "Capital of France?", "answer": "Paris"},
  {"input": "What color is the sky?", "answer": "blue"}
]
Pack the three files into a ZIP and host it at a publicly accessible URL:
zip agent_bundle.zip agent.py eval.py dataset.json
2

Submit the optimization job

Pass the ZIP URL, target model, and number of iterations to start the job.
result = optimizer.fit(
    input_url="https://example.com/agent_bundle.zip",
    target_model="target_model",
    n_iterations=15,
)

job_id = result["job_id"]
print(f"Job submitted. Job ID: {job_id}")
Parameters:
ParameterRequiredDefaultDescription
input_urlYesPublic URL to your ZIP bundle
target_modelYesOpenRouter model string to optimize for (e.g. target_model)
n_iterationsNo15Number of optimization iterations (1–50)
overall_timeout_secondsNo3600Total job timeout in seconds (300–7200)
llm_call_timeout_secondsNo300Timeout per LLM call (30–600)
sandbox_timeout_secondsNo600Timeout per sandbox benchmark run (60–1800)
Response:
{
  "job_id": "uuid",
  "status": "queued",
  "version": "x.x.x"
}
3

Monitor progress

Poll get_status() every 30 seconds. The response includes live per-iteration progress once the job starts running.
import time

while True:
    status_data = optimizer.get_status()
    status = status_data.get("status", "unknown")
    current = status_data.get("current_iteration")
    total = status_data.get("n_iterations")
    best = status_data.get("best_score")
    baseline = status_data.get("baseline_score")

    progress = f"iter {current}/{total}" if current is not None else "pending"
    scores = ""
    if baseline is not None:
        scores += f"  baseline={baseline:.4f}"
    if best is not None:
        scores += f"  best={best:.4f}"

    print(f"[{status}] {progress}{scores}")

    if status in ("completed", "failed", "interrupted"):
        break

    time.sleep(30)
Status values:
StatusDescription
queuedJob is waiting to start
runningOptimization is active
completedOptimization finished successfully
interruptedJob timed out or was cancelled
failedInternal error — check error_message field
AgentOpt-specific status fields:
FieldDescription
current_iterationLatest iteration completed (0 = baseline)
best_scoreBest score seen so far (0.0–1.0)
baseline_scoreScore before any optimization
n_iterationsTotal iterations requested
4

Get results

Retrieve the optimized prompt and performance metrics once the job completes.
results_data = optimizer.get_results()

for entry in results_data.get("results", []):
    print(f"Model            : {entry.get('model')}")
    print(f"Train score      : {entry.get('train_score')}")
    print(f"Test score       : {entry.get('test_score')}")
    print(f"Iterations run   : {entry.get('iterations_run')}")
    print(f"Iterations kept  : {entry.get('iterations_kept')}")
    print(f"Agent code URL   : {entry.get('agent_code_url')}")
    print(f"\nOriginal prompt:\n{entry.get('original_prompt')}")
    print(f"\nOptimized prompt:\n{entry.get('optimized_prompt')}")
Response:
{
  "job_id": "uuid",
  "status": "completed",
  "results": [
    {
      "model": "target_model",
      "optimizer": "AGENTOPT",
      "original_prompt": "You are a helpful assistant. Answer concisely.",
      "optimized_prompt": "You are a precise assistant. Answer with a single word or number when possible. Do not add explanations unless the question requires them.",
      "train_score": 0.91,
      "test_score": 0.87,
      "iterations_run": 15,
      "iterations_kept": 6,
      "agent_code_url": "https://storage.example.com/best_agent.py"
    }
  ]
}
Result fields:
FieldDescription
optimized_promptBest system prompt found across all iterations
original_promptSystem prompt from your original agent.py
train_scoreScore on the 70% training split
test_scoreScore on the 30% held-out test split
iterations_runTotal iterations executed
iterations_keptIterations where score improved
agent_code_urlPublic URL to download the best agent.py

Complete example

import time
from ironlabs import AgentOptimizer

def main():
    optimizer = AgentOptimizer()

    # 1. Submit job
    print("Submitting AgentOpt job...")
    result = optimizer.fit(
        input_url="https://example.com/agent_bundle.zip",
        target_model="target_model",
        n_iterations=15,
    )
    job_id = result["job_id"]
    print(f"Job submitted. Job ID: {job_id}\n")

    # 2. Poll until complete
    print("Polling status every 30s...")
    while True:
        status_data = optimizer.get_status()
        status = status_data.get("status", "unknown")
        current = status_data.get("current_iteration")
        total = status_data.get("n_iterations")
        best = status_data.get("best_score")
        baseline = status_data.get("baseline_score")

        progress = f"iter {current}/{total}" if current is not None else "pending"
        scores = ""
        if baseline is not None:
            scores += f"  baseline={baseline:.4f}"
        if best is not None:
            scores += f"  best={best:.4f}"

        print(f"  [{status}] {progress}{scores}")

        if status in ("completed", "failed", "interrupted"):
            break

        time.sleep(30)

    if status != "completed":
        print(f"\nJob ended with status: {status}")
        error = status_data.get("error_message")
        if error:
            print(f"Error: {error}")
        return

    # 3. Get results
    print("\nFetching results...")
    results_data = optimizer.get_results()
    for entry in results_data.get("results", []):
        print("=" * 60)
        print(f"Model            : {entry.get('model')}")
        print(f"Train score      : {entry.get('train_score')}")
        print(f"Test score       : {entry.get('test_score')}")
        print(f"Iterations run   : {entry.get('iterations_run')}")
        print(f"Iterations kept  : {entry.get('iterations_kept')}")
        print(f"Agent code URL   : {entry.get('agent_code_url')}")
        print("\nOriginal prompt:")
        print(entry.get("original_prompt", ""))
        print("\nOptimized prompt:")
        print(entry.get("optimized_prompt", ""))

if __name__ == "__main__":
    main()

How AgentOpt works

AgentOpt runs a closed optimization loop:
  1. Baseline — runs your original agent.py on the dataset to establish a starting score
  2. Propose — Claude reads the current system prompt and proposes an improved version
  3. Benchmark — the proposed variant runs in an isolated sandbox against the dataset
  4. Accept or reject — improvements are kept; regressions are discarded
  5. Repeat — steps 2–4 repeat for n_iterations iterations
The best-scoring agent.py (with the winning system prompt embedded) is returned at the end.

Error handling

ErrorCauseFix
422 agentopt requires input_urlMissing input_url in requestUpload ZIP and set the URL
422 agentopt requires target_modelsEmpty target models listAdd at least one model string
422 n_iterations must be 1–50Iterations out of rangeUse a value between 1 and 50
422 overall_timeout_seconds must be 300–7200Timeout out of rangeUse a value in range
status: interruptedJob timed outCheck error_message; retry with higher overall_timeout_seconds
status: failedInternal errorCheck error_message; verify agent.py runs locally without error