Skip to main content

AgentOpt Skill

End-to-end workflow: user files → ZIP → upload → optimize → poll → results.

Trigger

Activate when user says things like:
  • “optimize my agent / prompt”
  • “run agentopt on my files”
  • “improve my system prompt using the optimizer”
  • /AgentOpt

Prerequisites

Ask for these if not already provided:
ItemWhere to get
IRONA_API_KEYUser’s bearer token (env var or prompted)
agent.pyUser’s agent file (path or content)
eval.pyScoring function (path or content)
dataset.jsonArray of {input, answer} objects (path or content)
target_modelOpenRouter model string, default openai/gpt-4o-mini
n_iterations1–50, default 5 for quick test / 15 for full run

Step 1 — Validate & convert files

agent.py requirements

Must have EDITABLE and FIXED boundary markers, and export run_batch:
# ===== EDITABLE SECTION START =====
import asyncio
import os

MODEL = "openai/gpt-4o-mini"  # Set by benchmark config — do not change
MAX_STEPS = 1
CONCURRENT_REQUESTS = 5
DEPENDENCIES = []

SYSTEM_PROMPT = "Your system prompt here"

async def run_batch(inputs: list[str], api_key: str) -> tuple[list[str], dict]:
    """
    Process a batch of inputs. Return (predictions, usage).
    usage = {"tokens": int, "cost": float}
    """
    if not api_key:
        return ["unknown"] * len(inputs), {"tokens": 0, "cost": 0.0}

    from openai import AsyncOpenAI
    client = AsyncOpenAI(api_key=api_key, base_url="https://openrouter.ai/api/v1")
    semaphore = asyncio.Semaphore(CONCURRENT_REQUESTS)
    total_tokens = 0
    total_cost = 0.0

    async def _call(inp: str) -> str:
        nonlocal total_tokens, total_cost
        async with semaphore:
            response = await client.chat.completions.create(
                model=MODEL,
                temperature=0,
                messages=[
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": inp},
                ],
            )
            if response.usage:
                total_tokens += response.usage.total_tokens
                extra = getattr(response.usage, "model_extra", None) or {}
                total_cost += float(extra.get("cost", 0.0))
            return (response.choices[0].message.content or "").strip()

    predictions = list(await asyncio.gather(*(_call(inp) for inp in inputs)))
    return predictions, {"tokens": total_tokens, "cost": total_cost}

# ===== FIXED BOUNDARY - DO NOT MODIFY BELOW =====
if __name__ == "__main__":
    import json
    import sys

    inputs_file = sys.argv[1]
    with open(inputs_file) as f:
        inputs = json.load(f)

    api_key = os.environ.get("OPENROUTER_API_KEY", "")
    predictions, usage = asyncio.run(run_batch(inputs, api_key))

    print(json.dumps(predictions))
    print(json.dumps(usage), file=sys.stderr)
Conversion rules:
  • If user’s code has no EDITABLE SECTION START marker → wrap their SYSTEM_PROMPT + run_batch in the EDITABLE section, append the FIXED boundary block verbatim.
  • If run_batch doesn’t exist → wrap their inference logic inside the template above.
  • MODEL must be set to the target model string.
  • The FIXED boundary block must not be modified.
Reference examples: examples/gaia/agent.py, examples/finance_agent/agent.py

eval.py requirements

Must define score(expected: str, predicted: str) -> float returning 0.0–1.0.
# Minimal eval — exact/partial match
import re

def _normalize(s: str) -> str:
    return re.sub(r"[.,;:!?\-]$", "", s.strip().lower())

def score(expected: str, predicted: str) -> float:
    exp = _normalize(expected)
    pred = _normalize(predicted)
    if exp == pred:
        return 1.0
    if exp in pred:
        return 0.5
    return 0.0
Conversion rules:
  • If user has a different metric (e.g. F1, BLEU, exact_match) → wrap it in a score(expected, predicted) function that returns float 0–1.
  • If user has no eval → use the exact/partial match template above, noting that they should customize it for their task.
Reference examples: examples/gaia/eval.py, examples/finance_agent/eval.py, examples/trail/eval.py

dataset.json requirements

JSON array of {"input": str, "answer": str} objects. Minimum 10 rows.
[
  {"input": "What is the capital of France?", "answer": "Paris"},
  {"input": "2 + 2", "answer": "4"}
]
Conversion rules:
  • CSV with input/answer columns → python3 -c "import csv,json,sys; rows=list(csv.DictReader(open('data.csv'))); json.dump([{'input':r['input'],'answer':r['answer']} for r in rows],sys.stdout,indent=2)"
  • Different column names → remap to input/answer.
  • JSONL → python3 -c "import json,sys; data=[json.loads(l) for l in open('data.jsonl')]; json.dump(data,sys.stdout,indent=2)"
  • Fewer than 10 rows → warn the user; optimizer requires minimum 10.

Step 2 — Build ZIP

cd /tmp && mkdir -p agentopt_job
cp <agent.py path>  agentopt_job/agent.py
cp <eval.py path>   agentopt_job/eval.py
cp <dataset.json path> agentopt_job/dataset.json
cd agentopt_job && zip -q ../agentopt_input.zip agent.py eval.py dataset.json
echo "ZIP size: $(du -sh ../agentopt_input.zip | cut -f1)"

Step 3 — Upload ZIP to Cloudflare R2

Requires R2 credentials in environment:
VariableValue
CF_ENDPOINT_URLhttps://<account_id>.r2.cloudflarestorage.com
CF_ACCESS_KEY_IDR2 API token (access key)
CF_SECRET_ACCESS_KEYR2 API token (secret)
CF_ACCOUNT_IDCloudflare account ID
CF_BUCKET_NAMER2 bucket name
R2_KEY="router-training/$(python3 -c 'import uuid; print(uuid.uuid4().hex)')/agentopt_input.zip"

python3 - <<'EOF'
import boto3, os, sys
from botocore.config import Config

s3 = boto3.client(
    "s3",
    endpoint_url=os.environ["CF_ENDPOINT_URL"],
    aws_access_key_id=os.environ["CF_ACCESS_KEY_ID"],
    aws_secret_access_key=os.environ["CF_SECRET_ACCESS_KEY"],
    config=Config(signature_version="s3v4"),
    region_name="auto",
)
bucket = os.environ["CF_BUCKET_NAME"]
key = os.environ["R2_KEY"]
s3.upload_file("/tmp/agentopt_input.zip", bucket, key)
print(f"https://{bucket}.{os.environ['CF_ACCOUNT_ID']}.r2.dev/{key}")
EOF
# Capture the URL
INPUT_URL=$(R2_KEY="$R2_KEY" python3 - <<'EOF'
import boto3, os
from botocore.config import Config
s3 = boto3.client("s3", endpoint_url=os.environ["CF_ENDPOINT_URL"],
    aws_access_key_id=os.environ["CF_ACCESS_KEY_ID"],
    aws_secret_access_key=os.environ["CF_SECRET_ACCESS_KEY"],
    config=Config(signature_version="s3v4"), region_name="auto")
bucket = os.environ["CF_BUCKET_NAME"]
key = os.environ["R2_KEY"]
s3.upload_file("/tmp/agentopt_input.zip", bucket, key)
print(f"https://{bucket}.{os.environ['CF_ACCOUNT_ID']}.r2.dev/{key}")
EOF
)
echo "Input URL: $INPUT_URL"

Step 4 — Submit optimization job

API_KEY="${IRONA_API_KEY:-<paste your key>}"
BASE_URL="https://irona-ai--optimize-dev.modal.run"

JOB_RESPONSE=$(curl -s -X POST "$BASE_URL" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d "{
    \"optimizer\": \"agentopt\",
    \"input_url\": \"$RAW_URL\",
    \"target_models\": [\"$TARGET_MODEL\"],
    \"n_iterations\": $N_ITERATIONS
  }")

echo "$JOB_RESPONSE" | python3 -m json.tool
JOB_ID=$(echo "$JOB_RESPONSE" | python3 -c "import json,sys; print(json.load(sys.stdin)['job_id'])")
echo "Job ID: $JOB_ID"
Full request body:
FieldTypeDefaultNotes
optimizerstringmust be "agentopt"
input_urlstringpublic ZIP download URL
target_modelsstring[]exactly one model, e.g. ["openai/gpt-4o-mini"]
n_iterationsint151–50
overall_timeout_secondsint3600300–7200
llm_call_timeout_secondsint30030–600
sandbox_timeout_secondsint60060–1800
Response:
{"job_id": "uuid", "status": "queued", "version": "x.x.x"}

Step 5 — Poll status until complete

Poll every 30 seconds. Show live progress each iteration.
STATUS_URL="https://irona-ai--optimize-status-dev.modal.run"

while true; do
  STATUS=$(curl -s "$STATUS_URL?job_id=$JOB_ID" \
    -H "Authorization: Bearer $API_KEY")
  
  STATE=$(echo "$STATUS" | python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('status',''))")
  ITER=$(echo "$STATUS"  | python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('current_iteration','?'))")
  BEST=$(echo "$STATUS"  | python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('best_score','?'))")
  
  echo "[$(date +%H:%M:%S)] status=$STATE  iteration=$ITER  best_score=$BEST"
  
  case "$STATE" in
    completed|interrupted|failed) break ;;
  esac
  sleep 30
done
Status values: queuedrunningcompleted | interrupted | failed AgentOpt-specific status fields:
FieldDescription
current_iterationLatest iteration completed (0 = baseline)
best_scoreBest score seen so far (0.0–1.0)
baseline_scoreScore before any optimization
n_iterationsTotal iterations requested

Step 6 — Fetch and display results

RESULT_URL="https://irona-ai--optimize-result-dev.modal.run"

RESULT=$(curl -s "$RESULT_URL?job_id=$JOB_ID" \
  -H "Authorization: Bearer $API_KEY")

echo "$RESULT" | python3 -c "
import json, sys
data = json.load(sys.stdin)
results = data.get('results', [])
if not results:
    print('No results yet. Status:', data.get('status'))
    sys.exit(0)
for r in results:
    print('=== AgentOpt Result ===')
    print(f'Model:              {r[\"model\"]}')
    print(f'Train score:        {r.get(\"train_score\")}')
    print(f'Test score:         {r.get(\"test_score\")}')
    print(f'Iterations run:     {r.get(\"iterations_run\")}')
    print(f'Iterations kept:    {r.get(\"iterations_kept\")}')
    print()
    print('--- Original prompt ---')
    print(r['original_prompt'])
    print()
    print('--- Optimized prompt ---')
    print(r['optimized_prompt'])
    print()
    print(f'Best agent code:    {r.get(\"agent_code_url\")}')
"
Result fields (agentopt):
FieldDescription
optimized_promptBest system prompt found
original_promptSystem prompt from original agent.py
train_scoreScore on 70% train split
test_scoreScore on 30% held-out test split
iterations_runTotal iterations executed
iterations_keptIterations where score improved
agent_code_urlPublic URL to download best agent.py

Complete end-to-end script

Save to /tmp/run_agentopt.sh and run:
#!/usr/bin/env bash
set -euo pipefail

# ── Config ──────────────────────────────────────────────────────────────────
API_KEY="${IRONA_API_KEY:?Set IRONA_API_KEY}"
AGENT_FILE="${1:?Usage: $0 <agent.py> <eval.py> <dataset.json> [model] [n_iter]}"
EVAL_FILE="${2:?}"
DATASET_FILE="${3:?}"
TARGET_MODEL="${4:-openai/gpt-4o-mini}"
N_ITERATIONS="${5:-5}"

BASE_URL="https://irona-ai--optimize-dev.modal.run"
STATUS_URL="https://irona-ai--optimize-status-dev.modal.run"
RESULT_URL="https://irona-ai--optimize-result-dev.modal.run"

# ── Build ZIP ────────────────────────────────────────────────────────────────
TMP=$(mktemp -d)
cp "$AGENT_FILE" "$TMP/agent.py"
cp "$EVAL_FILE"  "$TMP/eval.py"
cp "$DATASET_FILE" "$TMP/dataset.json"
ZIP="$TMP/input.zip"
(cd "$TMP" && zip -q "$ZIP" agent.py eval.py dataset.json)
echo "ZIP: $(du -sh $ZIP | cut -f1)"

# ── Upload to Cloudflare R2 ──────────────────────────────────────────────────
echo "Uploading to R2..."
R2_KEY="router-training/$(python3 -c 'import uuid; print(uuid.uuid4().hex)')/agentopt_input.zip"
INPUT_URL=$(R2_KEY="$R2_KEY" python3 - <<'PYEOF'
import boto3, os
from botocore.config import Config
s3 = boto3.client("s3", endpoint_url=os.environ["CF_ENDPOINT_URL"],
    aws_access_key_id=os.environ["CF_ACCESS_KEY_ID"],
    aws_secret_access_key=os.environ["CF_SECRET_ACCESS_KEY"],
    config=Config(signature_version="s3v4"), region_name="auto")
bucket = os.environ["CF_BUCKET_NAME"]
key = os.environ["R2_KEY"]
s3.upload_file(os.environ["ZIP"], bucket, key)
print(f"https://{bucket}.{os.environ['CF_ACCOUNT_ID']}.r2.dev/{key}")
PYEOF
)
echo "URL: $INPUT_URL"

# ── Submit ──────────────────────────────────────────────────────────────────
echo "Submitting job (model=$TARGET_MODEL, n_iterations=$N_ITERATIONS)..."
JOB=$(curl -s -X POST "$BASE_URL" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d "{
    \"optimizer\": \"agentopt\",
    \"input_url\": \"$INPUT_URL\",
    \"target_models\": [\"$TARGET_MODEL\"],
    \"n_iterations\": $N_ITERATIONS
  }")
echo "$JOB" | python3 -m json.tool
JOB_ID=$(echo "$JOB" | python3 -c "import json,sys; print(json.load(sys.stdin)['job_id'])")
echo "Job ID: $JOB_ID"

# ── Poll ─────────────────────────────────────────────────────────────────────
echo "Polling..."
while true; do
  S=$(curl -s "$STATUS_URL?job_id=$JOB_ID" -H "Authorization: Bearer $API_KEY")
  STATE=$(echo "$S" | python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('status',''))")
  ITER=$(echo "$S"  | python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('current_iteration','?'))")
  BEST=$(echo "$S"  | python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('best_score','?'))")
  BASE=$(echo "$S"  | python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('baseline_score','?'))")
  echo "[$(date +%H:%M:%S)] $STATE  iter=$ITER  best=$BEST  baseline=$BASE"
  case "$STATE" in completed|interrupted|failed) break ;; esac
  sleep 30
done

# ── Results ──────────────────────────────────────────────────────────────────
echo ""
echo "Fetching results..."
curl -s "$RESULT_URL?job_id=$JOB_ID" \
  -H "Authorization: Bearer $API_KEY" | python3 -c "
import json, sys
data = json.load(sys.stdin)
for r in (data.get('results') or []):
    print('=== RESULT ===')
    print(f'Train: {r.get(\"train_score\")}  Test: {r.get(\"test_score\")}')
    print(f'Iterations: {r.get(\"iterations_run\")} run / {r.get(\"iterations_kept\")} kept')
    print()
    print('ORIGINAL PROMPT:')
    print(r['original_prompt'])
    print()
    print('OPTIMIZED PROMPT:')
    print(r['optimized_prompt'])
    print()
    print('Best agent.py:', r.get('agent_code_url'))
"

rm -rf "$TMP"
Usage:
IRONA_API_KEY=sk_... bash /tmp/run_agentopt.sh \
  agent.py eval.py dataset.json \
  openai/gpt-4o-mini 5

Reference examples

Pre-built working examples in examples/:
Benchmarkagent.pyeval.pydataset.json
GAIA (general knowledge + tool use)examples/gaia/agent.pyexamples/gaia/eval.pyexamples/gaia/dataset.json
Finance Q&Aexamples/finance_agent/agent.pyexamples/finance_agent/eval.pyexamples/finance_agent/dataset.json
TRAIL (LLM trace classification)examples/trail/agent.pyexamples/trail/eval.pyexamples/trail/dataset.json
Quick test with a pre-built example:
IRONA_API_KEY=sk_... bash /tmp/run_agentopt.sh \
  examples/gaia/agent.py \
  examples/gaia/eval.py \
  examples/gaia/dataset.json \
  openai/gpt-4o-mini 2

Error handling

ErrorCauseFix
422 agentopt requires input_urlMissing input_url in requestEnsure ZIP uploaded and URL is set
422 agentopt requires target_modelsEmpty target_models listAdd at least one model string
422 overall_timeout_seconds must be 300–7200Timeout out of rangeUse value in range
422 n_iterations must be 1–50Iterations out of rangeUse value in range
400 R2 upload failedinput_url not reachable or R2 key missingVerify R2 credentials and bucket public-read access
status: interruptedJob timed out or cancelledCheck error_message field in status response; retry with higher overall_timeout_seconds
status: failedInternal errorCheck error_message; ensure agent.py runs without error locally

Validating agent.py locally before submitting

# Create a minimal test run (no API key → smoke test only)
python3 <agent.py path> <(echo '["test question"]')
# Should print: ["unknown"] to stdout, {"tokens": 0, "cost": 0.0} to stderr