Skip to main content
IronLabs Prompt Optimization ships with two optimization algorithms — GEPA and MIPROv2 — and runs them in parallel by default. This page explains what each does, when to prefer one, and how the winner is selected.

At a glance

QuestionGEPAMIPROv2
What does it tune?Instruction wording and few-shot demonstrationsIn-context examples (BootstrapFewShot)
Best when…Your prompt is fundamentally wrong or vagueYour prompt is right but examples could be better
Typical costHigher (more reflection loops per iteration)Lower (bootstrap is cheaper than reflection)
Surprising behaviorCan drastically shorten prompts that seemed essentialCan pick one very specific example that lifts the whole eval
Underlying paper / libraryDSPy GEPA (Guided Experiment & Prompt Adaptation)DSPy MIPROv2 (Multi-prompt Instruction PRoposal)
Typical iterations8–164–8

When to prefer each

Pick GEPA when:
  • Your prompt was written quickly and you’re not sure the wording is right.
  • The task requires multi-step reasoning that the model isn’t currently doing.
  • You want the optimizer to explore structural changes (re-ordering instructions, adding a chain-of-thought scaffold, removing redundant constraints).
Pick MIPROv2 when:
  • Your prompt’s wording is solid and you’re trying to lift the last 5–10% of accuracy.
  • The task has a clear input/output schema and the failure mode is “almost right but format-wrong”.
  • You have a small, high-quality bootstrap set and want the optimizer to leverage in-context examples rather than rewrite instructions.
Pick both (default) when you don’t know which applies — the parallel run costs ~1.5× a single run (not 2×, because they share embedding and dataset prep) and lets the data decide.

How IronLabs picks the winner

Both optimizers run against the same dataset with the same metric (json_match, exact_match, bleu, rouge, meteor, or facility). At convergence:
  1. Each optimizer reports its best candidate prompt and the eval score on the held-out portion of your dataset.
  2. IronLabs compares scores with a tie threshold of 0.005. If both are within the threshold, MIPROv2 wins (lower-cost tiebreaker).
  3. The winner’s candidate prompt and score are written to the job’s result. The runner-up’s score is still recorded for reference.
You’ll see this in the optimization job result:
{
  "job_id": "opt-abc123-def456",
  "status": "completed",
  "results": [
    {
      "model": "openai/gpt-4o-mini",
      "optimizer": "gepa",
      "score": 0.847,
      "is_winner": true,
      "optimized_prompt": "..."
    },
    {
      "model": "openai/gpt-4o-mini",
      "optimizer": "miprov2",
      "score": 0.812,
      "is_winner": false,
      "optimized_prompt": "..."
    }
  ]
}

Disabling one optimizer

If you’ve measured that one optimizer always wins for your domain and you want to save the cost, pass optimizers:
optimizer.fit(
    prompt_url="https://example.com/prompt.txt",
    dataset_url="https://example.com/dataset.json",
    metric="exact_match",
    target_models=["openai/gpt-4o-mini"],
    optimizers=["gepa"],  # default is ["gepa", "miprov2"]
)

Cost model

Token spend per job is roughly:
total_tokens ≈ (dataset_size × iterations × tokens_per_call) × num_optimizers
A typical run with 100 examples, 12 iterations, GPT-4o-mini target, and both optimizers enabled lands in the $0.40–$1.20 range. Reflection-heavy GEPA contributes ~70% of that. Disabling GEPA roughly halves cost; disabling MIPROv2 saves ~25%.

Prompt Optimization Quickstart

Run your first optimization job.

AgentOpt

Optimize entire agents, not just prompts.

Routing Lifecycle

See how an optimized prompt is then served.

Security & Isolation

How your OpenRouter key is handled during optimization.