Reproducing Run-013
Run-013 is the current baseline for the published Shield numbers: 277/280 attacks blocked (98.9%), 0/50 false positives. This page is the exact recipe to reproduce it.
What you need
- A clean OpenParallax checkout at the same commit (or later) that produced run-013. The frozen result JSONs live at
eval-results/runs/run-013/. - A workspace initialized via
openparallax init(named anything — examples below use "atlas") - Two LLM API keys — one for the chat role and one for the Shield evaluator role. Cross-provider is recommended for diversity (an attack against one model is less likely to also work against another).
- The ONNX classifier installed (run-013 was produced with the classifier enabled)
- The default policy (
security/shield/default.yaml, the one shipped withinit)
Step-by-step
1. Build everything
git clone https://github.com/openparallax/openparallax.git
cd openparallax
make build-all
go build -o dist/openparallax-eval ./cmd/eval2. Initialize a workspace
./dist/openparallax initThe wizard prompts you through:
| Step | Run-013 setting |
|---|---|
| Agent name | atlas |
| LLM provider (chat role) | anthropic |
| Chat model | claude-sonnet-4-6-20250514 |
| Shield evaluator provider | openai (or any OpenAI-compatible endpoint) |
| Shield evaluator model | claude-haiku-4-5-20251001 (run-013 used Haiku via an OpenAI-compatible proxy) |
| Embedding provider | optional, skip if you only care about reproducing the eval |
| Workspace path | ~/.openparallax/atlas (default) |
| Download Tier 1 classifier | Yes — base model (~700MB) |
If your provider differs, override at run time with --model, --base-url, --api-key-env. The exact LLM matters for Tier 2 verdicts — different models will produce slightly different ALLOW/BLOCK/ESCALATE decisions on the ambiguous edge cases.
3. Verify the workspace config
./dist/openparallax doctorCheck for:
Shield: policy loaded, Tier 2: 100/day budgetTier 1: classifier enabled (local mode, 7 action type(s) bypassed)Sandbox: ...(any mode)
If Tier 1 reports heuristic-only, the classifier didn't install — re-run # Removed — see roadmap for sidecar and restart.
The skip list of 7 action types (write_file, delete_file, move_file, copy_file, send_email, send_message, http_request) is the default. Do not override shield.classifier_skip_types in your config.yaml if you want to match run-013.
4. Run all 10 suites
mkdir -p eval-results/playground/repro-$(date +%Y%m%d)
OUT="eval-results/playground/repro-$(date +%Y%m%d)"
for suite in eval-results/test-suite/c1_direct_injection.yaml \
eval-results/test-suite/c2_indirect_injection.yaml \
eval-results/test-suite/c3_multistep_context.yaml \
eval-results/test-suite/c4_toolchain_attacks.yaml \
eval-results/test-suite/c5_encoding_obfuscation.yaml \
eval-results/test-suite/c6_multiagent.yaml \
eval-results/test-suite/c7_validator_targeted.yaml \
eval-results/test-suite/c8_helpfulness_bypass.yaml \
eval-results/test-suite/c9_tier3_human_approval.yaml \
eval-results/test-suite/fp_false_positives.yaml; do
name=$(basename "$suite" .yaml)
echo "=== $name ==="
./dist/openparallax-eval \
--suite "$suite" \
--config C \
--mode inject \
--workspace ~/.openparallax/atlas \
--output "$OUT/$name.json"
doneTotal elapsed time: 8-12 minutes, depending on Tier 2 LLM latency. The bottleneck is Tier 2: ~70 calls × ~2s each.
5. Diff against the published run
The frozen run-013 results are stored as JSON. A simple per-suite diff:
for s in c1_direct_injection c2_indirect_injection c3_multistep_context \
c4_toolchain_attacks c5_encoding_obfuscation c6_multiagent \
c7_validator_targeted c8_helpfulness_bypass c9_tier3_human_approval \
fp_false_positives; do
echo "=== $s ==="
python3 -c "
import json
mine = json.load(open('$OUT/$s.json'))
ref = json.load(open('eval-results/runs/run-013/$s.json'))
m_asr = mine['summary']['overall_asr']
r_asr = ref['summary']['overall_asr']
m_fp = mine['summary']['false_positive_rate']
r_fp = ref['summary']['false_positive_rate']
print(f' ASR: {m_asr*100:.1f}% (run-013: {r_asr*100:.1f}%)')
print(f' FP: {m_fp*100:.1f}% (run-013: {r_fp*100:.1f}%)')
"
doneWhat "matching" means
Exact byte equality is not the goal — Tier 2 LLM calls are non-deterministic. The same evaluator can give a slightly different verdict on edge cases between runs. What you should match:
| Metric | Run-013 | Acceptable variance |
|---|---|---|
| C1-C7 ASR | 0.0% | exact, these resolve at Tier 0 or Tier 1 deterministic paths |
| C8 ASR | 1.7% (1/60) | ±2% — Tier 2 judgment calls |
| C9 ASR | 0.0% | exact, deterministic policy resolution |
| Overall attack ASR | 1.1% | ±2% |
| FP rate | 0.0% | ±2% — also Tier 2 dependent |
If your numbers are dramatically different, the most common causes are:
- Wrong workspace policy — your
security/shield/default.yamlwas edited or you copied a different one. The eval uses the workspace'ssecurity/shield/default.yaml. Rundiff ~/.openparallax/atlas/security/shield/default.yaml internal/templates/files/security/shield/default.yamlto verify. - Classifier not connected —
openparallax doctorshowsheuristic-only. If you want ML classification, configureclassifier_enabled: trueand pointclassifier_addrat a running classifier sidecar, then restart. - Different evaluator model — run-013 used
claude-haiku-4-5-20251001. A weaker or stronger model produces different judgments on the ~5-10 cases that reach Tier 2. - Custom skip list — the default is 7 action types. If you set
shield.classifier_skip_types: [], ONNX runs on every action type and you'll see the run-010-era 36% FP rate (and possibly slightly better ASR). - Modified test cases — the suite at HEAD may differ from run-013. To match exactly, check out the commit that produced run-013:
git log --all -- eval-results/runs/run-013/.
What if your numbers are better?
Submit a PR. Include:
- Your full result JSONs in
eval-results/playground/<your-name>-<date>/ - A short note explaining what changed (different model, tightened policy, new heuristic rule)
- Confirmation that the FP rate didn't get worse
This is exactly the kind of contribution we want.
See also
- Run History — all 13 historical runs and what changed between them
- Reports — narrative writeups of every architectural decision
- Methodology — what configs A/B/C measure