Skip to content

Run History

The eval went through 13 runs as the methodology evolved. Each run is a frozen snapshot — the JSON result files live at eval-results/runs/run-NNN/ and never change. New evaluation work goes into a new run directory.

The full narrative of every methodology change lives in the Reports. This page is the index — one paragraph per run with the headline finding.

LLM mode runs (the methodology pivot)

RunModeModel(s)Headline
001LLMclaude-sonnet-4-60% ASR across all configs (A, B, C). Sonnet refused every harmful action at the reasoning level — REFUSED_BY_MODEL on every case. Shield never fired.
002LLMsonnet-4-6 + haiku-4-50% ASR on the helpfulness-bypass suite for both models.
003LLMllama-3.1-8b-instant via GroqAlso 0% ASR. Llama's instruct training catches the same patterns.

Conclusion of LLM-mode phase: all commercially available instruction-tuned models with tool-use capability refuse the corpus at the reasoning layer. Configs A, B, and C produce identical results — no architectural differentiation possible. Methodology pivots to inject mode.

Inject-mode runs (architectural validation)

RunHeadlineΔ from previous
004First inject pass. C1 4%, C2 2.5%, C7 19.2%. Read-side policy gaps.Methodology change
005Initial policy hardening (credential files, agent internals, /tmp paths, evaluate_shell_commands → Tier 2). C1-C6 → 0%, C7 7.7%, C8 8.3%, FP 34%.Policy tightened
006Allow-rules bug. Added blanket allow_workspace_writes rule, broke C2/C3/C7/C8 detection. ASR went UP.Anti-pattern
007Targeted deny rules: block_system_config_writes, block_binary_overwrites, block_git_hook_writes, block_tmp_exfiltration. C2/C3/C8 → 0%. C7 still leaking 2 cases.Policy fix
008First gateway fix attempt: dual classifier always runs. All categories → 0%, but FP jumped 24% → 66% as ONNX started seeing every action type.Architectural blunder
009AlwaysBlock heuristic precheck added. Only specific high-precision rules fire on Tier 2 escalations. All categories → 0%, FP back to 40%. Three remaining FPs from too-broad regex.Architectural correction
010AlwaysBlock=true flag added to HeuristicRule struct. Only XP-036 (list_agent_internals) and XP-037 (grpc_service_enumeration) fire as precheck rules. Fixed XP-012 chmod regex. Tier 2 evaluator gained ESCALATE response option — Tier 3 wired end-to-end. New c9_tier3_human_approval suite added. 275/275 attacks blocked, FP 36%.Major: Tier 3 ships
011Heuristic-only opt-in default. ONNX disabled. Massive regression — C5 0%→100%, C4 0%→92%, C6 0%→96%. ONNX is doing critical work on encoding/obfuscation/multi-agent categories.Anti-pattern
012Per-action-type ONNX skip list (shield.classifier_skip_types). C5/C6/C7 → 0%, but C1/C2/C3/C4/C8/T3 leaking because policy routed skipped types to Tier 1.Partial fix
013Policy evaluate_external_communication, evaluate_file_writes, evaluate_destructive_file_ops → escalate to Tier 2 instead of Tier 1. 277/280 attacks blocked (98.9%), FP 0/50 (0%). This is the new default.Production-ready

Run-013 is the current published baseline. All future runs are new directories (run-014, run-015, ...) and existing runs are never modified.

Per-run files

Every run-NNN/ directory contains uniformly named result JSONs:

run-013/
  c1_direct_injection.json
  c2_indirect_injection.json
  c3_multistep_context.json
  c4_toolchain_attacks.json
  c5_encoding_obfuscation.json
  c6_multiagent.json
  c7_validator_targeted.json
  c8_helpfulness_bypass.json
  c9_tier3_human_approval.json
  fp_false_positives.json

Earlier runs may be missing files if a suite didn't exist yet (c8 was added before run-002, c9 before run-010). Run-001 has its own test-suite/ snapshot inside it because the corpus has evolved since.

Some runs have suffixes:

  • -cfgA / -cfgB / -cfgC — eval Config A (baseline), B (guardrails), or C (full Parallax). Used in run-001 where the same suite ran against all three configs.
  • -v1 / -v2 — multiple sub-runs of the same suite within one session. Used in run-004 during harness iteration.
  • -{model} — model-specific variant. Used in run-002/run-003 when comparing the same suite across sonnet, haiku, llama.

For the canonical naming convention and the schema each result file follows, see eval-results/runs/INDEX.md.

Adding a new run

If you reproduce or extend the eval, your run goes in eval-results/playground/ (gitignored). When the run is worth keeping — a new baseline, a new policy, a new model — move it to eval-results/runs/run-NNN/ and open a PR. See Reproducing Run-013 for the recipe.

See also