The Challenge
Anthropic's VLIW CPU performance take-home: optimize a tree traversal kernel for a simulated 5-engine VLIW processor. The goal is to minimize clock cycles by exploiting instruction-level parallelism, hiding memory latency, and keeping all execution units busy.
I achieved 1,137 cycles* beating Claude Opus 4.5's best published result of 1,363 cycles, despite having zero prior knowledge of kernel optimization or VLIW architectures.
At the time of submission, January 24th, 2025, this submission was tied for 11th in the world, and best among submitters with no domain knowledge.
*1,137 uses beam search and omits index storage, and beam search caused it to time out. The submittable version (which stores indices) achieves 1,152 cycles.
What is a kernel? How is this one different?
What's a "kernel" in this context?
A kernel is a small piece of code that runs over and over, usually the innermost loop of some larger program. If your code spends 90% of its time in one function, that function is the kernel. The term comes from GPU programming, where you write a kernel and the GPU runs it across thousands of threads. Matrix multiply, image blur, sorting algorithms: these are all kernels.
Optimizing a kernel means understanding what the hardware can actually do each cycle, and making sure you're using all of it.
What does this kernel do?
We have 256 items that each walk down a binary tree (each node has two children). The tree is 10 levels deep with 2,047 nodes. At each node:
- Hash the current value (scramble it into a new number)
- Look at one bit of that hash to go left or right
- Keep going until you hit the bottom
Do that 16 times per item ("rounds"), and you need to process all 256 items across all 16 rounds. The fewer cycles, the better.
How is this different from normal optimization?
Normally you optimize for real hardware (GPUs, x86 CPUs) and the compiler does the hard scheduling work. Here, you're writing code for a made-up processor, and there's no compiler. You have to do the scheduling yourself.
VLIW (Very Long Instruction Word): the processor runs multiple operations per cycle, but only if you pack them together into "bundles" yourself. Empty slots = wasted cycles.
SIMD (Single Instruction, Multiple Data): one instruction works on 8 values at once. Adding 8 numbers takes the same time as adding 1.
| Typical GPU/CPU Kernel | This VLIW Simulator |
|---|---|
| Hardware figures out what can run together | You manually pack operations into slots |
| Thousands of threads hide wait time | Only 8-wide SIMD, so you interleave work yourself |
| Can load scattered memory locations easily | Can only load consecutive addresses |
| Big register file | 1,536-word scratch space (registers + cache, combined) |
| Compiler schedules for you | You are the compiler |
So it's not just "make it fast." You have to figure out how to use a weird architecture that nobody's written a compiler for. That means:
- Packing independent ops into VLIW slots (12 scalar math, 6 vector math, 2 load, 2 store, 1 control-flow per cycle)
- Doing useful work while waiting for memory loads to finish
- Deciding what lives in scratch vs. what gets recomputed
- Changing the algorithm to fit the machine (like replacing random loads with math)
It's a test of whether you can think at the hardware level and keep every part of the processor busy.
AI-Assisted Methodology
This result was achieved through a systematic multi-agent approach using Claude Code (Opus), Codex 5.2 xhigh, and GPT 5.2 Pro:
Explore the Visualization
Click any card below to dive into different aspects of the optimization journey:
Detailed Technical Writeup
Full optimization analysis
1. The True Bottleneck: Cycles = Bundles
The machine model is a resource-constrained scheduler, not a scalar CPU. Each cycle executes one VLIW bundle containing up to:
- ALU: 12 slots/cycle (scalar ops on scratch)
- VALU: 6 slots/cycle (vector ops, VLEN=8 lanes)
- LOAD: 2 slots/cycle (scalar load / vload / const)
- STORE: 2 slots/cycle
- FLOW: 1 slot/cycle (select/vselect/jumps/pause)
All reads happen before any writes commit within a cycle. No same-cycle forwarding: if op B needs the value from op A, they must be in different cycles.
The optimization problem is: minimize number of cycles subject to (a) slot limits per engine per cycle, (b) dependency constraints between ops (RAW, WAW/WAR), (c) scratch capacity and aliasing constraints.
Per-Engine Throughput Floors
If you have N_valu vector ops, even with perfect scheduling: cycles ≥ ⌈N_valu / 6⌉. Similarly for loads: cycles ≥ ⌈N_load / 2⌉. Runtime is bounded below by the max of these floors plus unavoidable structural bubbles from dependencies.
The optimizations repeatedly: (1) reduce ops in the limiting engine, and/or (2) increase packing efficiency so actual cycles approach the max-engine floor.
2. Baseline: Why 147K Cycles
The reference kernel processes each element sequentially:
- idx = indices[i], val = values[i]
- node_val = tree[idx]
- val = myhash(val ^ node_val)
- idx = 2*idx + (1 if val even else 2)
- if idx >= n_nodes: idx = 0
- write back values[i], indices[i]
Compiled mechanically: tons of LOAD/STORE per round, a hash dependency chain per element, per-element control logic, very little ILP exploited.
3. The Four Structural Wins
- SIMD tiling: 8 elements per VALU op → ~8× reduction in arithmetic work
- Scratch residency: values live in scratch across rounds → eliminate almost all per-round value loads/stores
- Replace early gathers: preloaded / arithmetic / vselect selection at depths 0–3 → huge reduction in load pressure
- Real list scheduler: packs independent ops across tiles/rounds → converts ILP into fewer bundles
4. SIMD Tiling + Scratch Residency
VLEN=8, batch_size=256, so n_tiles=32. Tile t corresponds to values in lanes [8t...8t+7]. vals_scratch stores all 256 values contiguously.
The kernel:
- vloads input values into vals_scratch in init
- operates only on vals_scratch for all rounds
- vstores results back only at the end
This kills the most crippling baseline behavior: per-round loads/stores of values. Loads are scarce (2/cycle). Eliminating "load values, store values" per element per round removes a massive bandwidth floor.
5. Compile-Time Control Flow
The reference kernel has: idx = 0 if idx >= n_nodes else idx. A faithful implementation needs compare + conditional select in the dependency chain feeding the next gather address.
Key observation: traversal depth is deterministic by round number. In a perfect binary tree of height H=10, depth increases by 1 each round; after processing leaves (depth 10), the next update wraps.
Rather than compute wrap dynamically, treat "depth per round" as known. At leaf depth, simply set do_idx_update=False. The next round is depth0 which ignores idx and uses root value directly. The wrap check is replaced with compile-time structure—no runtime compare/select.
6. Hash Optimization
6.1 Algebraic Fusion
Hash stages include patterns like (a + C) + (a << k) which equals a * (1 + 2^k) + C. So stages 0, 2, 4 become:
- stage0: a*4097 + c1
- stage2: a*33 + c3
- stage4: a*9 + c5
The machine has multiply_add(dest, a, b, c) = a*b + c lane-wise. Those stages emit as one VALU op each instead of (add + shift + add). Direct reduction in N_valu.
6.2 Offload Constant Ops to ALU
Near the end, the kernel is VALU-bound (~6563 VALU ops → theoretical min ~1094 cycles). Many hash stages include "val op constant" patterns that could be done per-lane in ALU: constant XORs in stages 1 and 5, constant add in stage 3.
With alu_xor_offload=True and alu_add_offload=True, stage1 becomes:
- compute tmp = val >> 19 in VALU (vector op)
- do val ^= c2 via 8 scalar ALU XORs (one per lane)
- do val ^= tmp via VALU
Why this wins even though it's "more ops": ALU has 12 slots/cycle, VALU only 6. The ALU operations can be scheduled in the same cycles as VALU ops from other tiles/stages. You trade 1 VALU op for 8 ALU ops. If ALU isn't saturated, this moves work off the critical resource.
alu_add_offload removes ~1 VALU op per tile per round for stage3, which is ~512 VALU ops total across 32 tiles × 16 rounds → ~85 cycles of VALU floor reduction.
6.3 Stage-Major Interleaving
Each tile's hash is a strict dependency chain: stage0 → stage1 → stage2 → stage3 → stage4 → stage5. If you emit "do tile0 all stages, then tile1 all stages..." you're constantly blocked by RAW dependencies—the scheduler can't fill VALU slots because the next op for that tile isn't ready until the previous stage commits.
emit_hash_interleaved emits stage-major order: stage0 for all tiles, then stage1 shift for all tiles, then stage1 constant-xor for all tiles, then stage1 xor-with-tmp for all tiles, then stage2 for all tiles...
This (1) increases distance between dependent ops of the same tile so the scheduler has many ready ops each cycle, and (2) aligns with throughput: at any point there are enough independent VALU ops to fill 6 slots/cycle.
Why per-tile tmp vectors are mandatory: interleaving only works if temporaries don't alias. If tiles share the same tmp buffer, you get WAW hazards. So allocate 32 separate VLEN vectors, one per tile. Classic register allocation for ILP tradeoff: spend scratch capacity to buy parallelism.
7. Index Update
Index update sits on the critical path to future gathers, so it must be low-op-count and dependency-short.
Three Representations by Depth
- depth0: 1-bit (bit0). After hashing at root, compute bit0 = val & 1 and store it. Avoid idx = 2*idx + ... entirely.
- depth1: 2-bit partial index (0..3). idx_partial = 2*bit0 + bit1 via VALU multiply_add.
- depth2+: either 3-bit partial (if depth3 vselect is coming), or pointer-form absolute address for gather rounds.
Pointer-Form Recurrence
For depths ≥4, store idx_scratch as the absolute memory address of the node value: p = forest_values_p + idx. Loads become load node_val_lane = mem[p_lane]. No address-add required.
Reference update: idx_next = 2*idx + (bit + 1). In pointer form: p_next = 2p + (bit+1-base). So pointer update is: compute bit = val & 1, compute addend = (1-base) if bit==0 else (2-base), p = p*2 + addend.
FLOW vselect Trick
Computing addend naively would be a VALU add. But since bit ∈ {0,1}, addend is either (1-base) or (2-base). Use flow.vselect to pick between precomputed vectors. Saves a VALU op and consumes FLOW instead. Trade scarce VALU slots for slack resources.
8. Node Selection: Remove Gathers
The killer cost is gathering node values. No vector gathers; must do 8 scalar loads per tile. The only way to beat the load floor is to avoid loads at early depths via preloading + arithmetic selection.
Depth 0
Root node is a scalar load + vbroadcast once. In depth0 rounds, do val ^= root_node_vec in VALU. No gathers.
Depth 1
Nodes 1 and 2 are loaded once and broadcast. Selection based on bit0: either flow.vselect(bit0, n2, n1) or arithmetic: node = (n2-n1)*bit0 + n1 via multiply_add. Final config uses arithmetic because FLOW is scarce and depth3 vselect uses a lot of FLOW.
Depth 2
Nodes 3..6 preloaded and broadcast. 2-bit index (0..3). Let X = low bit, Y = high bit. Selection: choose base node via X, choose diff via X, final node = base + diff * Y. Uses two flow.vselect + one valu.multiply_add. Reduces VALU pressure relative to two multiply_adds.
Depth 3: The Big Win
Nodes 7..14 (8 nodes) are small enough to preload and select in registers.
Preload: vload nodes 7..14 into a temp vector, vbroadcast each scalar into node_vecs_d3[0..7].
Select: each lane has a 3-bit index (0..7). Use bits: bit0 = idx & 1, bit1 = idx & 2, bit2 = idx >> 2. Balanced tree of flow.vselect:
- Pair select by bit0: p0 = select(bit0, n8, n7), p1 = select(bit0, n10, n9), ...
- Group select by bit1: g0 = select(bit1, p1, p0), g1 = select(bit1, p3, p2)
- Final select by bit2: node = select(bit2, g1, g0)
Why this matters: A depth3 gather round costs 8 loads/tile × 32 tiles = 256 loads. LOAD has 2 slots/cycle → 128-cycle hard floor per depth3 round. Depth3 occurs twice in 16 rounds → 256 cycles minimum just for those gathers. Replacing with FLOW+VALU removes that 256-cycle floor.
Tiles are processed one at a time at depth3: FLOW is scarce and the vselect tree needs scratch temporaries. Emitting depth3 for many tiles with shared buffers creates WAW collisions. Classic software pipelining / register allocation tradeoff under scratch constraints.
9. Gather Rounds: Pipeline Loads with Compute
For depths ≥4, each tile needs 8 loads. If you emit "load all nodes for all tiles" then "xor+hash+idx update for all tiles", you get long stretches where either LOAD is saturated but VALU is underutilized, or vice versa.
The kernel builds a producer/consumer pipeline:
- Start loading the first tile's node vector
- As soon as a tile's node vector is available, enqueue its hash+idx ops in hash_queue
- While loading subsequent tiles lane-by-lane, periodically drain some hash ops from the queue (parameter hash_per_load_pair)
Emit loads in lane order; after every 2 loads (2 load slots/cycle), emit ~hash_between VALU/ALU/FLOW ops from the hash queue. Forces the stream to contain lots of independent work around the loads, which gives the list scheduler freedom to pack both engines every cycle.
10. Wavefront Scheduling
Even with perfect per-round pipelining, cycles are wasted at round boundaries because different depths have different resource mixes (LOAD-heavy vs FLOW-heavy vs VALU-heavy). If all tiles are synchronized, you get phase behavior and bubbles.
Wavefront: partition tiles into wavefront_groups (20), scheduled so group g is approximately g rounds behind group 0. At a global cycle level, the program executes multiple rounds at once across different tile groups, like a diagonal wavefront in (tile_group, round) space.
Why it helps: different depths stress different engines. While one group is in a gather phase (keeping LOAD busy), another might be in a compute-only phase (keeping VALU busy). The scheduler has more independent ops overall, smoothing utilization. Reduces "low-VALU" or "low-LOAD" cycles.
11. Storing Results
Stores capped at 2/cycle, vstore writes 8 lanes at once. Final output is 32 vstores → at least 16 store cycles if serialized.
The kernel computes out_ptrs[tile] in init. With stream_store: if a tile is in its last round and its value is finalized, emit its vstore early so the scheduler can overlap it with other tiles' compute. Reduces tail effects.
12. The List Scheduler
A simple scanline packer only looks locally—tries to add slots to the current bundle until a slot limit or RAW hazard triggers a flush. This leaves enormous performance on the table because many later ops are independent and could fill empty slots, but the packer never reaches forward to find them.
Dependency Graph Construction
Build a real dependency graph over ops:
- RAW: strict edge (must be later cycle)
- WAW: strict in final config
- WAR: weak edge (can be satisfied same cycle—all reads happen before writes commit)
Heuristics
- criticality = longest path length to a sink. Approximates "how urgent is this op for overall makespan".
- dist_to_load = min dependency distance to any load. Avoids starving the LOAD engine.
- unblocks_load: ops with load successors.
Per-Cycle Scheduling
Each cycle, select ready ops such that: slot limits per engine are respected, no two ops write the same scratch address in the same cycle, strict deps only decremented at end-of-cycle, weak deps can be decremented within cycle (enabling same-cycle WAR packing).
Priority order: loads first when load-starved, otherwise mix of load proximity, load-unblocking, criticality, and successor count.
Beam Search
Pure greedy list scheduling gets stuck in local optima early. Multi-cycle beam search for first schedule_beam_cycles=25 cycles, keeping schedule_beam_width=2 states, exploring schedule_beam_bundles=3 candidate bundles each step.
Scores partial schedules by: loads issued (important), ops scheduled (good), penalizes flow usage (scarce), penalizes larger dist_to_load sums. Tiny A*-style lookahead on the scheduling problem.
Ready-Set Repair
After building a bundle, a repair pass fixes a common pathology: didn't fill all load slots this cycle, but could have if you scheduled a different VALU op that was blocking future loads. Can evict a low-priority VALU op and insert a candidate VALU op near loads. Explicitly shaping the schedule to keep scarcest engines saturated.
13. Dead Code Elimination
The generator has many conditionals ("if last round", "if depth==3", etc.). Even careful emission produces scratch writes that never feed anything (e.g., leaf rounds where idx update isn't needed).
DCE walks slots backwards with a liveness set: keeps stores and control-flow ops (side effects), keeps ops whose writes feed later reads, drops everything else. Applied separately to init and body, with init's live-out computed from body reads.
14. Scratch Management
SCRATCH_SIZE = 1536 words. Every vector temp consumes 8 words. Per-tile temps and per-tile node buffers quickly eat scratch.
Explicit tradeoffs: allocate per-tile temps when it buys ILP (hash temps), reuse temps when safe (depth3 sequential processing), choose node buffer mode automatically (per_tile, pingpong, group) based on remaining scratch.
Aliasing kills correctness under out-of-order scheduling. If two ops from different tiles write the same scratch vector in overlapping lifetimes, the list scheduler can interleave them and you get schedule-dependent wrong answers. Depth3 vselect is emitted tile-by-tile with dedicated buffers to eliminate WAW/WAR hazards.
15. Why 1137 Cycles: VALU-Throughput-Limited
Final analysis:
- Total VALU ops: ~6563
- VALU theoretical min: 6563 / 6 ≈ 1094 cycles
- Actual: ~1137 cycles
- Overhead: ~43 cycles (~3.9%)
The kernel is no longer dominated by wasted memory traffic or poor packing. It's dominated by the fundamental throughput limit of the VALU engine.
Remaining 43 cycles explained by: init phase being load-bound (constants, preload nodes, initial vloads), depth/round transitions creating unavoidable structural gaps (phases requiring FLOW but FLOW is 1/cycle), tail stores, and residual dependency-induced bubbles that even beam search can't eliminate without changing the algorithm or ISA.
16. Causal Chain of ~130× Speedup
- SIMD tiling + scratch residency: 4096 scalar iterations → 32 tiles × 16 rounds of vector ops, removes per-round value loads/stores
- Hash fusion: cuts hash vector op count, reduces dependency length
- Stage-major hash interleaving: turns per-tile dependency chains into wide ready set, enables near-full VALU utilization
- Early-depth gather elimination + depth3 vselect: removes huge numbers of scalar loads, especially ~256 cycles of load floor at depth3
- Pointer-form indices + vselect addend: eliminates address formation ops, shortens hash→idx→load critical path, shifts work to FLOW
- Load/hash pipelining: keeps LOAD and VALU busy simultaneously, transforms load throughput limit into steady pipeline
- Wavefront scheduling: reduces synchronization phase bubbles between depths/rounds, improves global utilization
- List scheduling + beam lookahead + ready-set repair: converts the above ILP into fewer cycles
- ALU offload: reduces VALU pressure in final regime where VALU is roofline, gives direct cycle reduction
- DCE + pause coalescing + store overlap: removes tail overhead and nonessential work
That is how you go from ~147k cycles to ~1137 cycles on this architecture.
Optimization Milestones
Click a milestone, or use arrow keys to navigate between them.
Fig. 1. Cycle count reduction across optimization milestones (log scale). Each point represents a discrete improvement in scheduling strategy.
Engine Utilization
Utilization across milestones by engine type.
Fig. 3. Engine utilization percentage across optimization milestones. Higher values indicate better exploitation of available parallelism.
Kernel Schedule Explorer
Scroll to pan, ctrl/cmd + wheel to zoom. Click ops or bubbles to inspect dependencies.
Fig. 4. Full kernel schedule visualization. Each row represents an engine; columns are clock cycles.
Live Ranges & Register Pressure
Derived from full schedule export. Highlight conflict ranges and pressure hot spots.
Fig. 5. Register live ranges showing value lifetimes and potential conflicts.