Wordmaze: a one-shot, verifier-graded puzzle for RL

Wordmaze is a single-turn word-ladder puzzle with a password constraint, built to be trivially gradable for RL.

By Dip

June 3, 2026 24 min read

I wanted an RL task I could grade with a plain Python function ^{[1]Lambert et al. (2024) · Tülu 3 (RLVR) · arXiv:2411.15124}: no judge model, no human labels. Something a frontier model finds easy and a small one finds hard, so the gap between them is what RL has to close. Wordle is the usual pick, but it drags along multi-turn state and a reward surface that's easy to hack ^{[2]Amodei et al. (2016) · Concrete Problems in AI Safety · arXiv:1606.06565}. So I built a different one.

Numbers first. On the hard config, gemini-3.1-pro-preview solves 99.5% of these puzzles. gemini-2.5-flash gets 36.5%, and it burns a median of 19,374 reasoning tokens per puzzle to do it, about five times what 3.1-pro spends to win. Qwen3.5-9B lands at 9.5%, Qwen3.5-4B at 2.5%. So the task is cheap to write down, cheap to grade, and it pulls models apart. That's what you want from an RL environment.

This post is about the environment: the puzzle, the verifier, how I generate the data, and what the baselines say. The training story, what happens when you actually point GRPO ^{[3]Shao et al. (2024) · DeepSeekMath (GRPO) · arXiv:2402.03300} ^{[4]DeepSeek-AI (2025) · DeepSeek-R1 · arXiv:2501.12948} at it, is a separate post.

The puzzle

You get a start word and a goal word of the same length. Build a path between them under three rules:

Change exactly one letter per move.
Every word in the path must be a valid English word of that length.
Reach the goal in at most max_moves moves.

That much is a classic word ladder. The twist is a fourth rule, the password:

For each move, look at the letter that changed. If it moved forward in the alphabet (a → z), write F. If it moved backward (z → a), write B. Concatenate those into a string. It must equal the puzzle's password.

So the path isn't just any route from start to goal. It's a route whose shape (the forward/backward pattern of its letter changes) spells a specific string. Here's the canonical example, cold → warm with password FFBF:

cold -> cord -> word -> ward -> warm
        F       F       B       F

l→r forward, c→w forward, o→a backward, d→m forward: FFBF. The model sees the puzzle as plain text and must reply with only the path, inside a tag:

start: cold
goal: warm
word_length: 4
max_moves: 4
password: FFBF

<answer>cold -> cord -> word -> ward -> warm</answer>

That output contract is deliberate. Parsing <answer>...</answer> and splitting on -> is all a verifier needs: no free-text to interpret, no second model in the loop.

One dataset row (JSON)

{
  "id": "wordmaze-13-01800",
  "start": "fold",
  "goal": "gold",
  "word_length": 4,
  "max_moves": 3,
  "password": "BFB",
  "solution": "fold -> bold -> sold -> gold",
  "answer": "<answer>fold -> bold -> sold -> gold</answer>",
  "path": ["fold", "bold", "sold", "gold"],
  "num_moves": 3
}

Each row also carries a ready-to-use prompt and chat messages (system + user), so you can evaluate or train without assembling anything.

Design choices

My goal was to build a task such that i could run RL on it with limited compute, and it can't be easily cheated through.

Single turn, not a multi-turn dialogue. Wordle is a multi-turn game: the model guesses, gets feedback, guesses again, and the prompt accumulates state. Now, during RL, you can still treat it as single turn, by giving previous guesses as part of the prompt. But, Reward isn't very clear in this setting. Wordmaze is designed to be a single shot. The model reads a fixed prompt and emits a complete answer. Everything the verifier needs is in that one string, which keeps rollout cost low and grading simple. So, i don't end up spending my savings on RL training.

A constraint the model can't bluff past. The password is the point. Without it, a word ladder has many valid solutions and a model can stumble into any one of them. Even with password, there are still many valid solutions, but at least the model has to plan. The password forces planning: you have to find a path whose letter-change directions spell a specific pattern, so most valid ladders are wrong answers. It also makes the task character-level. To get the password right you have to know which letter changed between two words and whether it went up or down the alphabet.

There are lot of knobs that can be tuned when you generate the dataset. For example, Password length (number of moves), Word length, Vocabulary size, etc. This knobs can be tuned to generate different difficulty levels of the task. We will talk about this in more detail in the later sections.

The verifier

Grading is a pure function. Extract the path from <answer>...</answer>, split on ->, and run a fixed list of checks. There are no learned components and no partial-credit heuristics baked into the truth label. Every check is a boolean:

Check	Passes when
`valid_format`	parses to ≥2 same-length words
`starts_correctly`	first word == start
`reaches_goal`	last word == goal
`within_max_moves`	≤ `max_moves` steps
`exact_moves`	exactly `max_moves` steps
`one_letter_changes`	each step changes one letter
`all_valid_words`	every word in the dictionary
`correct_word_length`	every word the right length
`password_matches`	computed F/B == password
`fully_valid`	all of the above

fully_valid is the metric we care about in the evaluation. The per-check booleans are what make the dataset useful for RL: instead of a sparse 0/1 signal, you have multiple booleans which can be used to shape a overall reward.

at most versus exactly. The prompt asks the model to reach the goal "in at most max_moves moves," but the verifier tracks both within_max_moves and exact_moves. That difference matters. If the reward only says "valid path that reaches the goal," a model can learn the shortcut: emit a two-word answer like <answer>cold -> warm</answer>, pass the loose checks, and ignore the password. More on that in the training post. For now, this is why the verifier reports both checks.

Here's the verifier as a toy. The start and goal are fixed; fill the blanks between them, one word per step, and watch every check resolve as you go:

fill the ladder in moves · password

Build a word ladder: change exactly one letter each step to get from the first word to the last, in the number of moves shown. For each step, F means the changed letter moved forward in the alphabet and B means backward; the F/B string has to spell the password.

Type each step one letter per box; it auto-advances. The changed letter lights up green; the box turns red if the step isn't a valid one-letter word. The F/B under each arrow builds the password. Words are checked against a wordfreq English list, loaded in the background.

password

The widget checks each word against a wordfreq English list (the common 4- and 5-letter words), fetched on demand. The grader is a touch more lenient: it accepts any word wordfreq has frequency data for, so an obscure word the widget rejects might still pass there.

Generating the dataset

Puzzles are built backwards, the way a mystery writer starts with the culprit and works back to the clues. This is how your favourite murder mystery novel was written probably.

Start with a solution path, then read the constraints off it. The generator (create_dataset.py) does this:

Vocabulary. Take the top 10,000 English words by frequency (via wordfreq), keep only pure [a-z] strings, and drop proper nouns: a blocklist of countries, major cities, US states, plus first names from NLTK. Common words keep the puzzles readable and keep the neighbor graph dense.
Neighbor graph. For each word length, build a graph where words are nodes and an edge connects two words that differ in exactly one position. The trick is bucketing by wildcard pattern (c*ld, co*d, …) so you find neighbors in linear passes instead of comparing every pair. There are many ways to do this and honestly, this is itself a leetcode coding interview question.
Random walks. Sample simple paths (no repeated words) of the requested length, then read off the start, goal, and password from each path. Because the password is derived from the walk, every generated puzzle is solvable by construction.
Dedup and split. Drop duplicates keyed on (start, goal, password, length, moves), shuffle, assign stable IDs, and split 80/10/10 into train/validation/test.

A few notes to keep in mind.

Six-letter graphs are sparse, so random walks dead-end far more often than four-letter ones, and the generator logs thousands of failed attempts to fill the longer buckets.
Your sampled path is not the only solution, a model that finds a different valid path is still correct, which is why you grade with the verifier and never string-match against the reference.
Verifier has different dictionary than our dataset generation script, in dataset generation, we only use top 10k words by frequency, but the verifier uses the entire wordfreq English list.

Two configs are released, both 2,000 puzzles split 1,600 / 200 / 200:

m3-4: 4-6 letter words, 3-4 moves. The easier curriculum config.
m4-6: 4-6 letter words, 4-6 moves. The harder one, and the default.

Both live on the Hub at immortal3/wordmaze, together with the generator script. It's self-contained (dependencies declared inline), so uv run builds the environment on the fly. Regenerate m4-6 with:

uv run create_dataset.py \
  --move-counts 4-6 --word-lengths 4-6 \
  --num-examples 2000 --seed 42

Baselines

This is the most fun part, also where you lose money if you don't have free credits.

Here is the baseline on the hard m4-6 test split:

Wordmaze m4-6 test leaderboard: gemini-3.1-pro 99.5%, 3.5-flash 83.5%, 2.5-flash 36.5%, Qwen3.5-9B 9.5%, Qwen3.5-4B 2.5%

The top models basically solve it: 3.1-pro at 99.5%, 3.5-flash at 83.5%. The small open models are stuck in single digits, 9.5% for the 9B and 2.5% for the 4B. That gap, near-perfect at the top and almost nothing at the bottom, is the whole reason I made the dataset.

Accuracy alone hides the more interesting part, though. Here it is plotted against how much each model thinks (its reasoning tokens):

Fully-valid rate versus median reasoning tokens for the three Gemini models on Wordmaze m4-6

Look at the weakest of the three. gemini-2.5-flash scores 36.5%, and not because it ran out of room to think. The opposite. It uses a median of 19,374 reasoning tokens ^{[5]Wei et al. (2022) · Chain-of-Thought Prompting · arXiv:2201.11903} per puzzle, against 3.1-pro's 4,100, and still loses. On six-move puzzles its median goes past 60k. In the plot, the dot is each model's median and the diamond is its slow tail (the 95th percentile). Both flash models hit the 63k token cap on their worst puzzles; 3.1-pro's worst (about 10k) is still below 2.5-flash's typical puzzle. So 2.5-flash isn't really reasoning, it's flailing: it searches, second-guesses, and never lands. More thinking time isn't the fix. A better model is. ^{[6]Snell et al. (2024) · Scaling LLM Test-Time Compute · arXiv:2408.03314} The stronger model gets there with less thinking, not more.

Where do the failures concentrate? Format and word-validity are nearly free for everyone. Even 2.5-flash gets 98% valid format and 80% valid words. The puzzle lives in two checks: one_letter_changes and password_matches. 2.5-flash matches the password only 44% of the time. The character-level bookkeeping, which letter changed and in which direction, is what separates the models.

The open models fail the same way, only harder. Stripped of a reasoning budget the 4B collapses toward zero, 4% on one_letter_changes in direct-answer mode. A concrete 4B failure, on a four-move puzzle:

<answer>stark -> shark -> shore -> shire -> share</answer>

Every word is real. But shark -> shore changes two letters, not one, so the path is not a valid ladder and the password can't even be computed. The model produced plausible English and quietly broke the rule that makes it track characters.

Difficulty as a dial: m3-4 vs m4-6

Everything above is the hard config. Wordmaze also ships an easier one, and together they form a clean difficulty ladder: m3-4 is 3-4 moves, m4-6 is 4-6, same vocabulary and rules, just more steps to plan and a longer password to satisfy. Running the full frontier sweep on both turns the difficulty knob and shows who actually holds up.

model	m3-4 (3-4 moves)	m4-6 (4-6 moves)
gemini-3.1-pro-preview	99.5%	99.5%
gemini-3.5-flash	95.5%	83.5%
gemini-2.5-flash	56.5%	36.5%
Qwen3.5-9B (open)	32.5%	9.5%
Qwen3.5-4B (open)	18.5%	2.5%

The strongest model barely notices: 3.1-pro is 99.5% on both configs. The weaker ones drop as puzzles get longer, and 2.5-flash drops the most. But the two averages hide what's really going on. It's clearer move by move.

Accuracy and median reasoning tokens versus move count for the three Gemini models

The chart plots accuracy and thinking against the number of moves, one line per config. The two lines meet at four moves, which tells us the difficulty comes from how many moves a puzzle needs, not from which config it came from. That's the dial I lean on in the training post. 3.1-pro stays flat near 100% from three moves to six, and its thinking only creeps up, about 1.4k tokens to 5.7k. 2.5-flash goes the other way on both: accuracy falls from 65% to 17%, and its thinking jumps from 6k to 63k on the six-move puzzles. Ten times the effort, a sixth of the wins. Harder puzzles should cost more thought, sure, but for a good model that cost stays small, and for a weak one it blows up ^{[7]Shojaee et al. (2025) · The Illusion of Thinking · arXiv:2506.06941}.

Mean versus median says the same thing more sharply. For 3.1-pro the two are close (on m4-6, median 4.1k, mean 4.7k), so almost every puzzle costs about the same. For 2.5-flash the mean sits well above the median (median 19k, mean 31k; on m3-4 the mean is triple the median). That gap means a handful of puzzles send it spiraling into tens of thousands of tokens with nothing to show. 3.5-flash has those bad puzzles too, but it usually still gets the answer. 2.5-flash usually doesn't.

Wall-clock time tells the same story. On the m3-4 runs I also logged how long each request took. At the median, 3.5-flash is fastest (12s), 3.1-pro a bit slower at 19s (bigger model, fewer tokens), and 2.5-flash slowest at 33s. The tail is where it hurts: 3.1-pro and 3.5-flash stay under about 40 seconds 90% of the time, but 2.5-flash's slowest 10% run over four minutes. Its average is 92 seconds, nearly triple its median, the same runaway tail, now in seconds instead of tokens. The best model isn't the quickest per puzzle, but it's the only one you'd trust on a deadline.

The open models ride the same ladder, one tier down (the bottom two rows above): they slide with difficulty exactly like the flash models, only starting far lower.

The tokenization barrier

Remember the two checks every model flunked: one_letter_changes and password_matches. Both are about letters, and that turns out to be the whole problem.

Here's a real answer from gemini-2.5-flash:

<answer>chips -> ships -> shops -> shots -> steps</answer>

It reads like a clean ladder. But look at the last step. shots -> steps changes three letters at once (h→t, o→e, t→p), not one. The model wrote it down anyway. To you the mistake is obvious, you just look at the spelling. The model can't look at the spelling that easily, and that's the catch.

A language model doesn't read letters, it reads tokens ^{[8]Sennrich et al. (2016) · Subword Units (BPE) · arXiv:1508.07909}. A short word like shots or steps is usually a single token, one number to the model. And those two numbers are unrelated: nothing in them tells the model which letters the words share or where they differ. So to answer "did exactly one letter change?", the model can't just glance at the spelling. It has to recall each word letter by letter, line the two up, and compare, all in its head, in words.

The password is harder still. For every move the model has to find which letter changed and decide whether it moved forward or backward in the alphabet (l → r is forward, so F). That's two small character puzzles per step, on top of searching for a valid path. Frontier models grind through it with thousands of tokens of scratch work. A 4B can't keep that up, so it falls apart.

So here's a test. If the tokenizer is the problem, handing the model the letters should help. I reran the 4B with every word spelled out with spaces, c o l d instead of cold. Now each letter is its own token, which is the idea behind character-level models ^{[9]Xue et al. (2021) · ByT5 · arXiv:2105.13626} ^{[10]Clark et al. (2021) · CANINE · arXiv:2103.06874}:

Char-spacing roughly doubles a 4B's character-level pass rates on Wordmaze m3-4

It worked. One-letter steps doubled, password matches nearly doubled, and fully-valid went from 18.5% to 26.5%, even though spacing makes the text about twice as long, so the model has less room to work, not more. Handing it the letters helped more than the extra length hurt. That's a strong hint the tokenizer, not raw smarts, is a real part of why small models struggle here.

What's next

That's the environment: a one-shot puzzle, a pure-function verifier with a per-check signal, a reproducible generator, and a baseline gap that runs from 99.5% at the frontier down to single digits for a 4B. The gap is the whole motivation. A small open model can almost do this when you give it room to reason, which is the regime where RL might help: there's a non-zero success rate to amplify.

The next post is the training story: what happens when you point GRPO at a 4B and try to close that gap. It does not go the way you'd hope. The first honest result is that a naive pipeline made the model worse than its untrained base, by learning to game the very checks above. That failure, what it says about reward design, and whether RL can push a model past what its base already does ^{[11]Yue et al. (2025) · Limits of RLVR · arXiv:2504.13837}, is part two.

The dataset is on the Hub if you want to point a model at it. If you beat 3.1-pro, I'd like to know how.

References

Tülu 3: Pushing Frontiers in Open Language Model Post-Training. Lambert et al. (2024). Introduces RLVR (RL from verifiable rewards). arXiv:2411.15124
Concrete Problems in AI Safety. Amodei et al. (2016). On reward hacking and specification gaming. arXiv:1606.06565
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. Shao et al. (2024). Introduces Group Relative Policy Optimization (GRPO). arXiv:2402.03300
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. DeepSeek-AI (2025). arXiv:2501.12948
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Wei et al. (2022). arXiv:2201.11903
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. (2024). arXiv:2408.03314
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Shojaee et al. (2025). Reasoning models vs puzzle complexity. arXiv:2506.06941
Neural Machine Translation of Rare Words with Subword Units. Sennrich et al. (2016). Byte-pair encoding. arXiv:1508.07909
ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Xue et al. (2021). arXiv:2105.13626
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. Clark et al. (2021). arXiv:2103.06874
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?. Yue et al. (2025). arXiv:2504.13837

📏 Bits per byte (BPB) for LLMs: tokenizer-agnostic loss
BPB measures LLM loss per UTF-8 byte, making tokenizer comparisons fair. Formula, worked example, and PyTorch-style implementation.
🔀 GPT-5 router: why chat apps will route across models
Why GPT-5 style routers send simple prompts to smaller models and hard prompts to reasoning models, and what that means for chat UX.
☁️ AWS Bedrock Converse API: one endpoint for Claude, Llama, and Mistral
How to call Amazon Bedrock Converse API, pass messages and inference config, and handle common model-specific validation errors.