Synthetitastic is an accelerationist set of synthetic data. It's packed with a variety of programatically generated cases that can be evaled on and RLed with. Synthetitastic isn't ordinary - it wants to be saturated, so that LLMs can get better and more efficient at these seemingly easy tasks.
These are the kinds of tasks that computers are great at, reasoning LLMs are okay at, and normal LLMs are bad at. With Synthetitastic, LLMs can hone their skills on these tasks, reaching perfect accuracy and speed.
The test cases are jsonl files, where each line has keys like
input: Input textoutput: Expected output text
And we have these tests
- Day of week (eg "What day of the week is 2084-09-26? Just say the day.")
- Epoch conversion (eg "I want to make a Discord timestamp for 19:24:17 UTC on 2129-11-29. Just say the Unix time (seconds) I should use.")
- Large multiplication (eg "What is the product of 96933 and 90409? Just say the number.")
- Largest number (eg "What is the largest number without the letter o? Reply with just the decimal number. Exclude numbers like googolplex.")
- Letter counting (eg "How many 'e' are in 'bestseller'? Just say the number.")
- Wordle (eg "I guessed raxes and got 🟨🟨🟨🟨⬛ - so what's the word?")
Here are the results:
GPT-4.5: 8/10
Llama 70b: 2/10
Llama Maverick: 7/10
Qwen 235B (thinking): 5/10
Claude 3.7: 0/10
R1 0528: 8/10
R1: 8/10
GPT-4.5: 0/10
Llama 70b: 0/10
Llama Maverick: 0/10
Qwen 235B (thinking): 0/10
Claude 3.7: 0/10
R1 0528: 1/10
R1: 0/10
GPT-4.5: 0/10
Llama 70b: 0/10
Llama Maverick: 2/10
Qwen 235B (thinking): 0/10
Claude 3.7: 0/10
R1 0528: 10/10
R1: 10/10
GPT-4.5: 8/10
Llama 70b: 3/21
Llama Maverick: 6/21
Qwen 235B (after 6k tokens of thinking): 17/21
Claude 3.7: 5/10
R1 0528 (after 14.5k tokens of thinking): 20/21
R1: 7/10
GPT-4.5: 10/10
Llama 70b: 6/10
Llama Maverick: 5/10
Qwen 235B (thinking): 8/10
Claude 3.7: 6/10
R1 0528: 9/10
R1: 10/10
GPT-4.5: 2/10
Llama 70b: 0/10
Llama Maverick: 0/10
Qwen 235B (thinking): 0/10
Claude 3.7: 2/10
R1 0528: 10/10
R1: 9/10
These tasks are closer to real life. LLMs should be perfect at these in theory, but currently aren't that great.
- Angle identification
- Clock reading
- Icon recognition
- Object counting
- Point identification
The test cases are jsonl files, where each line has keys like
input: Input textinput_image: Base 64 encoded input image (PNG)output: Expected output text
If you have an idea, PR it.
I might add more things like test cases that mirror my workflow or reward functions for drawing things in the future.