Synthetitastic

Synthetitastic is an accelerationist set of synthetic data. It's packed with a variety of programatically generated cases that can be evaled on and RLed with. Synthetitastic isn't ordinary - it wants to be saturated, so that LLMs can get better and more efficient at these seemingly easy tasks.

Basic tasks

These are the kinds of tasks that computers are great at, reasoning LLMs are okay at, and normal LLMs are bad at. With Synthetitastic, LLMs can hone their skills on these tasks, reaching perfect accuracy and speed.

The test cases are jsonl files, where each line has keys like

input: Input text
output: Expected output text

And we have these tests

Day of week (eg "What day of the week is 2084-09-26? Just say the day.")
Epoch conversion (eg "I want to make a Discord timestamp for 19:24:17 UTC on 2129-11-29. Just say the Unix time (seconds) I should use.")
Large multiplication (eg "What is the product of 96933 and 90409? Just say the number.")
Largest number (eg "What is the largest number without the letter o? Reply with just the decimal number. Exclude numbers like googolplex.")
Letter counting (eg "How many 'e' are in 'bestseller'? Just say the number.")
Wordle (eg "I guessed raxes and got 🟨🟨🟨🟨⬛ - so what's the word?")

Here are the results:

Day of week

GPT-4.5: 8/10

Llama 70b: 2/10

Llama Maverick: 7/10

Qwen 235B (thinking): 5/10

Claude 3.7: 0/10

R1 0528: 8/10

R1: 8/10

Epoch conversion

GPT-4.5: 0/10

Llama 70b: 0/10

Llama Maverick: 0/10

Qwen 235B (thinking): 0/10

Claude 3.7: 0/10

R1 0528: 1/10

R1: 0/10

Large multiplication

GPT-4.5: 0/10

Llama 70b: 0/10

Llama Maverick: 2/10

Qwen 235B (thinking): 0/10

Claude 3.7: 0/10

R1 0528: 10/10

R1: 10/10

Largest number

GPT-4.5: 8/10

Llama 70b: 3/21

Llama Maverick: 6/21

Qwen 235B (after 6k tokens of thinking): 17/21

Claude 3.7: 5/10

R1 0528 (after 14.5k tokens of thinking): 20/21

R1: 7/10

Letter counting

GPT-4.5: 10/10

Llama 70b: 6/10

Llama Maverick: 5/10

Qwen 235B (thinking): 8/10

Claude 3.7: 6/10

R1 0528: 9/10

R1: 10/10

Wordle

GPT-4.5: 2/10

Llama 70b: 0/10

Llama Maverick: 0/10

Qwen 235B (thinking): 0/10

Claude 3.7: 2/10

R1 0528: 10/10

R1: 9/10

Multimodal tasks

These tasks are closer to real life. LLMs should be perfect at these in theory, but currently aren't that great.

Angle identification
Clock reading
Icon recognition
Object counting
Point identification

The test cases are jsonl files, where each line has keys like

input: Input text
input_image: Base 64 encoded input image (PNG)
output: Expected output text

More

If you have an idea, PR it.

I might add more things like test cases that mirror my workflow or reward functions for drawing things in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.zed		.zed
basic		basic
multimodal		multimodal
deno.json		deno.json
deno.lock		deno.lock
eval.ts		eval.ts
license.md		license.md
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Synthetitastic

Basic tasks

Day of week

Epoch conversion

Large multiplication

Largest number

Letter counting

Wordle

Multimodal tasks

More

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Languages

Uh oh!

License

KTibow/synthetitastic

Folders and files

Latest commit

History

Repository files navigation

Synthetitastic

Basic tasks

Day of week

Epoch conversion

Large multiplication

Largest number

Letter counting

Wordle

Multimodal tasks

More

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Languages

Packages