2048 BENCH

What happens when you make LLMs play 2048? We tested various models to see who would achieve the highest scores.

Each model plays multiple games, and we track their average score, maximum tile achieved, and win rate.

ModelScoreMax TileWin Rate
GPT 456.8011.20.00%
GPT-3.5-Turbo55.2010.40.00%