ChemSets Leaderboard

How Capable Are Chemistry LLMs?

⚠️ This leaderboard is outdated

Please check out the new leaderboard for the updated benchmark MolecularIQ:
https://huggingface.co/spaces/ml-jku/molecularIQ_leaderboard

RankModel Size Reasoning SymMolic ChemIQ Ether0
🥇GPT-oss-120b-high120B (A5B)Yes57.20065.60018.900
🥈GPT-oss-20b-high20B (A4B)Yes51.10047.40013.500
🥉Qwen3-Think-235B235B (A22B)Yes50.10065.5009.200
4GPT-oss-120b-medium120B (A5B)Yes43.10036.90015.900
5Qwen3-Think-30B30B (A3B)Yes34.80031.7004.100
6GPT-oss-20b-medium20B (A4B)Yes33.80020.80010.000
7Qwen3-32b32BYes28.20022.6002.800
8Qwen3-14b14BYes24.90012.2003.700
9Qwen3-8b8BYes19.30012.0004.100
10Llama-molinst8BNo8.9003.2000.600
11LlaSMol-Mistral7BNo3.6001.6000.400
12ChemDFM-8B8BNo3.3001.1001.900
13Txgemma-27b27BNo3.0004.0003.000
14ChemLLM-7B7BNo2.4000.7000.400
15Ether024BYes2.40013.10045.900
16ChemDFM-13B13BNo2.3001.4000.900
17Txgemma-9b9BNo0.7002.6003.900
Scoring: Models receive a binary reward (1 for correct, 0 for incorrect) for each question. The final score per question is the average across three rollouts. The column values shown represent the average of these scores across all questions in that category.
Rank: Based on SymMolic score (descending)