Chemistry Leaderboard

ChemSets Leaderboard

How Capable Are Chemistry LLMs?

⚠️ This leaderboard is outdated

Please check out the new leaderboard for the updated benchmark MolecularIQ:
https://huggingface.co/spaces/ml-jku/molecularIQ_leaderboard

Rank	Model	Size	Reasoning	SymMolic	ChemIQ	Ether0
🥇	GPT-oss-120b-high	120B (A5B)	Yes	57.200	65.600	18.900
🥈	GPT-oss-20b-high	20B (A4B)	Yes	51.100	47.400	13.500
🥉	Qwen3-Think-235B	235B (A22B)	Yes	50.100	65.500	9.200
4	GPT-oss-120b-medium	120B (A5B)	Yes	43.100	36.900	15.900
5	Qwen3-Think-30B	30B (A3B)	Yes	34.800	31.700	4.100
6	GPT-oss-20b-medium	20B (A4B)	Yes	33.800	20.800	10.000
7	Qwen3-32b	32B	Yes	28.200	22.600	2.800
8	Qwen3-14b	14B	Yes	24.900	12.200	3.700
9	Qwen3-8b	8B	Yes	19.300	12.000	4.100
10	Llama-molinst	8B	No	8.900	3.200	0.600
11	LlaSMol-Mistral	7B	No	3.600	1.600	0.400
12	ChemDFM-8B	8B	No	3.300	1.100	1.900
13	Txgemma-27b	27B	No	3.000	4.000	3.000
14	ChemLLM-7B	7B	No	2.400	0.700	0.400
15	Ether0	24B	Yes	2.400	13.100	45.900
16	ChemDFM-13B	13B	No	2.300	1.400	0.900
17	Txgemma-9b	9B	No	0.700	2.600	3.900

Scoring: Models receive a binary reward (1 for correct, 0 for incorrect) for each question. The final score per question is the average across three rollouts. The column values shown represent the average of these scores across all questions in that category.

Rank: Based on SymMolic score (descending)