AI 大模型排名 ArtificialAnalysis AI 大模型排行榜

本页面排行数据来自 Artificial Analysis ,它对超过 100 个 AI 模型(LLM)的性能进行了比较和排名,评估指标包括智能程度、价格等。另外排行也汇总其他权威 AI 基准测试结果以供参考。

AI 大模型排名(基于 Artificial Analysis

重置筛选
模型信息 Artificial Analysis测试基准结果 其他 AI 测试基准结果
排名 模型名称 机构 综合指数 Coding Math 价格 ($/1M) MMLU Pro ? GPQA ? HLE ? LiveCodeBench ? SciCode ? Math 500 ? AIME ?
1 Gemini 3.1 Pro Preview Google 57 55.5 - $4.5 - 0.941 0.447 - 0.589 - -
2 Claude Opus 4.6 (Adaptive Reasoning, Max Effort) Anthropic 53 48.1 - $10 - 0.896 0.367 - 0.519 - -
3 Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) Anthropic 51.3 50.9 - $6 - 0.875 0.3 - 0.468 - -
4 GPT-5.2 (xhigh) OpenAI 51.2 48.7 99 $4.813 0.874 0.903 0.354 0.889 0.521 - -
5 Claude Opus 4.5 (Reasoning) Anthropic 49.7 47.8 91.3 $10 0.895 0.866 0.284 0.871 0.495 - -
6 GLM-5 (Reasoning) Z AI 49.6 44.2 - $1.55 - 0.82 0.272 - 0.462 - -
7 GPT-5.2 Codex (xhigh) OpenAI 49 43 - $4.813 - 0.899 0.335 - 0.546 - -
8 Gemini 3 Pro Preview (high) Google 48.4 46.5 95.7 $4.5 0.898 0.908 0.372 0.917 0.561 - -
9 GPT-5.1 (high) OpenAI 47.6 44.7 94 $3.438 0.87 0.873 0.265 0.868 0.433 - -
10 Kimi K2.5 (Reasoning) Kimi 46.7 39.5 - $1.2 - 0.879 0.294 - 0.49 - -
11 GPT-5.2 (medium) OpenAI 46.6 44.2 96.7 $4.813 0.859 0.864 0.249 0.894 0.462 - -
12 Gemini 3 Flash Preview (Reasoning) Google 46.4 42.6 97 $1.125 0.89 0.898 0.347 0.908 0.506 - -
13 Claude Opus 4.6 (Non-reasoning, High Effort) Anthropic 46.4 47.6 - $10 - 0.84 0.186 - 0.457 - -
14 Qwen3.5 397B A17B (Reasoning) Alibaba 45 41.3 - $1.35 - 0.893 0.273 - 0.42 - -
15 GPT-5 (high) OpenAI 44.6 36 94.3 $3.438 0.871 0.854 0.265 0.846 0.429 0.994 0.957
16 GPT-5 Codex (high) OpenAI 44.5 38.9 98.7 $3.438 0.865 0.837 0.256 0.84 0.409 - -
17 Claude Sonnet 4.6 (Non-reasoning, High Effort) Anthropic 44.3 46.4 - $6 - 0.799 0.132 - 0.469 - -
18 Claude Opus 4.5 (Non-reasoning) Anthropic 43 42.9 62.7 $10 0.889 0.81 0.129 0.738 0.47 - -
19 Claude 4.5 Sonnet (Reasoning) Anthropic 42.9 38.6 88 $6 0.875 0.834 0.173 0.714 0.447 - -
20 Claude Sonnet 4.6 (Non-reasoning, Low Effort) Anthropic 42.5 43 - $6 - 0.797 0.108 - 0.441 - -
21 GPT-5.1 Codex (high) OpenAI 42.2 36.6 95.7 $3.438 0.86 0.86 0.234 0.849 0.402 - -
22 MiniMax-M2.5 MiniMax 42 37.4 - $0.525 - 0.848 0.191 - 0.426 - -
23 GLM-4.7 (Reasoning) Z AI 42 36.3 95 $0.938 0.856 0.859 0.251 0.894 0.451 - -
24 GPT-5 (medium) OpenAI 41.8 39 91.7 $3.438 0.867 0.842 0.235 0.703 0.411 0.991 0.917
25 DeepSeek V3.2 (Reasoning) DeepSeek 41.6 36.7 92 $0.315 0.862 0.84 0.222 0.862 0.389 - -
26 Grok 4 xAI 41.4 40.5 92.7 $6 0.866 0.877 0.239 0.819 0.457 0.99 0.943
27 MiMo-V2-Flash (Feb 2026) Xiaomi 41.4 33.5 - $0.15 - 0.835 0.2 - 0.383 - -
28 Gemini 3 Pro Preview (low) Google 41.1 39.4 86.7 $4.5 0.895 0.887 0.276 0.857 0.499 - -
29 GPT-5 mini (high) OpenAI 41 35.3 90.7 $0.688 0.837 0.828 0.197 0.838 0.392 - -
30 o3-pro OpenAI 40.7 - - $35 - 0.845 - - - - -
31 Kimi K2 Thinking Kimi 40.7 34.8 94.7 $1.075 0.848 0.838 0.223 0.853 0.424 - -
32 GLM-5 (Non-reasoning) Z AI 40.5 39 - $1.55 - 0.666 0.072 - 0.383 - -
33 Qwen3.5 397B A17B (Non-reasoning) Alibaba 39.9 37.4 - $1.35 - 0.861 0.188 - 0.411 - -
34 Qwen3 Max Thinking Alibaba 39.7 30.5 - $2.4 - 0.861 0.262 - 0.431 - -
35 MiniMax-M2.1 MiniMax 39.5 32.8 82.7 $0.525 0.875 0.83 0.222 0.81 0.407 - -
36 MiMo-V2-Flash (Reasoning) Xiaomi 39.2 31.8 96.3 $0.15 0.843 0.846 0.211 0.868 0.394 - -
37 GPT-5 (low) OpenAI 39 30.7 83 $3.438 0.86 0.808 0.184 0.763 0.391 0.987 0.83
38 GPT-5 mini (medium) OpenAI 38.8 32.9 85 $0.688 0.828 0.803 0.146 0.692 0.41 - -
39 Claude 4 Sonnet (Reasoning) Anthropic 38.6 34.1 74.3 $6 0.842 0.777 0.096 0.655 0.4 0.991 0.773
40 GPT-5.1 Codex mini (high) OpenAI 38.5 36.4 91.7 $0.688 0.82 0.813 0.169 0.836 0.426 - -
41 Grok 4.1 Fast (Reasoning) xAI 38.5 30.9 89.3 $0.275 0.854 0.853 0.176 0.822 0.442 - -
42 o3 OpenAI 38.3 38.4 88.3 $3.5 0.853 0.827 0.2 0.808 0.41 0.992 0.903
43 Kimi K2.5 (Non-reasoning) Kimi 37.2 25.8 - $1.2 - 0.789 0.123 - 0.396 - -
44 Claude 4.5 Sonnet (Non-reasoning) Anthropic 37.1 33.5 37 $6 0.86 0.727 0.071 0.59 0.428 - -
45 Claude 4.5 Haiku (Reasoning) Anthropic 37 32.6 83.7 $2 0.76 0.672 0.097 0.615 0.433 - -
46 KAT-Coder-Pro V1 KwaiKAT 36.1 18.3 94.7 $0.525 0.813 0.764 0.334 0.747 0.366 - -
47 MiniMax-M2 MiniMax 36 29.2 78.3 $0.525 0.82 0.777 0.125 0.826 0.361 - -
48 Nova 2.0 Pro Preview (medium) Amazon 35.6 30.4 89 $3.438 0.83 0.785 0.089 0.73 0.427 - -
49 Gemini 3 Flash Preview (Non-reasoning) Google 35.1 37.8 55.7 $1.125 0.882 0.812 0.141 0.797 0.499 - -
50 Grok 4 Fast (Reasoning) xAI 34.9 27.4 89.7 $0.275 0.85 0.847 0.17 0.832 0.442 - -
51 Claude 3.7 Sonnet (Reasoning) Anthropic 34.6 27.6 56.3 $6 0.837 0.772 0.103 0.473 0.403 0.947 0.487
52 Gemini 2.5 Pro Google 34.5 31.9 87.7 $3.438 0.862 0.844 0.211 0.801 0.428 0.967 0.887
53 DeepSeek V3.2 Speciale DeepSeek 34.1 37.9 96.7 $0.425 0.863 0.871 0.261 0.896 0.44 - -
54 GLM-4.7 (Non-reasoning) Z AI 34.1 32 48 $0.938 0.794 0.664 0.061 0.562 0.354 - -
55 DeepSeek V3.1 Terminus (Reasoning) DeepSeek 33.8 33.7 89.7 $0.8 0.851 0.792 0.152 0.798 0.406 - -
56 GPT-5.2 (Non-reasoning) OpenAI 33.5 34.7 51 $4.813 0.814 0.712 0.073 0.669 0.404 - -
57 Doubao Seed Code ByteDance Seed 33.5 31.3 79.3 $0 0.854 0.764 0.133 0.766 0.407 - -
58 gpt-oss-120B (high) OpenAI 33.3 28.6 93.4 $0.263 0.808 0.782 0.185 0.878 0.389 - -
59 o4-mini (high) OpenAI 33 25.6 90.7 $1.925 0.832 0.784 0.175 0.859 0.465 0.989 0.94
60 Claude 4 Sonnet (Non-reasoning) Anthropic 33 30.6 38 $6 0.837 0.683 0.04 0.449 0.373 0.934 0.407
61 DeepSeek V3.2 Exp (Reasoning) DeepSeek 32.9 33.3 87.7 $0.315 0.85 0.797 0.138 0.789 0.377 - -
62 Qwen3 Max Thinking (Preview) Alibaba 32.5 24.5 82.3 $2.4 0.824 0.776 0.12 0.535 0.387 - -
63 GLM-4.6 (Reasoning) Z AI 32.5 29.5 86 $0.963 0.829 0.78 0.133 0.695 0.384 - -
64 DeepSeek V3.2 (Non-reasoning) DeepSeek 32.1 34.6 59 $0.315 0.837 0.751 0.105 0.593 0.387 - -
65 K-EXAONE (Reasoning) LG AI Research 32.1 27 90.3 $0 0.838 0.783 0.131 0.768 0.356 - -
66 Grok 3 mini Reasoning (high) xAI 32 25.2 84.7 $0.35 0.828 0.791 0.111 0.696 0.406 0.992 0.933
67 Nova 2.0 Pro Preview (low) Amazon 31.9 24.5 63.3 $3.438 0.822 0.751 0.052 0.638 0.387 - -
68 Claude 4.1 Opus (Reasoning) Anthropic 31.9 36.5 80.3 $30 0.88 0.809 0.119 0.654 0.409 - -
69 Qwen3 Max Alibaba 31.3 26.4 80.7 $2.4 0.841 0.764 0.111 0.767 0.383 - -
70 Gemini 2.5 Flash Preview (Sep '25) (Reasoning) Google 31.1 24.6 78.3 $0.85 0.842 0.793 0.127 0.713 0.405 - -
71 Claude 4.5 Haiku (Non-reasoning) Anthropic 31 29.6 39 $2 0.8 0.646 0.043 0.511 0.344 - -
72 Claude 3.7 Sonnet (Non-reasoning) Anthropic 30.8 26.7 21 $6 0.803 0.656 0.048 0.394 0.376 0.85 0.223
73 Kimi K2 0905 Kimi 30.8 25.9 57.3 $1.2 0.819 0.767 0.063 0.61 0.307 - -
74 o1 OpenAI 30.7 20.5 - $26.25 0.841 0.747 0.077 0.679 0.358 0.97 0.723
75 MiMo-V2-Flash (Non-reasoning) Xiaomi 30.6 25.8 67.7 $0.15 0.744 0.656 0.08 0.402 0.259 - -
76 Gemini 2.5 Pro Preview (Mar' 25) Google 30.3 46.7 - $0 0.858 0.836 0.171 0.778 0.395 0.98 0.87
77 GLM-4.7-Flash (Reasoning) Z AI 30.1 25.9 - $0.152 - 0.581 0.071 - 0.337 - -
78 GLM-4.6 (Non-reasoning) Z AI 30.1 30.2 44.3 $1 0.784 0.632 0.052 0.561 0.331 - -
79 Nova 2.0 Lite (medium) Amazon 29.6 23.9 88.7 $0.85 0.813 0.768 0.086 0.663 0.368 - -
80 Qwen3 235B A22B 2507 (Reasoning) Alibaba 29.5 23.2 91 $2.625 0.843 0.79 0.15 0.788 0.424 0.984 0.94
81 Gemini 2.5 Pro Preview (May' 25) Google 29.5 - - $3.438 0.837 0.822 0.154 0.77 0.416 0.986 0.843
82 ERNIE 5.0 Thinking Preview Baidu 29.1 29.2 85 $0 0.83 0.777 0.127 0.812 0.375 - -
83 Grok Code Fast 1 xAI 28.7 23.7 43.3 $0.525 0.793 0.727 0.075 0.657 0.362 - -
84 DeepSeek V3.1 Terminus (Non-reasoning) DeepSeek 28.4 31.9 53.7 $0.8 0.836 0.751 0.084 0.529 0.321 - -
85 DeepSeek V3.2 Exp (Non-reasoning) DeepSeek 28.3 30 57.7 $0.315 0.836 0.738 0.086 0.554 0.399 - -
86 Apriel-v1.5-15B-Thinker ServiceNow 28.3 18.7 87.5 $0 0.773 0.713 0.12 0.728 0.348 - -
87 Qwen3 Coder Next Alibaba 28.1 22.9 - $0.525 - 0.737 0.093 - 0.323 - -
88 DeepSeek V3.1 (Non-reasoning) DeepSeek 28 28.4 49.7 $0.84 0.833 0.735 0.063 0.577 0.367 - -
89 Nova 2.0 Omni (medium) Amazon 27.9 15.1 89.7 $0.85 0.809 0.76 0.068 0.66 0.362 - -
90 DeepSeek V3.1 (Reasoning) DeepSeek 27.6 29.7 89.7 $0.865 0.851 0.779 0.13 0.784 0.391 - -
91 Apriel-v1.6-15B-Thinker ServiceNow 27.5 22 88 $0 0.79 0.733 0.098 0.807 0.373 - -
92 Qwen3 VL 235B A22B (Reasoning) Alibaba 27.5 20.9 88.3 $2.625 0.836 0.772 0.101 0.646 0.399 - -
93 GPT-5.1 (Non-reasoning) OpenAI 27.4 27.3 38 $3.438 0.801 0.643 0.052 0.494 0.365 - -
94 Claude 4 Opus (Reasoning) Anthropic 27.4 34 73.3 $30 0.873 0.796 0.117 0.636 0.398 0.982 0.757
95 Magistral Medium 1.2 Mistral 27 21.7 82 $2.75 0.815 0.739 0.096 0.75 0.392 - -
96 DeepSeek R1 0528 (May '25) DeepSeek 27 24 76 $2.362 0.849 0.813 0.149 0.77 0.403 0.983 0.893
97 Gemini 2.5 Flash (Reasoning) Google 26.8 22.2 73.3 $0.85 0.832 0.79 0.111 0.695 0.394 0.981 0.823
98 GPT-5 nano (high) OpenAI 26.7 20.3 83.7 $0.138 0.78 0.676 0.082 0.789 0.366 - -
99 Qwen3 Next 80B A3B (Reasoning) Alibaba 26.5 19.5 84.3 $1.875 0.824 0.759 0.117 0.784 0.388 - -
100 Kimi K2 Kimi 26.2 22.1 57 $1.075 0.824 0.766 0.07 0.556 0.345 0.971 0.693
101 GLM-4.5 (Reasoning) Z AI 26.2 26.3 73.7 $1 0.835 0.782 0.122 0.738 0.348 0.979 0.873
102 o3-mini OpenAI 25.9 17.9 - $1.925 0.791 0.748 0.087 0.717 0.399 0.973 0.77
103 Qwen3 Max (Preview) Alibaba 25.9 25.5 75 $2.4 0.838 0.764 0.093 0.651 0.37 - -
104 o1-pro OpenAI 25.8 - - $262.5 - - - - - - -
105 GPT-5 nano (medium) OpenAI 25.7 22.9 78.3 $0.138 0.772 0.67 0.076 0.763 0.338 - -
106 GPT-4.1 OpenAI 25.6 21.8 34.7 $3.5 0.806 0.666 0.046 0.457 0.381 0.913 0.437
107 Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) Google 25.5 22.1 56.7 $0.85 0.836 0.766 0.078 0.625 0.375 - -
108 o3-mini (high) OpenAI 25.1 17.3 - $1.925 0.802 0.773 0.123 0.734 0.398 0.985 0.86
109 Grok 3 xAI 25 19.8 58 $6 0.799 0.693 0.051 0.425 0.368 0.87 0.33
110 Seed-OSS-36B-Instruct ByteDance Seed 25 16.7 84.7 $0.3 0.815 0.726 0.091 0.765 0.365 - -
111 Qwen3 235B A22B 2507 Instruct Alibaba 24.7 22.1 71.7 $1.225 0.828 0.753 0.106 0.524 0.36 0.98 0.717
112 Qwen3 Coder 480B A35B Instruct Alibaba 24.6 24.6 39.3 $3 0.788 0.618 0.044 0.585 0.359 0.942 0.477
113 Sonar Reasoning Pro Perplexity 24.6 - - $0 - - - - - 0.957 0.79
114 gpt-oss-20B (high) OpenAI 24.5 18.5 89.3 $0.1 0.748 0.688 0.098 0.777 0.344 - -
115 K2 Think V2 MBZUAI Institute of Foundation Models 24.5 15.5 - $0 - 0.713 0.095 - 0.33 - -
116 Qwen3 VL 32B (Reasoning) Alibaba 24.5 14.5 84.7 $2.625 0.818 0.733 0.096 0.738 0.285 - -
117 NVIDIA Nemotron 3 Nano 30B A3B (Reasoning) NVIDIA 24.3 19 91 $0.105 0.794 0.757 0.102 0.741 0.296 - -
118 Gemini 2.5 Flash Preview (Reasoning) Google 24.3 - - $0 0.8 0.698 0.116 0.505 0.359 0.981 0.843
119 MiniMax M1 80k MiniMax 24.3 14.5 61 $0.963 0.816 0.697 0.082 0.711 0.374 0.98 0.847
120 Nova 2.0 Lite (low) Amazon 24.2 13.6 46.7 $0.85 0.788 0.698 0.042 0.469 0.333 - -
121 gpt-oss-120B (low) OpenAI 23.9 15.5 66.7 $0.263 0.775 0.672 0.052 0.707 0.36 - -
122 HyperCLOVA X SEED Think (32B) Naver 23.7 17.5 59 $0 0.785 0.615 0.055 0.629 0.284 - -
123 o1-preview OpenAI 23.7 34 - $28.875 - - - - - 0.924 -
124 GPT-5 (minimal) OpenAI 23.7 25.1 31.7 $3.438 0.806 0.673 0.054 0.558 0.388 0.861 0.367
125 Claude 4.1 Opus (Non-reasoning) Anthropic 23.6 - - $30 - - - - - - -
126 Grok 4.1 Fast (Non-reasoning) xAI 23.5 19.5 34.3 $0.275 0.743 0.637 0.05 0.399 0.296 - -
127 GLM-4.6V (Reasoning) Z AI 23.5 19.7 85.3 $0.45 0.799 0.719 0.089 0.16 0.304 - -
128 Nova 2.0 Omni (low) Amazon 23.2 13.9 56 $0.85 0.798 0.699 0.04 0.592 0.343 - -
129 GLM-4.5-Air Z AI 23.2 23.8 80.7 $0.425 0.815 0.733 0.068 0.684 0.306 0.965 0.673
130 K-EXAONE (Non-reasoning) LG AI Research 23 13.5 44 $0 0.81 0.695 0.054 - 0.27 - -
131 Mi:dm K 2.5 Pro Korea Telecom 23 12.6 76.7 $0 0.809 0.701 0.077 0.656 0.332 - -
132 Nova 2.0 Pro Preview (Non-reasoning) Amazon 22.9 20.5 30.7 $3.438 0.772 0.636 0.04 0.473 0.281 - -
133 Mistral Large 3 Mistral 22.7 22.7 38 $0.75 0.807 0.68 0.041 0.465 0.362 - -
134 Grok 4 Fast (Non-reasoning) xAI 22.6 19 41.3 $0.275 0.73 0.606 0.05 0.401 0.329 - -
135 Ring-1T InclusionAI 22.5 16.8 89.3 $0 0.806 0.774 0.102 0.643 0.367 - -
136 Qwen3 30B A3B 2507 (Reasoning) Alibaba 22.4 14.7 56.3 $0.75 0.805 0.707 0.098 0.707 0.333 0.976 0.907
137 GPT-4.1 mini OpenAI 22.4 18.5 46.3 $0.7 0.781 0.664 0.046 0.483 0.404 0.925 0.43
138 Claude 4 Opus (Non-reasoning) Anthropic 22.2 - 36.3 $30 0.86 0.701 0.059 0.542 0.409 0.941 0.563
139 INTELLECT-3 Prime Intellect 22.1 19.1 88 $0.425 0.822 0.761 0.121 0.777 0.391 - -
140 Devstral 2 Mistral 22 23.7 36.7 $0 0.762 0.594 0.036 0.448 0.331 - -
141 GPT-5 (ChatGPT) OpenAI 21.8 21.2 48.3 $3.438 0.82 0.686 0.058 0.543 0.378 - -
142 DeepSeek V3 0324 DeepSeek 21.8 22 41 $1.25 0.819 0.655 0.052 0.405 0.358 0.942 0.52
143 Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning) Google 21.6 18.1 68.7 $0.175 0.808 0.709 0.066 0.688 0.287 - -
144 Solar Open 100B (Reasoning) Upstage 21.6 10.5 - $0 - 0.657 0.092 - 0.269 - -
145 Grok 3 Reasoning Beta xAI 21.6 - - $0 - - - - - - -
146 GLM-4.7-Flash (Non-reasoning) Z AI 21.5 11 - $0.152 - 0.452 0.049 - 0.255 - -
147 Mistral Medium 3.1 Mistral 21.1 18.3 38.3 $0.8 0.683 0.588 0.044 0.406 0.338 - -
148 MiniMax M1 40k MiniMax 20.9 14.1 13.7 $0 0.808 0.682 0.075 0.657 0.378 0.972 0.813
149 gpt-oss-20B (low) OpenAI 20.8 14.4 62.3 $0.1 0.718 0.611 0.051 0.652 0.34 - -
150 K2-V2 (high) MBZUAI Institute of Foundation Models 20.7 16.1 78.3 $0 0.786 0.681 0.098 0.694 0.286 - -
151 GPT-5 mini (minimal) OpenAI 20.7 21.9 46.7 $0.688 0.775 0.687 0.05 0.545 0.369 - -
152 Qwen3 VL 235B A22B Instruct Alibaba 20.6 16.5 70.7 $1.225 0.823 0.712 0.063 0.594 0.359 - -
153 Tri-21B-think Preview Trillion Labs 20.5 7.4 - $0 - 0.538 0.057 - 0.178 - -
154 Gemini 2.5 Flash (Non-reasoning) Google 20.5 17.8 60.3 $0.85 0.809 0.683 0.051 0.495 0.291 0.932 0.5
155 o1-mini OpenAI 20.4 - - $0 0.742 0.603 0.049 0.576 0.323 0.944 0.603
156 Qwen3 Next 80B A3B Instruct Alibaba 20.1 15.3 66.3 $0.875 0.819 0.738 0.073 0.684 0.307 - -
157 Qwen3 Coder 30B A3B Instruct Alibaba 20 19.4 29 $0.9 0.706 0.516 0.04 0.403 0.278 0.893 0.297
158 GPT-4.5 (Preview) OpenAI 20 - - $0 - - - - - - -
159 Qwen3 235B A22B (Reasoning) Alibaba 19.8 17.4 82 $2.625 0.828 0.7 0.117 0.622 0.399 0.93 0.84
160 QwQ 32B Alibaba 19.7 - 29 $0.473 0.764 0.593 0.082 0.631 0.358 0.957 0.78
161 Qwen3 VL 30B A3B (Reasoning) Alibaba 19.6 13.1 82.3 $0.75 0.807 0.72 0.087 0.697 0.288 - -
162 Gemini 2.0 Flash Thinking Experimental (Jan '25) Google 19.6 24.1 - $0 0.798 0.701 0.071 0.321 0.329 0.944 0.5
163 Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning) Google 19.4 14.5 46.7 $0.175 0.796 0.651 0.046 0.641 0.285 - -
164 Devstral Small 2 Mistral 19.3 20.7 34.3 $0 0.678 0.532 0.034 0.348 0.288 - -
165 Motif-2-12.7B-Reasoning Motif Technologies 19.1 11.9 80.3 $0 0.796 0.695 0.082 0.651 0.282 - -
166 Ling-1T InclusionAI 19 18.8 71.3 $0 0.822 0.719 0.072 0.677 0.352 - -
167 Nova Premier Amazon 18.9 13.8 17.3 $5 0.733 0.569 0.047 0.317 0.279 0.839 0.17
168 GPT-4o (Aug '24) OpenAI 18.8 16.6 - $4.375 - 0.521 0.029 0.317 0.331 0.795 0.117
169 DeepSeek R1 (Jan '25) DeepSeek 18.8 15.9 68 $2.362 0.844 0.708 0.093 0.617 0.357 0.966 0.683
170 Solar Pro 2 (Preview) (Reasoning) Upstage 18.8 - - $0 0.768 0.578 0.057 0.462 0.164 0.9 0.663
171 K2-V2 (medium) MBZUAI Institute of Foundation Models 18.7 14 64.7 $0 0.761 0.598 0.044 0.541 0.252 - -
172 Claude 3.5 Haiku Anthropic 18.7 10.7 - $1.6 0.634 0.408 0.035 0.314 0.274 0.721 0.033
173 Mistral Medium 3 Mistral 18.7 13.6 30.3 $0.8 0.76 0.578 0.043 0.4 0.331 0.907 0.44
174 Magistral Medium 1 Mistral 18.7 16 40.3 $2.75 0.753 0.679 0.095 0.527 0.297 0.917 0.7
175 Llama Nemotron Super 49B v1.5 (Reasoning) NVIDIA 18.6 15.2 76.7 $0.175 0.814 0.748 0.068 0.737 0.348 0.983 0.86
176 Hermes 4 - Llama-3.1 405B (Reasoning) Nous Research 18.6 16 69.7 $1.5 0.829 0.727 0.103 0.686 0.252 - -
177 Tri-21B-Think Trillion Labs 18.6 6.3 - $0 - 0.601 0.061 - 0.174 - -
178 Qwen3 4B 2507 (Reasoning) Alibaba 18.6 9.5 82.7 $0 0.743 0.667 0.059 0.641 0.256 - -
179 GPT-4o (March 2025, chatgpt-4o-latest) OpenAI 18.6 - 25.7 $7.5 0.803 0.655 0.05 0.425 0.366 0.893 0.327
180 Devstral Medium Mistral 18.6 15.9 4.7 $0.8 0.708 0.492 0.038 0.337 0.294 0.707 0.067
181 Llama 3.3 Nemotron Super 49B v1 (Reasoning) NVIDIA 18.5 9.4 54.7 $0 0.785 0.643 0.065 0.277 0.282 0.959 0.583
182 Gemini 2.0 Flash (Feb '25) Google 18.5 13.6 21.7 $0.263 0.779 0.623 0.053 0.334 0.333 0.93 0.33
183 Llama 4 Maverick Meta 18.3 15.6 19.3 $0.461 0.809 0.671 0.048 0.397 0.331 0.889 0.39
184 Magistral Small 1.2 Mistral 18.1 14.8 80.3 $0.75 0.768 0.663 0.061 0.723 0.352 - -
185 Gemini 2.0 Pro Experimental (Feb '25) Google 18.1 25.5 - $0 0.805 0.622 0.068 0.347 0.312 0.923 0.36
186 Devstral Small (May '25) Mistral 18 12.2 - $0.15 0.632 0.434 0.04 0.258 0.245 0.684 0.067
187 Nova 2.0 Lite (Non-reasoning) Amazon 17.9 12.5 33.7 $0.85 0.743 0.603 0.03 0.346 0.24 - -
188 Sonar Reasoning Perplexity 17.9 - - $0 - 0.623 - - - 0.921 0.77
189 Gemini 2.5 Flash Preview (Non-reasoning) Google 17.8 - - $0 0.783 0.594 0.05 0.406 0.233 0.926 0.433
190 Hermes 4 - Llama-3.1 405B (Non-reasoning) Nous Research 17.6 18.1 15.3 $1.5 0.729 0.536 0.042 0.546 0.346 - -
191 Gemini 2.5 Flash-Lite (Reasoning) Google 17.4 9.5 53.3 $0.175 0.759 0.625 0.064 0.593 0.193 0.969 0.703
192 GPT-4o (Nov '24) OpenAI 17.3 16.7 6 $4.375 0.748 0.543 0.033 0.309 0.333 0.759 0.15
193 Qwen3 VL 32B Instruct Alibaba 17.2 15.6 68.3 $1.225 0.791 0.671 0.063 0.514 0.301 - -
194 DeepSeek R1 Distill Qwen 32B DeepSeek 17.2 - 63 $0.27 0.739 0.615 0.055 0.27 0.376 0.941 0.687
195 GLM-4.6V (Non-reasoning) Z AI 17.1 11.1 26.3 $0.45 0.752 0.566 0.037 0.411 0.272 - -
196 Qwen3 235B A22B (Non-reasoning) Alibaba 16.9 14 23.7 $1.225 0.762 0.613 0.047 0.343 0.299 0.902 0.327
197 Gemini 2.0 Flash (experimental) Google 16.8 - - $0 0.782 0.636 0.047 0.21 0.34 0.911 0.3
198 Magistral Small 1 Mistral 16.8 11.1 41.3 $0 0.746 0.641 0.072 0.514 0.241 0.963 0.713
199 Nova 2.0 Omni (Non-reasoning) Amazon 16.6 13.8 37 $0.85 0.719 0.555 0.039 0.305 0.279 - -
200 EXAONE 4.0 32B (Reasoning) LG AI Research 16.6 14 80 $0.7 0.818 0.739 0.105 0.747 0.344 0.977 0.843
201 Qwen3 VL 8B (Reasoning) Alibaba 16.6 9.8 30.7 $0.66 0.749 0.579 0.033 0.353 0.219 - -
202 Qwen3 32B (Reasoning) Alibaba 16.5 13.8 73 $2.625 0.798 0.668 0.083 0.546 0.354 0.961 0.807
203 DeepSeek R1 0528 Qwen3 8B DeepSeek 16.4 7.8 63.7 $0 0.739 0.612 0.056 0.513 0.204 0.932 0.65
204 DeepSeek V3 (Dec '24) DeepSeek 16.4 16.4 26 $0.625 0.752 0.557 0.036 0.359 0.354 0.887 0.253
205 Qwen2.5 Max Alibaba 16.3 - - $2.8 0.762 0.587 0.045 0.359 0.337 0.835 0.233
206 Qwen3 14B (Reasoning) Alibaba 16.2 13.1 55.7 $1.313 0.774 0.604 0.043 0.523 0.316 0.961 0.763
207 Ministral 3 14B Mistral 16 10.9 30 $0.2 0.693 0.572 0.046 0.351 0.236 - -
208 DeepSeek R1 Distill Llama 70B DeepSeek 16 11.4 53.7 $0.875 0.795 0.402 0.061 0.266 0.312 0.935 0.67
209 Hermes 4 - Llama-3.1 70B (Reasoning) Nous Research 16 14.4 68.7 $0.198 0.811 0.699 0.079 0.653 0.341 - -
210 Qwen3 VL 30B A3B Instruct Alibaba 16 14.3 72.3 $0.35 0.764 0.695 0.064 0.476 0.308 - -
211 GPT-4o (May '24) OpenAI 16 24.2 - $7.5 0.74 0.526 0.028 0.334 0.309 0.791 0.11
212 Gemini 1.5 Pro (Sep '24) Google 16 23.6 - $0 0.75 0.589 0.049 0.316 0.295 0.876 0.23
213 Solar Pro 2 (Preview) (Non-reasoning) Upstage 16 - - $0 0.725 0.544 0.038 0.385 0.272 0.871 0.297
214 Claude 3.5 Sonnet (Oct '24) Anthropic 15.9 30.2 - $6 0.772 0.599 0.039 0.381 0.366 0.771 0.157
215 Falcon-H1R-7B TII UAE 15.8 9.8 80 $0 0.725 0.661 0.108 0.724 0.249 - -
216 DeepSeek R1 Distill Qwen 14B DeepSeek 15.8 - 55.7 $0 0.74 0.484 0.044 0.376 0.239 0.949 0.667
217 Qwen3 Omni 30B A3B (Reasoning) Alibaba 15.6 12.7 74 $0.43 0.792 0.726 0.073 0.679 0.306 - -
218 Qwen2.5 Instruct 72B Alibaba 15.6 11.9 14 $0 0.72 0.491 0.042 0.276 0.267 0.858 0.16
219 Ling-flash-2.0 InclusionAI 15.5 16.7 65.3 $0.247 0.777 0.657 0.063 0.589 0.289 - -
220 Sonar Perplexity 15.5 - - $1 0.689 0.471 0.073 0.295 0.229 0.817 0.487
221 Step3 VL 10B StepFun 15.4 13.9 - $0 - 0.69 0.102 - 0.311 - -
222 Qwen3 30B A3B (Reasoning) Alibaba 15.3 11 72.3 $0.75 0.777 0.616 0.066 0.506 0.285 0.959 0.753
223 Devstral Small (Jul '25) Mistral 15.2 12.1 29.3 $0.15 0.622 0.414 0.037 0.254 0.243 0.635 0.003
224 Sonar Pro Perplexity 15.2 - - $6 0.755 0.578 0.079 0.275 0.226 0.745 0.29
225 QwQ 32B-Preview Alibaba 15.2 - - $0.135 0.648 0.557 0.048 0.337 0.038 0.91 0.453
226 Mistral Large 2 (Nov '24) Mistral 15.1 13.8 14 $3 0.697 0.486 0.04 0.293 0.292 0.736 0.11
227 Mistral Small 3.2 Mistral 15 13.3 27 $0.15 0.681 0.505 0.043 0.275 0.264 0.883 0.323
228 Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) NVIDIA 15 13.1 63.7 $0.9 0.825 0.728 0.081 0.641 0.347 0.952 0.747
229 Qwen3 30B A3B 2507 Instruct Alibaba 15 14.2 66.3 $0.35 0.777 0.659 0.068 0.515 0.304 0.975 0.727
230 Solar Pro 2 (Reasoning) Upstage 14.9 12.1 61.3 $0 0.805 0.687 0.07 0.616 0.302 0.967 0.69
231 ERNIE 4.5 300B A47B Baidu 14.9 14.5 41.3 $0.485 0.776 0.811 0.035 0.467 0.315 0.931 0.493
232 GLM-4.5V (Reasoning) Z AI 14.9 10.9 73 $0.9 0.788 0.684 0.059 0.604 0.221 - -
233 NVIDIA Nemotron Nano 9B V2 (Reasoning) NVIDIA 14.8 8.3 69.7 $0.07 0.742 0.57 0.046 0.724 0.22 - -
234 NVIDIA Nemotron Nano 12B v2 VL (Reasoning) NVIDIA 14.8 11.8 75 $0.3 0.759 0.572 0.053 0.694 0.262 - -
235 Gemini 2.0 Flash-Lite (Feb '25) Google 14.7 - - $0 0.724 0.535 0.036 0.185 0.25 0.873 0.277
236 Ministral 3 8B Mistral 14.6 10 31.7 $0.15 0.642 0.471 0.043 0.303 0.208 - -
237 Llama Nemotron Super 49B v1.5 (Non-reasoning) NVIDIA 14.5 10.5 8 $0.175 0.692 0.481 0.043 0.29 0.238 0.77 0.137
238 Gemini 2.0 Flash-Lite (Preview) Google 14.5 - - $0 - 0.542 0.044 0.179 0.247 0.873 0.303
239 Qwen3 32B (Non-reasoning) Alibaba 14.5 - 19.7 $1.225 0.727 0.535 0.043 0.288 0.28 0.869 0.303
240 Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) NVIDIA 14.4 - 50 $0 0.556 0.408 0.051 0.493 0.101 0.947 0.707
241 Kimi Linear 48B A3B Instruct Kimi 14.4 14.2 36.3 $0 0.585 0.412 0.027 0.378 0.199 - -
242 K2-V2 (low) MBZUAI Institute of Foundation Models 14.4 10.5 35.3 $0 0.713 0.541 0.039 0.393 0.223 - -
243 Llama 3.3 Nemotron Super 49B v1 (Non-reasoning) NVIDIA 14.3 7.6 7.7 $0 0.698 0.517 0.035 0.28 0.229 0.775 0.193
244 Qwen3 VL 8B Instruct Alibaba 14.3 7.3 27.3 $0.31 0.686 0.427 0.029 0.332 0.174 - -
245 Llama 3.3 Instruct 70B Meta 14.2 10.7 7.7 $0.675 0.713 0.498 0.04 0.288 0.26 0.773 0.3
246 Llama 3.1 Instruct 405B Meta 14.2 14.5 3 $4.188 0.732 0.515 0.042 0.305 0.299 0.703 0.213
247 Olmo 3.1 32B Think Allen Institute for AI 14.2 9.8 77.3 $0 0.763 0.591 0.06 0.695 0.293 - -
248 Claude 3.5 Sonnet (June '24) Anthropic 14.2 26 - $6 0.751 0.56 0.037 - 0.316 0.695 0.097
249 Qwen3 4B (Reasoning) Alibaba 14.2 - 22.3 $0.398 0.696 0.522 0.051 0.465 0.035 0.933 0.657
250 GPT-4o (ChatGPT) OpenAI 14.1 - - $7.5 0.773 0.511 0.037 - 0.334 0.797 0.103
251 Llama 3.1 Tulu3 405B Allen Institute for AI 14.1 - - $0 0.716 0.516 0.035 0.291 0.302 0.778 0.133
252 Ring-flash-2.0 InclusionAI 14 10.6 83.7 $0.247 0.793 0.725 0.089 0.628 0.168 - -
253 Pixtral Large Mistral 14 - 2.3 $3 0.701 0.505 0.036 0.261 0.292 0.714 0.07
254 Mistral Small 3.1 Mistral 14 13.9 3.7 $0.15 0.659 0.454 0.048 0.212 0.265 0.707 0.093
255 Grok 2 (Dec '24) xAI 13.9 - - $0 0.709 0.51 0.038 0.267 0.285 0.778 0.133
256 Gemini 1.5 Flash (Sep '24) Google 13.8 - - $0 0.68 0.463 0.035 0.273 0.267 0.827 0.18
257 Qwen3 VL 4B (Reasoning) Alibaba 13.7 6.7 25.7 $0 0.7 0.494 0.044 0.32 0.171 - -
258 GPT-4 Turbo OpenAI 13.7 21.5 - $15 0.694 - 0.033 0.291 0.319 0.737 0.15
259 GPT-5 nano (minimal) OpenAI 13.7 14.2 27.3 $0.138 0.556 0.428 0.041 0.47 0.291 - -
260 Hermes 4 - Llama-3.1 70B (Non-reasoning) Nous Research 13.6 9.2 11.3 $0.198 0.664 0.491 0.036 0.269 0.277 - -
261 Llama 4 Scout Meta 13.5 6.7 14 $0.287 0.752 0.587 0.043 0.299 0.17 0.844 0.283
262 Solar Pro 2 (Non-reasoning) Upstage 13.5 11.3 30 $0 0.75 0.561 0.038 0.424 0.248 0.889 0.407
263 Nova Pro Amazon 13.5 11 7 $1.4 0.691 0.499 0.034 0.233 0.208 0.786 0.107
264 Llama 3.1 Nemotron Instruct 70B NVIDIA 13.4 10.8 11 $1.2 0.69 0.465 0.046 0.169 0.233 0.733 0.247
265 Command A Cohere 13.4 9.9 13 $4.375 0.712 0.527 0.046 0.287 0.281 0.819 0.097
266 NVIDIA Nemotron 3 Nano 30B A3B (Non-reasoning) NVIDIA 13.3 15.8 13.3 $0.096 0.579 0.399 0.046 0.36 0.23 - -
267 Grok Beta xAI 13.3 - - $0 0.703 0.471 0.047 0.241 0.295 0.737 0.103
268 Qwen3 4B 2507 Instruct Alibaba 13.2 9.1 52.3 $0 0.672 0.517 0.047 0.377 0.181 - -
269 Qwen2.5 Instruct 32B Alibaba 13.2 - - $0 0.697 0.466 0.038 0.248 0.229 0.805 0.11
270 NVIDIA Nemotron Nano 9B V2 (Non-reasoning) NVIDIA 13.1 7.5 62.3 $0.102 0.739 0.557 0.04 0.701 0.209 - -
271 Qwen3 8B (Reasoning) Alibaba 13.1 9 19 $0.66 0.743 0.589 0.042 0.406 0.226 0.904 0.747
272 Mistral Large 2 (Jul '24) Mistral 13 - 0 $3 0.683 0.472 0.032 0.267 0.271 0.714 0.093
273 GPT-4.1 nano OpenAI 12.9 11.2 24 $0.175 0.657 0.512 0.039 0.326 0.259 0.848 0.237
274 Qwen2.5 Coder Instruct 32B Alibaba 12.9 - - $0.141 0.635 0.417 0.038 0.295 0.271 0.767 0.12
275 GPT-4 OpenAI 12.8 13.1 - $37.5 - - - - - - -
276 Mistral Small 3 Mistral 12.7 - 4.3 $0.15 0.652 0.462 0.041 0.252 0.236 0.715 0.08
277 Qwen3 14B (Non-reasoning) Alibaba 12.7 12.4 58 $0.613 0.675 0.47 0.042 0.28 0.265 0.871 0.28
278 GPT-4o mini OpenAI 12.6 - 14.7 $0.263 0.648 0.426 0.04 0.234 0.229 0.789 0.117
279 Gemini 2.5 Flash-Lite (Non-reasoning) Google 12.5 7.4 35.3 $0.175 0.724 0.474 0.037 0.4 0.177 0.926 0.5
280 Claude 3 Opus Anthropic 12.5 19.5 - $30 0.696 0.489 0.031 0.279 0.233 0.641 0.033
281 DeepSeek-V2.5 (Dec '24) DeepSeek 12.5 - - $0 - - - - - 0.763 -
282 GLM-4.5V (Non-reasoning) Z AI 12.5 10.8 15.3 $0.9 0.751 0.573 0.036 0.352 0.188 - -
283 Qwen3 4B (Non-reasoning) Alibaba 12.5 - - $0.188 0.586 0.398 0.037 0.233 0.167 0.843 0.213
284 Nova Lite Amazon 12.4 5.1 7 $0.105 0.59 0.433 0.046 0.167 0.139 0.765 0.107
285 Qwen3 30B A3B (Non-reasoning) Alibaba 12.4 13.3 21.7 $0.35 0.71 0.515 0.046 0.322 0.264 0.863 0.26
286 Gemini 2.0 Flash Thinking Experimental (Dec '24) Google 12.3 - - $0 - - - - - 0.48 -
287 DeepSeek-V2.5 DeepSeek 12.3 - - $0 - - - - - - -
288 Llama 3.1 Instruct 70B Meta 12.2 10.9 4 $0.56 0.676 0.409 0.046 0.232 0.267 0.649 0.173
289 Claude 3 Haiku Anthropic 12.1 6.7 - $0.5 - 0.374 0.039 0.154 0.186 0.394 0.01
290 Mistral Saba Mistral 12.1 - - $0 0.611 0.424 0.041 - 0.241 0.677 0.13
291 DeepSeek R1 Distill Llama 8B DeepSeek 12.1 - 41.3 $0 0.543 0.302 0.042 0.233 0.119 0.853 0.333
292 R1 1776 Perplexity 12 - - $0 - - - - - 0.954 -
293 Olmo 3.1 32B Instruct Allen Institute for AI 12 5.6 - $0.3 - 0.539 0.049 - 0.167 - -
294 Gemini 1.5 Pro (May '24) Google 12 19.8 - $0 0.657 0.371 0.039 0.244 0.274 0.673 0.08
295 Olmo 3 32B Think Allen Institute for AI 12 10.5 73.7 $0 0.759 0.61 0.059 0.672 0.286 - -
296 Reka Flash (Sep '24) Reka AI 12 - - $0.35 - - - - - 0.529 -
297 Qwen2.5 Turbo Alibaba 12 - - $0.087 0.633 0.41 0.042 0.163 0.153 0.805 0.12
298 Llama 3.2 Instruct 90B (Vision) Meta 11.9 - - $0.72 0.671 0.432 0.049 0.214 0.24 0.629 0.05
299 Grok-1 xAI 11.7 - - $0 - - - - - - -
300 Llama 3.1 Instruct 8B Meta 11.7 4.9 4.3 $0.1 0.476 0.259 0.051 0.116 0.132 0.519 0.077
301 Qwen2 Instruct 72B Alibaba 11.7 - - $0 0.622 0.371 0.037 0.159 0.229 0.701 0.147
302 EXAONE 4.0 32B (Non-reasoning) LG AI Research 11.5 9.4 39.3 $0.7 0.768 0.628 0.049 0.472 0.252 0.939 0.47
303 Ministral 3 3B Mistral 11.2 4.8 22 $0.1 0.524 0.358 0.053 0.247 0.144 - -
304 Gemini 1.5 Flash-8B Google 11.1 - - $0 0.569 0.359 0.045 0.217 0.229 0.689 0.033
305 Phi-4 Mini Instruct Microsoft Azure 10.9 3.6 6.7 $0 0.465 0.331 0.042 0.126 0.108 0.696 0.03
306 DeepHermes 3 - Mistral 24B Preview (Non-reasoning) Nous Research 10.9 - - $0 0.58 0.382 0.039 0.195 0.228 0.595 0.047
307 Granite 4.0 H Small IBM 10.8 8.5 13.7 $0.107 0.624 0.416 0.037 0.251 0.209 - -
308 Qwen3 Omni 30B A3B Instruct Alibaba 10.7 7.2 52.3 $0.43 0.725 0.62 0.051 0.422 0.186 - -
309 Jamba 1.5 Large AI21 Labs 10.7 - - $3.5 0.572 0.427 0.04 0.143 0.163 0.606 0.047
310 DeepSeek-Coder-V2 DeepSeek 10.6 - - $0 - - - - - 0.743 -
311 OLMo 2 32B Allen Institute for AI 10.6 2.7 3.3 $0 0.511 0.328 0.037 0.068 0.08 - -
312 Hermes 3 - Llama-3.1 70B Nous Research 10.6 - - $0.3 0.571 0.401 0.041 0.188 0.231 0.538 0.023
313 Jamba 1.6 Large AI21 Labs 10.6 - - $3.5 0.565 0.387 0.04 0.172 0.184 0.58 0.047
314 Qwen3 8B (Non-reasoning) Alibaba 10.6 7.1 24.3 $0.31 0.643 0.452 0.028 0.202 0.168 0.828 0.243
315 Phi-4 Microsoft Azure 10.5 11.2 18 $0.219 0.714 0.575 0.041 0.231 0.26 0.81 0.143
316 Gemini 1.5 Flash (May '24) Google 10.5 - - $0 0.574 0.324 0.042 0.196 0.181 0.554 0.093
317 Nova Micro Amazon 10.3 4.1 6 $0.061 0.531 0.358 0.047 0.14 0.094 0.703 0.08
318 Jamba Reasoning 3B AI21 Labs 10.3 2.5 10.7 $0 0.577 0.333 0.046 0.21 0.059 - -
319 Claude 3 Sonnet Anthropic 10.3 - - $6 0.579 0.4 0.038 0.175 0.229 0.414 0.047
320 Gemma 3 27B Instruct Google 10.2 9.6 20.7 $0 0.669 0.428 0.047 0.137 0.212 0.883 0.253
321 Mistral Small (Sep '24) Mistral 10.2 - - $0.3 0.529 0.381 0.043 0.141 0.156 0.563 0.063
322 NVIDIA Nemotron Nano 12B v2 VL (Non-reasoning) NVIDIA 10.1 5.9 26.7 $0.3 0.649 0.439 0.045 0.345 0.176 - -
323 Gemini 1.0 Ultra Google 10.1 17.6 - $0 - - - - - - -
324 Gemma 3n E4B Instruct Preview (May '25) Google 10.1 - - $0 0.483 0.278 0.049 0.138 0.086 0.749 0.107
325 Phi-3 Mini Instruct 3.8B Microsoft Azure 10.1 3 0.3 $0.228 0.435 0.319 0.044 0.116 0.09 0.457 0.04
326 Phi-4 Multimodal Instruct Microsoft Azure 10 - - $0 0.485 0.315 0.044 0.131 0.11 0.693 0.093
327 Qwen2.5 Coder Instruct 7B Alibaba 10 - - $0 0.473 0.339 0.048 0.126 0.148 0.66 0.053
328 Mistral Large (Feb '24) Mistral 9.9 - - $6 0.515 0.351 0.034 0.178 0.208 0.527 0
329 Mixtral 8x22B Instruct Mistral 9.8 - - $0 0.537 0.332 0.041 0.148 0.188 0.545 0
330 Llama 3.2 Instruct 3B Meta 9.7 - 3.3 $0.08 0.347 0.255 0.052 0.083 0.052 0.489 0.067
331 Olmo 3 7B Think Allen Institute for AI 9.5 7.6 70.7 $0.14 0.655 0.516 0.057 0.617 0.212 - -
332 Reka Flash 3 Reka AI 9.5 8.9 33.7 $0.35 0.669 0.529 0.051 0.435 0.267 0.893 0.51
333 Qwen3 VL 4B Instruct Alibaba 9.5 4.5 37 $0 0.634 0.371 0.037 0.29 0.137 - -
334 Qwen1.5 Chat 110B Alibaba 9.5 - - $0 - 0.289 - - - - -
335 Jamba 1.7 Large AI21 Labs 9.3 7.8 2.3 $3.5 0.577 0.39 0.038 0.181 0.188 0.6 0.057
336 Claude 2.1 Anthropic 9.3 14 - $0 0.495 0.319 0.042 0.195 0.184 0.374 0.033
337 OLMo 2 7B Allen Institute for AI 9.3 1.2 0.7 $0 0.282 0.288 0.055 0.041 0.037 - -
338 Molmo 7B-D Allen Institute for AI 9.2 1.2 0 $0 0.371 0.24 0.051 0.039 0.036 - -
339 Claude 2.0 Anthropic 9.1 12.9 - $0 0.486 0.344 - 0.171 0.194 - 0
340 DeepSeek R1 Distill Qwen 1.5B DeepSeek 9.1 - 22 $0 0.269 0.098 0.033 0.07 0.066 0.687 0.177
341 DeepSeek-V2-Chat DeepSeek 9.1 - - $0 - - - - - - -
342 GPT-3.5 Turbo OpenAI 9 10.7 - $0.75 0.462 0.297 - - - 0.441 -
343 Mistral Small (Feb '24) Mistral 9 - - $1.5 0.419 0.302 0.044 0.111 0.134 0.562 0.007
344 Mistral Medium Mistral 9 - - $4.088 0.491 0.349 0.034 0.099 0.118 0.405 0.037
345 Ling-mini-2.0 InclusionAI 8.9 5 49.3 $0.122 0.671 0.562 0.05 0.429 0.135 - -
346 Llama 3.2 Instruct 11B (Vision) Meta 8.8 4.3 1.7 $0.16 0.464 0.221 0.052 0.11 0.112 0.516 0.093
347 Gemma 3 12B Instruct Google 8.8 6.3 18.3 $0 0.595 0.349 0.048 0.137 0.174 0.853 0.22
348 Llama 3 Instruct 70B Meta 8.8 6.8 - $0.871 0.574 0.379 0.044 0.198 0.189 0.483 0
349 LFM 40B Liquid AI 8.8 - - $0 0.425 0.327 0.049 0.096 0.071 0.48 0.023
350 Arctic Instruct Snowflake 8.8 - - $0 - - - - - - -
351 Qwen Chat 72B Alibaba 8.8 - - $0 - - - - - - -
352 PALM-2 Google 8.6 4.6 - $0 - - - - - - -
353 Gemini 1.0 Pro Google 8.5 - - $0 0.431 0.277 0.046 0.116 0.117 0.403 0.007
354 DeepSeek Coder V2 Lite Instruct DeepSeek 8.5 - - $0 0.429 0.319 0.053 0.158 0.139 - -
355 DeepSeek LLM 67B Chat (V1) DeepSeek 8.4 - - $0 - - - - - - -
356 Exaone 4.0 1.2B (Reasoning) LG AI Research 8.3 3.1 50.3 $0 0.588 0.515 0.058 0.516 0.093 - -
357 OpenChat 3.5 (1210) OpenChat 8.3 - - $0 0.31 0.23 0.048 0.115 - 0.307 0
358 DBRX Instruct Databricks 8.3 - - $0 0.397 0.331 0.066 0.093 0.118 0.279 0.03
359 Command-R+ (Apr '24) Cohere 8.3 - - $6 0.432 0.323 0.045 0.122 0.118 0.279 0.007
360 LFM2.5-1.2B-Thinking Liquid AI 8.1 1.4 - $0 - 0.339 0.061 - 0.042 - -
361 Olmo 3 7B Instruct Allen Institute for AI 8.1 3.4 41.3 $0.125 0.522 0.4 0.058 0.266 0.103 - -
362 Exaone 4.0 1.2B (Non-reasoning) LG AI Research 8.1 2.5 24 $0 0.5 0.424 0.058 0.293 0.074 - -
363 LFM2.5-1.2B-Instruct Liquid AI 8 0.8 - $0 - 0.326 0.068 - 0.023 - -
364 Granite 4.0 H 1B IBM 8 2.7 6.3 $0 0.277 0.263 0.05 0.115 0.082 - -
365 Solar Mini Upstage 8 - - $0.15 - - - - - 0.331 -
366 Jamba 1.5 Mini AI21 Labs 8 - - $0.25 0.371 0.302 0.051 0.062 0.08 0.357 0.01
367 LFM2 2.6B Liquid AI 7.9 1.4 8.3 $0 0.298 0.306 0.052 0.081 0.025 - -
368 Qwen3 1.7B (Reasoning) Alibaba 7.9 1.4 38.7 $0.398 0.57 0.356 0.048 0.308 0.043 0.894 0.51
369 Jamba 1.6 Mini AI21 Labs 7.9 - - $0.25 0.367 0.3 0.046 0.071 0.101 0.257 0.033
370 Granite 4.0 Micro IBM 7.7 5 6 $0 0.447 0.336 0.051 0.18 0.119 - -
371 Mixtral 8x7B Instruct Mistral 7.7 - - $0.54 0.387 0.292 0.045 0.066 0.028 0.299 0
372 DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning) Nous Research 7.6 - - $0 0.365 0.27 0.043 0.085 0.091 0.218 0
373 Gemma 3 270M Google 7.5 0 2.3 $0 0.055 0.224 0.042 0.003 0 - -
374 Qwen Chat 14B Alibaba 7.4 - - $0 - - - - - - -
375 Claude Instant Anthropic 7.4 7.8 - $0 0.434 0.33 0.038 0.109 - 0.264 0
376 Mistral 7B Instruct Mistral 7.4 - - $0.25 0.245 0.177 0.043 0.046 0.024 0.121 0
377 Command-R (Mar '24) Cohere 7.4 - - $0.75 0.338 0.284 0.048 0.048 0.062 0.164 0.007
378 Granite 4.0 1B IBM 7.3 2.9 6.3 $0 0.325 0.281 0.051 0.047 0.087 - -
379 Jamba 1.7 Mini AI21 Labs 7.3 3.1 0.3 $0.25 0.388 0.322 0.045 0.061 0.093 0.258 0.013
380 Llama 2 Chat 70B Meta 7 - - $0 0.406 0.327 0.05 0.098 - 0.323 0
381 LFM2 8B A1B Liquid AI 6.8 2.3 25.3 $0 0.505 0.344 0.049 0.151 0.068 - -
382 Qwen3 1.7B (Non-reasoning) Alibaba 6.8 2.3 7.3 $0.188 0.411 0.283 0.052 0.126 0.069 0.717 0.097
383 Granite 3.3 8B (Non-reasoning) IBM 6.8 3.4 6.7 $0.085 0.468 0.338 0.042 0.127 0.101 0.665 0.047
384 Granite 4.0 350M IBM 6.6 0.3 0 $0 0.124 0.261 0.057 0.024 0.009 - -
385 Qwen3 0.6B (Reasoning) Alibaba 6.4 0.9 18 $0.398 0.347 0.239 0.057 0.121 0.028 0.75 0.1
386 LFM2 1.2B Liquid AI 6.4 0.8 3.3 $0 0.257 0.228 0.057 0.02 0.025 - -
387 Gemma 3 4B Instruct Google 6.3 2.9 12.7 $0 0.417 0.291 0.052 0.112 0.073 0.766 0.063
388 Gemma 3n E4B Instruct Google 6.3 4.2 14.3 $0.025 0.488 0.296 0.044 0.146 0.081 0.771 0.137
389 Llama 3 Instruct 8B Meta 6.3 4 - $0.07 0.405 0.296 0.051 0.096 0.119 0.499 0
390 Llama 3.2 Instruct 1B Meta 6.3 0.6 0 $0.1 0.2 0.196 0.053 0.019 0.017 0.14 0
391 LFM2.5-VL-1.6B Liquid AI 6.1 1 - $0 - 0.289 0.051 - 0.03 - -
392 Qwen3 0.6B (Non-reasoning) Alibaba 5.6 1.4 10.3 $0.188 0.231 0.231 0.052 0.073 0.041 0.521 0.017
393 Gemma 3 1B Instruct Google 5.4 0.2 3.3 $0 0.135 0.237 0.052 0.017 0.007 0.484 0
394 Granite 4.0 H 350M IBM 5.3 0.6 1.3 $0 0.127 0.257 0.064 0.019 0.017 - -
395 Llama 2 Chat 13B Meta 5 - - $0 0.406 0.321 0.047 0.098 0.118 0.329 0.017
396 Gemma 3n E2B Instruct Google 4.7 2.2 10.3 $0 0.378 0.229 0.04 0.095 0.052 0.691 0.09
397 Tiny Aya Global Cohere 4.7 1.2 - $0 - 0.305 0.052 - 0.036 - -
398 Llama 65B Meta 4 - - $0 - - - - - - -
399 Llama 2 Chat 7B Meta 4 - - $0.1 0.164 0.227 0.058 0.002 0 0.059 0
400 Grok Voice Agent xAI - - - $0 - - - - - - -
401 Molmo2-8B Allen Institute for AI - 4.4 - $0 - 0.425 0.044 - 0.133 - -
402 Cogito v2.1 (Reasoning) Deep Cogito - 24.8 72.7 $1.25 0.849 0.768 0.11 0.688 0.41 - -
403 Mi:dm K 2.5 Pro Preview Korea Telecom - 11.9 78.7 $0 0.813 0.722 0.088 0.576 0.297 - -
404 GPT-4o Realtime (Dec '24) OpenAI - - - $0 - - - - - - -
405 GPT-4o mini Realtime (Dec '24) OpenAI - - - $0 - - - - - - -
406 GPT-3.5 Turbo (0613) OpenAI - - - $0 - - - - - - -

* 价格为每百万 Token 的混合价格 (3:1 输入/输出)

Artificial Analysis AI 大模型排名 介绍

Artificial Analysis 是一家独立的 AI 基准测试和分析公司,提供独立的基准测试和分析,以支持开发者、研究人员、企业和其他 AI 用户。Artificial Analysis同时测试专有与开放权重模型,并以端到端用户体验为核心,测量实际使用中的响应时间、输出速度及成本。

质量基准涵盖语言理解与推理能力;性能基准则关注首次令牌到达时间、输出速度、端到端响应时间等真实可感知指标。我们区分 OpenAI Tokens 与原生 Tokens,以便在不同模型之间进行统一、公平的对比,并使用按 3:1 的输入/输出比计算混合价。基准对象包括模型、端点、系统与提供商,覆盖语言模型、语音、图像生成等多个方向,旨在帮助用户准确了解不同 AI 服务的真实表现与性价比。

Artificial Analysis AI 测试基准介绍

上下文窗口

输入和输出令牌的最大总数。输出令牌的数量限制通常要低得多(具体数量因模型而异)。

输出速度

模型生成令牌时每秒接收到的令牌数(即,对于支持流式传输的模型,在从 API 接收到第一个数据块之后)。

延迟(首次令牌到达时间)

API 请求发送后,收到第一个推理令牌所需的时间(以秒为单位)。对于共享推理令牌的推理模型,这将是第一个推理令牌。对于不支持流式传输的模型,这表示收到完成状态所需的时间。

价格

每个代币的价格,以美元/百万代币表示。价格是输入代币和输出代币价格的混合(比例为 3:1)。

常见 AI 大模型测试基准介绍

MMLU Pro

Massive Multitask Language Understanding Professional。MMLU 的增强版,旨在评估大语言模型的推理能力。它通过过滤简单问题、增加选项数量(从4个增加到10个)以及强调复杂的多步推理,来解决原版 MMLU 的局限性。涵盖 14 个领域的约 12,000 个问题。

GPQA

Graduate-Level Google-Proof Q&A Benchmark。一个具有挑战性的研究生级别问答基准,旨在评估 AI 系统在物理、化学和生物等复杂科学领域提供真实信息的能力。这些问题被设计为“防谷歌搜索”,即需要深度理解和推理,而不仅仅是简单的事实回忆。

HLE

Humanity's Last Exam。一个全面的评估框架,旨在测试 AI 系统在模仿人类水平推理、解决问题和知识整合方面的能力。包含 100 多个学科的 2,500 到 3,000 个专家级问题,强调多步推理和处理新颖场景的能力。

LiveCodeBench

一个无污染的 LLM 代码能力评估基准。它持续从 LeetCode、AtCoder 和 Codeforces 等平台的竞赛中收集新问题,以防止训练集数据污染。除了代码生成,还评估自我修复、代码执行和测试输出预测等能力。

SciCode

评估语言模型解决现实科学研究问题代码生成能力的基准。涵盖物理、数学、材料科学、生物和化学等 6 个领域的 16 个子领域。问题源自真实的科学工作流,通常需要知识回忆、推理和代码合成。

Math 500

旨在评估语言模型数学推理和解决问题能力的基准。包含 500 个来自 AMC 和 AIME 等高水平高中数学竞赛的难题,涵盖代数、组合数学、几何、数论和预微积分等领域。

AIME

American Invitational Mathematics Examination。基于美国数学邀请赛问题的基准,被认为是测试高级数学推理的最具挑战性的 AI 测试之一。包含 30 个“奥林匹克级别”的整数答案数学问题,测试多步推理、抽象和解决问题的能力。