AI 大模型排名 ArtificialAnalysis AI 大模型排行榜

本页面排行数据来自 Artificial Analysis ,它对超过 100 个 AI 模型(LLM)的性能进行了比较和排名,评估指标包括智能程度、价格等。另外排行也汇总其他权威 AI 基准测试结果以供参考。

AI 大模型排名(基于 Artificial Analysis

重置筛选
模型信息 Artificial Analysis测试基准结果 其他 AI 测试基准结果
排名 模型名称 机构 综合指数 Coding Math 价格 ($/1M) MMLU Pro ? GPQA ? HLE ? LiveCodeBench ? SciCode ? Math 500 ? AIME ?
1 GPT-5.2 (xhigh) OpenAI 50.5 46.7 99 $4.813 0.874 0.903 0.354 0.889 0.521 - -
2 Claude Opus 4.5 (Reasoning) Anthropic 49.1 45.8 91.3 $10 0.895 0.866 0.284 0.871 0.495 - -
3 Gemini 3 Pro Preview (high) Google 47.9 44.7 95.7 $4.5 0.898 0.908 0.372 0.917 0.561 - -
4 GPT-5.1 (high) OpenAI 47 42.8 94 $3.438 0.87 0.873 0.265 0.868 0.433 - -
5 Gemini 3 Flash Preview (Reasoning) Google 45.9 41 97 $1.125 0.89 0.898 0.347 0.908 0.506 - -
6 GPT-5.2 (medium) OpenAI 45.3 42.3 96.7 $4.813 0.859 0.864 0.249 0.894 0.462 - -
7 GPT-5 (high) OpenAI 44.1 34.6 94.3 $3.438 0.871 0.854 0.265 0.846 0.429 0.994 0.957
8 GPT-5 Codex (high) OpenAI 44 37.3 98.7 $3.438 0.865 0.837 0.256 0.84 0.409 - -
9 Claude Opus 4.5 (Non-reasoning) Anthropic 42.5 41.2 62.7 $10 0.889 0.81 0.129 0.738 0.47 - -
10 Claude 4.5 Sonnet (Reasoning) Anthropic 42.4 37.1 88 $6 0.875 0.834 0.173 0.714 0.447 - -
11 GLM-4.7 (Reasoning) Z AI 41.7 34.9 95 $0.938 0.856 0.859 0.251 0.894 0.451 - -
12 GPT-5 (medium) OpenAI 41.6 37.8 91.7 $3.438 0.867 0.842 0.235 0.703 0.411 0.991 0.917
13 GPT-5.1 Codex (high) OpenAI 41.5 35.1 95.7 $3.438 0.86 0.86 0.234 0.849 0.402 - -
14 Grok 4 xAI 41.3 40.3 92.7 $6 0.866 0.877 0.239 0.819 0.457 0.99 0.943
15 DeepSeek V3.2 (Reasoning) DeepSeek 41.2 35.2 92 $0.315 0.862 0.84 0.222 0.862 0.389 - -
16 o3 OpenAI 40.9 36.8 88.3 $3.5 0.853 0.827 0.2 0.808 0.41 0.992 0.903
17 o3-pro OpenAI 40.7 - - $35 - 0.845 - - - - -
18 GPT-5 mini (high) OpenAI 40.6 33.9 90.7 $0.688 0.837 0.828 0.197 0.838 0.392 - -
19 Gemini 3 Pro Preview (low) Google 40.6 37.9 86.7 $4.5 0.895 0.887 0.276 0.857 0.499 - -
20 Kimi K2 Thinking Kimi 40.3 33.5 94.7 $1.075 0.848 0.838 0.223 0.853 0.424 - -
21 MiniMax-M2.1 MiniMax 39.3 31.6 82.7 $0.525 0.875 0.83 0.222 0.81 0.407 - -
22 MiMo-V2-Flash (Reasoning) Xiaomi 39 30.6 96.3 $0.15 0.843 0.846 0.211 0.868 0.394 - -
23 GPT-5 (low) OpenAI 38.7 29.6 83 $3.438 0.86 0.808 0.184 0.763 0.391 0.987 0.83
24 GPT-5 mini (medium) OpenAI 38.6 31.6 85 $0.688 0.828 0.803 0.146 0.692 0.41 - -
25 Claude 4 Sonnet (Reasoning) Anthropic 38.4 33.2 74.3 $6 0.842 0.777 0.096 0.655 0.4 0.991 0.773
26 Grok 4.1 Fast (Reasoning) xAI 38.2 29.9 89.3 $0.275 0.854 0.853 0.176 0.822 0.442 - -
27 GPT-5.1 Codex mini (high) OpenAI 38 35 91.7 $0.688 0.82 0.813 0.169 0.836 0.426 - -
28 Claude 4.5 Haiku (Reasoning) Anthropic 36.6 31.4 83.7 $2 0.76 0.672 0.097 0.615 0.433 - -
29 Claude 4.5 Sonnet (Non-reasoning) Anthropic 36.6 32.2 37 $6 0.86 0.727 0.071 0.59 0.428 - -
30 KAT-Coder-Pro V1 KwaiKAT 35.9 17.9 94.7 $0 0.813 0.764 0.334 0.747 0.366 - -
31 MiniMax-M2 MiniMax 35.6 28.1 78.3 $0.525 0.82 0.777 0.125 0.826 0.361 - -
32 Nova 2.0 Pro Preview (medium) Amazon 35.3 29.4 89 $3.438 0.83 0.785 0.089 0.73 0.427 - -
33 Doubao-Seed-1.8 ByteDance Seed 34.8 28.7 84.7 $0.152 0.85 0.801 0.148 0.745 0.449 - -
34 Gemini 3 Flash Preview (Non-reasoning) Google 34.7 36.5 55.7 $1.125 0.882 0.812 0.141 0.797 0.499 - -
35 Grok 4 Fast (Reasoning) xAI 34.6 26.6 89.7 $0.275 0.85 0.847 0.17 0.832 0.442 - -
36 Claude 3.7 Sonnet (Reasoning) Anthropic 34.4 26.7 56.3 $6 0.837 0.772 0.103 0.473 0.403 0.947 0.487
37 Gemini 2.5 Pro Google 34.1 30.8 87.7 $3.438 0.862 0.844 0.211 0.801 0.428 0.967 0.887
38 DeepSeek V3.2 Speciale DeepSeek 34.1 36.4 96.7 $0.315 0.863 0.871 0.261 0.896 0.44 - -
39 GLM-4.7 (Non-reasoning) Z AI 33.7 30.7 48 $0.938 0.794 0.664 0.061 0.562 0.354 - -
40 DeepSeek V3.1 Terminus (Reasoning) DeepSeek 33.4 32.5 89.7 $0.8 0.851 0.792 0.152 0.798 0.406 - -
41 Doubao Seed Code ByteDance Seed 33.2 30.1 79.3 $0.407 0.854 0.764 0.133 0.766 0.407 - -
42 GPT-5.2 (Non-reasoning) OpenAI 33.1 33.3 51 $4.813 0.814 0.712 0.073 0.669 0.404 - -
43 gpt-oss-120B (high) OpenAI 32.9 27.6 93.4 $0.263 0.808 0.782 0.185 0.878 0.389 - -
44 o4-mini (high) OpenAI 32.9 25 90.7 $1.925 0.832 0.784 0.175 0.859 0.465 0.989 0.94
45 Claude 4 Sonnet (Non-reasoning) Anthropic 32.6 29.4 38 $6 0.837 0.683 0.04 0.449 0.373 0.934 0.407
46 DeepSeek V3.2 Exp (Reasoning) DeepSeek 32.5 32 87.7 $0.315 0.85 0.797 0.138 0.789 0.377 - -
47 Qwen3 Max Thinking Alibaba 32.4 23.8 82.3 $2.4 0.824 0.776 0.12 0.535 0.387 - -
48 Grok 3 mini Reasoning (high) xAI 32.3 24.4 84.7 $0.35 0.828 0.791 0.111 0.696 0.406 0.992 0.933
49 GLM-4.6 (Reasoning) Z AI 32.2 28.4 86 $0.963 0.829 0.78 0.133 0.695 0.384 - -
50 Nova 2.0 Pro Preview (low) Amazon 32 23.8 63.3 $3.438 0.822 0.751 0.052 0.638 0.387 - -
51 K-EXAONE (Reasoning) LG AI Research 31.9 26.1 90.3 $0 0.838 0.783 0.131 0.768 0.356 - -
52 Claude 4.1 Opus (Reasoning) Anthropic 31.9 35.1 80.3 $30 0.88 0.809 0.119 0.654 0.409 - -
53 DeepSeek V3.2 (Non-reasoning) DeepSeek 31.8 33.2 59 $0.315 0.837 0.751 0.105 0.593 0.387 - -
54 Qwen3 Max Alibaba 31 25.5 80.7 $2.4 0.841 0.764 0.111 0.767 0.383 - -
55 Gemini 2.5 Flash Preview (Sep '25) (Reasoning) Google 30.8 23.9 78.3 $0.85 0.842 0.793 0.127 0.713 0.405 - -
56 Claude 4.5 Haiku (Non-reasoning) Anthropic 30.5 28.5 39 $2 0.8 0.646 0.043 0.511 0.344 - -
57 Claude 3.7 Sonnet (Non-reasoning) Anthropic 30.5 25.8 21 $6 0.803 0.656 0.048 0.394 0.376 0.85 0.223
58 Gemini 2.5 Pro Preview (Mar' 25) Google 30.3 46.7 - $3.438 0.858 0.836 0.171 0.778 0.395 0.98 0.87
59 DeepSeek V3.1 (Reasoning) DeepSeek 30.2 29.1 89.7 $0.855 0.851 0.779 0.13 0.784 0.391 - -
60 Nova 2.0 Lite (medium) Amazon 29.8 23.1 88.7 $0.85 0.813 0.768 0.086 0.663 0.368 - -
61 GLM-4.6 (Non-reasoning) Z AI 29.8 29 44.3 $1 0.784 0.632 0.052 0.561 0.331 - -
62 Gemini 2.5 Pro Preview (May' 25) Google 29.5 - - $3.438 0.837 0.822 0.154 0.77 0.416 0.986 0.843
63 Qwen3 235B A22B 2507 (Reasoning) Alibaba 29.3 22.6 91 $2.625 0.843 0.79 0.15 0.788 0.424 0.984 0.94
64 ERNIE 5.0 Thinking Preview Baidu 28.9 28.1 85 $1.472 0.83 0.777 0.127 0.812 0.375 - -
65 Qwen3 VL 32B (Reasoning) Alibaba 28.6 14.2 84.7 $2.625 0.818 0.733 0.096 0.738 0.285 - -
66 Seed-OSS-36B-Instruct ByteDance Seed 28.4 16.4 84.7 $0.3 0.815 0.726 0.091 0.765 0.365 - -
67 Apriel-v1.5-15B-Thinker ServiceNow 28.3 18.2 87.5 $0 0.773 0.713 0.12 0.728 0.348 - -
68 DeepSeek V3.2 Exp (Non-reasoning) DeepSeek 28.1 28.9 57.7 $0.315 0.836 0.738 0.086 0.554 0.399 - -
69 DeepSeek V3.1 Terminus (Non-reasoning) DeepSeek 27.9 30.5 53.7 $0.8 0.836 0.751 0.084 0.529 0.321 - -
70 Nova 2.0 Omni (medium) Amazon 27.9 14.9 89.7 $0.85 0.809 0.76 0.068 0.66 0.362 - -
71 Kimi K2 0905 Kimi 27.7 25.4 57.3 $1.2 0.819 0.767 0.063 0.61 0.307 - -
72 Apriel-v1.6-15B-Thinker ServiceNow 27.7 21.4 88 $0 0.79 0.733 0.098 0.807 0.373 - -
73 o3-mini (high) OpenAI 27.7 42.1 - $1.925 0.802 0.773 0.123 0.734 0.398 0.985 0.86
74 DeepSeek V3.1 (Non-reasoning) DeepSeek 27.6 27.4 49.7 $0.834 0.833 0.735 0.063 0.577 0.367 - -
75 Qwen3 VL 235B A22B (Reasoning) Alibaba 27.4 20.4 88.3 $2.625 0.836 0.772 0.101 0.646 0.399 - -
76 Claude 4 Opus (Reasoning) Anthropic 27.4 32.7 73.3 $30 0.873 0.796 0.117 0.636 0.398 0.982 0.757
77 Magistral Medium 1.2 Mistral 27.3 21.6 82 $2.75 0.815 0.739 0.096 0.75 0.392 - -
78 GPT-5.1 (Non-reasoning) OpenAI 27.2 26.3 38 $3.438 0.801 0.643 0.052 0.494 0.365 - -
79 DeepSeek R1 0528 (May '25) DeepSeek 27 23.4 76 $2.362 0.849 0.813 0.149 0.77 0.403 0.983 0.893
80 Gemini 2.5 Flash (Reasoning) Google 27 21.6 73.3 $0.85 0.832 0.79 0.111 0.695 0.394 0.981 0.823
81 GPT-5 nano (high) OpenAI 26.6 19.8 83.7 $0.138 0.78 0.676 0.082 0.789 0.366 - -
82 Qwen3 Next 80B A3B (Reasoning) Alibaba 26.5 19.1 84.3 $1.875 0.824 0.759 0.117 0.784 0.388 - -
83 GLM-4.5 (Reasoning) Z AI 26.5 25.8 73.7 $1 0.835 0.782 0.122 0.738 0.348 0.979 0.873
84 Grok Code Fast 1 xAI 26.2 22.9 43.3 $0.525 0.793 0.727 0.075 0.657 0.362 - -
85 Qwen3 Max (Preview) Alibaba 26.1 24.6 75 $2.4 0.838 0.764 0.093 0.651 0.37 - -
86 o3-mini OpenAI 25.9 17.6 - $1.925 0.791 0.748 0.087 0.717 0.399 0.973 0.77
87 Kimi K2 Kimi 25.9 21.4 57 $1.075 0.824 0.766 0.07 0.556 0.345 0.971 0.693
88 o1-pro OpenAI 25.8 - - $262.5 - - - - - - -
89 GPT-4.1 OpenAI 25.7 21.2 34.7 $3.5 0.806 0.666 0.046 0.457 0.381 0.913 0.437
90 GPT-5 nano (medium) OpenAI 25.7 22.1 78.3 $0.138 0.772 0.67 0.076 0.763 0.338 - -
91 Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) Google 25.5 21.5 56.7 $0.85 0.836 0.766 0.078 0.625 0.375 - -
92 o1 OpenAI 25.2 20 - $26.25 0.841 0.747 0.077 0.679 0.358 0.97 0.723
93 Grok 3 xAI 25.1 19.4 58 $6 0.799 0.693 0.051 0.425 0.368 0.87 0.33
94 Nova 2.0 Lite (low) Amazon 24.8 13.5 46.7 $0.85 0.788 0.698 0.042 0.469 0.333 - -
95 Qwen3 Coder 480B A35B Instruct Alibaba 24.6 23.8 39.3 $3 0.788 0.618 0.044 0.585 0.359 0.942 0.477
96 Sonar Reasoning Pro Perplexity 24.6 - - $0 - - - - - 0.957 0.79
97 gpt-oss-20B (high) OpenAI 24.5 18.1 89.3 $0.1 0.748 0.688 0.098 0.777 0.344 - -
98 NVIDIA Nemotron 3 Nano 30B A3B (Reasoning) NVIDIA 24.5 18.4 91 $0.105 0.794 0.757 0.102 0.741 0.296 - -
99 MiMo-V2-Flash (Non-reasoning) Xiaomi 24.5 24.7 67.7 $0.15 0.744 0.656 0.08 0.402 0.259 - -
100 Qwen3 235B A22B 2507 Instruct Alibaba 24.5 21.4 71.7 $1.225 0.828 0.753 0.106 0.524 0.36 0.98 0.717
101 MiniMax M1 80k MiniMax 24.4 14.3 61 $0.825 0.816 0.697 0.082 0.711 0.374 0.98 0.847
102 Gemini 2.5 Flash Preview (Reasoning) Google 24.3 - - $0 0.8 0.698 0.116 0.505 0.359 0.981 0.843
103 GPT-5 (minimal) OpenAI 24.2 24.7 31.7 $3.438 0.806 0.673 0.054 0.558 0.388 0.861 0.367
104 Motif-2-12.7B-Reasoning Motif Technologies 23.9 11.8 80.3 $0 0.796 0.695 0.082 0.651 0.282 - -
105 HyperCLOVA X SEED Think (32B) Naver 23.9 17 59 $0 0.785 0.615 0.055 0.629 0.284 - -
106 gpt-oss-120B (low) OpenAI 23.8 15.3 66.7 $0.263 0.775 0.672 0.052 0.707 0.36 - -
107 Nova 2.0 Omni (low) Amazon 23.8 13.8 56 $0.85 0.798 0.699 0.04 0.592 0.343 - -
108 GLM-4.6V (Reasoning) Z AI 23.7 19.1 85.3 $0.45 0.799 0.719 0.089 0.16 0.304 - -
109 Qwen3 Next 80B A3B Instruct Alibaba 23.7 14.9 66.3 $0.875 0.819 0.738 0.073 0.684 0.307 - -
110 o1-preview OpenAI 23.7 34 - $28.875 - - - - - 0.924 -
111 Claude 4.1 Opus (Non-reasoning) Anthropic 23.6 - - $30 - - - - - - -
112 Grok 4.1 Fast (Non-reasoning) xAI 23.4 18.9 34.3 $0.275 0.743 0.637 0.05 0.399 0.296 - -
113 GLM-4.5-Air Z AI 23.3 22.9 80.7 $0.425 0.815 0.733 0.068 0.684 0.306 0.965 0.673
114 Nova 2.0 Pro Preview (Non-reasoning) Amazon 23.2 19.8 30.7 $3.438 0.772 0.636 0.04 0.473 0.281 - -
115 K-EXAONE (Non-reasoning) LG AI Research 23.2 13.2 44 $0 0.81 0.695 0.054 - 0.27 - -
116 DeepSeek R1 (Jan '25) DeepSeek 23.1 15.7 68 $2.362 0.844 0.708 0.093 0.617 0.357 0.966 0.683
117 Qwen3 4B 2507 (Reasoning) Alibaba 22.8 9.5 82.7 $0 0.743 0.667 0.059 0.641 0.256 - -
118 GPT-4.1 mini OpenAI 22.8 18.2 46.3 $0.7 0.781 0.664 0.046 0.483 0.404 0.925 0.43
119 Grok 4 Fast (Non-reasoning) xAI 22.7 18.5 41.3 $0.275 0.73 0.606 0.05 0.401 0.329 - -
120 Qwen3 30B A3B 2507 (Reasoning) Alibaba 22.6 14.4 56.3 $0.75 0.805 0.707 0.098 0.707 0.333 0.976 0.907
121 Magistral Small 1.2 Mistral 22.5 14.6 80.3 $0.75 0.768 0.663 0.061 0.723 0.352 - -
122 Mistral Large 3 Mistral 22.5 22 38 $0.75 0.807 0.68 0.041 0.465 0.362 - -
123 DeepSeek V3 0324 DeepSeek 22.4 21.4 41 $1.25 0.819 0.655 0.052 0.405 0.358 0.942 0.52
124 EXAONE 4.0 32B (Reasoning) LG AI Research 22.3 13.8 80 $0.7 0.818 0.739 0.105 0.747 0.344 0.977 0.843
125 Claude 4 Opus (Non-reasoning) Anthropic 22.2 - 36.3 $30 0.86 0.701 0.059 0.542 0.409 0.941 0.563
126 Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning) Google 22.1 17.6 68.7 $0.175 0.808 0.709 0.066 0.688 0.287 - -
127 GPT-5 (ChatGPT) OpenAI 21.8 20.6 48.3 $3.438 0.82 0.686 0.058 0.543 0.378 - -
128 Devstral 2 Mistral 21.7 22.9 36.7 $0 0.762 0.594 0.036 0.448 0.331 - -
129 Hermes 4 - Llama-3.1 405B (Reasoning) Nous Research 21.7 15.5 69.7 $1.5 0.829 0.727 0.103 0.686 0.252 - -
130 Grok 3 Reasoning Beta xAI 21.6 - - $0 - - - - - - -
131 Qwen3 VL 32B Instruct Alibaba 21.4 15.2 68.3 $1.225 0.791 0.671 0.063 0.514 0.301 - -
132 GPT-5 mini (minimal) OpenAI 21.2 21.3 46.7 $0.688 0.775 0.687 0.05 0.545 0.369 - -
133 gpt-oss-20B (low) OpenAI 21.1 14.2 62.3 $0.1 0.718 0.611 0.051 0.652 0.34 - -
134 Mistral Medium 3.1 Mistral 21.1 17.9 38.3 $0.8 0.683 0.588 0.044 0.406 0.338 - -
135 K2-V2 (high) MBZUAI Institute of Foundation Models 21 15.7 78.3 $0 0.786 0.681 0.098 0.694 0.286 - -
136 MiniMax M1 40k MiniMax 20.9 14 13.7 $0.825 0.808 0.682 0.075 0.657 0.378 0.972 0.813
137 Qwen3 Omni 30B A3B (Reasoning) Alibaba 20.8 12.6 74 $0.43 0.792 0.726 0.073 0.679 0.306 - -
138 Qwen3 VL 235B A22B Instruct Alibaba 20.6 16.2 70.7 $1.225 0.823 0.712 0.063 0.594 0.359 - -
139 Ring-flash-2.0 InclusionAI 20.6 10.3 83.7 $0.247 0.793 0.725 0.089 0.628 0.168 - -
140 Gemini 2.5 Flash (Non-reasoning) Google 20.5 17.3 60.3 $0.85 0.809 0.683 0.051 0.495 0.291 0.932 0.5
141 Hermes 4 - Llama-3.1 70B (Reasoning) Nous Research 20.4 14.2 68.7 $0.198 0.811 0.699 0.079 0.653 0.341 - -
142 o1-mini OpenAI 20.4 - - $0 0.742 0.603 0.049 0.576 0.323 0.944 0.603
143 Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) NVIDIA 20 13 63.7 $0.9 0.825 0.728 0.081 0.641 0.347 0.952 0.747
144 Qwen3 VL 30B A3B Instruct Alibaba 20 14 72.3 $0.35 0.764 0.695 0.064 0.476 0.308 - -
145 Qwen3 Coder 30B A3B Instruct Alibaba 20 18.7 29 $0.9 0.706 0.516 0.04 0.403 0.278 0.893 0.297
146 GPT-4.5 (Preview) OpenAI 20 - - $0 - - - - - - -
147 Ling-flash-2.0 InclusionAI 19.9 16.3 65.3 $0.247 0.777 0.657 0.063 0.589 0.289 - -
148 Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning) Google 19.8 14.2 46.7 $0.175 0.796 0.651 0.046 0.641 0.285 - -
149 Qwen3 235B A22B (Reasoning) Alibaba 19.8 17.1 82 $2.625 0.828 0.7 0.117 0.622 0.399 0.93 0.84
150 QwQ 32B Alibaba 19.7 - 29 $0.473 0.764 0.593 0.082 0.631 0.358 0.957 0.78
151 Gemini 2.0 Flash Thinking Experimental (Jan '25) Google 19.6 24.1 - $0 0.798 0.701 0.071 0.321 0.329 0.944 0.5
152 Qwen3 VL 30B A3B (Reasoning) Alibaba 19.5 12.9 82.3 $0.75 0.807 0.72 0.087 0.697 0.288 - -
153 Ling-1T InclusionAI 19.5 18.4 71.3 $0 0.822 0.719 0.072 0.677 0.352 - -
154 GLM-4.5V (Reasoning) Z AI 19.3 10.7 73 $0.9 0.788 0.684 0.059 0.604 0.221 - -
155 Devstral Small 2 Mistral 19.1 20 34.3 $0 0.678 0.532 0.034 0.348 0.288 - -
156 Nova Premier Amazon 19.1 13.6 17.3 $5 0.733 0.569 0.047 0.317 0.279 0.839 0.17
157 Llama Nemotron Super 49B v1.5 (Reasoning) NVIDIA 19.1 14.9 76.7 $0.175 0.814 0.748 0.068 0.737 0.348 0.983 0.86
158 K2-V2 (medium) MBZUAI Institute of Foundation Models 19 13.6 64.7 $0 0.761 0.598 0.044 0.541 0.252 - -
159 OLMo 3 32B Think Allen Institute for AI 18.9 10.5 73.7 $0.237 0.759 0.61 0.059 0.672 0.286 - -
160 Solar Pro 2 (Preview) (Reasoning) Upstage 18.8 - - $0 0.768 0.578 0.057 0.462 0.164 0.9 0.663
161 Devstral Medium Mistral 18.7 15.5 4.7 $0.8 0.708 0.492 0.038 0.337 0.294 0.707 0.067
162 Llama 4 Maverick Meta 18.6 15.3 19.3 $0.422 0.809 0.671 0.048 0.397 0.331 0.889 0.39
163 GPT-4o (March 2025, chatgpt-4o-latest) OpenAI 18.6 - 25.7 $7.5 0.803 0.655 0.05 0.425 0.366 0.893 0.327
164 Llama 3.3 Nemotron Super 49B v1 (Reasoning) NVIDIA 18.5 9.4 54.7 $0 0.785 0.643 0.065 0.277 0.282 0.959 0.583
165 Nova 2.0 Lite (Non-reasoning) Amazon 18.4 12.2 33.7 $0.85 0.743 0.603 0.03 0.346 0.24 - -
166 Gemini 2.0 Pro Experimental (Feb '25) Google 18.1 25.5 - $0 0.805 0.622 0.068 0.347 0.312 0.923 0.36
167 Sonar Reasoning Perplexity 17.9 - - $2 - 0.623 - - - 0.921 0.77
168 Gemini 2.5 Flash Preview (Non-reasoning) Google 17.8 - - $0 0.783 0.594 0.05 0.406 0.233 0.926 0.433
169 Gemini 2.0 Flash (Feb '25) Google 17.6 13.5 21.7 $0.175 0.779 0.623 0.053 0.334 0.333 0.93 0.33
170 Gemini 2.5 Flash-Lite (Reasoning) Google 17.6 9.3 53.3 $0.175 0.759 0.625 0.064 0.593 0.193 0.969 0.703
171 Mistral Medium 3 Mistral 17.6 13.4 30.3 $0.8 0.76 0.578 0.043 0.4 0.331 0.907 0.44
172 Magistral Medium 1 Mistral 17.4 15.6 40.3 $2.75 0.753 0.679 0.095 0.527 0.297 0.917 0.7
173 Llama 3.1 Instruct 405B Meta 17.3 14.2 3 $4.188 0.732 0.515 0.042 0.305 0.299 0.703 0.213
174 ERNIE 4.5 300B A47B Baidu 17.3 14.3 41.3 $0.485 0.776 0.811 0.035 0.467 0.315 0.931 0.493
175 GLM-4.6V (Non-reasoning) Z AI 17.3 11 26.3 $0.45 0.752 0.566 0.037 0.411 0.272 - -
176 DeepSeek R1 Distill Qwen 32B DeepSeek 17.2 - 63 $0.285 0.739 0.615 0.055 0.27 0.376 0.941 0.687
177 Hermes 4 - Llama-3.1 405B (Non-reasoning) Nous Research 17.1 17.7 15.3 $1.5 0.729 0.536 0.042 0.546 0.346 - -
178 Qwen3 235B A22B (Non-reasoning) Alibaba 17.1 13.7 23.7 $1.225 0.762 0.613 0.047 0.343 0.299 0.902 0.327
179 DeepSeek V3 (Dec '24) DeepSeek 16.9 16.1 26 $0.625 0.752 0.557 0.036 0.359 0.354 0.887 0.253
180 Qwen3 32B (Reasoning) Alibaba 16.9 13.7 73 $2.625 0.798 0.668 0.083 0.546 0.354 0.961 0.807
181 Nova 2.0 Omni (Non-reasoning) Amazon 16.8 13.6 37 $0.85 0.719 0.555 0.039 0.305 0.279 - -
182 OLMo 3 7B Think Allen Institute for AI 16.8 7.5 70.7 $0.14 0.655 0.516 0.057 0.617 0.212 - -
183 Qwen3 VL 8B (Reasoning) Alibaba 16.8 9.7 30.7 $0.66 0.749 0.579 0.033 0.353 0.219 - -
184 Gemini 2.0 Flash (experimental) Google 16.8 - - $0 0.782 0.636 0.047 0.21 0.34 0.911 0.3
185 Magistral Small 1 Mistral 16.8 10.9 41.3 $0.75 0.746 0.641 0.072 0.514 0.241 0.963 0.713
186 Qwen3 14B (Reasoning) Alibaba 16.6 12.9 55.7 $1.313 0.774 0.604 0.043 0.523 0.316 0.961 0.763
187 DeepSeek R1 0528 Qwen3 8B DeepSeek 16.4 7.7 63.7 $0.068 0.739 0.612 0.056 0.513 0.204 0.932 0.65
188 Qwen2.5 Max Alibaba 16.3 - - $2.8 0.762 0.587 0.045 0.359 0.337 0.835 0.233
189 Ministral 14B (Dec '25) Mistral 16.2 10.7 30 $0.2 0.693 0.572 0.046 0.351 0.236 - -
190 Qwen3 Omni 30B A3B Instruct Alibaba 16.1 7.2 52.3 $0.43 0.725 0.62 0.051 0.422 0.186 - -
191 Qwen3 4B 2507 Instruct Alibaba 16.1 8.9 52.3 $0 0.672 0.517 0.047 0.377 0.181 - -
192 DeepSeek R1 Distill Llama 70B DeepSeek 16 11.4 53.7 $0.875 0.795 0.402 0.061 0.266 0.312 0.935 0.67
193 Gemini 1.5 Pro (Sep '24) Google 16 23.6 - $0 0.75 0.589 0.049 0.316 0.295 0.876 0.23
194 Solar Pro 2 (Preview) (Non-reasoning) Upstage 16 - - $0 0.725 0.544 0.038 0.385 0.272 0.871 0.297
195 Claude 3.5 Sonnet (Oct '24) Anthropic 15.9 30.2 - $6 0.772 0.599 0.039 0.381 0.366 0.771 0.157
196 DeepSeek R1 Distill Qwen 14B DeepSeek 15.8 - 55.7 $0.15 0.74 0.484 0.044 0.376 0.239 0.949 0.667
197 GPT-4o (Aug '24) OpenAI 15.6 - - $4.375 - 0.521 0.029 0.317 - 0.795 0.117
198 Qwen2.5 Instruct 72B Alibaba 15.6 11.8 14 $0 0.72 0.491 0.042 0.276 0.267 0.858 0.16
199 Qwen3 30B A3B (Reasoning) Alibaba 15.6 10.9 72.3 $0.75 0.777 0.616 0.066 0.506 0.285 0.959 0.753
200 Devstral Small (Jul '25) Mistral 15.5 11.9 29.3 $0.15 0.622 0.414 0.037 0.254 0.243 0.635 0.003
201 Sonar Perplexity 15.5 - - $1 0.689 0.471 0.073 0.295 0.229 0.817 0.487
202 Solar Pro 2 (Reasoning) Upstage 15.4 12 61.3 $0.5 0.805 0.687 0.07 0.616 0.302 0.967 0.69
203 NVIDIA Nemotron Nano 9B V2 (Reasoning) NVIDIA 15.3 8.3 69.7 $0.07 0.742 0.57 0.046 0.724 0.22 - -
204 Qwen3 30B A3B 2507 Instruct Alibaba 15.3 13.9 66.3 $0.35 0.777 0.659 0.068 0.515 0.304 0.975 0.727
205 Qwen3 8B (Reasoning) Alibaba 15.3 8.9 19 $0.66 0.743 0.589 0.042 0.406 0.226 0.904 0.747
206 Sonar Pro Perplexity 15.2 - - $6 0.755 0.578 0.079 0.275 0.226 0.745 0.29
207 QwQ 32B-Preview Alibaba 15.2 - - $0.135 0.648 0.557 0.048 0.337 0.038 0.91 0.453
208 Llama Nemotron Super 49B v1.5 (Non-reasoning) NVIDIA 15.1 10.3 8 $0.175 0.692 0.481 0.043 0.29 0.238 0.77 0.137
209 Ling-mini-2.0 InclusionAI 15.1 5 49.3 $0.122 0.671 0.562 0.05 0.429 0.135 - -
210 Mistral Small 3.2 Mistral 15 13.1 27 $0.15 0.681 0.505 0.043 0.275 0.264 0.883 0.323
211 K2-V2 (low) MBZUAI Institute of Foundation Models 15 10.3 35.3 $0 0.713 0.541 0.039 0.393 0.223 - -
212 Ministral 8B (Dec '25) Mistral 14.9 9.8 31.7 $0.15 0.642 0.471 0.043 0.303 0.208 - -
213 Qwen3 VL 4B (Reasoning) Alibaba 14.9 6.7 25.7 $0 0.7 0.494 0.044 0.32 0.171 - -
214 Llama 3.3 Instruct 70B Meta 14.8 10.6 7.7 $0.64 0.713 0.498 0.04 0.288 0.26 0.773 0.3
215 NVIDIA Nemotron Nano 12B v2 VL (Reasoning) NVIDIA 14.8 11.6 75 $0.3 0.759 0.572 0.053 0.694 0.262 - -
216 GPT-4o (Nov '24) OpenAI 14.8 16.3 6 $4.375 0.748 0.543 0.033 0.309 0.333 0.759 0.15
217 Gemini 2.0 Flash-Lite (Feb '25) Google 14.7 - - $0.131 0.724 0.535 0.036 0.185 0.25 0.873 0.277
218 Mistral Large 2 (Nov '24) Mistral 14.7 13.5 14 $3 0.697 0.486 0.04 0.293 0.292 0.736 0.11
219 Qwen3 30B A3B (Non-reasoning) Alibaba 14.6 13.1 21.7 $0.35 0.71 0.515 0.046 0.322 0.264 0.863 0.26
220 Qwen3 VL 8B Instruct Alibaba 14.5 7.2 27.3 $0.31 0.686 0.427 0.029 0.332 0.174 - -
221 GPT-4o (May '24) OpenAI 14.5 24.2 - $7.5 0.74 0.526 0.028 0.334 0.309 0.791 0.11
222 Gemini 2.0 Flash-Lite (Preview) Google 14.5 - - $0.131 - 0.542 0.044 0.179 0.247 0.873 0.303
223 Qwen3 32B (Non-reasoning) Alibaba 14.5 - 19.7 $1.225 0.727 0.535 0.043 0.288 0.28 0.869 0.303
224 Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) NVIDIA 14.4 - 50 $0 0.556 0.408 0.051 0.493 0.101 0.947 0.707
225 Kimi Linear 48B A3B Instruct Kimi 14.4 13.7 36.3 $0 0.585 0.412 0.027 0.378 0.199 - -
226 Llama 3.3 Nemotron Super 49B v1 (Non-reasoning) NVIDIA 14.3 7.6 7.7 $0 0.698 0.517 0.035 0.28 0.229 0.775 0.193
227 Reka Flash 3 Reka AI 14.3 8.9 33.7 $0.35 0.669 0.529 0.051 0.435 0.267 0.893 0.51
228 Claude 3.5 Sonnet (June '24) Anthropic 14.2 26 - $6 0.751 0.56 0.037 - 0.316 0.695 0.097
229 Qwen3 4B (Reasoning) Alibaba 14.2 - 22.3 $0.398 0.696 0.522 0.051 0.465 0.035 0.933 0.657
230 Solar Pro 2 (Non-reasoning) Upstage 14.1 11.1 30 $0.5 0.75 0.561 0.038 0.424 0.248 0.889 0.407
231 Qwen3 VL 4B Instruct Alibaba 14.1 4.5 37 $0 0.634 0.371 0.037 0.29 0.137 - -
232 GPT-4o (ChatGPT) OpenAI 14.1 - - $7.5 0.773 0.511 0.037 - 0.334 0.797 0.103
233 Llama 3.1 Tulu3 405B Allen Institute for AI 14.1 - - $0 0.716 0.516 0.035 0.291 0.302 0.778 0.133
234 Pixtral Large Mistral 14 - 2.3 $3 0.701 0.505 0.036 0.261 0.292 0.714 0.07
235 Mistral Small 3.1 Mistral 14 13.6 3.7 $0.15 0.659 0.454 0.048 0.212 0.265 0.707 0.093
236 Nova Pro Amazon 14 10.7 7 $1.4 0.691 0.499 0.034 0.233 0.208 0.786 0.107
237 Grok 2 (Dec '24) xAI 13.9 - - $0 0.709 0.51 0.038 0.267 0.285 0.778 0.133
238 Gemini 1.5 Flash (Sep '24) Google 13.8 - - $0 0.68 0.463 0.035 0.273 0.267 0.827 0.18
239 Llama 4 Scout Meta 13.7 6.6 14 $0.282 0.752 0.587 0.043 0.299 0.17 0.844 0.283
240 Llama 3.1 Nemotron Instruct 70B NVIDIA 13.7 10.6 11 $1.2 0.69 0.465 0.046 0.169 0.233 0.733 0.247
241 GPT-4 Turbo OpenAI 13.7 21.5 - $15 0.694 - 0.033 0.291 0.319 0.737 0.15
242 NVIDIA Nemotron 3 Nano 30B A3B (Non-reasoning) NVIDIA 13.6 15.2 13.3 $0.105 0.579 0.399 0.046 0.36 0.23 - -
243 Hermes 4 - Llama-3.1 70B (Non-reasoning) Nous Research 13.6 9.2 11.3 $0.198 0.664 0.491 0.036 0.269 0.277 - -
244 GPT-5 nano (minimal) OpenAI 13.6 13.9 27.3 $0.138 0.556 0.428 0.041 0.47 0.291 - -
245 NVIDIA Nemotron Nano 9B V2 (Non-reasoning) NVIDIA 13.4 7.5 62.3 $0.102 0.739 0.557 0.04 0.701 0.209 - -
246 Command A Cohere 13.4 9.8 13 $4.375 0.712 0.527 0.046 0.287 0.281 0.819 0.097
247 Grok Beta xAI 13.3 - - $0 0.703 0.471 0.047 0.241 0.295 0.737 0.103
248 Phi-4 Microsoft Azure 13.2 11 18 $0.219 0.714 0.575 0.041 0.231 0.26 0.81 0.143
249 Qwen2.5 Instruct 32B Alibaba 13.2 - - $0 0.697 0.466 0.038 0.248 0.229 0.805 0.11
250 Qwen3 8B (Non-reasoning) Alibaba 13.2 7 24.3 $0.31 0.643 0.452 0.028 0.202 0.168 0.828 0.243
251 Qwen3 14B (Non-reasoning) Alibaba 13.2 12.1 58 $0.613 0.675 0.47 0.042 0.28 0.265 0.871 0.28
252 Qwen3 1.7B (Reasoning) Alibaba 13.1 1.4 38.7 $0.398 0.57 0.356 0.048 0.308 0.043 0.894 0.51
253 GPT-4.1 nano OpenAI 13.1 11 24 $0.175 0.657 0.512 0.039 0.326 0.259 0.848 0.237
254 Llama 3.1 Instruct 70B Meta 13.1 10.8 4 $0.56 0.676 0.409 0.046 0.232 0.267 0.649 0.173
255 Mistral Large 2 (Jul '24) Mistral 13 - 0 $3 0.683 0.472 0.032 0.267 0.271 0.714 0.093
256 GLM-4.5V (Non-reasoning) Z AI 12.9 10.5 15.3 $0.9 0.751 0.573 0.036 0.352 0.188 - -
257 Qwen2.5 Coder Instruct 32B Alibaba 12.9 - - $0.141 0.635 0.417 0.038 0.295 0.271 0.767 0.12
258 GPT-4 OpenAI 12.8 13.1 - $37.5 - - - - - - -
259 Nova Lite Amazon 12.8 5.1 7 $0.105 0.59 0.433 0.046 0.167 0.139 0.765 0.107
260 Gemini 2.5 Flash-Lite (Non-reasoning) Google 12.7 7.3 35.3 $0.175 0.724 0.474 0.037 0.4 0.177 0.926 0.5
261 Mistral Small 3 Mistral 12.7 - 4.3 $0.15 0.652 0.462 0.041 0.252 0.236 0.715 0.08
262 Jamba Reasoning 3B AI21 Labs 12.6 2.4 10.7 $0 0.577 0.333 0.046 0.21 0.059 - -
263 GPT-4o mini OpenAI 12.6 - 14.7 $0.263 0.648 0.426 0.04 0.234 0.229 0.789 0.117
264 Claude 3 Opus Anthropic 12.5 19.5 - $30 0.696 0.489 0.031 0.279 0.233 0.641 0.033
265 DeepSeek-V2.5 (Dec '24) DeepSeek 12.5 - - $0 - - - - - 0.763 -
266 Qwen3 4B (Non-reasoning) Alibaba 12.5 - - $0.188 0.586 0.398 0.037 0.233 0.167 0.843 0.213
267 Gemini 2.0 Flash Thinking Experimental (Dec '24) Google 12.3 - - $0 - - - - - 0.48 -
268 Claude 3.5 Haiku Anthropic 12.3 - - $1.6 0.634 0.408 0.035 0.314 0.274 0.721 0.033
269 DeepSeek-V2.5 DeepSeek 12.3 - - $0 - - - - - - -
270 Devstral Small (May '25) Mistral 12.1 - - $0.15 0.632 0.434 0.04 0.258 0.245 0.684 0.067
271 Mistral Saba Mistral 12.1 - - $0 0.611 0.424 0.041 - 0.241 0.677 0.13
272 DeepSeek R1 Distill Llama 8B DeepSeek 12.1 - 41.3 $0 0.543 0.302 0.042 0.233 0.119 0.853 0.333
273 R1 1776 Perplexity 12 - - $0 - - - - - 0.954 -
274 Gemini 1.5 Pro (May '24) Google 12 19.8 - $0 0.657 0.371 0.039 0.244 0.274 0.673 0.08
275 Reka Flash (Sep '24) Reka AI 12 - - $0.35 - - - - - 0.529 -
276 Qwen2.5 Turbo Alibaba 12 - - $0.087 0.633 0.41 0.042 0.163 0.153 0.805 0.12
277 Llama 3.2 Instruct 90B (Vision) Meta 11.9 - - $0.72 0.671 0.432 0.049 0.214 0.24 0.629 0.05
278 Llama 3.1 Instruct 8B Meta 11.9 4.9 4.3 $0.1 0.476 0.259 0.051 0.116 0.132 0.519 0.077
279 Solar Mini Upstage 11.9 - - $0.15 - - - - - 0.331 -
280 Ministral 3B (Dec '25) Mistral 11.8 4.8 22 $0.1 0.524 0.358 0.053 0.247 0.144 - -
281 Grok-1 xAI 11.7 - - $0 - - - - - - -
282 EXAONE 4.0 32B (Non-reasoning) LG AI Research 11.7 9.4 39.3 $0.7 0.768 0.628 0.049 0.472 0.252 0.939 0.47
283 Qwen2 Instruct 72B Alibaba 11.7 - - $0 0.622 0.371 0.037 0.159 0.229 0.701 0.147
284 Nova Micro Amazon 11.6 4.1 6 $0.061 0.531 0.358 0.047 0.14 0.094 0.703 0.08
285 LFM2 8B A1B Liquid AI 11.4 2.3 25.3 $0 0.505 0.344 0.049 0.151 0.068 - -
286 Granite 4.0 H Small IBM 11.2 8.4 13.7 $0.107 0.624 0.416 0.037 0.251 0.209 - -
287 Granite 4.0 Micro IBM 11.1 4.9 6 $0 0.447 0.336 0.051 0.18 0.119 - -
288 Gemini 1.5 Flash-8B Google 11.1 - - $0 0.569 0.359 0.045 0.217 0.229 0.689 0.033
289 Llama 3.2 Instruct 11B (Vision) Meta 10.9 4.2 1.7 $0.16 0.464 0.221 0.052 0.11 0.112 0.516 0.093
290 Gemma 3n E4B Instruct Google 10.9 4.1 14.3 $0.025 0.488 0.296 0.044 0.146 0.081 0.771 0.137
291 Phi-4 Mini Instruct Microsoft Azure 10.9 3.6 6.7 $0 0.465 0.331 0.042 0.126 0.108 0.696 0.03
292 DeepHermes 3 - Mistral 24B Preview (Non-reasoning) Nous Research 10.9 - - $0 0.58 0.382 0.039 0.195 0.228 0.595 0.047
293 Granite 3.3 8B (Non-reasoning) IBM 10.8 3.4 6.7 $0.085 0.468 0.338 0.042 0.127 0.101 0.665 0.047
294 Jamba 1.5 Large AI21 Labs 10.7 - - $3.5 0.572 0.427 0.04 0.143 0.163 0.606 0.047
295 Qwen3 1.7B (Non-reasoning) Alibaba 10.6 2.3 7.3 $0.188 0.411 0.283 0.052 0.126 0.069 0.717 0.097
296 DeepSeek-Coder-V2 DeepSeek 10.6 - - $0 - - - - - 0.743 -
297 OLMo 2 32B Allen Institute for AI 10.6 2.7 3.3 $0 0.511 0.328 0.037 0.068 0.08 - -
298 Hermes 3 - Llama-3.1 70B Nous Research 10.6 - - $0.3 0.571 0.401 0.041 0.188 0.231 0.538 0.023
299 Jamba 1.6 Large AI21 Labs 10.6 - - $3.5 0.565 0.387 0.04 0.172 0.184 0.58 0.047
300 Qwen3 0.6B (Reasoning) Alibaba 10.5 0.9 18 $0.398 0.347 0.239 0.057 0.121 0.028 0.75 0.1
301 Gemini 1.5 Flash (May '24) Google 10.5 - - $0 0.574 0.324 0.042 0.196 0.181 0.554 0.093
302 Gemma 3 27B Instruct Google 10.3 9.4 20.7 $0 0.669 0.428 0.047 0.137 0.212 0.883 0.253
303 Claude 3 Sonnet Anthropic 10.3 - - $6 0.579 0.4 0.038 0.175 0.229 0.414 0.047
304 Llama 3 Instruct 70B Meta 10.2 - - $0.88 0.574 0.379 0.044 0.198 0.189 0.483 0
305 Mistral Small (Sep '24) Mistral 10.2 - - $0.3 0.529 0.381 0.043 0.141 0.156 0.563 0.063
306 NVIDIA Nemotron Nano 12B v2 VL (Non-reasoning) NVIDIA 10.1 5.9 26.7 $0.3 0.649 0.439 0.045 0.345 0.176 - -
307 Gemma 3n E4B Instruct Preview (May '25) Google 10.1 - - $0 0.483 0.278 0.049 0.138 0.086 0.749 0.107
308 Gemini 1.0 Ultra Google 10.1 17.6 - $0 - - - - - - -
309 Phi-3 Mini Instruct 3.8B Microsoft Azure 10.1 3 0.3 $0.228 0.435 0.319 0.044 0.116 0.09 0.457 0.04
310 Phi-4 Multimodal Instruct Microsoft Azure 10 - - $0 0.485 0.315 0.044 0.131 0.11 0.693 0.093
311 Qwen2.5 Coder Instruct 7B Alibaba 10 - - $0 0.473 0.339 0.048 0.126 0.148 0.66 0.053
312 Mistral Large (Feb '24) Mistral 9.9 - - $6 0.515 0.351 0.034 0.178 0.208 0.527 0
313 Mixtral 8x22B Instruct Mistral 9.8 - - $0 0.537 0.332 0.041 0.148 0.188 0.545 0
314 Gemma 3n E2B Instruct Google 9.7 2.2 10.3 $0 0.378 0.229 0.04 0.095 0.052 0.691 0.09
315 Llama 3.2 Instruct 3B Meta 9.7 - 3.3 $0.06 0.347 0.255 0.052 0.083 0.052 0.489 0.067
316 Llama 2 Chat 7B Meta 9.7 - - $0.1 0.164 0.227 0.058 0.002 0 0.059 0
317 Qwen1.5 Chat 110B Alibaba 9.5 - - $0 - 0.289 - - - - -
318 Jamba 1.7 Large AI21 Labs 9.4 7.7 2.3 $3.5 0.577 0.39 0.038 0.181 0.188 0.6 0.057
319 Claude 3 Haiku Anthropic 9.3 - - $0.5 - - - 0.154 0.186 0.394 0.01
320 Claude 2.1 Anthropic 9.3 14 - $0 0.495 0.319 0.042 0.195 0.184 0.374 0.033
321 OLMo 2 7B Allen Institute for AI 9.3 1.2 0.7 $0 0.282 0.288 0.055 0.041 0.037 - -
322 Molmo 7B-D Allen Institute for AI 9.2 1.2 0 $0 0.371 0.24 0.051 0.039 0.036 - -
323 Gemma 3 12B Instruct Google 9.1 6.3 18.3 $0 0.595 0.349 0.048 0.137 0.174 0.853 0.22
324 Llama 3.2 Instruct 1B Meta 9.1 0.6 0 $0.053 0.2 0.196 0.053 0.019 0.017 0.14 0
325 Claude 2.0 Anthropic 9.1 12.9 - $0 0.486 0.344 - 0.171 0.194 - 0
326 DeepSeek R1 Distill Qwen 1.5B DeepSeek 9.1 - 22 $0 0.269 0.098 0.033 0.07 0.066 0.687 0.177
327 DeepSeek-V2-Chat DeepSeek 9.1 - - $0 - - - - - - -
328 GPT-3.5 Turbo OpenAI 9 10.7 - $0.75 0.462 0.297 - - - 0.441 -
329 Mistral Small (Feb '24) Mistral 9 - - $1.5 0.419 0.302 0.044 0.111 0.134 0.562 0.007
330 Mistral Medium Mistral 9 - - $4.088 0.491 0.349 0.034 0.099 0.118 0.405 0.037
331 LFM 40B Liquid AI 8.8 - - $0 0.425 0.327 0.049 0.096 0.071 0.48 0.023
332 Arctic Instruct Snowflake 8.8 - - $0 - - - - - - -
333 Qwen Chat 72B Alibaba 8.8 - - $0 - - - - - - -
334 Llama 3 Instruct 8B Meta 8.7 - - $0.07 0.405 0.296 0.051 0.096 0.119 0.499 0
335 Gemma 3 1B Instruct Google 8.6 0.2 3.3 $0 0.135 0.237 0.052 0.017 0.007 0.484 0
336 PALM-2 Google 8.6 4.6 - $0 - - - - - - -
337 Gemini 1.0 Pro Google 8.5 - - $0 0.431 0.277 0.046 0.116 0.117 0.403 0.007
338 DeepSeek Coder V2 Lite Instruct DeepSeek 8.5 - - $0 0.429 0.319 0.053 0.158 0.139 - -
339 Gemma 3 270M Google 8.4 0.1 2.3 $0 0.055 0.224 0.042 0.003 0 - -
340 Exaone 4.0 1.2B (Reasoning) LG AI Research 8.4 3.1 50.3 $0 0.588 0.515 0.058 0.516 0.093 - -
341 Llama 2 Chat 70B Meta 8.4 - - $0 0.406 0.327 0.05 0.098 - 0.323 0
342 Llama 2 Chat 13B Meta 8.4 - - $0 0.406 0.321 0.047 0.098 0.118 0.329 0.017
343 DeepSeek LLM 67B Chat (V1) DeepSeek 8.4 - - $0 - - - - - - -
344 Exaone 4.0 1.2B (Non-reasoning) LG AI Research 8.3 2.5 24 $0 0.5 0.424 0.058 0.293 0.074 - -
345 OpenChat 3.5 (1210) OpenChat 8.3 - - $0 0.31 0.23 0.048 0.115 - 0.307 0
346 DBRX Instruct Databricks 8.3 - - $0 0.397 0.331 0.066 0.093 0.118 0.279 0.03
347 Command-R+ (Apr '24) Cohere 8.3 - - $6 0.432 0.323 0.045 0.122 0.118 0.279 0.007
348 Granite 4.0 H 1B IBM 8.2 2.7 6.3 $0 0.277 0.263 0.05 0.115 0.082 - -
349 OLMo 3 7B Instruct Allen Institute for AI 8.1 3.4 41.3 $0.125 0.522 0.4 0.058 0.266 0.103 - -
350 Jamba 1.5 Mini AI21 Labs 8 - - $0.25 0.371 0.302 0.051 0.062 0.08 0.357 0.01
351 LFM2 2.6B Liquid AI 7.9 1.3 8.3 $0 0.298 0.306 0.052 0.081 0.025 - -
352 Jamba 1.6 Mini AI21 Labs 7.9 - - $0.25 0.367 0.3 0.046 0.071 0.101 0.257 0.033
353 Mixtral 8x7B Instruct Mistral 7.7 - - $0.54 0.387 0.292 0.045 0.066 0.028 0.299 0
354 DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning) Nous Research 7.6 - - $0 0.365 0.27 0.043 0.085 0.091 0.218 0
355 Jamba 1.7 Mini AI21 Labs 7.5 3.1 0.3 $0.25 0.388 0.322 0.045 0.061 0.093 0.258 0.013
356 Llama 65B Meta 7.4 - - $0 - - - - - - -
357 Qwen Chat 14B Alibaba 7.4 - - $0 - - - - - - -
358 Claude Instant Anthropic 7.4 7.8 - $0 0.434 0.33 0.038 0.109 - 0.264 0
359 Mistral 7B Instruct Mistral 7.4 - - $0.25 0.245 0.177 0.043 0.046 0.024 0.121 0
360 Command-R (Mar '24) Cohere 7.4 - - $0.75 0.338 0.284 0.048 0.048 0.062 0.164 0.007
361 Granite 4.0 1B IBM 7.3 2.9 6.3 $0 0.325 0.281 0.051 0.047 0.087 - -
362 Granite 4.0 350M IBM 6.8 0.3 0 $0 0.124 0.261 0.057 0.024 0.009 - -
363 LFM2 1.2B Liquid AI 6.5 0.8 3.3 $0 0.257 0.228 0.057 0.02 0.025 - -
364 Gemma 3 4B Instruct Google 6.3 2.9 12.7 $0 0.417 0.291 0.052 0.112 0.073 0.766 0.063
365 Qwen3 0.6B (Non-reasoning) Alibaba 5.8 1.4 10.3 $0.188 0.231 0.231 0.052 0.073 0.041 0.521 0.017
366 Granite 4.0 H 350M IBM 5.7 0.6 1.3 $0 0.127 0.257 0.064 0.019 0.017 - -
367 DeepSeek-OCR DeepSeek - - - $0.048 - - - - - - -
368 Grok Voice Agent xAI - - - $0 - - - - - - -
369 OLMo 3.1 32B Think Allen Institute for AI - 9.8 77.3 $0 0.763 0.591 0.06 0.695 0.293 - -
370 Cogito v2.1 (Reasoning) Deep Cogito - 24.1 72.7 $1.25 0.849 0.768 0.11 0.688 0.41 - -
371 Mi:dm K 2.5 Pro Korea Telecom - 12.5 76.7 $0 0.809 0.701 0.077 0.656 0.332 - -
372 Mi:dm K 2.5 Pro Preview Korea Telecom - 11.8 78.7 $0 0.813 0.722 0.088 0.576 0.297 - -
373 GPT-3.5 Turbo (0613) OpenAI - - - $0 - - - - - - -
374 GPT-4o Realtime (Dec '24) OpenAI - - - $0 - - - - - - -
375 GPT-4o mini Realtime (Dec '24) OpenAI - - - $0 - - - - - - -

* 价格为每百万 Token 的混合价格 (3:1 输入/输出)

Artificial Analysis AI 大模型排名 介绍

Artificial Analysis 是一家独立的 AI 基准测试和分析公司,提供独立的基准测试和分析,以支持开发者、研究人员、企业和其他 AI 用户。Artificial Analysis同时测试专有与开放权重模型,并以端到端用户体验为核心,测量实际使用中的响应时间、输出速度及成本。

质量基准涵盖语言理解与推理能力;性能基准则关注首次令牌到达时间、输出速度、端到端响应时间等真实可感知指标。我们区分 OpenAI Tokens 与原生 Tokens,以便在不同模型之间进行统一、公平的对比,并使用按 3:1 的输入/输出比计算混合价。基准对象包括模型、端点、系统与提供商,覆盖语言模型、语音、图像生成等多个方向,旨在帮助用户准确了解不同 AI 服务的真实表现与性价比。

Artificial Analysis AI 测试基准介绍

上下文窗口

输入和输出令牌的最大总数。输出令牌的数量限制通常要低得多(具体数量因模型而异)。

输出速度

模型生成令牌时每秒接收到的令牌数(即,对于支持流式传输的模型,在从 API 接收到第一个数据块之后)。

延迟(首次令牌到达时间)

API 请求发送后,收到第一个推理令牌所需的时间(以秒为单位)。对于共享推理令牌的推理模型,这将是第一个推理令牌。对于不支持流式传输的模型,这表示收到完成状态所需的时间。

价格

每个代币的价格,以美元/百万代币表示。价格是输入代币和输出代币价格的混合(比例为 3:1)。

常见 AI 大模型测试基准介绍

MMLU Pro

Massive Multitask Language Understanding Professional。MMLU 的增强版,旨在评估大语言模型的推理能力。它通过过滤简单问题、增加选项数量(从4个增加到10个)以及强调复杂的多步推理,来解决原版 MMLU 的局限性。涵盖 14 个领域的约 12,000 个问题。

GPQA

Graduate-Level Google-Proof Q&A Benchmark。一个具有挑战性的研究生级别问答基准,旨在评估 AI 系统在物理、化学和生物等复杂科学领域提供真实信息的能力。这些问题被设计为“防谷歌搜索”,即需要深度理解和推理,而不仅仅是简单的事实回忆。

HLE

Humanity's Last Exam。一个全面的评估框架,旨在测试 AI 系统在模仿人类水平推理、解决问题和知识整合方面的能力。包含 100 多个学科的 2,500 到 3,000 个专家级问题,强调多步推理和处理新颖场景的能力。

LiveCodeBench

一个无污染的 LLM 代码能力评估基准。它持续从 LeetCode、AtCoder 和 Codeforces 等平台的竞赛中收集新问题,以防止训练集数据污染。除了代码生成,还评估自我修复、代码执行和测试输出预测等能力。

SciCode

评估语言模型解决现实科学研究问题代码生成能力的基准。涵盖物理、数学、材料科学、生物和化学等 6 个领域的 16 个子领域。问题源自真实的科学工作流,通常需要知识回忆、推理和代码合成。

Math 500

旨在评估语言模型数学推理和解决问题能力的基准。包含 500 个来自 AMC 和 AIME 等高水平高中数学竞赛的难题,涵盖代数、组合数学、几何、数论和预微积分等领域。

AIME

American Invitational Mathematics Examination。基于美国数学邀请赛问题的基准,被认为是测试高级数学推理的最具挑战性的 AI 测试之一。包含 30 个“奥林匹克级别”的整数答案数学问题,测试多步推理、抽象和解决问题的能力。