AI 大模型排名 ArtificialAnalysis AI 大模型排行榜

本页面排行数据来自 Artificial Analysis ,它对超过 100 个 AI 模型(LLM)的性能进行了比较和排名,评估指标包括智能程度、价格等。另外排行也汇总其他权威 AI 基准测试结果以供参考。

AI 大模型排名(基于 Artificial Analysis

重置筛选
模型信息 Artificial Analysis测试基准结果 其他 AI 测试基准结果
排名 模型名称 机构 综合指数 Coding Math 价格 ($/1M) MMLU Pro ? GPQA ? HLE ? LiveCodeBench ? SciCode ? Math 500 ? AIME ?
1 GPT-5.4 (xhigh) OpenAI 57.2 57.3 - $5.625 - 0.92 0.416 - 0.566 - -
2 Gemini 3.1 Pro Preview Google 57.2 55.5 - $4.5 - 0.941 0.447 - 0.589 - -
3 GPT-5.3 Codex (xhigh) OpenAI 54 53.1 - $4.813 - 0.915 0.399 - 0.532 - -
4 Claude Opus 4.6 (Adaptive Reasoning, Max Effort) Anthropic 53 48.1 - $10 - 0.896 0.367 - 0.519 - -
5 Muse Spark Meta 52.1 47.5 - $0 - 0.884 0.399 - 0.515 - -
6 Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) Anthropic 51.7 50.9 - $6 - 0.875 0.3 - 0.468 - -
7 GLM-5.1 (Reasoning) Z AI 51.4 43.4 - $2.15 - 0.868 0.28 - 0.438 - -
8 GPT-5.2 (xhigh) OpenAI 51.3 48.7 99 $4.813 0.874 0.903 0.354 0.889 0.521 - -
9 Qwen3.6 Plus Alibaba 50 42.9 - $1.125 - 0.882 0.257 - 0.407 - -
10 GLM-5 (Reasoning) Z AI 49.8 44.2 - $1.55 - 0.82 0.272 - 0.462 - -
11 Claude Opus 4.5 (Reasoning) Anthropic 49.7 47.8 91.3 $10 0.895 0.866 0.284 0.871 0.495 - -
12 MiniMax-M2.7 MiniMax 49.6 41.9 - $0.525 - 0.874 0.281 - 0.47 - -
13 Grok 4.20 0309 v2 (Reasoning) xAI 49.3 40.5 - $3 - 0.911 0.322 - 0.456 - -
14 MiMo-V2-Pro Xiaomi 49.2 41.4 - $1.5 - 0.87 0.283 - 0.425 - -
15 GPT-5.2 Codex (xhigh) OpenAI 49 43 - $4.813 - 0.899 0.335 - 0.546 - -
16 Grok 4.20 0309 (Reasoning) xAI 48.5 42.2 - $3 - 0.885 0.3 - 0.447 - -
17 Gemini 3 Pro Preview (high) Google 48.4 46.5 95.7 $4.5 0.898 0.908 0.372 0.917 0.561 - -
18 GPT-5.4 mini (xhigh) OpenAI 48.1 51.5 - $1.688 - 0.875 0.266 - 0.499 - -
19 GPT-5.1 (high) OpenAI 47.7 44.7 94 $3.438 0.87 0.873 0.265 0.868 0.433 - -
20 Kimi K2.5 (Reasoning) Kimi 46.8 39.5 - $1.2 - 0.879 0.294 - 0.49 - -
21 GLM-5-Turbo Z AI 46.8 36.8 - $0 - 0.847 0.254 - 0.436 - -
22 GPT-5.2 (medium) OpenAI 46.6 44.2 96.7 $4.813 0.859 0.864 0.249 0.894 0.462 - -
23 Claude Opus 4.6 (Non-reasoning, High Effort) Anthropic 46.5 47.6 - $10 - 0.84 0.186 - 0.457 - -
24 Gemini 3 Flash Preview (Reasoning) Google 46.4 42.6 97 $1.125 0.89 0.898 0.347 0.908 0.506 - -
25 Qwen3.5 397B A17B (Reasoning) Alibaba 45 41.3 - $1.35 - 0.893 0.273 - 0.42 - -
26 MiMo-V2-Omni-0327 Xiaomi 44.9 36.9 - $0 - 0.855 0.204 - 0.395 - -
27 GPT-5 (high) OpenAI 44.6 36 94.3 $3.438 0.871 0.854 0.265 0.846 0.429 0.994 0.957
28 GPT-5 Codex (high) OpenAI 44.6 38.9 98.7 $3.438 0.865 0.837 0.256 0.84 0.409 - -
29 GPT-5.4 nano (xhigh) OpenAI 44.4 43.9 - $0.463 - 0.817 0.265 - 0.469 - -
30 Claude Sonnet 4.6 (Non-reasoning, High Effort) Anthropic 44.4 46.4 - $6 - 0.799 0.132 - 0.469 - -
31 KAT Coder Pro V2 KwaiKAT 43.8 45.6 - $0.525 - 0.855 0.16 - 0.383 - -
32 MiMo-V2-Omni Xiaomi 43.4 35.5 - $0 - 0.828 0.199 - 0.367 - -
33 GPT-5.1 Codex (high) OpenAI 43.1 36.6 95.7 $3.438 0.86 0.86 0.234 0.849 0.402 - -
34 Claude Opus 4.5 (Non-reasoning) Anthropic 43.1 42.9 62.7 $10 0.889 0.81 0.129 0.738 0.47 - -
35 Claude 4.5 Sonnet (Reasoning) Anthropic 43 38.6 88 $6 0.875 0.834 0.173 0.714 0.447 - -
36 GLM 5V Turbo (Reasoning) Z AI 42.9 36.2 - $0 - 0.809 0.158 - 0.435 - -
37 Claude Sonnet 4.6 (Non-reasoning, Low Effort) Anthropic 42.6 43 - $6 - 0.797 0.108 - 0.441 - -
38 Qwen3.5 27B (Reasoning) Alibaba 42.1 34.9 - $0.825 - 0.858 0.222 - 0.395 - -
39 GLM-4.7 (Reasoning) Z AI 42.1 36.3 95 $1 0.856 0.859 0.251 0.894 0.451 - -
40 GPT-5 (medium) OpenAI 42 39 91.7 $3.438 0.867 0.842 0.235 0.703 0.411 0.991 0.917
41 Claude 4.1 Opus (Reasoning) Anthropic 42 36.5 80.3 $30 0.88 0.809 0.119 0.654 0.409 - -
42 MiniMax-M2.5 MiniMax 41.9 37.4 - $0.525 - 0.848 0.191 - 0.426 - -
43 DeepSeek V3.2 (Reasoning) DeepSeek 41.7 36.7 92 $0.315 0.862 0.84 0.222 0.862 0.389 - -
44 Qwen3.5 122B A10B (Reasoning) Alibaba 41.6 34.7 - $1.1 - 0.857 0.234 - 0.42 - -
45 MiMo-V2-Flash (Feb 2026) Xiaomi 41.5 33.5 - $0.15 - 0.835 0.2 - 0.383 - -
46 Grok 4 xAI 41.5 40.5 92.7 $6 0.866 0.877 0.239 0.819 0.457 0.99 0.943
47 Gemini 3 Pro Preview (low) Google 41.3 39.4 86.7 $4.5 0.895 0.887 0.276 0.857 0.499 - -
48 GPT-5 mini (high) OpenAI 41.2 35.3 90.7 $0.688 0.837 0.828 0.197 0.838 0.392 - -
49 Kimi K2 Thinking Kimi 40.9 34.8 94.7 $1.075 0.848 0.838 0.223 0.853 0.424 - -
50 o3-pro OpenAI 40.7 - - $35 - 0.845 - - - - -
51 GLM-5 (Non-reasoning) Z AI 40.6 39 - $1.55 - 0.666 0.072 - 0.383 - -
52 Qwen3.5 397B A17B (Non-reasoning) Alibaba 40.1 37.4 - $1.35 - 0.861 0.188 - 0.411 - -
53 Qwen3 Max Thinking Alibaba 39.9 30.5 - $2.4 - 0.861 0.262 - 0.431 - -
54 MiniMax-M2.1 MiniMax 39.4 32.8 82.7 $0.525 0.875 0.83 0.222 0.81 0.407 - -
55 Gemma 4 31B (Reasoning) Google 39.2 38.7 - $0 - 0.857 0.227 - 0.434 - -
56 GPT-5 (low) OpenAI 39.2 30.7 83 $3.438 0.86 0.808 0.184 0.763 0.391 0.987 0.83
57 MiMo-V2-Flash (Reasoning) Xiaomi 39.2 31.8 96.3 $0.15 0.843 0.846 0.211 0.868 0.394 - -
58 Claude 4 Opus (Reasoning) Anthropic 39 34 73.3 $30 0.873 0.796 0.117 0.636 0.398 0.982 0.757
59 GPT-5 mini (medium) OpenAI 38.9 32.9 85 $0.688 0.828 0.803 0.146 0.692 0.41 - -
60 Claude 4 Sonnet (Reasoning) Anthropic 38.7 34.1 74.3 $6 0.842 0.777 0.096 0.655 0.4 0.991 0.773
61 Grok 4.1 Fast (Reasoning) xAI 38.6 30.9 89.3 $0.275 0.854 0.853 0.176 0.822 0.442 - -
62 Qwen3.5 Omni Plus Alibaba 38.6 27.6 - $1.5 - 0.826 0.139 - 0.405 - -
63 GPT-5.1 Codex mini (high) OpenAI 38.6 36.4 91.7 $0.688 0.82 0.813 0.169 0.836 0.426 - -
64 o3 OpenAI 38.4 38.4 88.3 $3.5 0.853 0.827 0.2 0.808 0.41 0.992 0.903
65 GPT-5.4 nano (medium) OpenAI 38.1 35 - $0.463 - 0.761 0.147 - 0.384 - -
66 Step 3.5 Flash StepFun 37.8 31.6 - $0.15 - 0.831 0.191 - 0.404 - -
67 GPT-5.4 mini (medium) OpenAI 37.7 37.5 - $1.688 - 0.823 0.171 - 0.442 - -
68 Kimi K2.5 (Non-reasoning) Kimi 37.3 25.8 - $1.2 - 0.789 0.123 - 0.396 - -
69 Qwen3.5 27B (Non-reasoning) Alibaba 37.2 33.4 - $0.825 - 0.842 0.132 - 0.367 - -
70 Claude 4.5 Haiku (Reasoning) Anthropic 37.1 32.6 83.7 $2 0.76 0.672 0.097 0.615 0.433 - -
71 Qwen3.5 35B A3B (Reasoning) Alibaba 37.1 30.3 - $0.688 - 0.845 0.197 - 0.377 - -
72 Claude 4.5 Sonnet (Non-reasoning) Anthropic 37.1 33.5 37 $6 0.86 0.727 0.071 0.59 0.428 - -
73 MiniMax-M2 MiniMax 36.1 29.2 78.3 $0.525 0.82 0.777 0.125 0.826 0.361 - -
74 NVIDIA Nemotron 3 Super 120B A12B (Reasoning) NVIDIA 36 31.2 - $0.412 - 0.8 0.192 - 0.36 - -
75 KAT-Coder-Pro V1 KwaiKAT 36 18.3 94.7 $0.525 0.813 0.764 0.334 0.747 0.366 - -
76 Claude 4.1 Opus (Non-reasoning) Anthropic 36 - - $30 - - - - - - -
77 Qwen3.5 122B A10B (Non-reasoning) Alibaba 35.9 31.6 - $1.1 - 0.827 0.148 - 0.356 - -
78 Nova 2.0 Pro Preview (medium) Amazon 35.7 30.4 89 $3.438 0.83 0.785 0.089 0.73 0.427 - -
79 GPT-5.4 (Non-reasoning) OpenAI 35.4 41 - $5.625 - 0.748 0.106 - 0.471 - -
80 Grok 4 Fast (Reasoning) xAI 35.1 27.4 89.7 $0.275 0.85 0.847 0.17 0.832 0.442 - -
81 Gemini 3 Flash Preview (Non-reasoning) Google 35 37.8 55.7 $1.125 0.882 0.812 0.141 0.797 0.499 - -
82 Claude 3.7 Sonnet (Reasoning) Anthropic 34.7 27.6 56.3 $6 0.837 0.772 0.103 0.473 0.403 0.947 0.487
83 Gemini 2.5 Pro Google 34.6 31.9 87.7 $3.438 0.862 0.844 0.211 0.801 0.428 0.967 0.887
84 Nova 2.0 Lite (high) Amazon 34.5 23.4 94.3 $0.85 0.818 0.811 0.109 0.711 0.369 - -
85 GLM-4.7 (Non-reasoning) Z AI 34.2 32 48 $0.938 0.794 0.664 0.061 0.562 0.354 - -
86 DeepSeek V3.1 Terminus (Reasoning) DeepSeek 33.9 33.7 89.7 $0.8 0.851 0.792 0.152 0.798 0.406 - -
87 GPT-5.2 (Non-reasoning) OpenAI 33.6 34.7 51 $4.813 0.814 0.712 0.073 0.669 0.404 - -
88 Gemini 3.1 Flash-Lite Preview Google 33.5 30.1 - $0.563 - 0.822 0.162 - 0.419 - -
89 Doubao Seed Code ByteDance Seed 33.5 31.3 79.3 $0 0.854 0.764 0.133 0.766 0.407 - -
90 gpt-oss-120B (high) OpenAI 33.3 28.6 93.4 $0.263 0.808 0.782 0.185 0.878 0.389 - -
91 o4-mini (high) OpenAI 33.1 25.6 90.7 $1.925 0.832 0.784 0.175 0.859 0.465 0.989 0.94
92 Claude 4 Opus (Non-reasoning) Anthropic 33 - 36.3 $30 0.86 0.701 0.059 0.542 0.409 0.941 0.563
93 Claude 4 Sonnet (Non-reasoning) Anthropic 33 30.6 38 $6 0.837 0.683 0.04 0.449 0.373 0.934 0.407
94 DeepSeek V3.2 Exp (Reasoning) DeepSeek 32.9 33.3 87.7 $0.315 0.85 0.797 0.138 0.789 0.377 - -
95 Mercury 2 Inception 32.8 30.6 - $0.375 - 0.77 0.155 - 0.387 - -
96 GLM-4.6 (Reasoning) Z AI 32.5 29.5 86 $0.981 0.829 0.78 0.133 0.695 0.384 - -
97 Qwen3 Max Thinking (Preview) Alibaba 32.5 24.5 82.3 $2.4 0.824 0.776 0.12 0.535 0.387 - -
98 Qwen3.5 9B (Reasoning) Alibaba 32.4 25.3 - $0.096 - 0.806 0.133 - 0.275 - -
99 Gemma 4 31B (Non-reasoning) Google 32.3 33.9 - $0 - 0.763 0.115 - 0.411 - -
100 DeepSeek V3.2 (Non-reasoning) DeepSeek 32.1 34.6 59 $0.315 0.837 0.751 0.105 0.593 0.387 - -
101 Grok 3 mini Reasoning (high) xAI 32.1 25.2 84.7 $0.35 0.828 0.791 0.111 0.696 0.406 0.992 0.933
102 K-EXAONE (Reasoning) LG AI Research 32.1 27 90.3 $0 0.838 0.783 0.131 0.768 0.356 - -
103 Nova 2.0 Pro Preview (low) Amazon 31.9 24.5 63.3 $3.438 0.822 0.751 0.052 0.638 0.387 - -
104 Trinity Large Thinking Arcee AI 31.9 27.2 - $0.395 - 0.752 0.147 - 0.361 - -
105 Qwen3 Max Alibaba 31.4 26.4 80.7 $2.4 0.841 0.764 0.111 0.767 0.383 - -
106 Gemma 4 26B A4B (Reasoning) Google 31.2 22.4 - $0.198 - 0.792 0.183 - 0.4 - -
107 Claude 4.5 Haiku (Non-reasoning) Anthropic 31.1 29.6 39 $2 0.8 0.646 0.043 0.511 0.344 - -
108 Gemini 2.5 Flash Preview (Sep '25) (Reasoning) Google 31.1 24.6 78.3 $0 0.842 0.793 0.127 0.713 0.405 - -
109 Kimi K2 0905 Kimi 30.9 25.9 57.3 $1.137 0.819 0.767 0.063 0.61 0.307 - -
110 o1 OpenAI 30.8 20.5 - $26.25 0.841 0.747 0.077 0.679 0.358 0.97 0.723
111 Claude 3.7 Sonnet (Non-reasoning) Anthropic 30.8 26.7 21 $6 0.803 0.656 0.048 0.394 0.376 0.85 0.223
112 Qwen3.5 35B A3B (Non-reasoning) Alibaba 30.7 16.8 - $0.688 - 0.819 0.128 - 0.293 - -
113 MiMo-V2-Flash (Non-reasoning) Xiaomi 30.4 25.8 67.7 $0.15 0.744 0.656 0.08 0.402 0.259 - -
114 Gemini 2.5 Pro Preview (Mar' 25) Google 30.3 46.7 - $0 0.858 0.836 0.171 0.778 0.395 0.98 0.87
115 GLM-4.6 (Non-reasoning) Z AI 30.2 30.2 44.3 $1 0.784 0.632 0.052 0.561 0.331 - -
116 GLM-4.7-Flash (Reasoning) Z AI 30.1 25.9 - $0.152 - 0.581 0.071 - 0.337 - -
117 Nova 2.0 Lite (medium) Amazon 29.7 23.9 88.7 $0.85 0.813 0.768 0.086 0.663 0.368 - -
118 Grok 4.20 0309 (Non-reasoning) xAI 29.7 25.4 - $3 - 0.785 0.225 - 0.322 - -
119 Gemini 2.5 Pro Preview (May' 25) Google 29.5 - - $3.438 0.837 0.822 0.154 0.77 0.416 0.986 0.843
120 Qwen3 235B A22B 2507 (Reasoning) Alibaba 29.5 23.2 91 $2.625 0.843 0.79 0.15 0.788 0.424 0.984 0.94
121 DeepSeek V3.2 Speciale DeepSeek 29.4 37.9 96.7 $0 0.863 0.871 0.261 0.896 0.44 - -
122 ERNIE 5.0 Thinking Preview Baidu 29.1 29.2 85 $0 0.83 0.777 0.127 0.812 0.375 - -
123 Grok 4.20 0309 v2 (Non-reasoning) xAI 29 22 - $3 - 0.776 0.242 - 0.328 - -
124 Grok Code Fast 1 xAI 28.7 23.7 43.3 $0.525 0.793 0.727 0.075 0.657 0.362 - -
125 DeepSeek V3.1 Terminus (Non-reasoning) DeepSeek 28.5 31.9 53.7 $0.626 0.836 0.751 0.084 0.529 0.321 - -
126 DeepSeek V3.2 Exp (Non-reasoning) DeepSeek 28.4 30 57.7 $0.315 0.836 0.738 0.086 0.554 0.399 - -
127 Qwen3 Coder Next Alibaba 28.3 22.9 - $0.6 - 0.737 0.093 - 0.323 - -
128 Apriel-v1.5-15B-Thinker ServiceNow 28.3 18.7 87.5 $0 0.773 0.713 0.12 0.728 0.348 - -
129 DeepSeek V3.1 (Non-reasoning) DeepSeek 28.1 28.4 49.7 $0.84 0.833 0.735 0.063 0.577 0.367 - -
130 Nova 2.0 Omni (medium) Amazon 28 15.1 89.7 $0.85 0.809 0.76 0.068 0.66 0.362 - -
131 Nemotron Cascade 2 30B A3B NVIDIA 27.7 25.1 - $0 - 0.763 0.114 - 0.33 - -
132 DeepSeek V3.1 (Reasoning) DeepSeek 27.7 29.7 89.7 $0.875 0.851 0.779 0.13 0.784 0.391 - -
133 Apriel-v1.6-15B-Thinker ServiceNow 27.6 22 88 $0 0.79 0.733 0.098 0.807 0.373 - -
134 Qwen3 VL 235B A22B (Reasoning) Alibaba 27.6 20.9 88.3 $2.625 0.836 0.772 0.101 0.646 0.399 - -
135 GPT-5.1 (Non-reasoning) OpenAI 27.4 27.3 38 $3.438 0.801 0.643 0.052 0.494 0.365 - -
136 Qwen3.5 9B (Non-reasoning) Alibaba 27.3 21.4 - $0.08 - 0.786 0.086 - 0.277 - -
137 Mistral Small 4 (Reasoning) Mistral 27.2 24.3 - $0.263 - 0.769 0.095 - 0.38 - -
138 Gemma 4 26B A4B (Non-reasoning) Google 27.1 29.1 - $0 - 0.714 0.107 - 0.373 - -
139 Magistral Medium 1.2 Mistral 27.1 21.7 82 $2.75 0.815 0.739 0.096 0.75 0.392 - -
140 DeepSeek R1 0528 (May '25) DeepSeek 27.1 24 76 $2.362 0.849 0.813 0.149 0.77 0.403 0.983 0.893
141 Qwen3.5 4B (Reasoning) Alibaba 27.1 17.5 - $0.06 - 0.771 0.078 - 0.161 - -
142 Gemini 2.5 Flash (Reasoning) Google 27 22.2 73.3 $0.85 0.832 0.79 0.111 0.695 0.394 0.981 0.823
143 GPT-5 nano (high) OpenAI 26.8 20.3 83.7 $0.138 0.78 0.676 0.082 0.789 0.366 - -
144 Qwen3 Next 80B A3B (Reasoning) Alibaba 26.7 19.5 84.3 $1.875 0.824 0.759 0.117 0.784 0.388 - -
145 GLM-4.5 (Reasoning) Z AI 26.4 26.3 73.7 $0.843 0.835 0.782 0.122 0.738 0.348 0.979 0.873
146 GPT-4.1 OpenAI 26.3 21.8 34.7 $3.5 0.806 0.666 0.046 0.457 0.381 0.913 0.437
147 Kimi K2 Kimi 26.3 22.1 57 $1.002 0.824 0.766 0.07 0.556 0.345 0.971 0.693
148 Qwen3 Max (Preview) Alibaba 26.1 25.5 75 $2.4 0.838 0.764 0.093 0.651 0.37 - -
149 Solar Pro 3 Upstage 25.9 13.3 - $0 - 0.724 0.101 - 0.247 - -
150 Qwen3.5 Omni Flash Alibaba 25.9 14 - $0.275 - 0.742 0.071 - 0.255 - -
151 GPT-5 nano (medium) OpenAI 25.9 22.9 78.3 $0.138 0.772 0.67 0.076 0.763 0.338 - -
152 o3-mini OpenAI 25.9 17.9 - $1.925 0.791 0.748 0.087 0.717 0.399 0.973 0.77
153 o1-pro OpenAI 25.8 - - $262.5 - - - - - - -
154 Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) Google 25.7 22.1 56.7 $0 0.836 0.766 0.078 0.625 0.375 - -
155 o3-mini (high) OpenAI 25.2 17.3 - $1.925 0.802 0.773 0.123 0.734 0.398 0.985 0.86
156 Grok 3 xAI 25.2 19.8 58 $6 0.799 0.693 0.051 0.425 0.368 0.87 0.33
157 Seed-OSS-36B-Instruct ByteDance Seed 25.2 16.7 84.7 $0.3 0.815 0.726 0.091 0.765 0.365 - -
158 Qwen3 235B A22B 2507 Instruct Alibaba 25 22.1 71.7 $1.225 0.828 0.753 0.106 0.524 0.36 0.98 0.717
159 Qwen3 Coder 480B A35B Instruct Alibaba 24.8 24.6 39.3 $3 0.788 0.618 0.044 0.585 0.359 0.942 0.477
160 Qwen3 VL 32B (Reasoning) Alibaba 24.7 14.5 84.7 $2.625 0.818 0.733 0.096 0.738 0.285 - -
161 Nova 2.0 Lite (low) Amazon 24.6 13.6 46.7 $0.85 0.788 0.698 0.042 0.469 0.333 - -
162 Sonar Reasoning Pro Perplexity 24.6 - - $0 - - - - - 0.957 0.79
163 gpt-oss-20B (high) OpenAI 24.5 18.5 89.3 $0.094 0.748 0.688 0.098 0.777 0.344 - -
164 gpt-oss-120B (low) OpenAI 24.5 15.5 66.7 $0.263 0.775 0.672 0.052 0.707 0.36 - -
165 GPT-5.4 nano (Non-Reasoning) OpenAI 24.4 27.9 - $0.463 - 0.558 0.042 - 0.352 - -
166 MiniMax M1 80k MiniMax 24.4 14.5 61 $0.963 0.816 0.697 0.082 0.711 0.374 0.98 0.847
167 NVIDIA Nemotron 3 Nano 30B A3B (Reasoning) NVIDIA 24.3 19 91 $0.105 0.794 0.757 0.102 0.741 0.296 - -
168 Gemini 2.5 Flash Preview (Reasoning) Google 24.3 - - $0 0.8 0.698 0.116 0.505 0.359 0.981 0.843
169 K2 Think V2 MBZUAI Institute of Foundation Models 24.1 15.5 - $0 - 0.713 0.095 - 0.33 - -
170 LongCat Flash Lite LongCat 23.9 16.5 - $0 - 0.636 0.06 - 0.284 - -
171 GPT-5 (minimal) OpenAI 23.9 25.1 31.7 $3.438 0.806 0.673 0.054 0.558 0.388 0.861 0.367
172 HyperCLOVA X SEED Think (32B) Naver 23.7 17.5 59 $0 0.785 0.615 0.055 0.629 0.284 - -
173 o1-preview OpenAI 23.7 34 - $28.875 - - - - - 0.924 -
174 Grok 4.1 Fast (Non-reasoning) xAI 23.6 19.5 34.3 $0.275 0.743 0.637 0.05 0.399 0.296 - -
175 K-EXAONE (Non-reasoning) LG AI Research 23.4 13.5 44 $0 0.81 0.695 0.054 - 0.27 - -
176 GLM-4.6V (Reasoning) Z AI 23.4 19.7 85.3 $0.45 0.799 0.719 0.089 0.16 0.304 - -
177 GPT-5.4 mini (Non-Reasoning) OpenAI 23.3 25.3 - $1.688 - 0.606 0.057 - 0.396 - -
178 Nova 2.0 Omni (low) Amazon 23.2 13.9 56 $0.85 0.798 0.699 0.04 0.592 0.343 - -
179 GLM-4.5-Air Z AI 23.2 23.8 80.7 $0.425 0.815 0.733 0.068 0.684 0.306 0.965 0.673
180 Nova 2.0 Pro Preview (Non-reasoning) Amazon 23.1 20.5 30.7 $3.438 0.772 0.636 0.04 0.473 0.281 - -
181 Mi:dm K 2.5 Pro Korea Telecom 23.1 12.6 76.7 $0 0.809 0.701 0.077 0.656 0.332 - -
182 Grok 4 Fast (Non-reasoning) xAI 23.1 19 41.3 $0.275 0.73 0.606 0.05 0.401 0.329 - -
183 GPT-4.1 mini OpenAI 22.9 18.5 46.3 $0.7 0.781 0.664 0.046 0.483 0.404 0.925 0.43
184 Mistral Large 3 Mistral 22.8 22.7 38 $0.75 0.807 0.68 0.041 0.465 0.362 - -
185 Ring-1T InclusionAI 22.8 16.8 89.3 $0 0.806 0.774 0.102 0.643 0.367 - -
186 Qwen3.5 4B (Non-reasoning) Alibaba 22.6 13.7 - $0.06 - 0.712 0.075 - 0.183 - -
187 Qwen3 30B A3B 2507 (Reasoning) Alibaba 22.4 14.7 56.3 $0.75 0.805 0.707 0.098 0.707 0.333 0.976 0.907
188 DeepSeek V3 0324 DeepSeek 22.3 22 41 $1.25 0.819 0.655 0.052 0.405 0.358 0.942 0.52
189 INTELLECT-3 Prime Intellect 22.2 19.1 88 $0 0.822 0.761 0.121 0.777 0.391 - -
190 GLM-4.7-Flash (Non-reasoning) Z AI 22.1 11 - $0.152 - 0.452 0.049 - 0.255 - -
191 Devstral 2 Mistral 22 23.7 36.7 $0 0.762 0.594 0.036 0.448 0.331 - -
192 GPT-5 (ChatGPT) OpenAI 21.8 21.2 48.3 $3.438 0.82 0.686 0.058 0.543 0.378 - -
193 Solar Open 100B (Reasoning) Upstage 21.7 10.5 - $0 - 0.657 0.092 - 0.269 - -
194 Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning) Google 21.6 18.1 68.7 $0.175 0.808 0.709 0.066 0.688 0.287 - -
195 Grok 3 Reasoning Beta xAI 21.6 - - $0 - - - - - - -
196 Mistral Medium 3.1 Mistral 21.3 18.3 38.3 $0.8 0.683 0.588 0.044 0.406 0.338 - -
197 MiniMax M1 40k MiniMax 20.9 14.1 13.7 $0 0.808 0.682 0.075 0.657 0.378 0.972 0.813
198 gpt-oss-20B (low) OpenAI 20.8 14.4 62.3 $0.094 0.718 0.611 0.051 0.652 0.34 - -
199 Qwen3 VL 235B A22B Instruct Alibaba 20.8 16.5 70.7 $1.225 0.823 0.712 0.063 0.594 0.359 - -
200 GPT-5 mini (minimal) OpenAI 20.7 21.9 46.7 $0.688 0.775 0.687 0.05 0.545 0.369 - -
201 K2-V2 (high) MBZUAI Institute of Foundation Models 20.6 16.1 78.3 $0 0.786 0.681 0.098 0.694 0.286 - -
202 Gemini 2.5 Flash (Non-reasoning) Google 20.6 17.8 60.3 $0.85 0.809 0.683 0.051 0.495 0.291 0.932 0.5
203 o1-mini OpenAI 20.4 - - $0 0.742 0.603 0.049 0.576 0.323 0.944 0.603
204 Qwen3 Next 80B A3B Instruct Alibaba 20.1 15.3 66.3 $0.875 0.819 0.738 0.073 0.684 0.307 - -
205 Tri-21B-think Preview Trillion Labs 20 7.4 - $0 - 0.538 0.057 - 0.178 - -
206 GPT-4.5 (Preview) OpenAI 20 - - $0 - - - - - - -
207 Qwen3 Coder 30B A3B Instruct Alibaba 20 19.4 29 $0.9 0.706 0.516 0.04 0.403 0.278 0.893 0.297
208 Qwen3 235B A22B (Reasoning) Alibaba 19.8 17.4 82 $2.625 0.828 0.7 0.117 0.622 0.399 0.93 0.84
209 QwQ 32B Alibaba 19.7 - 29 $0.745 0.764 0.593 0.082 0.631 0.358 0.957 0.78
210 Qwen3 VL 30B A3B (Reasoning) Alibaba 19.7 13.1 82.3 $0.75 0.807 0.72 0.087 0.697 0.288 - -
211 Gemini 2.0 Flash Thinking Experimental (Jan '25) Google 19.6 24.1 - $0 0.798 0.701 0.071 0.321 0.329 0.944 0.5
212 Devstral Small 2 Mistral 19.5 20.7 34.3 $0 0.678 0.532 0.034 0.348 0.288 - -
213 Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning) Google 19.4 14.5 46.7 $0.175 0.796 0.651 0.046 0.641 0.285 - -
214 Motif-2-12.7B-Reasoning Motif Technologies 19.1 11.9 80.3 $0 0.796 0.695 0.082 0.651 0.282 - -
215 Nova Premier Amazon 19 13.8 17.3 $5 0.733 0.569 0.047 0.317 0.279 0.839 0.17
216 Ling-1T InclusionAI 19 18.8 71.3 $0 0.822 0.719 0.072 0.677 0.352 - -
217 Gemma 4 E4B (Reasoning) Google 18.8 13.7 - $0 - 0.576 0.037 - 0.244 - -
218 Mistral Medium 3 Mistral 18.8 13.6 30.3 $0.8 0.76 0.578 0.043 0.4 0.331 0.907 0.44
219 Magistral Medium 1 Mistral 18.8 16 40.3 $0 0.753 0.679 0.095 0.527 0.297 0.917 0.7
220 DeepSeek R1 (Jan '25) DeepSeek 18.8 15.9 68 $2.362 0.844 0.708 0.093 0.617 0.357 0.966 0.683
221 Solar Pro 2 (Preview) (Reasoning) Upstage 18.8 - - $0 0.768 0.578 0.057 0.462 0.164 0.9 0.663
222 Llama Nemotron Super 49B v1.5 (Reasoning) NVIDIA 18.7 15.2 76.7 $0.175 0.814 0.748 0.068 0.737 0.348 0.983 0.86
223 K2-V2 (medium) MBZUAI Institute of Foundation Models 18.7 14 64.7 $0 0.761 0.598 0.044 0.541 0.252 - -
224 Claude 3.5 Haiku Anthropic 18.7 10.7 - $1.6 0.634 0.408 0.035 0.314 0.274 0.721 0.033
225 Devstral Medium Mistral 18.7 15.9 4.7 $0.8 0.708 0.492 0.038 0.337 0.294 0.707 0.067
226 Mistral Small 4 (Non-reasoning) Mistral 18.6 16.4 - $0.263 - 0.571 0.037 - 0.281 - -
227 Hermes 4 - Llama-3.1 405B (Reasoning) Nous Research 18.6 16 69.7 $1.5 0.829 0.727 0.103 0.686 0.252 - -
228 Tri-21B-Think Trillion Labs 18.6 6.3 - $0 - 0.601 0.061 - 0.174 - -
229 GPT-4o (Aug '24) OpenAI 18.6 16.6 - $4.375 - 0.521 0.029 0.317 0.331 0.795 0.117
230 GPT-4o (March 2025, chatgpt-4o-latest) OpenAI 18.6 - 25.7 $0 0.803 0.655 0.05 0.425 0.366 0.893 0.327
231 Llama 3.3 Nemotron Super 49B v1 (Reasoning) NVIDIA 18.5 9.4 54.7 $0 0.785 0.643 0.065 0.277 0.282 0.959 0.583
232 Gemini 2.0 Flash (Feb '25) Google 18.5 13.6 21.7 $0.263 0.779 0.623 0.053 0.334 0.333 0.93 0.33
233 Llama 4 Maverick Meta 18.4 15.6 19.3 $0.487 0.809 0.671 0.048 0.397 0.331 0.889 0.39
234 Magistral Small 1.2 Mistral 18.2 14.8 80.3 $0.75 0.768 0.663 0.061 0.723 0.352 - -
235 Sarvam 105B (high) Sarvam 18.2 9.8 - $0 - 0.738 0.101 - 0.264 - -
236 Qwen3 4B 2507 (Reasoning) Alibaba 18.2 9.5 82.7 $0 0.743 0.667 0.059 0.641 0.256 - -
237 Gemini 2.0 Pro Experimental (Feb '25) Google 18.1 25.5 - $0 0.805 0.622 0.068 0.347 0.312 0.923 0.36
238 Nova 2.0 Lite (Non-reasoning) Amazon 18 12.5 33.7 $0.85 0.743 0.603 0.03 0.346 0.24 - -
239 Claude 3 Opus Anthropic 18 19.5 - $30 0.696 0.489 0.031 0.279 0.233 0.641 0.033
240 Devstral Small (May '25) Mistral 18 12.2 - $0.075 0.632 0.434 0.04 0.258 0.245 0.684 0.067
241 Sonar Reasoning Perplexity 17.9 - - $0 - 0.623 - - - 0.921 0.77
242 Gemini 2.5 Flash Preview (Non-reasoning) Google 17.8 - - $0 0.783 0.594 0.05 0.406 0.233 0.926 0.433
243 Hermes 4 - Llama-3.1 405B (Non-reasoning) Nous Research 17.6 18.1 15.3 $1.5 0.729 0.536 0.042 0.546 0.346 - -
244 Gemini 2.5 Flash-Lite (Reasoning) Google 17.6 9.5 53.3 $0.175 0.759 0.625 0.064 0.593 0.193 0.969 0.703
245 Llama 3.1 Instruct 405B Meta 17.4 14.5 3 $3.688 0.732 0.515 0.042 0.305 0.299 0.703 0.213
246 GPT-4o (Nov '24) OpenAI 17.3 16.7 6 $4.375 0.748 0.543 0.033 0.309 0.333 0.759 0.15
247 DeepSeek R1 Distill Qwen 32B DeepSeek 17.2 - 63 $0.27 0.739 0.615 0.055 0.27 0.376 0.941 0.687
248 Qwen3 VL 32B Instruct Alibaba 17.2 15.6 68.3 $1.225 0.791 0.671 0.063 0.514 0.301 - -
249 GLM-4.6V (Non-reasoning) Z AI 17.1 11.1 26.3 $0.45 0.752 0.566 0.037 0.411 0.272 - -
250 Qwen3 235B A22B (Non-reasoning) Alibaba 17 14 23.7 $1.225 0.762 0.613 0.047 0.343 0.299 0.902 0.327
251 Gemini 2.0 Flash (experimental) Google 16.8 - - $0 0.782 0.636 0.047 0.21 0.34 0.911 0.3
252 Magistral Small 1 Mistral 16.8 11.1 41.3 $0 0.746 0.641 0.072 0.514 0.241 0.963 0.713
253 EXAONE 4.0 32B (Reasoning) LG AI Research 16.7 14 80 $0 0.818 0.739 0.105 0.747 0.344 0.977 0.843
254 Qwen3 VL 8B (Reasoning) Alibaba 16.7 9.8 30.7 $0.66 0.749 0.579 0.033 0.353 0.219 - -
255 Nova 2.0 Omni (Non-reasoning) Amazon 16.6 13.8 37 $0.85 0.719 0.555 0.039 0.305 0.279 - -
256 DeepSeek V3 (Dec '24) DeepSeek 16.5 16.4 26 $0.625 0.752 0.557 0.036 0.359 0.354 0.887 0.253
257 Qwen3 32B (Reasoning) Alibaba 16.5 13.8 73 $2.625 0.798 0.668 0.083 0.546 0.354 0.961 0.807
258 DeepSeek R1 0528 Qwen3 8B DeepSeek 16.4 7.8 63.7 $0 0.739 0.612 0.056 0.513 0.204 0.932 0.65
259 Qwen3.5 2B (Reasoning) Alibaba 16.3 3.5 - $0.04 - 0.456 0.021 - 0.028 - -
260 Qwen2.5 Max Alibaba 16.3 - - $2.8 0.762 0.587 0.045 0.359 0.337 0.835 0.233
261 Qwen3 14B (Reasoning) Alibaba 16.2 13.1 55.7 $1.313 0.774 0.604 0.043 0.523 0.316 0.961 0.763
262 Nanbeige4.1-3B Nanbeige 16.1 8.9 - $0 - 0.849 0.1 - 0.266 - -
263 Qwen3 VL 30B A3B Instruct Alibaba 16.1 14.3 72.3 $0.35 0.764 0.695 0.064 0.476 0.308 - -
264 Ministral 3 14B Mistral 16 10.9 30 $0.2 0.693 0.572 0.046 0.351 0.236 - -
265 DeepSeek R1 Distill Llama 70B DeepSeek 16 11.4 53.7 $0.875 0.795 0.402 0.061 0.266 0.312 0.935 0.67
266 Hermes 4 - Llama-3.1 70B (Reasoning) Nous Research 16 14.4 68.7 $0.198 0.811 0.699 0.079 0.653 0.341 - -
267 Gemini 1.5 Pro (Sep '24) Google 16 23.6 - $0 0.75 0.589 0.049 0.316 0.295 0.876 0.23
268 Solar Pro 2 (Preview) (Non-reasoning) Upstage 16 - - $0 0.725 0.544 0.038 0.385 0.272 0.871 0.297
269 Claude 3.5 Sonnet (Oct '24) Anthropic 15.9 30.2 - $6 0.772 0.599 0.039 0.381 0.366 0.771 0.157
270 Falcon-H1R-7B TII UAE 15.8 9.8 80 $0 0.725 0.661 0.108 0.724 0.249 - -
271 DeepSeek R1 Distill Qwen 14B DeepSeek 15.8 - 55.7 $0 0.74 0.484 0.044 0.376 0.239 0.949 0.667
272 Ling-flash-2.0 InclusionAI 15.7 16.7 65.3 $0.247 0.777 0.657 0.063 0.589 0.289 - -
273 Qwen3 Omni 30B A3B (Reasoning) Alibaba 15.6 12.7 74 $0.43 0.792 0.726 0.073 0.679 0.306 - -
274 Qwen2.5 Instruct 72B Alibaba 15.6 11.9 14 $0 0.72 0.491 0.042 0.276 0.267 0.858 0.16
275 Sonar Perplexity 15.5 - - $0 0.689 0.471 0.073 0.295 0.229 0.817 0.487
276 Step3 VL 10B StepFun 15.4 13.9 - $0 - 0.69 0.102 - 0.311 - -
277 Qwen3 30B A3B (Reasoning) Alibaba 15.3 11 72.3 $0.75 0.777 0.616 0.066 0.506 0.285 0.959 0.753
278 Gemma 4 E2B (Reasoning) Google 15.2 9 - $0 - 0.433 0.048 - 0.209 - -
279 Devstral Small (Jul '25) Mistral 15.2 12.1 29.3 $0.15 0.622 0.414 0.037 0.254 0.243 0.635 0.003
280 Sonar Pro Perplexity 15.2 - - $0 0.755 0.578 0.079 0.275 0.226 0.745 0.29
281 QwQ 32B-Preview Alibaba 15.2 - - $0.135 0.648 0.557 0.048 0.337 0.038 0.91 0.453
282 Mistral Large 2 (Nov '24) Mistral 15.1 13.8 14 $3 0.697 0.486 0.04 0.293 0.292 0.736 0.11
283 Mistral Small 3.2 Mistral 15.1 13.3 27 $0.15 0.681 0.505 0.043 0.275 0.264 0.883 0.323
284 GLM-4.5V (Reasoning) Z AI 15.1 10.9 73 $0.9 0.788 0.684 0.059 0.604 0.221 - -
285 Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) NVIDIA 15 13.1 63.7 $0.9 0.825 0.728 0.081 0.641 0.347 0.952 0.747
286 ERNIE 4.5 300B A47B Baidu 15 14.5 41.3 $0.485 0.776 0.811 0.035 0.467 0.315 0.931 0.493
287 Qwen3 30B A3B 2507 Instruct Alibaba 15 14.2 66.3 $0.35 0.777 0.659 0.068 0.515 0.304 0.975 0.727
288 Solar Pro 2 (Reasoning) Upstage 14.9 12.1 61.3 $0 0.805 0.687 0.07 0.616 0.302 0.967 0.69
289 NVIDIA Nemotron Nano 12B v2 VL (Reasoning) NVIDIA 14.9 11.8 75 $0.3 0.759 0.572 0.053 0.694 0.262 - -
290 Gemma 4 E4B (Non-reasoning) Google 14.8 6.4 - $0 - 0.549 0.047 - 0.039 - -
291 Ministral 3 8B Mistral 14.8 10 31.7 $0.15 0.642 0.471 0.043 0.303 0.208 - -
292 NVIDIA Nemotron Nano 9B V2 (Reasoning) NVIDIA 14.8 8.3 69.7 $0.07 0.742 0.57 0.046 0.724 0.22 - -
293 NVIDIA Nemotron 3 Nano 4B NVIDIA 14.7 10 - $0 - 0.513 0.048 - 0.164 - -
294 Qwen3.5 2B (Non-reasoning) Alibaba 14.7 4.9 - $0.04 - 0.438 0.049 - 0.072 - -
295 Gemini 2.0 Flash-Lite (Feb '25) Google 14.7 - - $0 0.724 0.535 0.036 0.185 0.25 0.873 0.277
296 Llama Nemotron Super 49B v1.5 (Non-reasoning) NVIDIA 14.6 10.5 8 $0.175 0.692 0.481 0.043 0.29 0.238 0.77 0.137
297 Llama 3.3 Instruct 70B Meta 14.5 10.7 7.7 $0.64 0.713 0.498 0.04 0.288 0.26 0.773 0.3
298 GPT-4o (May '24) OpenAI 14.5 24.2 - $7.5 0.74 0.526 0.028 0.334 0.309 0.791 0.11
299 Gemini 2.0 Flash-Lite (Preview) Google 14.5 - - $0 - 0.542 0.044 0.179 0.247 0.873 0.303
300 Mistral Small 3.1 Mistral 14.5 13.9 3.7 $0.15 0.659 0.454 0.048 0.212 0.265 0.707 0.093
301 Qwen3 32B (Non-reasoning) Alibaba 14.5 - 19.7 $1.225 0.727 0.535 0.043 0.288 0.28 0.869 0.303
302 Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) NVIDIA 14.4 - 50 $0 0.556 0.408 0.051 0.493 0.101 0.947 0.707
303 Kimi Linear 48B A3B Instruct Kimi 14.4 14.2 36.3 $0 0.585 0.412 0.027 0.378 0.199 - -
304 K2-V2 (low) MBZUAI Institute of Foundation Models 14.4 10.5 35.3 $0 0.713 0.541 0.039 0.393 0.223 - -
305 Llama 3.3 Nemotron Super 49B v1 (Non-reasoning) NVIDIA 14.3 7.6 7.7 $0 0.698 0.517 0.035 0.28 0.229 0.775 0.193
306 Qwen3 VL 8B Instruct Alibaba 14.3 7.3 27.3 $0.31 0.686 0.427 0.029 0.332 0.174 - -
307 Claude 3.5 Sonnet (June '24) Anthropic 14.2 26 - $6 0.751 0.56 0.037 - 0.316 0.695 0.097
308 Qwen3 4B (Reasoning) Alibaba 14.2 - 22.3 $0.398 0.696 0.522 0.051 0.465 0.035 0.933 0.657
309 GPT-4o (ChatGPT) OpenAI 14.1 - - $0 0.773 0.511 0.037 - 0.334 0.797 0.103
310 Llama 3.1 Tulu3 405B Allen Institute for AI 14.1 - - $0 0.716 0.516 0.035 0.291 0.302 0.778 0.133
311 Ring-flash-2.0 InclusionAI 14 10.6 83.7 $0.247 0.793 0.725 0.089 0.628 0.168 - -
312 Pixtral Large Mistral 14 - 2.3 $3 0.701 0.505 0.036 0.261 0.292 0.714 0.07
313 Olmo 3.1 32B Think Allen Institute for AI 13.9 9.8 77.3 $0 0.763 0.591 0.06 0.695 0.293 - -
314 Grok 2 (Dec '24) xAI 13.9 - - $0 0.709 0.51 0.038 0.267 0.285 0.778 0.133
315 GPT-5 nano (minimal) OpenAI 13.8 14.2 27.3 $0.138 0.556 0.428 0.041 0.47 0.291 - -
316 Gemini 1.5 Flash (Sep '24) Google 13.8 - - $0 0.68 0.463 0.035 0.273 0.267 0.827 0.18
317 GPT-4 Turbo OpenAI 13.7 21.5 - $15 0.694 - 0.033 0.291 0.319 0.737 0.15
318 Qwen3 VL 4B (Reasoning) Alibaba 13.7 6.7 25.7 $0 0.7 0.494 0.044 0.32 0.171 - -
319 Solar Pro 2 (Non-reasoning) Upstage 13.6 11.3 30 $0 0.75 0.561 0.038 0.424 0.248 0.889 0.407
320 Llama 4 Scout Meta 13.5 6.7 14 $0.292 0.752 0.587 0.043 0.299 0.17 0.844 0.283
321 Command A Cohere 13.5 9.9 13 $4.375 0.712 0.527 0.046 0.287 0.281 0.819 0.097
322 Nova Pro Amazon 13.5 11 7 $1.4 0.691 0.499 0.034 0.233 0.208 0.786 0.107
323 Llama 3.1 Nemotron Instruct 70B NVIDIA 13.4 10.8 11 $1.2 0.69 0.465 0.046 0.169 0.233 0.733 0.247
324 Grok Beta xAI 13.3 - - $0 0.703 0.471 0.047 0.241 0.295 0.737 0.103
325 NVIDIA Nemotron Nano 9B V2 (Non-reasoning) NVIDIA 13.2 7.5 62.3 $0.086 0.739 0.557 0.04 0.701 0.209 - -
326 NVIDIA Nemotron 3 Nano 30B A3B (Non-reasoning) NVIDIA 13.2 15.8 13.3 $0.087 0.579 0.399 0.046 0.36 0.23 - -
327 Qwen2.5 Instruct 32B Alibaba 13.2 - - $0 0.697 0.466 0.038 0.248 0.229 0.805 0.11
328 Qwen3 8B (Reasoning) Alibaba 13.2 9 19 $0.66 0.743 0.589 0.042 0.406 0.226 0.904 0.747
329 GPT-4.1 nano OpenAI 13 11.2 24 $0.175 0.657 0.512 0.039 0.326 0.259 0.848 0.237
330 Mistral Large 2 (Jul '24) Mistral 13 - 0 $3 0.683 0.472 0.032 0.267 0.271 0.714 0.093
331 Qwen2.5 Coder Instruct 32B Alibaba 12.9 - - $0 0.635 0.417 0.038 0.295 0.271 0.767 0.12
332 Qwen3 4B 2507 Instruct Alibaba 12.9 9.1 52.3 $0 0.672 0.517 0.047 0.377 0.181 - -
333 GPT-4 OpenAI 12.8 13.1 - $37.5 - - - - - - -
334 Qwen3 14B (Non-reasoning) Alibaba 12.8 12.4 58 $0.613 0.675 0.47 0.042 0.28 0.265 0.871 0.28
335 Gemini 2.5 Flash-Lite (Non-reasoning) Google 12.7 7.4 35.3 $0.175 0.724 0.474 0.037 0.4 0.177 0.926 0.5
336 Mistral Small 3 Mistral 12.7 - 4.3 $0.15 0.652 0.462 0.041 0.252 0.236 0.715 0.08
337 Nova Lite Amazon 12.7 5.1 7 $0.105 0.59 0.433 0.046 0.167 0.139 0.765 0.107
338 GLM-4.5V (Non-reasoning) Z AI 12.7 10.8 15.3 $0.9 0.751 0.573 0.036 0.352 0.188 - -
339 Hermes 4 - Llama-3.1 70B (Non-reasoning) Nous Research 12.6 9.2 11.3 $0.198 0.664 0.491 0.036 0.269 0.277 - -
340 GPT-4o mini OpenAI 12.6 - 14.7 $0.263 0.648 0.426 0.04 0.234 0.229 0.789 0.117
341 Llama 3.1 Instruct 70B Meta 12.5 10.9 4 $0.56 0.676 0.409 0.046 0.232 0.267 0.649 0.173
342 DeepSeek-V2.5 (Dec '24) DeepSeek 12.5 - - $0 - - - - - 0.763 -
343 Qwen3 30B A3B (Non-reasoning) Alibaba 12.5 13.3 21.7 $0.35 0.71 0.515 0.046 0.322 0.264 0.863 0.26
344 Qwen3 4B (Non-reasoning) Alibaba 12.5 - - $0.188 0.586 0.398 0.037 0.233 0.167 0.843 0.213
345 Sarvam 30B (high) Sarvam 12.3 7.9 - $0 - 0.633 0.07 - 0.192 - -
346 Gemini 2.0 Flash Thinking Experimental (Dec '24) Google 12.3 - - $0 - - - - - 0.48 -
347 Claude 3 Haiku Anthropic 12.3 6.7 - $0.5 - 0.374 0.039 0.154 0.186 0.394 0.01
348 DeepSeek-V2.5 DeepSeek 12.3 - - $0 - - - - - - -
349 Olmo 3.1 32B Instruct Allen Institute for AI 12.2 5.6 - $0.3 - 0.539 0.049 - 0.167 - -
350 Gemma 4 E2B (Non-reasoning) Google 12.1 8.3 - $0 - 0.405 0.045 - 0.204 - -
351 Mistral Saba Mistral 12.1 - - $0 0.611 0.424 0.041 - 0.241 0.677 0.13
352 DeepSeek R1 Distill Llama 8B DeepSeek 12.1 - 41.3 $0 0.543 0.302 0.042 0.233 0.119 0.853 0.333
353 Olmo 3 32B Think Allen Institute for AI 12.1 10.5 73.7 $0 0.759 0.61 0.059 0.672 0.286 - -
354 R1 1776 Perplexity 12 - - $0 - - - - - 0.954 -
355 Gemini 1.5 Pro (May '24) Google 12 19.8 - $0 0.657 0.371 0.039 0.244 0.274 0.673 0.08
356 Reka Flash (Sep '24) Reka AI 12 - - $0.35 - - - - - 0.529 -
357 Qwen2.5 Turbo Alibaba 12 - - $0.087 0.633 0.41 0.042 0.163 0.153 0.805 0.12
358 Llama 3.2 Instruct 90B (Vision) Meta 11.9 - - $0.72 0.671 0.432 0.049 0.214 0.24 0.629 0.05
359 Solar Mini Upstage 11.9 - - $0.15 - - - - - 0.331 -
360 Llama 3.1 Instruct 8B Meta 11.8 4.9 4.3 $0.1 0.476 0.259 0.051 0.116 0.132 0.519 0.077
361 Grok-1 xAI 11.7 - - $0 - - - - - - -
362 EXAONE 4.0 32B (Non-reasoning) LG AI Research 11.7 9.4 39.3 $0 0.768 0.628 0.049 0.472 0.252 0.939 0.47
363 Qwen2 Instruct 72B Alibaba 11.7 - - $0 0.622 0.371 0.037 0.159 0.229 0.701 0.147
364 Ministral 3 3B Mistral 11.2 4.8 22 $0.1 0.524 0.358 0.053 0.247 0.144 - -
365 Gemini 1.5 Flash-8B Google 11.1 - - $0 0.569 0.359 0.045 0.217 0.229 0.689 0.033
366 DeepHermes 3 - Mistral 24B Preview (Non-reasoning) Nous Research 10.9 - - $0 0.58 0.382 0.039 0.195 0.228 0.595 0.047
367 Jamba 1.7 Large AI21 Labs 10.9 7.8 2.3 $3.5 0.577 0.39 0.038 0.181 0.188 0.6 0.057
368 Granite 4.0 H Small IBM 10.8 8.5 13.7 $0.107 0.624 0.416 0.037 0.251 0.209 - -
369 Qwen3 Omni 30B A3B Instruct Alibaba 10.7 7.2 52.3 $0.43 0.725 0.62 0.051 0.422 0.186 - -
370 Jamba 1.5 Large AI21 Labs 10.7 - - $3.5 0.572 0.427 0.04 0.143 0.163 0.606 0.047
371 DeepSeek-Coder-V2 DeepSeek 10.6 - - $0 - - - - - 0.743 -
372 OLMo 2 32B Allen Institute for AI 10.6 2.7 3.3 $0 0.511 0.328 0.037 0.068 0.08 - -
373 Hermes 3 - Llama-3.1 70B Nous Research 10.6 - - $0.3 0.571 0.401 0.041 0.188 0.231 0.538 0.023
374 Jamba 1.6 Large AI21 Labs 10.6 - - $3.5 0.565 0.387 0.04 0.172 0.184 0.58 0.047
375 Qwen3 8B (Non-reasoning) Alibaba 10.6 7.1 24.3 $0.31 0.643 0.452 0.028 0.202 0.168 0.828 0.243
376 LFM2 24B A2B Liquid AI 10.5 3.6 - $0.052 - 0.474 0.044 - 0.109 - -
377 Qwen3.5 0.8B (Reasoning) Alibaba 10.5 0 - $0.02 - 0.111 0.012 - 0 - -
378 Gemini 1.5 Flash (May '24) Google 10.5 - - $0 0.574 0.324 0.042 0.196 0.181 0.554 0.093
379 Phi-4 Microsoft Azure 10.4 11.2 18 $0.219 0.714 0.575 0.041 0.231 0.26 0.81 0.143
380 Nova Micro Amazon 10.3 4.1 6 $0.061 0.531 0.358 0.047 0.14 0.094 0.703 0.08
381 Gemma 3 27B Instruct Google 10.3 9.6 20.7 $0 0.669 0.428 0.047 0.137 0.212 0.883 0.253
382 Claude 3 Sonnet Anthropic 10.3 - - $6 0.579 0.4 0.038 0.175 0.229 0.414 0.047
383 Mistral Small (Sep '24) Mistral 10.2 - - $0.3 0.529 0.381 0.043 0.141 0.156 0.563 0.063
384 NVIDIA Nemotron Nano 12B v2 VL (Non-reasoning) NVIDIA 10.1 5.9 26.7 $0.3 0.649 0.439 0.045 0.345 0.176 - -
385 Gemma 3n E4B Instruct Preview (May '25) Google 10.1 - - $0 0.483 0.278 0.049 0.138 0.086 0.749 0.107
386 Gemini 1.0 Ultra Google 10.1 17.6 - $0 - - - - - - -
387 Phi-3 Mini Instruct 3.8B Microsoft Azure 10.1 3 0.3 $0 0.435 0.319 0.044 0.116 0.09 0.457 0.04
388 Phi-4 Multimodal Instruct Microsoft Azure 10 - - $0 0.485 0.315 0.044 0.131 0.11 0.693 0.093
389 Qwen2.5 Coder Instruct 7B Alibaba 10 - - $0 0.473 0.339 0.048 0.126 0.148 0.66 0.053
390 Qwen3.5 0.8B (Non-reasoning) Alibaba 9.9 1 - $0.02 - 0.236 0.049 - 0.029 - -
391 Mistral Large (Feb '24) Mistral 9.9 - - $6 0.515 0.351 0.034 0.178 0.208 0.527 0
392 Mixtral 8x22B Instruct Mistral 9.8 - - $0 0.537 0.332 0.041 0.148 0.188 0.545 0
393 Llama 3.2 Instruct 3B Meta 9.7 - 3.3 $0.085 0.347 0.255 0.052 0.083 0.052 0.489 0.067
394 Llama 2 Chat 7B Meta 9.7 - - $0.1 0.164 0.227 0.058 0.002 0 0.059 0
395 Jamba Reasoning 3B AI21 Labs 9.6 2.5 10.7 $0 0.577 0.333 0.046 0.21 0.059 - -
396 Qwen3 VL 4B Instruct Alibaba 9.6 4.5 37 $0 0.634 0.371 0.037 0.29 0.137 - -
397 Reka Flash 3 Reka AI 9.5 8.9 33.7 $0.35 0.669 0.529 0.051 0.435 0.267 0.893 0.51
398 Qwen1.5 Chat 110B Alibaba 9.5 - - $0 - 0.289 - - - - -
399 Olmo 3 7B Think Allen Institute for AI 9.4 7.6 70.7 $0 0.655 0.516 0.057 0.617 0.212 - -
400 Claude 2.1 Anthropic 9.3 14 - $0 0.495 0.319 0.042 0.195 0.184 0.374 0.033
401 OLMo 2 7B Allen Institute for AI 9.3 1.2 0.7 $0 0.282 0.288 0.055 0.041 0.037 - -
402 Molmo 7B-D Allen Institute for AI 9.2 1.2 0 $0 0.371 0.24 0.051 0.039 0.036 - -
403 Ling-mini-2.0 InclusionAI 9.2 5 49.3 $0 0.671 0.562 0.05 0.429 0.135 - -
404 Claude 2.0 Anthropic 9.1 12.9 - $0 0.486 0.344 - 0.171 0.194 - 0
405 DeepSeek R1 Distill Qwen 1.5B DeepSeek 9.1 - 22 $0 0.269 0.098 0.033 0.07 0.066 0.687 0.177
406 DeepSeek-V2-Chat DeepSeek 9.1 - - $0 - - - - - - -
407 GPT-3.5 Turbo OpenAI 9 10.7 - $0.75 0.462 0.297 - - - 0.441 -
408 Mistral Small (Feb '24) Mistral 9 - - $1.5 0.419 0.302 0.044 0.111 0.134 0.562 0.007
409 Mistral Medium Mistral 9 - - $4.088 0.491 0.349 0.034 0.099 0.118 0.405 0.037
410 Llama 3 Instruct 70B Meta 8.9 6.8 - $0.871 0.574 0.379 0.044 0.198 0.189 0.483 0
411 Gemma 3 12B Instruct Google 8.8 6.3 18.3 $0 0.595 0.349 0.048 0.137 0.174 0.853 0.22
412 LFM 40B Liquid AI 8.8 - - $0 0.425 0.327 0.049 0.096 0.071 0.48 0.023
413 Arctic Instruct Snowflake 8.8 - - $0 - - - - - - -
414 Qwen Chat 72B Alibaba 8.8 - - $0 - - - - - - -
415 Llama 3.2 Instruct 11B (Vision) Meta 8.7 4.3 1.7 $0.16 0.464 0.221 0.052 0.11 0.112 0.516 0.093
416 PALM-2 Google 8.6 4.6 - $0 - - - - - - -
417 Gemini 1.0 Pro Google 8.5 - - $0 0.431 0.277 0.046 0.116 0.117 0.403 0.007
418 DeepSeek Coder V2 Lite Instruct DeepSeek 8.5 - - $0 0.429 0.319 0.053 0.158 0.139 - -
419 Phi-4 Mini Instruct Microsoft Azure 8.4 3.6 6.7 $0 0.465 0.331 0.042 0.126 0.108 0.696 0.03
420 Llama 2 Chat 70B Meta 8.4 - - $0 0.406 0.327 0.05 0.098 - 0.323 0
421 Llama 2 Chat 13B Meta 8.4 - - $0 0.406 0.321 0.047 0.098 0.118 0.329 0.017
422 DeepSeek LLM 67B Chat (V1) DeepSeek 8.4 - - $0 - - - - - - -
423 Sarvam M (Reasoning) Sarvam 8.4 7.5 - $0 0.696 0.416 0.033 0.295 0.178 0.847 0.203
424 Exaone 4.0 1.2B (Reasoning) LG AI Research 8.3 3.1 50.3 $0 0.588 0.515 0.058 0.516 0.093 - -
425 OpenChat 3.5 (1210) OpenChat 8.3 - - $0 0.31 0.23 0.048 0.115 - 0.307 0
426 DBRX Instruct Databricks 8.3 - - $0 0.397 0.331 0.066 0.093 0.118 0.279 0.03
427 Command-R+ (Apr '24) Cohere 8.3 - - $6 0.432 0.323 0.045 0.122 0.118 0.279 0.007
428 Olmo 3 7B Instruct Allen Institute for AI 8.2 3.4 41.3 $0.125 0.522 0.4 0.058 0.266 0.103 - -
429 LFM2.5-1.2B-Thinking Liquid AI 8.1 1.4 - $0 - 0.339 0.061 - 0.042 - -
430 Exaone 4.0 1.2B (Non-reasoning) LG AI Research 8.1 2.5 24 $0 0.5 0.424 0.058 0.293 0.074 - -
431 Jamba 1.7 Mini AI21 Labs 8.1 3.1 0.3 $0 0.388 0.322 0.045 0.061 0.093 0.258 0.013
432 LFM2 2.6B Liquid AI 8 1.4 8.3 $0 0.298 0.306 0.052 0.081 0.025 - -
433 LFM2.5-1.2B-Instruct Liquid AI 8 0.8 - $0 - 0.326 0.068 - 0.023 - -
434 Granite 4.0 H 1B IBM 8 2.7 6.3 $0 0.277 0.263 0.05 0.115 0.082 - -
435 Jamba 1.5 Mini AI21 Labs 8 - - $0.25 0.371 0.302 0.051 0.062 0.08 0.357 0.01
436 Qwen3 1.7B (Reasoning) Alibaba 8 1.4 38.7 $0.398 0.57 0.356 0.048 0.308 0.043 0.894 0.51
437 Jamba 1.6 Mini AI21 Labs 7.9 - - $0.25 0.367 0.3 0.046 0.071 0.101 0.257 0.033
438 Gemma 3 270M Google 7.7 0 2.3 $0 0.055 0.224 0.042 0.003 0 - -
439 Granite 4.0 Micro IBM 7.7 5 6 $0 0.447 0.336 0.051 0.18 0.119 - -
440 Apertus 70B Instruct Swiss AI Initiative 7.7 1.9 - $1.345 - 0.272 0.055 - 0.057 - -
441 Mixtral 8x7B Instruct Mistral 7.7 - - $0.54 0.387 0.292 0.045 0.066 0.028 0.299 0
442 DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning) Nous Research 7.6 - - $0 0.365 0.27 0.043 0.085 0.091 0.218 0
443 Llama 65B Meta 7.4 - - $0 - - - - - - -
444 Qwen Chat 14B Alibaba 7.4 - - $0 - - - - - - -
445 Claude Instant Anthropic 7.4 7.8 - $0 0.434 0.33 0.038 0.109 - 0.264 0
446 Mistral 7B Instruct Mistral 7.4 - - $0.25 0.245 0.177 0.043 0.046 0.024 0.121 0
447 Command-R (Mar '24) Cohere 7.4 - - $0.75 0.338 0.284 0.048 0.048 0.062 0.164 0.007
448 Molmo2-8B Allen Institute for AI 7.3 4.4 - $0 - 0.425 0.044 - 0.133 - -
449 Granite 4.0 1B IBM 7.3 2.9 6.3 $0 0.325 0.281 0.051 0.047 0.087 - -
450 LFM2 8B A1B Liquid AI 7 2.3 25.3 $0 0.505 0.344 0.049 0.151 0.068 - -
451 Granite 3.3 8B (Non-reasoning) IBM 7 3.4 6.7 $0.085 0.468 0.338 0.042 0.127 0.101 0.665 0.047
452 Qwen3 1.7B (Non-reasoning) Alibaba 6.8 2.3 7.3 $0.188 0.411 0.283 0.052 0.126 0.069 0.717 0.097
453 Qwen3 0.6B (Reasoning) Alibaba 6.5 0.9 18 $0.398 0.347 0.239 0.057 0.121 0.028 0.75 0.1
454 Llama 3 Instruct 8B Meta 6.4 4 - $0.07 0.405 0.296 0.051 0.096 0.119 0.499 0
455 Gemma 3n E4B Instruct Google 6.4 4.2 14.3 $0.025 0.488 0.296 0.044 0.146 0.081 0.771 0.137
456 Llama 3.2 Instruct 1B Meta 6.3 0.6 0 $0.1 0.2 0.196 0.053 0.019 0.017 0.14 0
457 Gemma 3 4B Instruct Google 6.3 2.9 12.7 $0 0.417 0.291 0.052 0.112 0.073 0.766 0.063
458 LFM2 1.2B Liquid AI 6.3 0.8 3.3 $0 0.257 0.228 0.057 0.02 0.025 - -
459 LFM2.5-VL-1.6B Liquid AI 6.2 1 - $0 - 0.289 0.051 - 0.03 - -
460 Granite 4.0 350M IBM 6.1 0.3 0 $0 0.124 0.261 0.057 0.024 0.009 - -
461 Apertus 8B Instruct Swiss AI Initiative 5.9 1.4 - $0.125 - 0.256 0.05 - 0.041 - -
462 Qwen3 0.6B (Non-reasoning) Alibaba 5.7 1.4 10.3 $0.188 0.231 0.231 0.052 0.073 0.041 0.521 0.017
463 Gemma 3 1B Instruct Google 5.5 0.2 3.3 $0 0.135 0.237 0.052 0.017 0.007 0.484 0
464 Granite 4.0 H 350M IBM 5.4 0.6 1.3 $0 0.127 0.257 0.064 0.019 0.017 - -
465 Gemma 3n E2B Instruct Google 4.8 2.2 10.3 $0 0.378 0.229 0.04 0.095 0.052 0.691 0.09
466 Tiny Aya Global Cohere 4.7 1.2 - $0 - 0.305 0.052 - 0.036 - -
467 GPT-5.4 Pro (xhigh) OpenAI - - - $67.5 - - - - - - -
468 Gemini 3 Deep Think Google - - - $0 - - - - - - -
469 Cogito v2.1 (Reasoning) Deep Cogito - 24.8 72.7 $1.25 0.849 0.768 0.11 0.688 0.41 - -
470 Mi:dm K 2.5 Pro Preview Korea Telecom - 11.9 78.7 $0 0.813 0.722 0.088 0.576 0.297 - -
471 GPT-4o mini Realtime (Dec '24) OpenAI - - - $0 - - - - - - -
472 GPT-4o Realtime (Dec '24) OpenAI - - - $0 - - - - - - -
473 GPT-3.5 Turbo (0613) OpenAI - - - $0 - - - - - - -

* 价格为每百万 Token 的混合价格 (3:1 输入/输出)

Artificial Analysis AI 大模型排名 介绍

Artificial Analysis 是一家独立的 AI 基准测试和分析公司,提供独立的基准测试和分析,以支持开发者、研究人员、企业和其他 AI 用户。Artificial Analysis同时测试专有与开放权重模型,并以端到端用户体验为核心,测量实际使用中的响应时间、输出速度及成本。

质量基准涵盖语言理解与推理能力;性能基准则关注首次令牌到达时间、输出速度、端到端响应时间等真实可感知指标。我们区分 OpenAI Tokens 与原生 Tokens,以便在不同模型之间进行统一、公平的对比,并使用按 3:1 的输入/输出比计算混合价。基准对象包括模型、端点、系统与提供商,覆盖语言模型、语音、图像生成等多个方向,旨在帮助用户准确了解不同 AI 服务的真实表现与性价比。

Artificial Analysis AI 测试基准介绍

上下文窗口

输入和输出令牌的最大总数。输出令牌的数量限制通常要低得多(具体数量因模型而异)。

输出速度

模型生成令牌时每秒接收到的令牌数(即,对于支持流式传输的模型,在从 API 接收到第一个数据块之后)。

延迟(首次令牌到达时间)

API 请求发送后,收到第一个推理令牌所需的时间(以秒为单位)。对于共享推理令牌的推理模型,这将是第一个推理令牌。对于不支持流式传输的模型,这表示收到完成状态所需的时间。

价格

每个代币的价格,以美元/百万代币表示。价格是输入代币和输出代币价格的混合(比例为 3:1)。

常见 AI 大模型测试基准介绍

MMLU Pro

Massive Multitask Language Understanding Professional。MMLU 的增强版,旨在评估大语言模型的推理能力。它通过过滤简单问题、增加选项数量(从4个增加到10个)以及强调复杂的多步推理,来解决原版 MMLU 的局限性。涵盖 14 个领域的约 12,000 个问题。

GPQA

Graduate-Level Google-Proof Q&A Benchmark。一个具有挑战性的研究生级别问答基准,旨在评估 AI 系统在物理、化学和生物等复杂科学领域提供真实信息的能力。这些问题被设计为“防谷歌搜索”,即需要深度理解和推理,而不仅仅是简单的事实回忆。

HLE

Humanity's Last Exam。一个全面的评估框架,旨在测试 AI 系统在模仿人类水平推理、解决问题和知识整合方面的能力。包含 100 多个学科的 2,500 到 3,000 个专家级问题,强调多步推理和处理新颖场景的能力。

LiveCodeBench

一个无污染的 LLM 代码能力评估基准。它持续从 LeetCode、AtCoder 和 Codeforces 等平台的竞赛中收集新问题,以防止训练集数据污染。除了代码生成,还评估自我修复、代码执行和测试输出预测等能力。

SciCode

评估语言模型解决现实科学研究问题代码生成能力的基准。涵盖物理、数学、材料科学、生物和化学等 6 个领域的 16 个子领域。问题源自真实的科学工作流,通常需要知识回忆、推理和代码合成。

Math 500

旨在评估语言模型数学推理和解决问题能力的基准。包含 500 个来自 AMC 和 AIME 等高水平高中数学竞赛的难题,涵盖代数、组合数学、几何、数论和预微积分等领域。

AIME

American Invitational Mathematics Examination。基于美国数学邀请赛问题的基准,被认为是测试高级数学推理的最具挑战性的 AI 测试之一。包含 30 个“奥林匹克级别”的整数答案数学问题,测试多步推理、抽象和解决问题的能力。