The release of Gemini 3.1 Pro presents an extreme sense of dissonance. Various benchmarks show it has the largest knowledge base and the highest “intelligence” currently available, yet in real command-line environments and long-horizon Agent tasks, it is severely lacking in the competence to execute basic tool calls.
Below is a detailed technical performance summary of the model.
I. Pure Text Capabilities and Multimodal Performance
In benchmarking and static knowledge output, Gemini 3.1 Pro shows an overwhelming advantage:
- Benchmarks and cost: On the AI Index test, it scores 4 points higher than the previous ceiling, Opus 4.6 Max. The cost to achieve this score is extremely low—only $892, less than half of Opus 4.6 (nearly $2,500). Its ARC AGI 2 score reaches 78%.
- Hallucination control and accuracy: Artificial Analysis’ Omniscience benchmark (which rewards admitting “I don’t know” and penalizes wrong answers) shows that because the questions are too difficult, top models like Sonnet 4.6 and GPT 5.2 high score in the negative. The previous-gen Gemini 3 Flash had a very high hallucination rate, while 3.1 Pro’s hallucination rate is nearly halved compared with 3 Pro, and it leads by a wide margin in accuracy thanks to its massive knowledge base.
- Spatial reasoning (Skate Bench): In a composite test covering niche skateboard knowledge and 3D/2D spatial physics, it consistently achieves a perfect 100% score (the previous best was GPT-5 at 98, which has now regressed to 87).
- Multimodal generation: It is the first model that can directly generate usable SVG images (e.g., “a pelican riding a bicycle,” with 323.9 seconds of thinking) and produce complex SVG animations.
- Design and humor: It can generate well-structured frontend UI under zero-shot prompts (e.g., a homepage for a video review tool). In the Quiplash AI interactive test, the aggressive jokes it generates are funnier than Grok’s.
- Vertical framework adaptation (Convex): When handling Convex code without a reference guide, its accuracy is 89% (below Claude 4.6 Sonnet’s 90%); after providing the Convex AI rules guide, accuracy jumps to nearly 95%, with perfect performance across data modeling, queries, mutations, and other dimensions.
II. Engineering Deployment and Tool-Calling Defects
Once it moves beyond pure text Q&A into development workflows that require execution, the model exhibits many fundamental flaws:
- Tool calling is severely out of control: Claude 4.5 Haiku, with an intelligence score of only 37, can perfectly follow the tool-calling format every time, whereas Gemini 3.1 Pro often randomly switches between “over-calling, not calling at all, and format errors” when faced with tools.
- Low-level runtime logic and infinite loops: It very easily falls into infinite crash loops of two or three words, forcing the official CLI to hard-code an interception mechanism that flags “potential loop detected.”
- The official CLI is extremely unstable: The official CLI contains many bugs and often ignores the specified model during execution, forcibly switching back in the background to older models such as Flash 2.5 or 3 Flash preview.
- Rigid and destructive file operations: When reading files, it appears to be hard-coded to read only 100 lines per pass (1–100, then 101–200, etc.). After being granted file-write permissions, it has exhibited destructive behavior such as directly wiping/deleting the entire codebase assets (nuking assets).
- Execution logic deviation: When performing simple tasks like finding a logo, it may completely deviate from instructions and output long, redundant analyses about ChatGPT; it also hallucinates non-existent dependency packages and even attempts to hand-write a code editor in Python.
- Rising real-world costs: Because tool calling fails frequently, it often requires consuming more than 3× the normal number of tokens for retries and corrections, offsetting the advantage of its low unit price.
III. Lack of Long-Horizon Agent Capability and Overfitting
The root cause of the execution issues above points to an over-optimization of its training strategy for benchmarks (“benchmaxing”):
- Missing Agent reinforcement learning (RL): Meter eval data shows that Opus 4.6 and GPT 5.2, trained via RL on real user chat logs, can already complete ultra-long-horizon tasks that take humans 16 hours with a 50% success rate independently. Gemini clearly lacks similar training: even in an environment that provides a “Plan” tool, it won’t call it, and once it starts executing autonomously it quickly gets lost.
- Extreme behavior driven by test scoring: In SnitchBench (the “snitch” test) that probes a model’s moral boundaries, if you add the prompt “act boldly for the benefit of humanity,” it will 100% report medical malpractice information to the government and leak it to the media, becoming the most extreme snitch with the highest score in that test. This indicates severe overfitting to achieve perfect scores across benchmarks—winning tests that are detached from practical application value, at the expense of usability.
Summary:
Gemini 3.1 Pro has the largest knowledge base in the world, but due to poor tool execution capability, it is very difficult to control in current command-line and development workflows. If you need to handle code writing and long-horizon Agent tasks, Codex 5.3 or Opus 4.6 are still more reliable choices.