GLM-5.2: Open-Weights Model Challenges Claude Opus on Coding

On June 13, 2026, Zhipu AI released GLM-5.2: a 744-billion-parameter open-weights model with a 1-million-token context window and an MIT license. On standard coding benchmarks it surpasses GPT-5.5. On long-horizon tasks that run for hours, it trails Claude Opus 4.8 by 1% on FrontierSWE. The developer community has started calling it an "Opus alternative."

GLM-5.2 Specs: 744B Parameters, 1M Context, MIT License

GLM-5.2 is a Mixture-of-Experts model with 744 billion total parameters and 40 billion active per token. It routes each token to 8 of 256 experts, so you pay the compute cost of a medium model while accessing the knowledge of a very large one. The context window is 1,048,576 tokens. Maximum output is 131,072 tokens.

It ships under an MIT license. You can download the weights from Hugging Face or ModelScope, fine-tune them, deploy them commercially, and modify them however you want. No regional restrictions.

API pricing is $1.40 per million input tokens and $4.40 per million output tokens. Cached input costs $0.26 per million tokens. The GLM Coding Plan starts at $10/month for Lite access.

GLM-5.2 Coding Benchmarks: SWE-bench, Terminal-Bench, and FrontierSWE

The jump from GLM-5.1 to GLM-5.2 is large on every coding benchmark Zhipu tested. On SWE-bench Pro the score went from 58.4 to 62.1. On Terminal-Bench 2.1 it went from 63.5 to 81.0. On DeepSWE, a benchmark where GLM-5.1 managed only 18.0, GLM-5.2 reaches 46.2.

The real story is in the extended-duration benchmarks. FrontierSWE evaluates whether an agent can complete open-ended technical projects across hours of work. GLM-5.2 jumped from 30.5 to 74.4 on this benchmark, putting it within 1 point of Claude Opus 4.8 (75.1) and ahead of GPT-5.5 (72.6).

PostTrainBench measures how much an agent can improve a small model when given an H100 and a training pipeline. GLM-5.2 scored 34.3, up from 20.1. On SWE-Marathon, which tests tasks like building compilers and optimizing compute kernels, GLM-5.2 moved from 1.0 to 13.0.

GLM-5.2 vs Claude Opus 4.8 and GPT-5.5

Claude Opus 4.8 still leads on most benchmarks, but the gap has narrowed considerably. On Terminal-Bench 2.1, GLM-5.2 scores 81.0 to Opus's 85.0. On MCP-Atlas, a tool-use evaluation, it scores 76.8 to Opus's 77.8. On FrontierSWE, the gap is a single percentage point.

GLM-5.2 beats GPT-5.5 on SWE-bench Pro (62.1 vs 58.6), FrontierSWE (74.4 vs 72.6), PostTrainBench (34.3 vs 28.4), and AIME 2026 (99.2 vs 98.3). GPT-5.5 pulls ahead on DeepSWE (70.0 vs 46.2), where it more than doubles GLM-5.2's score.

Against other open models, GLM-5.2 is the clear leader. It outscores Qwen3.7-Max, MiniMax M3, and DeepSeek-V4-Pro across the board. Artificial Analysis ranked it the strongest open-weights model on their Intelligence Index at 51 points.

IndexShare: How GLM-5.2 Makes 1M Context Practical

Standard attention scales quadratically with context length. At a million tokens, that becomes impractical. Zhipu AI builds on DeepSeek Sparse Attention (DSA), which attends only to the most relevant tokens rather than computing across the full sequence. The problem was that even DSA's indexer was expensive at scale.

IndexShare solves this by reusing the same indexer across every four sparse-attention layers. The indexer computes at the first layer, and the top-k indices carry through the next three. This cuts per-token FLOPs by 2.9x at the 1M context length. Zhipu reports throughput improvements of 3% to 192% over GLM-5.1, with the largest gains at longer contexts.

They also improved the multi-token prediction (MTP) layer used for speculative decoding, increasing acceptance length by up to 20%. The MTP parameters are shared across prediction steps, and the system uses rejection sampling with end-to-end TV loss during training.

Agentic Training: RL and Offline Policy Distillation

GLM-5.2 uses two training approaches for agentic capability. Reinforcement Learning covers tasks with clear reward signals: math, coding, structured reasoning. Offline Policy Distillation (OPD) handles the messier work of merging expertise from specialized models back into a general model.

Zhipu's slime framework manages the infrastructure. It supports white-box rollout, black-box rollout, compact trajectory, and sub-agent workflow modes. During post-training, the framework ran parallel OPD to merge more than ten expert models into GLM-5.2. The entire OPD process took about two days.

The model supports two reasoning effort levels: High and Max. Use High for tasks where you want faster responses. Switch to Max when the problem needs deeper thinking. This is similar to the effort-level controls other frontier models now offer.

Deployment Options

You can download the model weights in BF16 or FP8 precision. The full BF16 model exceeds 1.5 TB in Safetensors format, which means multi-GPU infrastructure for self-hosting. FP8 reduces the footprint but still needs substantial hardware.

For inference frameworks, GLM-5.2 supports vLLM, SGLang, transformers, xLLM, and ktransformers. It integrates with coding agents including Claude Code, OpenCode, and ZCode. Zhipu completed Day 0 inference adaptation for Huawei Ascend, T-Head, Moore Threads, Cambricon, Kunlun Core, MetaX, Hygon, and Biren chip platforms.

Pricing Compared

For developers using the API, the cost difference is significant. A typical agentic coding task with 100K input tokens and 10K output tokens costs about $0.18 on GLM-5.2. The same task on Claude Opus 4.8 costs about $0.75. That is roughly 4x cheaper for comparable work.

Cached input is where the savings get even larger. At $0.26 per million tokens versus $0.625 per million for Opus 4.8's prompt cache, teams that repeatedly query against the same codebase context save more than 50%.

What You Trade

Claude Opus 4.8 still leads on the hardest tasks. On SWE-Marathon, which includes building compilers and kernel optimization, Opus scores 26.0 to GLM-5.2's 13.0. On Humanity's Last Exam, Opus leads by about 9 points (49.8 vs 40.5).

GLM-5.2 is text-only. It does not accept images. GPT-5.5 handles multimodal inputs. If your workflow involves screenshots, diagrams, or visual UI testing, GLM-5.2 is not the right tool today.

One more thing to know: Zhipu published no official benchmarks at launch. The numbers cited here come from their Hugging Face card and independent reporting. Community verification is expected within two to three weeks of release.

Key Takeaways

Strongest open-weights coding model across SWE-bench Pro, Terminal-Bench 2.1, FrontierSWE, and MCP-Atlas. It outperforms GPT-5.5 on most coding benchmarks.

1M-token context with real throughput. IndexShare reduces per-token FLOPs by 2.9x at full context, making the million-token window practical for production workloads.

4x cheaper than Opus 4.8 per agentic task. At $1.40 input and $4.40 output per million tokens, teams that run many coding agents save significantly.

Try GLM-5.2 in Tarsk

Connect your Z.ai or OpenRouter API key in Settings and select GLM-5.2 as your model. The 1M context window means you can feed entire repositories into a single prompt.

Download Tarsk Read the Docs