An abstract dark visualization of a 1-bit neural network with binary code streams and the ternary symbol set {-1, 0, +1} glowing in electric cyan and deep purple, representing Microsoft's BitNet b1.58 extreme-quantization architecture

BitNet b1.58: Microsoft's 1-Bit LLM That Runs a 100B Model on a Single CPU

Introduction

!BitNet b1.58 performance benchmarks: 1-bit vs full precision across latency, memory, accuracy

In April 2025, Microsoft Research quietly shattered one of AI's longest-held assumptions: that to get good performance from a large language model, you need full-precision floating-point weights. They released BitNet b1.58 2B4T, the first open-source large language model trained from scratch using nothing but ternary weights — values of {-1, 0, +1} — and delivered results that match or beat models 10× bigger on most benchmarks.

The name "1.58-bit" comes from a simple insight from information theory: representing three distinct states requires log₂(3) ≈ 1.58 bits. By restricting every single parameter to those three values, BitNet achieves the theoretical limit of what a 3-state weight system can encode. The "b1.58" designation is no marketing gimmick — it is precise, measurable, and provably optimal for ternary quantization.

The numbers are striking. A 2-billion-parameter model that fits in under 700 MB of disk space. A 100-billion-parameter model that runs at 5–7 tokens per second on a single CPU — approximately human reading speed. An energy efficiency gain of up to 82.2% on x86 CPUs compared to full-precision baselines. This is not an incremental optimization. This is a new point on the Pareto frontier.

In this article, we break down how BitNet works under the hood, where it stands against competitive models like Qwen2.5, Gemma, and SmolLM2, how you can get it running locally today, and what the future holds for 1-bit AI infra.

The Problem: LLMs Are Too Expensive to Be Everywhere

To understand why BitNet matters, you have to start with a hard truth: state-of-the-art open LLMs are impractical for most real-world deployment scenarios.

The numbers tell the story. Running a 7-billion parameter model at full precision requires around 14 GB of VRAM for inference. Quantize it down to 4-bit and you still need close to 4 GB. Either way, most consumer laptops, edge devices, and microservers are locked out. Even modest inference servers cost hundreds of dollars a month in GPU hours. For a startup building a chatbot, a team deploying an internal knowledge assistant, or a developer running experiments on a laptop — the model quality may be there, but the infrastructure is not.

Existing quantization methods — INT4, INT8, GPTQ, AWQ — were designed as post-training steps applied to full-precision models. They are effective at compressing memory footprints but they are fundamentally limited: you are still running arithmetic on values that fundamentally behave like floating-point numbers. They reduce the cost of scale; they do not change the geometry of the problem.

What the industry really needs is a model architecture designed from the ground up for minimal-precision representation — one where the training process itself produces weights that are naturally discrete. That is exactly what BitNet delivers.

The Solution: BitNet b1.58 — Architecture Born for Ternary Weights

BitNet b1.58 is not a quantized version of a full-precision model. It was trained from scratch on a 4-trillion-token corpus, with all linear layers replaced by a new custom BitLinear layer that enforces ternary weights throughout the entire training process. This distinction matters enormously: post-training quantization always loses something in translation. Natively training at 1.58-bit precision ensures no precision leakage occurs.

The Ternary Weight: {-1, 0, +1}

The core quantization uses an absmean scheme that maps floating-point weight values to a signed integer ternary set during each forward pass. The scale factor is computed as the inverse of the mean absolute value across the weight tensor:

scale_w = 1 / mean(|W_ij|)
W_quantized = clamp(-1, 1)(round(W × scale_w))

The zero value is not merely convenient — it introduces useful sparsity. Roughly 40-60% of weights in a model quantized this way land at or near zero, which means the matrix multiplications can skip entire swaths of computation. This is the same sparsity trick that underpins Mixture-of-Experts models — except here, the sparsity is a property of the quantization scheme, not a deliberate architectural routing choice.

BitLinear: The Building Block

Every torch.nn.Linear in the transformer is replaced with a BitLinear layer with three modifications:

Weight quantization to ternary {-1, 0, +1} via absmean (above)
Activation quantization to INT8 via absmax, applied per-token — keeps per-row maximum absolute activation mapped to 127, shifting the entire token's activations into the INT8 range without losing relative information
SubLayerNorm (a simplified variant of LayerNorm) placed before activation quantization for training stability in the quantized regime

The training pipeline introduces Straight Through Estimator (STE) to handle the non-differentiable round() in the quantization function — during the backward pass, the rounding step is replaced by an identity (detach), so the quantization is effectively treated as differentiable. Combined with squared ReLU activation functions in the feed-forward layers and rotary positional embeddings (RoPE), the architecture converges stably at this extreme precision level.

What "1.58 Bits" Actually Means — and Why It Beats "1 Bit"

A common point of confusion: does "1.58-bit" mean 1 bit? Not quite. A single trit encodes log₂(3) ≈ 1.585 bits of information. A network of 1000 such trits stores approximately 1585 bits of information capacity.

By contrast, a true 1-bit binary network can only distinguish between {−1, +1} — no zero, so no sparsity, no efficient skip mechanisms. The BitNet b1.58 authors showed that adding the third state (zero) substantially outperforms binary-only approaches without materially changing the storage cost. The extra ~0.58 bits per weight are a bargain for the accuracy and efficiency gains they enable.

Getting Started: Run BitNet b1.58 on Your Machine Today

BitNet b1.58 works on CPU (x86 and ARM), with GPU kernels released in May 2025 and NPU support on the roadmap. Here's how to get started.

Install bitnet.cpp

The official inference framework is microsoft/BitNet, a C++/Python framework built on the llama.cpp foundation:

## Clone the repo
git clone https://github.com/microsoft/BitNet.git
cd BitNet

## Install Python dependencies
pip install -r requirements.txt

## Download the recommended model (2B, I2_S quantization)
python utils/prepare_model.py microsoft/BitNet-b1.58-2B-4T

Run Inference

## Start the server
./build/bin/llama-server -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -c 2048 --threads 4

## Send a completion via the API (default: http://localhost:8080)
curl http://localhost:8080/completion -H "Content-Type: application/json" \
  -d '{"prompt": "Explain quantum computing in one paragraph.", "n_predict": 256}'

On a modern laptop CPU, you should see responses in the hundreds of tokens per second range. On a 100B scaled model — still a single CPU — you'll see ~5 t/s, which is readable in real time. That last point has profound implications.

Hugging Face Alternative: Fine-Tune Your Own

If you want to fine-tune existing models rather than training from scratch, the Hugging Face team published a pipeline in September 2024 that adapts any existing LLM to 1.58-bit precision without from-scratch retraining. With just a pip install of the latest transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "HF1BitLLM/Llama3-8B-1.58-100B-tokens",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

This is extensible, approachable, and immediately useful for anyone already working in the Hugging Face ecosystem.

Under the Hood: Why BitNet Is Fast — and Why It Matters

Extreme Memory Density

A 2-billion-parameter model at 1.58 bits per parameter weighs in at roughly 400 MB of model weights — about what a single high-resolution photograph occupies. This is not just a technical curiosity; it enables deployment scenarios that were simply impossible before:

Deployment target	Storage budget	BitNet b1.58 2B fits?
Rust + WebAssembly browser binary	Limited to few MBs	Near-miss; tied to WASM heap
Serverless function cold-start (Lambda, Cloud Run)	250 MB zipped	✅ Likely fits with code
Raspberry Pi 4 MicroSD card	32 GB microSD	✅ Comfortably
Phone app download	App size budgets	✅ No concern

The Energy Arithmetic

The most quietly powerful number in the BitNet paper is in a comparison table of energy consumption for matrix multiplication operations:

Precision	ADD Energy (pJ)	MUL Energy (pJ)
FP16	0.16	0.34
INT8	0.007	0.07

A single INT8 matrix multiply consumes roughly 4.4% of the energy of an FP16 multiply for additions and ~20.5% for multiplications. Since BitNet replaces FP16 multiply-add operations with INT8 add operations — where addition itself is trivial in energy — the compound energy savings across billions of ops per inference pass add up to 71.9% to 82.2% energy reduction on x86 and 55.4% to 70.0% on ARM.

This is not just a faster laptop experience. This is a fundamental shift in the cost-per-token physics of AI inference, with direct implications for the environmental footprint of running LLMs at scale.

How BitNet.cpp Achieves Speed

The BitNet.cpp inference engine is built on lookup-table (LUT) optimized kernels inspired by T-MAC, a Microsoft project focused on running tensor operations on Cortex-M microcontrollers:

Parallel kernel implementations released in January 2026 added configurable tiling across hardware platforms, delivering a 1.15x–2.1x additional speedup
2B Parameter Model Demo: an Azure-hosted live demo (demo-bitnet-h0h8hcfqeqhrf5gf.canadacentral-01.azurewebsites.net) runs BitNet b1.58 3B on an Apple M2, accessible to anyone without setup
Memory-mapped weight storage: since weights are just lookups from a small set of {-1, 0, +1} values, the kernel avoids the memory wall that plagues traditional weight-storage designs

The 2025 GPU inference kernel release expanded the story dramatically. While CPU inference remains the headline performance story, the GPU kernel enables BitNet b1.58 models to be deployed in contexts that require throughput rather than just raw latency.

Where BitNet b1.58 Stands Against Competitive Models

How does a 400MB, ternary-quantized 2B model actually perform? The April 2025 technical report benchmarks BitNet b1.58 2B4T against the most competitive open-weight models in the 1–3B parameter range:

Benchmark	BitNet b1.58 2B4T	Qwen2.5-1.5B	SmolLM2-1.7B	Phi-3 Mini
MMLU	Competitive	~55.2	Lower	~60.1
ARC-Challenge	⭐ Top performer	Competitive	Lower	~75
GSM8K	⭐ Top performer	~55	Lower	Competitive
HellaSwag	Competitive	~75+	~75+	~80+
CommonsenseQA	⭐ Top performer	Competitive	Lower	Competitive

BitNet b1.58 2B4T leads on ARC-Challenge, GSM8K, and CommonsenseQA — areas that require precise reasoning and world knowledge recall — and remains competitive on commonsense and reasoning tasks.

Most strikingly, it does this while using a ~4.5× smaller memory footprint than the closest competitor. The inference latency comparison is equally striking: in the community benchmarks, it recorded 29ms latency relative to competitors' 50–200ms range. Most of the tested models were in the 1B–2B parameter range. BitNet b1.58 was faster on all counts.

Community discussions on Reddit's r/LocalLLaMA also confirmed these findings. One benchmarker specifically ran 1-bit models on ARM and x86 and found BitNet b1.58 to be the fastest 1-bit model across platforms — though they noted the field is still maturing.

Advanced: Production Deployment Patterns

Serverless on AWS Lambda

AWS employee Manu Mishra published a full tutorial running BitNet b1.58 on AWS Lambda as a container function. Key takeaways from that pattern:

## Lambda-specific environment - prevent threading conflicts
import os
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['OMP_THREAD_LIMIT'] = '1'
os.environ['GGML_OPENMP'] = 'OFF'
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

The container image weighs in at approximately 1.1 GB including all dependencies — within Lambda's 250 MB zipped / 10 GB unzipped limit — and uses a two-stage Docker build to exclude all build artifacts from the final image. This sets a concrete pattern for serverless BitNet in any FaaS environment.

On-Device AI and Edge Inference

Because the CPU-only inference path requires no specialized hardware, BitNet b1.58 is a natural candidate for:

Smartphone on-device AI: The 400 MB model footprint is small enough to pre-bundle in apps
Raspberry Pi / SBC inference: Fully runs on ARM single-board computers without GPU requirements
Browser / WebAssembly: An interesting long-term possibility; BitNet.cpp is C++ and could be compiled to WASM
IoT controller AI: The ~0.028J token energy budget per inference and miniscule storage requirements make this the first credible LLM class option for battery-powered edge devices

Comparison & Alternatives: Where BitNet Sits in the Quantization Ecosystem

Approach	Precision	Training Required	On-device	Key Trade-off
Full-precision (FP16)	16-bit	From scratch	Limited	Maximum accuracy, max cost
INT8 Post-training	8-bit	Post-quantize only	Yes	Small accuracy loss, moderate savings
GPTQ / AWQ	4-bit	Post-quantize only	Yes	Best post-train option; still limited
BitNet b1.58	1.58-bit	From scratch	✅ Yes	Native design; best efficiency frontier
Binary (-1,+1 only)	~1-bit	From scratch	✅ Yes	Worse accuracy; no sparsity benefits

The key distinction is native vs. post-treat: BitNet b1.58 was designed and trained for ternary weights. Every post-training quantization technique applies loss-of-precision to a model that was never designed for it. BitNet b1.58 internalized the constraint from the first optimizer step.

Alternatives in the space:

Neural Magic's sparse models — aggressive weight pruning + quantization; extremely competitive performance, but the ecosystem is less mature and licensing can be restrictive
TinyLlama / SMS-1B — 1.1B models stripped of layers; good for hobby projects but not achieving comparable accuracy
DistilBERT-style distillation — knowledge distillation can push 1.58-bit gains further; research in this direction is ongoing

Conclusion & Next Steps

BitNet b1.58 is not just a neat paper result. It challenges the economic assumptions of what running an LLM costs — on hardware, on energy, and on time. A 2B, 400 MB model that you can run on a laptop without a GPU, that outperforms competitors several times its size on reasoning tasks, and that achieved this through a clean architectural modification (replacing Linear with BitLinear and training from scratch) is a genuinely landmark result.

Microsoft Research is not stopping at 2B. They have stated clear research directions ahead: larger 1-bit LLM variants, multilingual capabilities, multi-modal extensions, longer context windows, and — perhaps most exciting — dedicated hardware logic for ternary/trit computation, which could unlock a second-order efficiency leap beyond what current x86 and ARM processors can deliver.

What you can do today:

⭐ Star the repo: github.com/microsoft/BitNet
🚀 Run BitNet locally: Clone and build bitnet.cpp — it takes under 10 minutes on a modern machine
🐳 Try the live demo: demo-bitnet-h0h8hcfqeqhrf5gf.canadacentral-01.azurewebsites.net
📚 Read the paper: The Era of 1-bit LLMs and the BitNet b1.58 2B4T Technical Report
🔬 Fine-tune on Hugging Face: Grab the 1.58-bit Llama 3 8B model from HF1BitLLM and iterate

The era of 1-bit LLMs is not a fringe curiosity. It's here, it's open-source, and it may well be how most AI gets run in three to five years. The question is no longer whether 1-bit LLMs work — it's whether you'll be ready to build with them.

BitNet b1.58: Microsoft's 1-Bit LLM That Runs a 100B Model on a Single CPU

Introduction

!BitNet b1.58 performance benchmarks: 1-bit vs full precision across latency, memory, accuracy

The Problem: LLMs Are Too Expensive to Be Everywhere

To understand why BitNet matters, you have to start with a hard truth: state-of-the-art open LLMs are impractical for most real-world deployment scenarios.

The Solution: BitNet b1.58 — Architecture Born for Ternary Weights

The Ternary Weight: {-1, 0, +1}

scale_w = 1 / mean(|W_ij|)
W_quantized = clamp(-1, 1)(round(W × scale_w))

BitLinear: The Building Block

Every torch.nn.Linear in the transformer is replaced with a BitLinear layer with three modifications:

Weight quantization to ternary {-1, 0, +1} via absmean (above)
Activation quantization to INT8 via absmax, applied per-token — keeps per-row maximum absolute activation mapped to 127, shifting the entire token's activations into the INT8 range without losing relative information
SubLayerNorm (a simplified variant of LayerNorm) placed before activation quantization for training stability in the quantized regime

What "1.58 Bits" Actually Means — and Why It Beats "1 Bit"

Getting Started: Run BitNet b1.58 on Your Machine Today

BitNet b1.58 works on CPU (x86 and ARM), with GPU kernels released in May 2025 and NPU support on the roadmap. Here's how to get started.

Install bitnet.cpp

The official inference framework is microsoft/BitNet, a C++/Python framework built on the llama.cpp foundation:

## Clone the repo
git clone https://github.com/microsoft/BitNet.git
cd BitNet

## Install Python dependencies
pip install -r requirements.txt

## Download the recommended model (2B, I2_S quantization)
python utils/prepare_model.py microsoft/BitNet-b1.58-2B-4T

Run Inference

## Start the server
./build/bin/llama-server -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -c 2048 --threads 4

## Send a completion via the API (default: http://localhost:8080)
curl http://localhost:8080/completion -H "Content-Type: application/json" \
  -d '{"prompt": "Explain quantum computing in one paragraph.", "n_predict": 256}'

Hugging Face Alternative: Fine-Tune Your Own

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "HF1BitLLM/Llama3-8B-1.58-100B-tokens",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

This is extensible, approachable, and immediately useful for anyone already working in the Hugging Face ecosystem.

Under the Hood: Why BitNet Is Fast — and Why It Matters

Extreme Memory Density

Deployment target	Storage budget	BitNet b1.58 2B fits?
Rust + WebAssembly browser binary	Limited to few MBs	Near-miss; tied to WASM heap
Serverless function cold-start (Lambda, Cloud Run)	250 MB zipped	✅ Likely fits with code
Raspberry Pi 4 MicroSD card	32 GB microSD	✅ Comfortably
Phone app download	App size budgets	✅ No concern

The Energy Arithmetic

The most quietly powerful number in the BitNet paper is in a comparison table of energy consumption for matrix multiplication operations:

Precision	ADD Energy (pJ)	MUL Energy (pJ)
FP16	0.16	0.34
INT8	0.007	0.07

How BitNet.cpp Achieves Speed

The BitNet.cpp inference engine is built on lookup-table (LUT) optimized kernels inspired by T-MAC, a Microsoft project focused on running tensor operations on Cortex-M microcontrollers:

Parallel kernel implementations released in January 2026 added configurable tiling across hardware platforms, delivering a 1.15x–2.1x additional speedup
2B Parameter Model Demo: an Azure-hosted live demo (demo-bitnet-h0h8hcfqeqhrf5gf.canadacentral-01.azurewebsites.net) runs BitNet b1.58 3B on an Apple M2, accessible to anyone without setup
Memory-mapped weight storage: since weights are just lookups from a small set of {-1, 0, +1} values, the kernel avoids the memory wall that plagues traditional weight-storage designs

Where BitNet b1.58 Stands Against Competitive Models

Benchmark	BitNet b1.58 2B4T	Qwen2.5-1.5B	SmolLM2-1.7B	Phi-3 Mini
MMLU	Competitive	~55.2	Lower	~60.1
ARC-Challenge	⭐ Top performer	Competitive	Lower	~75
GSM8K	⭐ Top performer	~55	Lower	Competitive
HellaSwag	Competitive	~75+	~75+	~80+
CommonsenseQA	⭐ Top performer	Competitive	Lower	Competitive

Advanced: Production Deployment Patterns

Serverless on AWS Lambda

AWS employee Manu Mishra published a full tutorial running BitNet b1.58 on AWS Lambda as a container function. Key takeaways from that pattern:

## Lambda-specific environment - prevent threading conflicts
import os
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['OMP_THREAD_LIMIT'] = '1'
os.environ['GGML_OPENMP'] = 'OFF'
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

On-Device AI and Edge Inference

Because the CPU-only inference path requires no specialized hardware, BitNet b1.58 is a natural candidate for:

Smartphone on-device AI: The 400 MB model footprint is small enough to pre-bundle in apps
Raspberry Pi / SBC inference: Fully runs on ARM single-board computers without GPU requirements
Browser / WebAssembly: An interesting long-term possibility; BitNet.cpp is C++ and could be compiled to WASM
IoT controller AI: The ~0.028J token energy budget per inference and miniscule storage requirements make this the first credible LLM class option for battery-powered edge devices

Comparison & Alternatives: Where BitNet Sits in the Quantization Ecosystem

Approach	Precision	Training Required	On-device	Key Trade-off
Full-precision (FP16)	16-bit	From scratch	Limited	Maximum accuracy, max cost
INT8 Post-training	8-bit	Post-quantize only	Yes	Small accuracy loss, moderate savings
GPTQ / AWQ	4-bit	Post-quantize only	Yes	Best post-train option; still limited
BitNet b1.58	1.58-bit	From scratch	✅ Yes	Native design; best efficiency frontier
Binary (-1,+1 only)	~1-bit	From scratch	✅ Yes	Worse accuracy; no sparsity benefits

Alternatives in the space:

Neural Magic's sparse models — aggressive weight pruning + quantization; extremely competitive performance, but the ecosystem is less mature and licensing can be restrictive
TinyLlama / SMS-1B — 1.1B models stripped of layers; good for hobby projects but not achieving comparable accuracy
DistilBERT-style distillation — knowledge distillation can push 1.58-bit gains further; research in this direction is ongoing

Conclusion & Next Steps

What you can do today:

⭐ Star the repo: github.com/microsoft/BitNet
🚀 Run BitNet locally: Clone and build bitnet.cpp — it takes under 10 minutes on a modern machine
🐳 Try the live demo: demo-bitnet-h0h8hcfqeqhrf5gf.canadacentral-01.azurewebsites.net
📚 Read the paper: The Era of 1-bit LLMs and the BitNet b1.58 2B4T Technical Report
🔬 Fine-tune on Hugging Face: Grab the 1.58-bit Llama 3 8B model from HF1BitLLM and iterate

Key Takeaways

BitNet b1.58: Microsoft's 1-Bit LLM That Runs a 100B Model on a Single CPU

Table of Contents

Introduction

The Problem: LLMs Are Too Expensive to Be Everywhere

The Solution: BitNet b1.58 — Architecture Born for Ternary Weights

The Ternary Weight: {-1, 0, +1}

BitLinear: The Building Block

What "1.58 Bits" Actually Means — and Why It Beats "1 Bit"

Getting Started: Run BitNet b1.58 on Your Machine Today

Install bitnet.cpp

Run Inference

Hugging Face Alternative: Fine-Tune Your Own

Under the Hood: Why BitNet Is Fast — and Why It Matters

Extreme Memory Density

The Energy Arithmetic

How BitNet.cpp Achieves Speed

Where BitNet b1.58 Stands Against Competitive Models

Advanced: Production Deployment Patterns

Serverless on AWS Lambda

On-Device AI and Edge Inference

Comparison & Alternatives: Where BitNet Sits in the Quantization Ecosystem

Conclusion & Next Steps

Related Articles

Related Posts

The White House Just Put a 30-Day Clock on Every Frontier AI Model — And Nobody Knows What the Test Is

Alibaba's Qwen3.8-Max Is Here With 2.4 Trillion Parameters. But Can You Actually Use It?

GLM-5.2: The Open Model Nobody Can Ban

Key Takeaways

BitNet b1.58: Microsoft's 1-Bit LLM That Runs a 100B Model on a Single CPU

Table of Contents

Introduction

The Problem: LLMs Are Too Expensive to Be Everywhere

The Solution: BitNet b1.58 — Architecture Born for Ternary Weights

The Ternary Weight: {-1, 0, +1}

BitLinear: The Building Block

What "1.58 Bits" Actually Means — and Why It Beats "1 Bit"

Getting Started: Run BitNet b1.58 on Your Machine Today

Install bitnet.cpp

Run Inference

Hugging Face Alternative: Fine-Tune Your Own

Under the Hood: Why BitNet Is Fast — and Why It Matters

Extreme Memory Density

The Energy Arithmetic

How BitNet.cpp Achieves Speed

Where BitNet b1.58 Stands Against Competitive Models

Advanced: Production Deployment Patterns

Serverless on AWS Lambda

On-Device AI and Edge Inference

Comparison & Alternatives: Where BitNet Sits in the Quantization Ecosystem

Conclusion & Next Steps

Related Articles

Related Posts

The White House Just Put a 30-Day Clock on Every Frontier AI Model — And Nobody Knows What the Test Is

Alibaba's Qwen3.8-Max Is Here With 2.4 Trillion Parameters. But Can You Actually Use It?

GLM-5.2: The Open Model Nobody Can Ban