GLM 4.5 Air
AIR is a streamlined descendant of the GLM lineage designed for scenarios where every millisecond and cent count. By pruning redundant pathways and adopting low‑rank adaptation kernels, AIR delivers much of the expressive power of larger siblings while running comfortably on a single high‑end GPU or modest CPU cluster. Its latency is measured in tens of milliseconds, enabling responsive mobile chat and high‑frequency retrieval tasks. Compression‑aware training ensures knowledge retention despite the reduced footprint, keeping answers factual and coherent. With flexible quantization presets, AIR allows developers to trade accuracy for speed on the fly, optimizing cost at scale.
Tools
Function Calling
Context Window
128,000
Max Output Tokens
96,000