|
查看: 75|回复: 3
|
认识 Bf16 gguf 等 jargon 训练 LLM, 智谱 GLM 开源4.7
[复制链接]
|
|
|
Bf16 is better than fp8 in precision
That's generally true. **BF16 (Bfloat16)** maintains the same 8-bit exponent as FP32, giving it a much wider dynamic range than **FP8** (which uses fewer exponent bits, typically 4 or 5). This means BF16 suffers less from overflow/underflow issues and retains more precision for gradients and activations during training.
However, the trade-off is that **FP8** offers 2× the memory bandwidth and compute throughput on supported hardware (like NVIDIA H100 with Transformer Engine, or recent AMD/Intel accelerators). For inference and some large-scale training scenarios where pure throughput is the bottleneck, FP8 can be worth the precision cost—especially when combined with quantization-aware training or mixed-precision strategies.
So the practical answer depends on your constraint: **precision and training stability** → BF16; **throughput and memory efficiency** → FP8.
Glm 4.5 |
|
|
|
|
|
|
|
|
|
|

楼主 |
发表于 17-6-2026 08:55 AM
来自手机
|
显示全部楼层
Glm 4.7 355B-A32B required vram how many
I need to search for the VRAM requirements for the GLM-4.7 355B-A32B model.
Based on the search results, here's what I found about **GLM-4.7 355B-A32B** VRAM requirements:
## GLM-4.7 355B-A32B VRAM Requirements
| Variant | Precision | VRAM Required | Notes |
|---------|-----------|---------------|-------|
| **GLM-4.7** | BF16 | **~710 GB** | Full precision, highest quality |
| **GLM-4.7-FP8** | FP8 | **~328 GB** | Near-lossless, ~2x memory savings |
| **GLM-4.7 (Q8 GGUF)** | INT8 | ~205 GB combined | CPU offload with GPU assistance |
| **GLM-4.7 (Q4 GGUF)** | INT4 | ~40-60 GB GPU + RAM | MoE offloading to RAM |
### Key Data Points:
- **BF16 full model**: The 355B parameter model at BF16 requires approximately **710 GB of VRAM** (355B × 2 bytes = 710 GB, plus overhead for KV cache and activations). This is a multi-node deployment scenario.
- **FP8 variant**: The official **GLM-4.7-FP8** model runs on **~82 GB per node** across 4 nodes (DGX Spark cluster), totaling ~328 GB. This is the practical production setup for the full model.
- **System RAM requirement**: The official GitHub notes that **server memory must exceed 1TB** to ensure normal model loading and operation.
- **GGUF quantized versions**: For consumer/local deployment, the 4-bit GGUF can fit on a **single 40GB GPU** with MoE layers offloaded to RAM, but requires ~205GB combined RAM+VRAM for good performance.
### Practical Deployment Options:
| Setup | Hardware | Context |
|-------|----------|---------|
| **Production (FP8)** | 4× H100 80GB or 4× DGX Spark | Full 128K+ context |
| **Research (BF16)** | 8× H100 80GB or 10× A100 80GB | Maximum precision |
| **Local/Consumer** | 1× RTX 4090 24GB + 128GB RAM | Q4 GGUF, limited context |
**Bottom line**: If you want to run the **full GLM-4.7 355B-A32B at BF16**, you need roughly **710+ GB of VRAM** (typically 8-10× H100/A100 80GB GPUs). The **FP8 variant** is the practical choice, requiring **~328 GB total** (~82 GB per node × 4 nodes). |
|
|
|
|
|
|
|
|
|
|

楼主 |
发表于 17-6-2026 08:58 AM
来自手机
|
显示全部楼层
Kimi K 2.6 ai can make mistakes
What is gguf
**GGUF** (GPT-Generated Unified Format) is a binary file format for storing and distributing quantized large language models, primarily used with **llama.cpp** and related inference engines.
## Key Points
| Aspect | Details |
|--------|---------|
| **Origin** | Created by Georgi Gerganov (creator of llama.cpp) as a successor to the older GGML format |
| **Purpose** | Store quantized models (INT4, INT8, Q4_K_M, Q5_K_M, Q6_K, Q8_0, etc.) for efficient CPU/GPU inference |
| **Quantization** | Reduces model size by 2-4× (e.g., a 7B model from ~13GB → ~4GB) |
| **Metadata** | Self-contained: includes tokenizer, vocab, and all tensors in one file |
| **Compatibility** | Works with llama.cpp, Ollama, LM Studio, koboldcpp, and many other tools |
## Why It Matters
- **Consumer hardware**: Run 70B+ models on a single GPU or even CPU
- **No dependencies**: Single `.gguf` file, no PyTorch/Transformers stack needed
- **Flexible offloading**: Mix GPU layers + CPU/RAM layers to fit your hardware
## Quantization Types (Common)
| Type | Size vs FP16 | Use Case |
|------|-------------|----------|
| Q4_K_M | ~25% | Best speed/size tradeoff |
| Q5_K_M | ~31% | Better quality, still fast |
| Q6_K | ~38% | Near-lossless |
| Q8_0 | ~50% | Minimal quality loss |
| BF16/FP16 | 100% | Reference quality |
For **GLM-4.7 355B-A32B**, the GGUF versions (Q4, Q8) are the only practical way to run it on consumer hardware, trading some precision for massive memory savings. |
|
|
|
|
|
|
|
|
|
|

楼主 |
发表于 17-6-2026 09:09 AM
来自手机
|
显示全部楼层
本地端配 Cc最好???
Pairing Claude Code with Local Models
https://www.kdnuggets.com/pairing-claude-code-with-local-models |
|
|
|
|
|
|
|
|
| |
本周最热论坛帖子
|