佳礼资讯网

 找回密码
 注册

ADVERTISEMENT

搜索
查看: 70|回复: 3

认识 Bf16 gguf 等 jargon 训练 LLM, 智谱 GLM 开源4.7

[复制链接]
发表于 17-6-2026 08:53 AM 来自手机 | 显示全部楼层 |阅读模式
Bf16 is better than fp8 in precision

That's generally true. **BF16 (Bfloat16)** maintains the same 8-bit exponent as FP32, giving it a much wider dynamic range than **FP8** (which uses fewer exponent bits, typically 4 or 5). This means BF16 suffers less from overflow/underflow issues and retains more precision for gradients and activations during training.

However, the trade-off is that **FP8** offers 2× the memory bandwidth and compute throughput on supported hardware (like NVIDIA H100 with Transformer Engine, or recent AMD/Intel accelerators). For inference and some large-scale training scenarios where pure throughput is the bottleneck, FP8 can be worth the precision cost—especially when combined with quantization-aware training or mixed-precision strategies.

So the practical answer depends on your constraint: **precision and training stability** → BF16; **throughput and memory efficiency** → FP8.

Glm 4.5
回复

使用道具 举报


ADVERTISEMENT

 楼主| 发表于 17-6-2026 08:55 AM 来自手机 | 显示全部楼层
Glm 4.7 355B-A32B required vram how many

I need to search for the VRAM requirements for the GLM-4.7 355B-A32B model.
Based on the search results, here's what I found about **GLM-4.7 355B-A32B** VRAM requirements:

## GLM-4.7 355B-A32B VRAM Requirements

| Variant | Precision | VRAM Required | Notes |
|---------|-----------|---------------|-------|
| **GLM-4.7** | BF16 | **~710 GB** | Full precision, highest quality |
| **GLM-4.7-FP8** | FP8 | **~328 GB** | Near-lossless, ~2x memory savings |
| **GLM-4.7 (Q8 GGUF)** | INT8 | ~205 GB combined | CPU offload with GPU assistance |
| **GLM-4.7 (Q4 GGUF)** | INT4 | ~40-60 GB GPU + RAM | MoE offloading to RAM |

### Key Data Points:

- **BF16 full model**: The 355B parameter model at BF16 requires approximately **710 GB of VRAM** (355B × 2 bytes = 710 GB, plus overhead for KV cache and activations). This is a multi-node deployment scenario.

- **FP8 variant**: The official **GLM-4.7-FP8** model runs on **~82 GB per node** across 4 nodes (DGX Spark cluster), totaling ~328 GB. This is the practical production setup for the full model.

- **System RAM requirement**: The official GitHub notes that **server memory must exceed 1TB** to ensure normal model loading and operation.

- **GGUF quantized versions**: For consumer/local deployment, the 4-bit GGUF can fit on a **single 40GB GPU** with MoE layers offloaded to RAM, but requires ~205GB combined RAM+VRAM for good performance.

### Practical Deployment Options:

| Setup | Hardware | Context |
|-------|----------|---------|
| **Production (FP8)** | 4× H100 80GB or 4× DGX Spark | Full 128K+ context |
| **Research (BF16)** | 8× H100 80GB or 10× A100 80GB | Maximum precision |
| **Local/Consumer** | 1× RTX 4090 24GB + 128GB RAM | Q4 GGUF, limited context |

**Bottom line**: If you want to run the **full GLM-4.7 355B-A32B at BF16**, you need roughly **710+ GB of VRAM** (typically 8-10× H100/A100 80GB GPUs). The **FP8 variant** is the practical choice, requiring **~328 GB total** (~82 GB per node × 4 nodes).
回复

使用道具 举报

 楼主| 发表于 17-6-2026 08:58 AM 来自手机 | 显示全部楼层

Kimi K 2.6 ai can make mistakes



What is gguf

**GGUF** (GPT-Generated Unified Format) is a binary file format for storing and distributing quantized large language models, primarily used with **llama.cpp** and related inference engines.

## Key Points

| Aspect | Details |
|--------|---------|
| **Origin** | Created by Georgi Gerganov (creator of llama.cpp) as a successor to the older GGML format |
| **Purpose** | Store quantized models (INT4, INT8, Q4_K_M, Q5_K_M, Q6_K, Q8_0, etc.) for efficient CPU/GPU inference |
| **Quantization** | Reduces model size by 2-4× (e.g., a 7B model from ~13GB → ~4GB) |
| **Metadata** | Self-contained: includes tokenizer, vocab, and all tensors in one file |
| **Compatibility** | Works with llama.cpp, Ollama, LM Studio, koboldcpp, and many other tools |

## Why It Matters

- **Consumer hardware**: Run 70B+ models on a single GPU or even CPU
- **No dependencies**: Single `.gguf` file, no PyTorch/Transformers stack needed
- **Flexible offloading**: Mix GPU layers + CPU/RAM layers to fit your hardware

## Quantization Types (Common)

| Type | Size vs FP16 | Use Case |
|------|-------------|----------|
| Q4_K_M | ~25% | Best speed/size tradeoff |
| Q5_K_M | ~31% | Better quality, still fast |
| Q6_K | ~38% | Near-lossless |
| Q8_0 | ~50% | Minimal quality loss |
| BF16/FP16 | 100% | Reference quality |

For **GLM-4.7 355B-A32B**, the GGUF versions (Q4, Q8) are the only practical way to run it on consumer hardware, trading some precision for massive memory savings.
回复

使用道具 举报

 楼主| 发表于 17-6-2026 09:09 AM 来自手机 | 显示全部楼层
本地端配 Cc最好???

Pairing Claude Code with Local Models
https://www.kdnuggets.com/pairing-claude-code-with-local-models
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

 

ADVERTISEMENT



ADVERTISEMENT



ADVERTISEMENT

ADVERTISEMENT


版权所有 © 1996-2026 Cari Internet Sdn Bhd (483575-W)|IPSERVERONE 提供云主机|广告刊登|关于我们|私隐权|免控|投诉|联络|脸书|佳礼资讯网

GMT+8, 17-6-2026 09:29 PM , Processed in 0.506671 second(s), 13 queries , Gzip On, Redis On.

Powered by Discuz! X3.4

Copyright © 2001-2021, Tencent Cloud.

快速回复 返回顶部 返回列表