Memory Footprint: 4-bit vs 16-bit
Here is the arithmetic that decides whether a model even fits on your hardware, and why "quantized" models are everywhere. A model's weights are just numbers, and each number takes some bytes depending on its precision. The footprint is simple:
bytes = params * bits_per_param / 8
# (8 bits per byte)Training and full-precision inference often use 16-bit floats (2 bytes each). To run a model on a laptop or phone you quantize it: store each weight in fewer bits. 4-bit quantization uses a quarter of the bytes, because 4 bits is a quarter of 16 bits. Same parameter count, one quarter of the memory.
Work a 7-billion-parameter model:
16-bit: 7e9 * 16 / 8 = 1.4e10 bytes ~= 13.04 GiB
4-bit: 7e9 * 4 / 8 = 3.5e9 bytes ~= 3.26 GiB (exactly 1/4)That is the difference between "needs a data-center GPU" and "runs on a MacBook." It is exactly why an on-device assistant (like the one powering this course's AI tutor) ships a 4-bit model: the quality cost is small, the memory and bandwidth win is huge. This is also why parameter-efficient fine-tuning (LoRA) is attractive, you only ever store and ship a few small adapter weights on top of a frozen, possibly quantized, base.
We report footprint in GiB (gibibytes), dividing bytes by 1024 ** 3. Keep it as an exact float, do not round, so the relationships below hold precisely.
Build two functions.
model_memory(params, bits_per_param)-> the footprint in GiB as a float:params * bits_per_param / 8 / (1024 ** 3).quantization_savings(params, from_bits, to_bits)->{"before_gb", "after_gb", "ratio"}, wherebefore_gbandafter_gbare the two footprints andratioisfrom_bits / to_bits(how many times smaller the quantized model is).
The math is real and gradable; the GPU training it enables is not something this sandbox can run. Press Run to print the footprint of a 7B model at a few precisions.
Write model_memory(params, bits_per_param) returning the footprint in GiB as a float (params * bits_per_param / 8 / (1024 ** 3)), and quantization_savings(params, from_bits, to_bits) returning {"before_gb", "after_gb", "ratio"} where the two footprints come from model_memory and ratio is from_bits / to_bits. Do not round; keep exact floats so 4-bit comes out as exactly one quarter of 16-bit.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.