Syllabus Lesson 199 of 239 · Fine-Tuning, Conceptually
Fine-Tuning, Conceptually

Memory Footprint: 4-bit vs 16-bit

Here is the arithmetic that decides whether a model even fits on your hardware, and why "quantized" models are everywhere. A model's weights are just numbers, and each number takes some bytes depending on its precision. The footprint is simple:

bytes = params * bits_per_param / 8
# (8 bits per byte)

Training and full-precision inference often use 16-bit floats (2 bytes each). To run a model on a laptop or phone you quantize it: store each weight in fewer bits. 4-bit quantization uses a quarter of the bytes, because 4 bits is a quarter of 16 bits. Same parameter count, one quarter of the memory.

Work a 7-billion-parameter model:

16-bit: 7e9 * 16 / 8 = 1.4e10 bytes  ~= 13.04 GiB
 4-bit: 7e9 *  4 / 8 = 3.5e9  bytes  ~=  3.26 GiB   (exactly 1/4)

That is the difference between "needs a data-center GPU" and "runs on a MacBook." It is exactly why an on-device assistant (like the one powering this course's AI tutor) ships a 4-bit model: the quality cost is small, the memory and bandwidth win is huge. This is also why parameter-efficient fine-tuning (LoRA) is attractive, you only ever store and ship a few small adapter weights on top of a frozen, possibly quantized, base.

We report footprint in GiB (gibibytes), dividing bytes by 1024 ** 3. Keep it as an exact float, do not round, so the relationships below hold precisely.

Build two functions.

  • model_memory(params, bits_per_param) -> the footprint in GiB as a float: params * bits_per_param / 8 / (1024 ** 3).
  • quantization_savings(params, from_bits, to_bits) -> {"before_gb", "after_gb", "ratio"}, where before_gb and after_gb are the two footprints and ratio is from_bits / to_bits (how many times smaller the quantized model is).

The math is real and gradable; the GPU training it enables is not something this sandbox can run. Press Run to print the footprint of a 7B model at a few precisions.

Your turn

Write model_memory(params, bits_per_param) returning the footprint in GiB as a float (params * bits_per_param / 8 / (1024 ** 3)), and quantization_savings(params, from_bits, to_bits) returning {"before_gb", "after_gb", "ratio"} where the two footprints come from model_memory and ratio is from_bits / to_bits. Do not round; keep exact floats so 4-bit comes out as exactly one quarter of 16-bit.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output