Self-hosted DeepSeek on a lightweight, minimum install

Unsloth, an AI development team run by brothers Daniel and Michael Han, has successfully reduced the size of DeepSeek-R1 by approximately 80% using dynamic quantization techniques[1][2]. This significant reduction allows the model to run more efficiently on consumer hardware while maintaining much of its original performance.

## Key Achievements

- **Size Reduction**: The original DeepSeek-R1 model, which required 720GB of storage, has been compressed to just 131GB[1][2][6].

- **Performance Retention**: Despite the drastic size reduction, the compressed model maintains 80-90% of the original model's reasoning capabilities[4].

- **Efficiency Gain**: The compressed model can achieve a throughput of 140 tokens per second and 14 tokens per second for single-user inference on dual H100s[1].

## Dynamic Quantization Technique

Unsloth's approach to compressing DeepSeek-R1 involves:

1. **Selective Quantization**: Different parts of the model are quantized at varying levels of precision[2].

2. **MoE Layer Focus**: The Mixture of Experts (MoE) layers, which account for about 88% of the total weights, are quantized to 1.58 bits[2][5].

3. **Precision Balance**: Critical layers like the attention mechanism and initial transformer blocks use higher precision (4-bit or 6-bit) to maintain model integrity[2][3].

## Available Versions

Unsloth has created four dynamically quantized versions of DeepSeek-R1[2]:

1. 1.58-bit version (131GB)

2. 1.73-bit version (158GB)

3. 2.22-bit version (183GB)

4. 2.51-bit version (212GB)

## Practical Implications

- **Accessibility**: The compressed model can run on systems with as little as 80GB of combined VRAM and RAM[1][7].

- **Local Deployment**: Users can now run powerful AI models locally, reducing reliance on cloud services[1].

- **Cost-Efficiency**: The compression technique significantly reduces computational costs while maintaining strong performance[5].

This breakthrough in model compression demonstrates the potential for making advanced AI models more accessible and efficient, paving the way for broader adoption and application of powerful language models.

Citations:

[1] https://www.reddit.com/r/selfhosted/comments/1ic8zil/yes_you_can_run_deepseekr1_locally_on_your_device/

[2] https://gigazine.net/gsc_news/en/20250129-deepseek-r1-dynamic-quantized/

[3] https://digialps.com/1-58bit-deepseek-r1-gguf-seriously-yep-and-it-runs-locally/

[4] https://aws.amazon.com/blogs/machine-learning/deploy-deepseek-r1-distilled-llama-models-with-amazon-bedrock-custom-model-import/

[5] https://www.hypotenuse.ai/blog/what-is-deepseek-r1-and-why-is-it-making-waves-in-ai

[6] https://parsers.vc/news/250130-the-rise-of-deepseek--a-new-era-in-ai/

[7] https://www.reddit.com/r/LocalLLaMA/comments/1ic29lq/unsloth_made_dynamic_r1_quants_can_be_run_on_as/

[8] https://news.ycombinator.com/item?id=42850222

[9] https://fireworks.ai/blog/deepseek-r1-deepdive

[10] https://x.com/UnslothAI/status/1883899061893546254

[11] https://unsloth.ai/blog/deepseek-r1