How to run DeepSeek locally? Complete guide
DeepSeek is an advanced AI model that specializes in natural language processing (NLP) and machine learning (ML). If you want to run it on your computer locally, this guide is for you.
In this article, we will explain in detail how to install, configure, and run DeepSeek, what system requirements are necessary, and what steps need to be followed.
1. Can DeepSeek be run on a local machine?
Yes! If you have strong hardware and the necessary software installed, you can run DeepSec locally on your machine.
Why run DeepSeek on a local system?
- Fast processing: Less time in data loading
- Privacy: Your data will not have to be uploaded to an external server
- Customization: You can tune the model to your needs
- No internet required: Offline use possible once installed
2. System Requirements
A powerful machine is required to run large AI models like DeepSec. The minimum requirements are as follows:
Hardware Requirements:
- GPU: NVIDIA RTX 3090 / A100 / H100 (minimum 24GB VRAM)
- RAM: Minimum 32GB (64GB+ preferred)
- Storage: 100GB+ SSD (NVMe SSD for better performance)
- Processor: AMD Ryzen 9 / Intel i9 (16 cores+ preferred)
Software Requirements:
- Operating System: Ubuntu 20.04 / Windows 11 / macOS (limited support for M1/M2)
- Python Version: Python 3.8 or later
- CUDA & cuDNN: (if you are using a GPU)
- AI Frameworks: PyTorch, TensorFlow (PyTorch is highly recommended Dependencies: Hugging Face
- Transformers, DeepSeek MoE Library, Torch, NumPy
DeepSeq Model Training and Fine-Tuning
Fine-tuning large MoE (Mixture of Experts) models like DeepSeq requires powerful hardware. If you want to fine-tune the model on a specific data set, techniques like LoRA, QLoRA, and Low-Rank Adaptation can be used to enable training with less memory.
Hardware Requirements (For Fine-Tuning)
Minimum Specs (for smaller models)
- GPU: NVIDIA RTX 3090 (24GB VRAM) or AMD Instinct MI250
- RAM: 64GB DDR4/DDR5
- Storage: 500GB SSD (NVMe preferred)
Recommended Specs (for larger models)
- GPU: NVIDIA A100 / H100 (80GB VRAM)
- RAM: 128GB+
- Storage: 1TB+ NVMe SSD
Mac Mini and DeepSeek are a combination made in heaven.
I utilized a Mac mini M4 Pro, which supports AI for text, images, and advanced reasoning. Forget about cloud subscriptions, latency, and transmitting data to third parties.
With 64GB of unified memory, a 20-core GPU, and an M4 Pro processor, this system is capable of handling some heavy AI jobs. Unfortunately, the terminal interface is terrible. No spell checker, no conversation history, and no UI customization. Deepseek Locally
This is where Docker and Open WebUI come in. They transform your basic terminal into a ChatGPT-like experience, complete with saved chats, an easy interface, and a variety of models at your disposal.
To clarify, we are not using the DeepSeek API. I’m running DeepSeek R1 models locally using llama.cpp (or Ollama), without
Local AI performance variables table
Below is a one-stop “Performance Variables” Table showing all the key knobs you can turn (in Ollama or llama.cpp) to push your Mac mini — or any machine — to the max.
The hardware (CPU cores, GPU VRAM, total RAM) is your fixed limit, but these variables help you dial in how that hardware is actually used.
Quick tips to actually push your computer past 20% usage
Max threads
- Set
--threads or OLLAMA_THREADS
to something near your logical core count (e.g., 28 if 14 physical cores or try 64–128).
High GPU layers
- If you’re using llama.cpp or Ollama with
--ngl
, push it (e.g., 100–400 GPU layers for 70B). - Watch out for VRAM limits if you set it too high.
Increase batch size
- In llama.cpp:
--batch-size 256
or512
can double or triple your throughput. - If you see memory errors or slowdowns, dial it back.
Use nice priority
nice -n -20 ollama run deepseek-r1:70b
… to hog CPU time.- But your Mac might stutter if you do heavy tasks in the background.
Don’t overextend context
- Keep
--context-size
at default unless you need longer chat memory. - Big context means more memory overhead.
Avoid running multiple instances
- If your goal is to push one chat to 100% usage, don’t spin up multiple models.
- Instead, throw all resources at a single session with high threads and batch size.
Conclusion
- LoRA and 4-bit Quantization are the best options for Fine-Tuning
- Use ONNX and TensorRT to improve performance on CPU
- DeepSpeed or FSDP are the best techniques for Multi-GPU training
- Prefer Ray or NVIDIA NCCL for Cluster-based Deployment
Leave a Reply