What to do when Qwen3.5 runs out of VRAM?

Switch to a Q4_K_M quantized Qwen3.5 GGUF to cut VRAM usage by ~60%, or use a smaller model variant. With vLLM, add --max-model-len 8192 to reduce peak memory.

How to speed up Qwen3.5 local inference?

For GPU servers, use vLLM or SGLang with continuous batching. Apple Silicon users should use mlx-lm, which is optimized for unified memory and outperforms llama.cpp on M-series chips.

Qwen3.5 Local Deployment Guide | Best Model for Your Hardware

What Is Qwen3.5?

Qwen3.5 is a next-generation open-source large language model series released by Alibaba Cloud's Qwen Team between February and March 2026. It covers 8 model variants ranging from 0.8B to 397B parameters. Qwen3.5 adopts a groundbreaking hybrid architecture that combines Gated Delta Networks with sparse Mixture-of-Experts (MoE), delivering high-throughput inference at significantly lower cost than dense models. With support for 201 languages and dialects, native multimodal vision capabilities, and an Apache 2.0 license for free commercial use, Qwen3.5 is one of the most compelling open-source models available today for private local deployment.

Three Core Technical Advances

Compared to its predecessors, Qwen3.5 makes three qualitative leaps. First, early-fusion multimodal training brings its image-text understanding on par with dedicated vision-language models. Second, reinforcement learning scaled across million-agent environments dramatically improves complex reasoning and code generation. Third, near-100% multimodal training efficiency — meaning the model achieves the same quality as a text-only model at a fraction of the training compute. The result is a model that punches above its weight class against both open and closed-source competitors.

Qwen3.5 Model Lineup: All 8 Versions Compared

Server-Grade Models (Enterprise Private Deployment)

Flagship MoE

Qwen3.5-397B-A17B

397B total params, only 17B active per token. Outperforms same-tier closed-source models. Requires multi-GPU A100/H100 cluster.

MoE arch Multi-GPU Enterprise

Mid MoE

Qwen3.5-122B-A10B

122B params, 10B active. Dual A100 cards achievable. Excellent price-to-performance ratio for large-scale deployments.

MoE arch Dual A100

Balanced

Qwen3.5-27B

Dense 27B model. Runs on a single 80 GB GPU. The best balanced choice for enterprise private deployment in the Qwen3.5 lineup.

Dense 1× 80G GPU

Lite MoE

Qwen3.5-35B-A3B

35B params, only 3B active. Ultra-low latency for high-throughput service workloads requiring fast response times.

MoE arch Low latency

Best for Personal / Local Deployment

Qwen3.5-9B — The Sweet Spot for Consumer GPUs

Qwen3.5-9B requires approximately 16 GB of VRAM and performs excellently on RTX 3090 / 4090 class consumer cards. It offers the best balance of reasoning capability and memory footprint — making it the strongest all-around Qwen3.5 pick for anyone with a 16 GB GPU.

Qwen3.5-4B — Best Entry Point for 8 GB VRAM

The quantized Qwen3.5-4B needs only 3–5 GB of VRAM, making it viable on any modern gaming laptop. This is the most accessible Qwen3.5 version for users with 8 GB GPUs who want a locally-running AI assistant without cloud fees.

Qwen3.5-2B and Qwen3.5-0.8B — No GPU Required

These ultralightweight Qwen3.5 models are purpose-built for edge computing. They run on pure CPU — no discrete GPU needed — making them ideal for Raspberry Pi, embedded devices, or any low-power local AI deployment.

How to Deploy Qwen3.5 Locally — Step-by-Step Guides

📦 Model file sizes (Q4 quantized): 0.8B ≈ 1.0 GB · 2B ≈ 2.7 GB · 4B ≈ 3.4 GB · 9B ≈ 6.6 GB · 27B ≈ 17 GB · 35B ≈ 24 GB · 122B ≈ 81 GB

1

Ollama — One-Command Deploy, Best for Beginners Recommended

Ollama bundles a complete Qwen3.5 model manager and inference engine into a single tool. After installation, one command pulls and runs any model size and automatically starts a local OpenAI-compatible API on port 11434. Zero configuration needed.

① Install Ollama

Windows

Download OllamaSetup.exe from ollama.com/download, double-click to install, then restart.

macOS

brew install ollama

Or download the Ollama.dmg from the official website.

Linux

curl -fsSL https://ollama.com/install.sh | sh

② Choose your Qwen3.5 version based on VRAM

# No discrete GPU / CPU-only — 0.8B (file size ~1.0 GB)
ollama run qwen3.5:0.8b

# 8 GB VRAM (RTX 3060, etc.) — 4B (file size ~3.4 GB)
ollama run qwen3.5:4b

# 16 GB VRAM (RTX 3090, etc.) — 9B (file size ~6.6 GB, recommended)
ollama run qwen3.5:9b

# 24 GB+ VRAM (RTX 4090, etc.) — 27B (file size ~17 GB)
ollama run qwen3.5:27b

The first run automatically downloads the model. If you're behind a firewall, configure an HTTP proxy via the HTTPS_PROXY environment variable before running Ollama.

③ Optional: Add Open-WebUI for a Chat Interface

# Requires Docker Desktop to be installed
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --name open-webui ghcr.io/open-webui/open-webui:main
# Open http://localhost:3000 in your browser

④ OpenAI-Compatible API Call

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5:9b",
    "messages": [{"role":"user","content":"Introduce yourself briefly."}],
    "temperature": 0.6,
    "top_k": 20,
    "top_p": 0.95,
    "max_tokens": 2048
  }'

2

llama.cpp — Cross-Platform GGUF, CPU & GPU Hybrid CPU First

llama.cpp is the most widely supported local Qwen3.5 inference engine, running on Windows, macOS, and Linux. It can run purely on CPU, and the -ngl flag offloads any number of layers to a GPU — enabling flexible CPU+GPU hybrid inference when your VRAM is tight.

① Install llama.cpp

# macOS — simplest method
brew install llama.cpp

# Linux / Windows — compile from source for best performance
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j 8
# Binaries will be in ./build/bin/ — add to PATH for convenience

② Download a Qwen3.5 GGUF model file

# From Hugging Face (international users)
pip install huggingface_hub
huggingface-cli download Qwen/Qwen3.5-4B-GGUF \
  qwen3.5-4b-q4_k_m.gguf --local-dir ./models

# From ModelScope (China mirror, faster)
pip install modelscope
modelscope download Qwen/Qwen3.5-4B-GGUF \
  qwen3.5-4b-q4_k_m.gguf --local_dir ./models

Quantization guide: Q4_K_M (balanced, recommended) · Q6_K (higher quality, +50% VRAM) · Q2_K (extreme compression, quality loss). The Q4_K_M 4B file is about 2.5 GB; 9B is about 5.5 GB.

③ Run the model

# Interactive chat — CPU only (-t = thread count, use physical cores)
llama-cli -m ./models/qwen3.5-4b-q4_k_m.gguf \
  --jinja --color -t 8 \
  --temp 0.6 --top-k 20 --top-p 0.95 \
  -c 40960 --no-context-shift

# Full GPU mode (-ngl 99 = all layers on GPU, -fa = FlashAttention)
llama-cli -m ./models/qwen3.5-4b-q4_k_m.gguf \
  --jinja --color -ngl 99 -fa \
  --temp 0.6 --top-k 20 --top-p 0.95 \
  -c 40960 --no-context-shift

# CPU+GPU hybrid (e.g., -ngl 20 = offload first 20 layers to GPU)
llama-cli -m ./models/qwen3.5-4b-q4_k_m.gguf \
  --jinja --color -ngl 20 \
  --temp 0.6 --top-k 20 --top-p 0.95 -c 40960

④ Start an HTTP server (OpenAI-compatible API)

llama-server -m ./models/qwen3.5-4b-q4_k_m.gguf \
  --jinja -ngl 99 -fa -c 40960 \
  --host 0.0.0.0 --port 8080
# Web UI:  http://localhost:8080
# API:     http://localhost:8080/v1/chat/completions

3

vLLM — Production-Grade High-Throughput on NVIDIA GPUs Production

vLLM uses PagedAttention memory management and continuous batching to multiply Qwen3.5 inference throughput by several times over vanilla Transformers. It is the recommended choice for GPU server deployments that need to serve multiple concurrent users. Requires CUDA 11.8+.

① Install vLLM

pip install "vllm>=0.8.5"

② Start the Qwen3.5 inference server

# Single 16 GB GPU — deploy Qwen3.5-9B (32K context)
vllm serve Qwen/Qwen3.5-9B-Instruct \
  --port 8000 \
  --reasoning-parser qwen3 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.9

# Tensor-parallel across 2 GPUs — deploy Qwen3.5-27B
vllm serve Qwen/Qwen3.5-27B-Instruct \
  --port 8000 \
  --tensor-parallel-size 2 \
  --reasoning-parser qwen3 \
  --max-model-len 32768

# 8× A100 cluster — flagship Qwen3.5-397B-A17B MoE
vllm serve Qwen/Qwen3.5-397B-A17B-Instruct \
  --port 8000 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --reasoning-parser qwen3

③ Test the API (OpenAI-compatible format)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.5-9B-Instruct",
    "messages": [{"role":"user","content":"Hello, introduce yourself."}],
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20,
    "max_tokens": 2048
  }'

⚠️ OOM / Out of Memory: Try --max-model-len 8192 to reduce peak context, or --gpu-memory-utilization 0.8 to lower pre-allocated VRAM. You can also drop to a smaller Qwen3.5 variant (e.g., 9B → 4B).

4

MLX — Apple Silicon Unified Memory, Best for Mac Mac Only

For Apple M1 through M4 chip users, MLX is the definitive local deployment solution. The framework is fully optimized for Apple Silicon's unified memory architecture — the CPU and GPU share the same memory pool — so Qwen3.5 inference is both faster and more power-efficient than llama.cpp on any M-series Mac.

① Install mlx-lm

pip install mlx-lm

② Choose model based on unified memory size

# 8 GB unified RAM (M1 base) — 4B recommended
mlx_lm.chat --model mlx-community/Qwen3.5-4B-Instruct-4bit

# 16 GB unified RAM (M1 Pro / M2) — 9B recommended
mlx_lm.chat --model mlx-community/Qwen3.5-9B-Instruct-4bit

# 32 GB unified RAM (M1 Max / M3 Pro) — can run 27B
mlx_lm.chat --model mlx-community/Qwen3.5-27B-Instruct-4bit

# One-shot generation (no interactive chat)
mlx_lm.generate \
  --model mlx-community/Qwen3.5-4B-Instruct-4bit \
  --max-tokens 2048 \
  --prompt "Describe the architecture of Qwen3.5 in detail."

③ Start an OpenAI-compatible API server

mlx_lm.server \
  --model mlx-community/Qwen3.5-4B-Instruct-4bit \
  --port 8000

# Test the API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Qwen3.5-4B-Instruct-4bit",
       "messages":[{"role":"user","content":"Hello"}],
       "temperature":0.6,"max_tokens":1024}'

💡 MLX model files are hosted in the mlx-community Hugging Face organization. First download will take a few minutes. To speed it up from inside China, set HF_ENDPOINT=https://hf-mirror.com before running.

Hardware Guide: Which Qwen3.5 Should You Run?

VRAM is the primary constraint when choosing a Qwen3.5 version. Use this table to instantly find the right model and toolchain for your specific hardware setup.

Hardware	Recommended Qwen3.5	Best Tool	Experience
CPU only (no GPU)	Qwen3.5-0.8B / 2B	llama.cpp GGUF	Usable ★★☆
8 GB VRAM (RTX 3060, etc.)	Qwen3.5-4B Q4	Ollama / llama.cpp	Smooth ★★★
16 GB VRAM (RTX 3090, etc.)	Qwen3.5-9B	Ollama / vLLM	Excellent ★★★★
24 GB+ VRAM (RTX 4090, etc.)	Qwen3.5-27B Q4	vLLM / SGLang	Outstanding ★★★★★
Apple Silicon (M1–M4)	Qwen3.5-4B / 9B MLX	MLX / Ollama	Smooth ★★★★
Multi-GPU A100 cluster	Qwen3.5-27B / 397B	vLLM / SGLang	Flagship ★★★★★

💡 Quantization pick: Q4_K_M hits the best balance of VRAM savings vs. model quality and is the default recommendation for local Qwen3.5 runs. Use Q6_K if you have headroom and want higher accuracy, or Q2_K only as a last resort (noticeable quality drop).

Qwen3.5 Local Deployment — Frequently Asked Questions

What should I do when Qwen3.5 runs out of VRAM?

First try a quantized Qwen3.5 GGUF: Q4_K_M reduces VRAM usage by roughly 60% versus FP16. If that still isn't enough, drop to a smaller parameter count (e.g., 9B → 4B), or in llama.cpp use -ngl to partially offload layers to GPU while keeping the rest in system RAM. With vLLM, add --max-model-len 8192 to cut peak KV-cache memory and --gpu-memory-utilization 0.8 to lower pre-allocated VRAM.

How do I speed up Qwen3.5 local inference?

For NVIDIA GPU servers, deploy Qwen3.5 with vLLM or SGLang — both support continuous batching and FlashAttention, delivering several times the throughput of raw Transformers. For Mac users, mlx-lm is the winner: it exploits Apple Silicon's unified memory natively and typically runs 50–100% faster than llama.cpp on the same M-chip hardware.

Can I fine-tune Qwen3.5?

Yes. All Qwen3.5 variants support supervised fine-tuning (SFT), DPO, and GRPO workflows. The recommended training frameworks are LLaMA-Factory, ms-Swift, and UnSloth — all of which have native support for Qwen3.5 and can run LoRA or full fine-tuning depending on your GPU memory budget.

Where can I download Qwen3.5 model files?

All Qwen3.5 weights are open-sourced on Hugging Face Hub (organization: Qwen) and ModelScope (for faster downloads inside China). For GGUF-quantized versions, search for repos ending in -GGUF. For MLX versions, search the mlx-community organization on Hugging Face. You can set VLLM_USE_MODELSCOPE=true or SGLANG_USE_MODELSCOPE=true to auto-route downloads through ModelScope.

Does Qwen3.5 support vision / images?

Yes. Qwen3.5 builds in multimodal image understanding via early-fusion training, achieving parity with dedicated vision-language models from previous generations. For vision support with llama.cpp, look for the vision-enabled GGUF variants. With MLX, use mlx-vlm instead of mlx-lm for image+text inference.