LLM Self-hosted Deployment Roadmap: From Zero to Production
Deploy Vision Language Models at **$187/month** with our breakthrough automation guide. Achieve **4000+ documents/hour processing** with very high precision using advanced quantization (W8A8, FP8 KV cache) on Consumer grade hardware like RTX 4000 Ada. Our one-command deployment system includes automated RunPod integration, real-time monitoring, and production-grade performance benchmarks—transforming complex VLM deployment into a 30-second, cost-effective operation. Real-world testing shows 1.92 requests/sec with 97-99% accuracy retention, making enterprise-grade document processing accessible to teams of any size.
Complete Guide to Production-Ready Qwen2.5-VL Deployment with Full Automation
This guide focuses on production-grade deployment with automated pod creation, monitoring, and scaling for single-node setups using RunPod's infrastructure.
Motivation
First of all, Many thanks to the community, LLM researchers and open-source model providers for their work and contributions to the field of LLM.
My ambition is to leverage a real world use-case to provide a complete guide to production-ready deployment of your favorite LLM models, with a focus on automation and scalability.
I will focus on VLMs as it edges with what i've been working on. But you can "easily" generalize the guide to any LLM.
For sake of simplicity, I will also focus on the deployment on a single node, with a single GPU. Multi-Node deployment is a bit more complex and will be covered in a future guide. If any of you are interested, please let me know or reach out to me on LinkedIn.
Moreover, it's important to understand that each of the following sections are very broad subjects by their own. I will try to make them as self-contained as possible to make this use-case exploration self-sufficient.
Seet this more as a handbook that is by no means exhaustive but very useful to understand the different aspects of the deployment.
Table of Contents
Production Overview & Requirements
Memory Calculation Framework
Understanding Quantization Strategies
GPU Selection & Theoretical Requirements
Inference Engine Comparison
Automated Pod Deployment
Production Monitoring & Throughput
Real-World Testing & Validation
Production Overview & Requirements
The main goal of this deployment is to provide a high-volume document processing service. The main requirements are :
Performance : 2000+ documents/hour sustained
Precision : 90%+ accuracy
Reliability : 99% uptime
Latency : < 20 seconds for document processing - Not expected to be real-time
Production-Grade Deployment Characteristics
For production deployments, we need to first define the requirements. And for that we will navigate between the basic expectations that any system should have and the specific requirements for LLMs.
For a first step, we will focus on a single-pod deployment.
Architecture of self-Deployment Strateegy
Before diving in any technical choices, we need to understand the basics of LLM deployments. Quantization is a key factor in the deployment of LLMs. By now, i hope size of the model is commonly known by everyone. But let's recap it quickly.
Quick breakdown :
Model Size : when we talk about the size of the model, we are talking about the number of parameters. Those parameters are the weights of the model that we train on a dataset.
Model Precision : the precision of the model weights and activations. Today the main precisions are FP32, FP16, INT8, INT4. meaning that for each of those precisions, each parameter is stored with a different number of bits. FP32 = 32 bits, FP16 = 16 bits, INT8 = 8 bits, INT4 = 4 bits.
Model Memory : That's what helps us understand the memory footprint of the model. Model Memory = Model Size * Model Precision
KV Cache : The KV cache is a cache of the model's key and value vectors. It is used to store the model's key and value vectors for the next token prediction.
Activations : The activations are the output of the model's layers. They are used to generate the output of the model. During training, the activations are used to compute the loss and update the model weights during backpropagation Storage is then required for backprop. During inference, the activations are used to generate the output of the model.
Memory Calculation Framework
Understanding GPU memory requirements is essential for production planning and quantization decisions:
Memory Components Breakdown
1. Model Weights:
Base Memory = Parameters × Precision_Bytes - FP16: 2 bytes per parameter - INT8: 1 byte per parameter - INT4: 0.5 bytes per parameter
2. KV Cache: ( Key-Value Cache in transformer architecture)
Note: Activations are generally not quantized during inference, they are stored in FP16. Activation memory is not a bottleneck for inference as we do not need to store them for backprop.
4. Framework Overhead:
Overhead = 10-20% of total memory (PyTorch, CUDA kernels, etc.)
Critical Understanding: What Gets Quantized
Model Weights (What Gets Quantized):
FP16: 2 bytes per parameter
INT8: 1 byte per parameter
INT4: 0.5 bytes per parameter
KV Cache & Activations (Usually NOT Quantized):
Traditional methods (AWQ/GPTQ/GGUF): Weights are quantized, but KV cache and activations remain FP16 (2 bytes per element)
Advanced methods (W8A8): Both weights AND activations can be quantized to 8-bit
Reason: Numerical stability and accuracy preservation
Why This Matters:
Weight quantization saves memory on model storage
KV cache grows linearly with sequence length and remains unquantized (in traditional methods)
For long sequences, unquantized KV cache can dominate total memory usage
This is why quantization benefits diminish with very long contexts
Inference vs Training: No need to store intermediate activations for backprop
For 32K context: Model weights dominate memory usage (1.87GB KV cache vs 15.2GB FP16 weights)
INT4 quantization saves ~8.87GB, INT8 saves ~11.87GB from model weights
KV cache remains 1.87GB regardless of weight quantization (traditional methods)
W8A8 advantage: Also quantizes activations (0.7GB → 0.35GB)
Activations minimal: Only 0.7GB for inference vs 1.5GB+ for training
⚠️ Important Reality Check: For long document processing (>8K tokens), the KV cache becomes the bottleneck, not the model weights.
Memory Scaling Example (Qwen2.5-VL-7B AWQ):
Model weights (INT4): ~3.8GB
KV cache at 8K tokens: ~0.46GB
KV cache at 32K tokens: ~1.87GB
KV cache at 128K tokens: ~7.48GB
This understanding is crucial for choosing the right quantization strategy for your use case.
Understanding Quantization Strategies
Quantization is the process of reducing the precision of model weights to reduce memory footprint and improve inference speed. Important: Most modern quantization methods (AWQ, GPTQ, GGUF) only quantize model weights, while activations and KV cache typically remain in higher precision (FP16) for numerical stability and accuracy. ( As opposed to, rarer W8A8 - a quantization method that quantizes model weights to 8 bits and activations to 8 bits.)
Quantization Impact Analysis
Understanding quantization is crucial for production deployment decisions. Here's a comprehensive breakdown:
1. GGUF (GPT-Generated Unified Format)
What is GGUF:
Successor to GGML format, optimized for llama.cpp
Designed for CPU and consumer GPU inference
Excellent for edge deployment and resource-constrained environments
GGUF Quantization Levels:
Format
Effective Bits
Memory Usage
Quality Loss
Best Use Case
Q2_K
~2.6 bits
25% of FP16
High (15-20%)
Extreme memory constraints
Q3_K_S
~3.5 bits
35% of FP16
Medium (10-15%)
Mobile/edge deployment
Q4_0
4.5 bits
45% of FP16
Low (5-8%)
Balanced production
Q4_K_M
~4.8 bits
48% of FP16
Very Low (3-5%)
Recommended for production
Q5_K_S
~5.5 bits
55% of FP16
Minimal (2-3%)
High-quality production
Q6_K
6.6 bits
66% of FP16
Negligible (1-2%)
Near-original quality
Q8_0
8.5 bits
85% of FP16
Negligible (<1%)
Quality-critical applications
GGUF Pros:
Excellent CPU performance
Wide hardware compatibility
Mature ecosystem (llama.cpp)
Good compression ratios
GGUF Cons:
Primarily CPU-focused
Limited GPU acceleration compared to other formats
Not optimal for high-throughput production
2. AWQ (Activation-aware Weight Quantization)
What is AWQ:
Quantizes only model weights to INT4 based on activation patterns
Activations and KV cache remain in FP16 for numerical stability
Optimized for GPU inference with minimal quality loss
AWQ Characteristics:
Weight Precision: INT4 (4-bit weights only)
Activations/KV Cache: FP16 (unquantized)
Memory Reduction: ~75% for model weights, but KV cache dominates for long sequences
Quality Retention: 95-98% of original performance
GPU Optimization: Designed for CUDA acceleration
AWQ Production Benefits:
Excellent GPU utilization
High inference speed
Good quality preservation
Wide model support
AWQ Limitations:
Requires CUDA-compatible GPUs
KV cache remains unquantized (memory bottleneck for long sequences)
Conversion process can be time-consuming
Memory scaling: For long contexts, unquantized KV cache can use more memory than quantized weights
3. GPTQ (GPT Quantization)
What is GPTQ:
Post-training quantization using optimal brain surgeon approach
Quantizes only model weights, activations and KV cache remain FP16
Layer-wise quantization with configurable bit widths
GPTQ INT4 + FP8 KV Cache: Good alternative with KV cache benefits
BitsAndBytes INT8 + FP8 KV Cache: Excellent for HuggingFace pipelines
GPTQ INT8: Conservative quantization approach
Use Case: Moderate throughput, HF ecosystem integration, concurrent batching
🥉 Tier 3 (Specialized Cases):
BitsAndBytes NF4: Good quality with HF integration
GGUF Q4_K_M: For CPU-heavy deployments
GGUF Q8_0: Maximum quality, higher memory usage
Use Case: Development, testing, quality-critical applications
🚀 New Tier 1+ (Cutting Edge Production):
AWQ INT4 + FP8 KV Cache: Enables RTX 3080/4070 for 32K contexts with batching
Memory: 7.7GB total for 32K context (vs 9.8GB without KV quantization)
Benefit: 2x more concurrent requests OR 2x longer contexts
Status: Production-ready in vLLM, experimental in others
⚠️ Important Reality Check: For long document processing (>8K tokens), the KV cache becomes the bottleneck, not the model weights. This means:
KV Cache Quantization: The Next Frontier
Why KV Cache Quantization Matters: As we've seen, KV cache starts to have big impact on memory usage for long contexts. KV cache quantization addresses this directly by reducing the precision of cached key-value pairs.
FP8 KV Cache Implementation
Current State (2024-2025):
FP8 KV Cache: Available in vLLM and other inference engines
Memory Benefits: ~2x more space for KV cache allocation
Quality: Minimal degradation with proper calibration
Performance Impact:
Throughput: Approximately double the KV cache capacity
Enables: Either longer context lengths OR more concurrent batches
Latency: Currently no improvements (no fused dequantization yet)
Future: Fused operations will improve latency
Production Benefits:
✅ 2x more concurrent requests at same memory ✅ 2x longer context lengths at same memory ✅ Better GPU memory utilization ✅ Reduced memory bottlenecks ⚠️ No latency improvements yet ⚠️ Still experimental in some engines
Memory Impact with KV Cache Quantization
Qwen2.5-VL-7B with FP8 KV Cache (32K context, INFERENCE ONLY):
Configuration
Model Weights
KV Cache
Activations
Framework Overhead
Total Required
AWQ INT4 + FP16 KV
3.8 GB
1,87 GB GB
0.7 GB
1.5 GB
8.07 GB
AWQ INT4 + FP8 KV
3.8 GB
0.935 GB
0.7 GB
1.3 GB
6.385 GB
W8A8-INT + FP8 KV
7.6 GB
0.935 GB
0.35 GB
1.3 GB
10.185 GB
W8A8-FP + FP8 KV
7.6 GB
0.935 GB
0.35 GB
1.3 GB
10.185 GB
BnB INT8 + FP8 KV
7.6 GB
0.935 GB
0.7 GB
1.5 GB
10.735 GB
Key Insights:
FP8 KV cache halves memory usage for the cache component (1,87GB → 0.935GB)
Total memory savings: 21% when combining AWQ + FP8 KV cache (10.0GB → 6.385GB)
Now that we understand the memory calculation framework and quantization strategies, let's apply this knowledge to select the optimal GPU configuration for production deployment.
GPU Selection Matrix
Production-Ready GPU Options
💰 Major Price Changes (May 2025):
RTX 4000 Ada: $0.26/hr (NEW - excellent budget option)
RTX 4090: $0.69/hr (unchanged)
RTX 6000 Ada: $0.77/hr (down from $1.20 - 36% reduction!)
H100 variants: $2.39-$6.39/hr (premium tier)
L40/L40S: $0.43-$0.86/hr (good mid-range options)
🎯 Production Sweet Spots (May 2025):
Ultra-Budget Champion:
Qwen2.5-VL-3B with AWQ + FP8 KV Cache on RTX 4000 Ada
Context: Up to 32K tokens
Cost: $0.26/hr
Memory: 3.6GB required, 20GB available
Use Case: Ultra-cost-sensitive production, development, testing
Production Budget Champion:
Qwen2.5-VL-7B with AWQ + FP8 KV Cache on RTX 4000 Ada
Context: Up to 32K tokens
Cost: $0.26/hr
Memory: 6.4GB required, 20GB available
Use Case: Cost-sensitive production deployments with excellent performance
Performance Leader:
Qwen2.5-VL-7B with AWQ + FP8 KV Cache on RTX 4000 Ada
Context: Up to 32K tokens with maximum throughput
Cost: $0.26/hr
Memory: 10.2GB required, 20GB available
Use Case: High-throughput production (1.6x speedup) at budget pricing
32B Model Champion:
Qwen2.5-VL-32B with AWQ + FP8 KV Cache on RTX 6000 Ada
Context: Up to 8K tokens
Cost: $0.77/hr
Memory: 20.3GB required, 48GB available
Use Case: Large model deployment with excellent memory efficiency
Inference Engine Comparison
Production Inference Engine Analysis
Choosing the right inference engine is critical for production performance:
1. vLLM (Recommended for Production)
What is vLLM:
High-performance inference engine for large language models
Optimized for throughput and memory efficiency
Built-in support for continuous batching and paged attention
vLLM Advantages:
PagedAttention: Efficient KV cache management
Continuous Batching: Process requests as they arrive
High Throughput: 2-24x faster than naive implementations
Memory Efficiency: Optimal GPU memory utilization
Production Features: Built-in API server, metrics, health checks
vLLM Performance Characteristics:
Latency: Optimized for throughput over latency
Concurrency: Excellent multi-request handling
Memory: Efficient KV cache management
Scalability: Handles varying request loads well
vLLM Production Benefits:
✅ Built-in OpenAI-compatible API ✅ Automatic request batching ✅ Memory optimization ✅ Production monitoring ✅ Easy deployment ✅ Active development and support
Use case: Development, testing, resource-constrained environments
Production Recommendation: vLLM
Why vLLM for Production:
Ease of Deployment: Simple Docker container setup
Production Features: Built-in monitoring, health checks, metrics
Performance: Excellent throughput for document processing
Compatibility: OpenAI-compatible API
Community: Active development and support
Reliability: Proven in production environments
vLLM Production Architecture:
Internet → Load Balancer → vLLM API Server → GPU → Response ↓ Monitoring & Logging
RUNPOD DEPLOYMENT
Before automating the deployment, Let's just manually deploy a few models to get a feel for the process.
1. Create a new RunPod account/project
2. Choose POD specs
RTX 4090 - this seems to be the best option for the moment given models i want to deploy and benchmark.
but i want to try something stronger. Let's try RTX 6000 ADA. ( & maybe RTX 5090 later)
Step 2 : Image Template
We can either choose some base image with ubuntu (e.g. 22.04) but fore sake of simplicity, we will go with apreconfigured vLLM image that serves openAI API.
That will help us iterate quicker. But basically what it saves us to do is installing & configuring ourselves :
- Cuda, Pytorch, vLLM, launching vLLM
vllm.entrypoints.openai.api_server
Step 2 : Model & Run configs
Once template vLLM is chosen, it allows us to quickly iterate and change models
the VLLM Parameters we will be mainly playing with:
--model $Model
--enforce-eager # Eager mode for faster inference
--gpu-memory-utilization 0.95
--api-key test-key-1234
--max-model-len 8128
--dtype float16
--kv-cache-dtype fp8 # Quantize the KV cache to FP8
--calculate-kv-scales # Calculate the KV cache scale
here are some models that i would like to benchmark :
Qwen/Qwen2.5-VL-3B-Instruct-AWQ
Qwen/Qwen2.5-VL-3B-Instruct
RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8
Qwen/Qwen2.5-VL-7B-Instruct-AWQ
Qwen/Qwen2.5-VL-7B-Instruct
Qwen/Qwen2.5-VL-32B-Instruct-AWQ
reducto/RolmOCR
reducto/RolmOCR:Q4_K_M
Note : Cannot use FlashAttention backend for FP8 KV cache. FlashInfer backend with FP8 KV Cache for better performance by setting environment variable VLLM_ATTENTION_BACKEND=FLASHINFER
3. Real-World Benchmark Results: RTX 6000 Ada Performance Analysis
Testing Environment:
GPU: RTX 6000 Ada (48GB VRAM)
Context Length: 8,128 tokens
GPU Memory Utilization: 95%
vLLM Version: Latest with FP8 KV cache support
📊 Performance Summary
Model
Quantization
Requests/Sec
Avg Latency (s)
Success Rate
Avg Completion Tokens
Qwen2.5-VL-3B
None
1.67
11.57
100%
375.0
Qwen2.5-VL-3B
AWQ
1.91
10.16
100%*
375.0
Qwen2.5-VL-7B
W8A8
1.79
11.15
100%
485.6
Qwen2.5-VL-7B
AWQ
1.92
10.03
100%
423.0
Qwen2.5-VL-7B
None
1.65
11.55
100%
442.8
Qwen2.5-VL-32B
AWQ
0.38
23.66
100%
473.8
RolmOCR-8B
None
1.88
10.21
100%
371.8
📈 Detailed Latency Statistics
Latency Diagrams :
Precision Diagrams :
Pleasant surprise, Rolmo Seams to be a good spot with great performances.
This use case becomes really massive so we will continue on a second step on how to automate & monitor all this to get those results.