LLM推論のメモリ効率を劇的に変えたPagedAttention。メモリ断片化を仮想記憶の概念で解決する発想が凄い。
・KVキャッシュ浪費を理論上ゼロに
・動的ブロック割り当てでバッチを最大化
推論エンジンの設計思想を変えた金字塔的技術。
#vLLM #LLM
vLLM’s new PagedAttention slashes latency, cranks up GPU inference, and lets you batch continuously for production LLM workloads. Curious how it beats the OpenAI API? Dive in! #vLLM #PagedAttention #GPUInference
🔗 aidailypost.com/news/vllm-bo...
Maintaining separate attention kernels for every GPU platform doesn't scale.
Hence, for the #vLLM #Triton #attention backend, we took a different approach: ~800 LoC Triton for NVIDIA and AMD GPUs, with SOTA performance on both.
📖 Deep dive: blog.vllm.ai/2026/03/04/v...
@pytorch.org #OpenSourceAI
vLLM 0.17 Ships FlashAttention 4 and Live MoE Scaling
awesomeagents.ai/news/vllm-0-17-0-flashat...
#Vllm #Inference #OpenSource
Собственная облачная LLM на 16 ГБ VRAM — часть 1: базовая сборка, tools и MCP Привет, Хабр! На фоне ажиотажа вокруг ней...
#langchain #langgraph #python #vllm #qwen3 #localai #selectel #MCP #ии-агенты #API-сервис
Origin | Interest | Match
Из коробки не работает: запускаем свежие большие LLM В последнее время открытых моделей сверхбольшого разме...
#Kimi-K2.5 #DeepSeek-v3.2 #GLM-5 #Qwen3.5 #vllm #B200
Origin | Interest | Match
🚀 Docker Model Runner lleva vLLM a macOS con Apple Silicon
vLLM, el motor de inferencia líder, ahora en macOS gracias a vllm-metal.
www.docker.com/blog/docker-model-runner...
#vLLM #AppleSilicon #MLOps #Docker #RoxsRoss
Complete guide to LLM hosting in 2026. Compare Ollama, vLLM, Docker Model Runner, LocalAI and cloud providers. Learn cost, performance, and infrastructure trade-offs:
www.glukhov.org/llm-hosting/
#AI #LLM #hosting #Self-Hosting #SelfHosting #ollama #vllm #infrastructure
Setup an open source model with #Ollama or #vLLM, but unsure how to connect it to Claude Code?
Don't worry, we've got you covered 💪
Then run 'gpu llm run' from your terminal of choice, select whether you want to use #Ollama or #vLLM for inference and choose the model you want to use.
Here we're opting for the #Z.ai model GLM-4.7 Flash.
The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works
techlife.blog/posts/llm-in...
#LLM #Inference #PagedAttention #vLLM #FlashAttention #SpeculativeDecoding #MachineLearning #GPUOptimization #KVCache
Where can you prepro #Dev test leading #OpenSource #LLM #AI models that are not 'walled garden' & US monitored #AmericanAI?
Synthetic.new has #PrivacyFirst runnable model choices like #KIMIK2-Thinking, #MiniMax2.1, #Quen3 ++. #vLLM support & use as in #OpenAI tools via #Roo #Cline ++
Remote #GPU network volumes shouldn't require a config file, a cloud console, and 20 minutes of your life.
With GPU CLI, adding a volume is as simple as yes or no.
#Ollama #vLLM #ComfyUI
Today kicks off @jfokus.se in Stockholm 🇸🇪 and we just delivered our workshop on building with open source AI models using:
⚡️ #vLLM serve local LLM’s as a local API endpoint
🦜 @langchain4j.dev for adding LLM capabilities in our Java application
Was a huge hit! Slides ⬇️
📉 "훈련은 밑 빠진 독에 물 붓기?"
시드 투자로만 2,182억 원 챙긴 '인퍼랙트'의 독설
오픈소스 추론 엔진의 끝판왕 'vLLM' 팀이 만든 인퍼랙트가 전장에 뛰어들었습니다. 이제 AI 산업의 승자는 '누가 더 큰 모델을 가졌느냐'가 아니라 '누가 더 효율적으로 추론하느냐'에서 갈릴 것입니다.
www.aipostkorea.com/news/article...
#인퍼랙트 #Inferact #vLLM #사이먼모 #AI인프라 #추론의경제학 #시드투자 #a16z #테크트렌드
Inferact raises $150M to commercialize vLLM, enhancing AI inference efficiency. Backed by Andreessen Horowitz & Lightspeed. #AI #Inference #TechFunding #vLLM #Inferact Link: thedailytechfeed.com/inferact-rai...
Andreessen Horowitz just pumped $150M into Inferact’s seed round, pushing its valuation to $800M. The startup’s open‑source vLLM engine could reshape AI model inference. Curious? Dive in. #Inferact #vLLM #SeedFunding
🔗 aidailypost.com/news/andrees...
Nice example of a production #vLLM setup on 𝗡𝗲𝗯𝗶𝘂𝘀 with terraform, managed K8s, inference, and observability all in one place.
This can be a ref stack builders can use without reinventing the basics 💡.
👨🏻💻 full code on our repo.
github.com/CloudThrill/vllm-production-stack-terraform
📢 𝗡𝗲𝘄 𝘁𝗲𝗿𝗿𝗮𝗳𝗼𝗿𝗺 #vLLM 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗦𝘁𝗮𝗰𝗸 𝗔𝗰𝗿𝗼𝘀𝘀 𝗖𝗹𝗼𝘂𝗱𝘀 🧑🏼🚀 | 𝗣𝗮𝗿𝘁 𝟰: 𝗡𝗲𝗯𝗶𝘂𝘀 𝗖𝗹𝗼𝘂𝗱 💚
🔎 𝗪𝗵𝗮𝘁 𝘆𝗼𝘂'𝗹𝗹 𝗱𝗲𝗽𝗹𝗼𝘆:
✅ Enterprise-grade GPU inference
✅ Secure vllm endpoints (LetsEncrypt)
✅ Full observability: Grafana + vLLM dashboards
✅ Lightning-fast deployment
👉 read the guide: tinyurl.com/Nebiusvllm
Opinion: another step forward for scalable agentic workloads in 2026
#huggingface #vllm #openai #llm #ai #artificial-intelligence #langchain #llama-index #vllm #sglang
@jfokus.se is BACK for its 20th year and I’m so happy to be hosting a workshop on open source models & how to scale them up on #Kubernetes! We’ll feature projects including #vLLM + @langchain4j.dev + @promptfoo.bsky.social and more for enterprise AI deployment, app dev, and testing 🔥
[Перевод] Как работает кэширование промптов — PagedAttention и автоматическое кэширование префикса плюс практиче...
#prompt #caching #префилл #декодинг #инференс #LLM #vLLM #PagedAttention #prefix #caching #фрагментация
Origin | Interest | Match
🏆Ranked #2 most-read in 2025 - #vLLM for Beginners (Key features)
2️⃣ Here’s the most exhaustive list of VLLM features you wish you knew. 👇
📖 cloudthrill.ca/what-is-vllm...
Learn what makes #vllm the 𝗥𝗼𝗹𝗹𝘀 𝗥𝗼𝘆𝗰𝗲 of Inference in production✨. #vLLM #AIForBeginners
The LLM inference landscape is exploding.
Should you use the data center standard #vLLM, local favorite #Ollama or the radical newcomer #ZML?
I applied the rigorous #QSOS method to compare these engines on features, performance and operational ease.
Link to full article in comment.
#TechAtWorldline
Meeting-LLM: Транскрипция + ИИ-анализ совещаний в одном окне своими руками (T-One + GPT-OSS-20B) В интернете огромное колич...
#Сезон #ИИ #в #разработке #GPT-OSS-20B #транскрипция #STT #T-One #vLLM #LLM #совещания
Origin | Interest | Match
Docker Model Runner just got two big upgrades:
- Run vLLM on Windows with WSL2 + NVIDIA GPUs
- Now included in Universal Blue (Bluefin + Aurora)
Read more: https://bit.ly/3Y7WabG
Run LLMs with a single command , no setup, no GPU headaches.
#vLLM #UniversalBlue #Bluefin #Aurora
www.docker.com/blog/docker-... - setting up #vLLM on #Windows with #Docker #Model Runner. Great tutorial Dorin Geman.
📢 𝗡𝗲𝘄 𝘁𝗲𝗿𝗿𝗮𝗳𝗼𝗿𝗺 #vLLM 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗦𝘁𝗮𝗰𝗸 𝗔𝗰𝗿𝗼𝘀𝘀 𝗖𝗹𝗼𝘂𝗱𝘀
𝗣𝗮𝗿𝘁 𝟭: GCP 𝗚𝗞𝗘 🔵🔴🟢
🔎 𝗪𝗵𝗮𝘁 𝘆𝗼𝘂'𝗹𝗹 𝗱𝗲𝗽𝗹𝗼𝘆:
✅ Enterprise-grade infra
✅ Switch between CPU/ GPU inference with a single flag
✅ Full observability: Grafana + vLLM dashboards
✅ OpenAI-compatible API
👉 read the guide: cloudthrill.ca/vllm-product...