Emerging
Jun 18, 20261
67%
JetFlow: New Method Breaks Speed Ceiling for Large Language Model Inference

JetFlow, a novel speculative decoding framework for Large Language Models, addresses scaling limitations by combining efficient one-forward drafting with causal branch conditioning. Achieving up to 9.64x speedup on benchmark tasks, it outperforms existing methods and enables practical deployment gains through production serving integration.
Quick Facts
Who
HAO AI Lab researchers
What
Proposed JetFlow framework for speculative decoding
When
Submitted 16 June 2026
Where
H100 GPU systems
- Proposed JetFlow framework for speculative decoding
- Combines one-forward drafting efficiency with branch-wise causal conditioning
- Trains causal parallel draft head over fused hidden states
- Tested across math, coding, and chat benchmarks
- Integrated with vLLM for production serving
Researchers have introduced JetFlow, a novel framework designed to overcome fundamental scaling limitations in speculative decoding—a technique that accelerates Large Language Models (LLMs) by drafting and verifying multiple tokens in parallel. The work addresses a persistent challenge in the field: while increasing the draft budget theoretically improves speed, practical gains plateau when acceptance rates decline and drafting overhead grows.
Speculative decoding has long faced a causality-efficiency dilemma. Traditional autoregressive drafters produce path-conditioned candidates effective for tree-based decoding with higher acceptance, but their computational cost scales with tree depth. Conversely, bidirectional block-diffusion drafters generate all positions simultaneously, but their branch-agnostic approach can produce individually plausible yet mutually inconsistent token trees, wasting computational budget and reducing overall acceptance rates.
JetFlow resolves this tension by combining one-forward drafting efficiency with branch-wise causal conditioning. The framework trains a causal parallel draft head over fused hidden states from a frozen target model, producing candidate trees whose probability scores align with the target model's autoregressive factorization. This architectural choice enables the system to convert larger draft budgets into longer accepted token sequences and higher end-to-end speedup.
Performance testing on H100 GPUs demonstrates substantial improvements. The framework achieves up to 9.64x speedup on mathematical reasoning tasks (MATH-500) and 4.58x acceleration on open-ended conversational workloads. Comprehensive benchmarking across mathematics, coding, and chat tasks on both dense and Mixture-of-Experts Qwen3 models shows JetFlow consistently outperforming existing bidirectional-head and tree-based speculative decoding baselines. Additional latency improvements are demonstrated through integration with vLLM, a production serving framework, under realistic deployment conditions.
Why This Matters
JetFlow addresses a critical bottleneck in LLM deployment: while speculative decoding theoretically speeds up inference, practical gains plateau due to declining token acceptance rates. By achieving up to 9.64x speedup on mathematical reasoning tasks and demonstrating production-ready integration with vLLM, this work directly impacts the feasibility and cost-efficiency of deploying large language models at scale. Organizations running LLM inference services can immediately benefit from reduced latency and computational overhead, translating to faster user-facing applications and lower operational costs.
Timeline & Sources
Jun 16, 2026
WireJetFlow paper submitted to arXiv
Jun 18, 2026
WireJetFlow preprint published on arXiv with code availability