JetFlow: New Method Breaks Speed Ceiling for Large Language Model Inference

JetFlow, a novel speculative decoding framework for Large Language Models, addresses scaling limitations by combining efficient one-forward drafting with causal branch conditioning. Achieving up to 9.64x speedup on benchmark tasks, it outperforms existing methods and enables practical deployment gains through production serving integration.

Quick Facts

Who

HAO AI Lab researchers

What

Proposed JetFlow framework for speculative decoding

When

Submitted 16 June 2026

Where

H100 GPU systems

Proposed JetFlow framework for speculative decoding
Combines one-forward drafting efficiency with branch-wise causal conditioning
Trains causal parallel draft head over fused hidden states
Tested across math, coding, and chat benchmarks
Integrated with vLLM for production serving

Researchers have introduced JetFlow, a novel framework designed to overcome fundamental scaling limitations in speculative decoding—a technique that accelerates Large Language Models (LLMs) by drafting and verifying multiple tokens in parallel. The work addresses a persistent challenge in the field: while increasing the draft budget theoretically improves speed, practical gains plateau when acceptance rates decline and drafting overhead grows.

Speculative decoding has long faced a causality-efficiency dilemma. Traditional autoregressive drafters produce path-conditioned candidates effective for tree-based decoding with higher acceptance, but their computational cost scales with tree depth. Conversely, bidirectional block-diffusion drafters generate all positions simultaneously, but their branch-agnostic approach can produce individually plausible yet mutually inconsistent token trees, wasting computational budget and reducing overall acceptance rates.

JetFlow resolves this tension by combining one-forward drafting efficiency with branch-wise causal conditioning. The framework trains a causal parallel draft head over fused hidden states from a frozen target model, producing candidate trees whose probability scores align with the target model's autoregressive factorization. This architectural choice enables the system to convert larger draft budgets into longer accepted token sequences and higher end-to-end speedup.

Performance testing on H100 GPUs demonstrates substantial improvements. The framework achieves up to 9.64x speedup on mathematical reasoning tasks (MATH-500) and 4.58x acceleration on open-ended conversational workloads. Comprehensive benchmarking across mathematics, coding, and chat tasks on both dense and Mixture-of-Experts Qwen3 models shows JetFlow consistently outperforming existing bidirectional-head and tree-based speculative decoding baselines. Additional latency improvements are demonstrated through integration with vLLM, a production serving framework, under realistic deployment conditions.

Topics

Technology Tech Breakthrough Science Artificial Intelligence

#causal conditioning #speculative decoding #parallel tree drafting #GPU acceleration #machine learning efficiency #LLM inference optimization #large language models #token generation speedup

Why This Matters

JetFlow addresses a critical bottleneck in LLM deployment: while speculative decoding theoretically speeds up inference, practical gains plateau due to declining token acceptance rates. By achieving up to 9.64x speedup on mathematical reasoning tasks and demonstrating production-ready integration with vLLM, this work directly impacts the feasibility and cost-efficiency of deploying large language models at scale. Organizations running LLM inference services can immediately benefit from reduced latency and computational overhead, translating to faster user-facing applications and lower operational costs.

Timeline & Sources

Jun 16, 2026

Wire

JetFlow paper submitted to arXiv

Jun 18, 2026

Wire

JetFlow preprint published on arXiv with code availability

Entities

Sources

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Draftingarxiv_csMediaJun 18, 2026