Numerically Stable Cholesky-QR Algorithm Addresses GPU Computing Limitations

Researchers introduced MRCQR, a GPU algorithm that solves numerical instability problems in Cholesky-QR factorization for ill-conditioned matrices by using mixed-precision randomized preconditioning. The method achieves 1.4–13.5× speedup over existing algorithms while maintaining accuracy for condition numbers up to 10^16.

Quick Facts

Who

arXiv researchers in Numerical Analysis

What

Developed MRCQR (Mixed-Precision Randomized Cholesky-QR) algorithm

When

Submitted 16 June 2026

Where

GPU computing environment (NVIDIA H100)

Developed MRCQR (Mixed-Precision Randomized Cholesky-QR) algorithm
Addresses numerical instability in Cholesky-QR for ill-conditioned matrices
Uses subsampled randomized trigonometric transform for preconditioning
Applies Cholesky-QR in double precision to preconditioned matrix
Tested on NVIDIA H100 GPU

Researchers have developed MRCQR (Mixed-Precision Randomized Cholesky-QR), a new GPU algorithm that overcomes numerical stability limitations in one of the fastest methods for computing QR factorization of tall-and-skinny matrices. The traditional Cholesky-QR algorithm, while efficient on GPUs through BLAS-3 operations, becomes unstable when matrix condition numbers exceed approximately 10^8, as the process of forming the Gram matrix squares the condition number and causes computational breakdown.

The MRCQR approach introduces a subsampled randomized trigonometric transform to construct a preconditioner that reduces the condition number of the preconditioned matrix to near unity with high probability. A key insight of the method is that the preconditioner itself requires significantly less numerical precision than the final result. Single-precision (FP32) arithmetic suffices for condition numbers up to 10^8, while half-precision (FP16) is adequate for condition numbers up to 10^4. After preconditioning, Cholesky-QR is applied in double precision to produce an orthogonal factor meeting double-precision accuracy standards for condition numbers as high as 10^16—well beyond the 10^8 limit of the previous CholQR2 algorithm.

Experimental validation on NVIDIA H100 GPUs demonstrates significant performance advantages. The FP16 variant of MRCQR outperforms the previous rand-cholQR algorithm by 1.4 to 1.8 times across all tested matrix column counts and is 1.8 to 13.5 times faster than cuSOLVER's geqrf implementation. The FP16 sketch mode, used when condition numbers remain below 10^4, achieves twice the computational efficiency of FP64 (double-precision) calculations without sacrificing accuracy. This work addresses a fundamental challenge in numerical linear algebra on GPUs, enabling stable and efficient QR decomposition for ill-conditioned matrices that previously would have failed or required significantly slower algorithms.

Topics

Technology Science Tech Breakthrough

#Nvidia H100 #numerical stability #mixed-precision arithmetic #QR factorization #preconditioning #linear algebra #GPU computing #condition number

Why This Matters

This breakthrough addresses a fundamental bottleneck in scientific computing and machine learning workloads that rely on QR factorization for ill-conditioned problems. By enabling stable decomposition up to condition number 10^16 with significant GPU speedup, the method opens practical solutions for previously intractable numerical problems in optimization, statistics, and linear algebra. Practitioners can now run demanding computations faster without sacrificing accuracy—critical for real-time analytics and large-scale simulations.

Timeline & Sources

Jun 16, 2026

Wire

MRCQR paper submitted to arXiv

Jun 18, 2026

Wire

MRCQR paper announced and published on arXiv

Entities

Sources

Numerically Stable Cholesky-QR on GPU via Mixed-Precision Randomized Preconditioningarxiv_csMediaJun 18, 2026