Emerging
Jun 18, 20261
66%
Numerically Stable Cholesky-QR Algorithm Addresses GPU Computing Limitations

Researchers introduced MRCQR, a GPU algorithm that solves numerical instability problems in Cholesky-QR factorization for ill-conditioned matrices by using mixed-precision randomized preconditioning. The method achieves 1.4–13.5× speedup over existing algorithms while maintaining accuracy for condition numbers up to 10^16.
Quick Facts
Who
arXiv researchers in Numerical Analysis
What
Developed MRCQR (Mixed-Precision Randomized Cholesky-QR) algorithm
When
Submitted 16 June 2026
Where
GPU computing environment (NVIDIA H100)
- Developed MRCQR (Mixed-Precision Randomized Cholesky-QR) algorithm
- Addresses numerical instability in Cholesky-QR for ill-conditioned matrices
- Uses subsampled randomized trigonometric transform for preconditioning
- Applies Cholesky-QR in double precision to preconditioned matrix
- Tested on NVIDIA H100 GPU
Researchers have developed MRCQR (Mixed-Precision Randomized Cholesky-QR), a new GPU algorithm that overcomes numerical stability limitations in one of the fastest methods for computing QR factorization of tall-and-skinny matrices. The traditional Cholesky-QR algorithm, while efficient on GPUs through BLAS-3 operations, becomes unstable when matrix condition numbers exceed approximately 10^8, as the process of forming the Gram matrix squares the condition number and causes computational breakdown.
The MRCQR approach introduces a subsampled randomized trigonometric transform to construct a preconditioner that reduces the condition number of the preconditioned matrix to near unity with high probability. A key insight of the method is that the preconditioner itself requires significantly less numerical precision than the final result. Single-precision (FP32) arithmetic suffices for condition numbers up to 10^8, while half-precision (FP16) is adequate for condition numbers up to 10^4. After preconditioning, Cholesky-QR is applied in double precision to produce an orthogonal factor meeting double-precision accuracy standards for condition numbers as high as 10^16—well beyond the 10^8 limit of the previous CholQR2 algorithm.
Experimental validation on NVIDIA H100 GPUs demonstrates significant performance advantages. The FP16 variant of MRCQR outperforms the previous rand-cholQR algorithm by 1.4 to 1.8 times across all tested matrix column counts and is 1.8 to 13.5 times faster than cuSOLVER's geqrf implementation. The FP16 sketch mode, used when condition numbers remain below 10^4, achieves twice the computational efficiency of FP64 (double-precision) calculations without sacrificing accuracy. This work addresses a fundamental challenge in numerical linear algebra on GPUs, enabling stable and efficient QR decomposition for ill-conditioned matrices that previously would have failed or required significantly slower algorithms.
Why This Matters
This breakthrough addresses a fundamental bottleneck in scientific computing and machine learning workloads that rely on QR factorization for ill-conditioned problems. By enabling stable decomposition up to condition number 10^16 with significant GPU speedup, the method opens practical solutions for previously intractable numerical problems in optimization, statistics, and linear algebra. Practitioners can now run demanding computations faster without sacrificing accuracy—critical for real-time analytics and large-scale simulations.
Timeline & Sources
Jun 16, 2026
WireMRCQR paper submitted to arXiv
Jun 18, 2026
WireMRCQR paper announced and published on arXiv