CUDA vs. UDA for Laniakea OS — and the Road to QPU APIs
- Erick Rosado
- Sep 20
- 5 min read

TL;DR
CUDA = vendor-specific (NVIDIA) GPU platform with best-in-class tooling and performance.
UDA (Unified Device Architecture) = vendor-neutral idea: one interface for many accelerators (CPU/GPU/FPGA/…QPU).
Laniakea OS support both: CUDA for peak GPU speed and UDA-style abstraction for portability today and QPU integration tomorrow.
1) What is CUDA?
NVIDIA’s Compute Unified Device Architecture: a programming model (kernels, grids/blocks/threads), compiler toolchain (nvcc), driver/runtime, and libraries (cuBLAS, cuDNN, NCCL, Thrust…) that expose massive GPU parallelism for general-purpose compute.
Why teams pick it
Mature ecosystem and profilers (Nsight), highly optimized libs, broad cloud/on-prem availability.
Tight control over memory hierarchy (global/shared/constant) and occupancy.
2) What is “UDA” (Unified Device Architecture)?
A concept rather than a single product: a unified API that can target diverse accelerators (CPU, NVIDIA/AMD GPU, FPGA, DSP—and eventually QPU). Think of it as the portability layer that keeps app code stable while the backend changes.
Concrete realizations of “UDA-like” portability include:
SYCL / oneAPI (Khronos/Intel): C++ single-source kernels, backends for CPU, Level-Zero, CUDA, HIP.
OpenCL: cross-vendor compute API (lower-level).
HIP/ROCm (AMD): CUDA-like model for AMD GPUs, with some CUDA translation.
ML frameworks runtimes that dispatch to multiple backends.
3) Why both matter to Laniakea OS
Performance now (CUDA): When NVIDIA GPUs are present, CUDA provides the fastest path.
Flexibility forever (UDA): A portability layer lets Laniakea schedule the same job on CPUs/GPUs/FPGAs—and future QPUs—without app rewrites.
Operations: A unified telemetry schema (utilization, memory, errors, energy) enables one scheduler to optimize placement across all devices.
4) Side-by-side comparison
Axis | CUDA (NVIDIA) | UDA-style (e.g., SYCL/oneAPI/OpenCL/HIP wrappers) |
Vendor scope | NVIDIA only | Multi-vendor / multi-device |
Performance | Peak on NVIDIA GPUs via tuned libs | Competitive; depends on backend vendor libraries |
Tooling | Nsight, nvprof, CUPTI, rich ecosystem | Improving (VTune/Advisor, Codeplay tools, ROCm tools), more variance |
Portability | Low (CUDA code) | High (single source, multiple targets) |
Maintenance | Duplicate paths for non-NVIDIA | One codebase; backend adapters |
Best use | NVIDIA-heavy fleets, latency-critical paths | Heterogeneous fleets, long-term portability, QPU runway |
5) Laniakea OS: runtime & telemetry blueprint
Scheduler responsibilities
Discover devices (CPU/GPU/FPGA/QPU) → capabilities (SMs/CU, mem, drivers, features).
Normalize telemetry (utilization, mem pressure, temp, power, errors, queue depth).
Match workload → backend: dense_linear_algebra → CUDA/cuBLAS, bit-exact streaming → FPGA, combinatorial search → QPU(hybrid)+CPU.
Unified telemetry schema (sample)
Metric | CPU | GPU | FPGA | QPU(hybrid) | Notes |
Utilization (%) | ✓ | ✓ | ✓ | ✓ | normalized per backend |
Mem used / total | ✓ | ✓ (HBM/VRAM) | ✓ | ✓ (shots cache) | |
Temperature (°C) | ✓ | ✓ | ✓ | — | QPU uses dilution-fridge metrics instead |
Power (W) | ✓ | ✓ | ✓ | — | QPU has fridge power; abstract separately |
Error counters | ✓ | ✓ (ECC) | ✓ | ✓ (T1/T2 drift, readout err) | |
Queue depth | ✓ | ✓ | ✓ | ✓ | back-pressure signal |
6) “Workload → Accelerator” quick guide
Workload pattern | Best default | Portable fallback |
Dense BLAS (GEMM/conv) | CUDA (cuBLAS/cuDNN) | oneAPI MKL / ROCm rocBLAS / CPU MKL |
Sparse / graph traversals | GPU (CUDA) or CPU if branchy | SYCL/OpenCL backends |
Streaming bit-level pipelines | FPGA | CPU SIMD / GPU custom kernels |
Combinatorial optimization / QAOA-like | Hybrid QPU + CPU/GPU | Classical heuristic meta-solvers |
Note: choose by measured perf; above are starting heuristics.
7) The road to QPU APIs
Layer | What it does | Today’s analog | Laniakea plan |
Circuit DSL | Express circuits/ansätze | OpenQASM, QIR | Ingest standard DSLs |
IR & transpile | Map to hardware topology | MLIR-Quantum, QIR, t | ket> |
Runtime | Submit circuits/jobs; manage shots | Cloud runtimes (gRPC/REST) | Unified “DeviceQueue” API |
Telemetry | Shots/s, queue depth, error rates, calib drift | Provider dashboards | Normalize to scheduler schema |
Hybrid control | Classical loop around quantum calls | Orchestration SDKs | Built-in hybrid tasks (async futures) |
8) Example graph (copy-paste ASCII)
Relative throughput (illustrative) across backendsBaseline CPU = 1.0; higher is better (not real benchmarks)
Vector Math (BLAS1)
CPU |████████████ (1.0)
CUDA |██████████████████████████████ (8.0)
ROCm/HIP |██████████████████████████ (6.5)
FPGA |███████████ (3.0)
QPU(h) |██ (0.2) [classical task; QPU not ideal]
Tensor Convolution
CPU |████████████ (1.0)
CUDA |██████████████████████████████████████ (10.0)
ROCm/HIP |███████████████████████████████ (7.5)
FPGA |██████████████ (4.0)
QPU(h) |██ (0.2)
Graph Traversal (irregular)
CPU |████████████ (1.0)
CUDA |████████████████████ (3.5)
ROCm/HIP |██████████████████ (3.0)
FPGA |███████████████ (3.8)
QPU(h) |███ (0.3)
Note: Values are illustrative to visualize trade-offs; always measure on your hardware.
9) Two tables you can paste into docs
A) Device inventory → backend handle
Device type | Probe | Backend handle | Notes |
CPU | /proc/cpuinfo, CPUID, sysfs | uda://cpu/0 | NUMA & SIMD flags |
NVIDIA GPU | CUDA runtime cudaGetDeviceProperties | cuda://gpu/0 | SMs, HBM size |
AMD GPU | ROCm SMI / HIP | hip://gpu/0 | CUs, HBM size |
FPGA | OpenCL / vendor SMI | ocl://fpga/0 | bitstream ID |
QPU | Provider gRPC/REST | qpu://provider/systemA | topology, calib time |
B) Scheduler decision hints
Hint | Source | Effect |
job.tags includes "dense-blas" | user/job meta | prefer CUDA/ROCm |
power_cap < X | energy policy | prefer CPU/FPGA |
qpu.calib_age > threshold | QPU telemetry | delay quantum jobs |
gpu.mem_free < model_mem | GPU telemetry | spill or split batch |
11) Practical guidance for Laniakea OS
Support CUDA directly where present; keep UDA-style path for everything else.
Unify telemetry early; scheduling wins come from shared metrics.
Abstract jobs (not devices): express intent (dense, sparse, bit-level, hybrid-quantum).
Prepare for QPUs via a clean async “submit/await” interface and hybrid loops.
Continuously benchmark; keep routing decisions empirical, not dogmatic.
12) One-page glossary
CUDA: NVIDIA platform for general GPU compute.
UDA (concept): one API for many accelerators.
SYCL/oneAPI: portable C++ single-source model targeting multiple backends.
HIP/ROCm: AMD’s CUDA-like stack.
QPU: Quantum Processing Unit (gate-model/annealer).
Hybrid: classical control loop + quantum subroutines.
Telemetry: normalized device metrics for scheduling.
Topic | CUDA (Compute Unified Device Architecture) | UDA (Unified Device Architecture, vendor-agnostic idea) |
Scope | NVIDIA GPUs only | Abstracts CPUs/GPUs/FPGAs/accelerators (potentially QPUs) |
Programming model | SIMT, CUDA C/C++, kernels, grids/blocks/threads | Portable device model; map to CUDA, HIP, OpenCL, SYCL, FPGA HDL, future QPU APIs |
Tooling | nvcc, Nsight Compute/Systems, cuBLAS/cuDNN/cuFFT | Backend adapters; shared runtime; capability discovery |
Strengths | Deep libraries + mature perf | Flexibility across vendors and form factors |
Risks | Lock-in, single vendor | Lowest-common-denominator unless backends expose extensions |
Unified Telemetry (minimal schema) | Type | Example |
job_id | string | wkld-2025-09-17-001 |
backend | enum | `cpu |
device_id | string | GPU0-AD102 |
ts_start, ts_end | ISO 8601 | 2025-09-17T18:22:03Z |
workload | string | tensor_convolution |
size_hint | string | N=8192, K=1024 |
throughput_x | float | 10.0 |
latency_ms | float | 14.2 |
energy_j | float | 5.8 |
errors | array | [] |
Workload → Accelerator Quick Guide | Best | Also viable |
Dense linear algebra (BLAS/GEMM) | CUDA/ROCm | FPGA (pipelined), CPU (MKL/BLIS) |
Convolutions | CUDA/ROCm (cuDNN/MIOpen) | FPGA (streaming) |
Irregular graphs | FPGA≈CUDA/ROCm | CPU (NUMA-aware) |
Low-latency control | CPU/FPGA | — |
Combinatorial search | QPU-hybrid (future) | CPU/FPGA heuristics |
| Device Inventory → Backend Handle | Example ||---|---|---|| CPU | cpu:0 || NVIDIA | cuda:0 || AMD | rocm:1 || FPGA | fpga:xcvu9p-0 || QPU (remote) | qpu:rigetti:Ankaa-9Q |
Scheduler Decision Hints | Signal | Use |
throughput_x | higher is better | primary for batch |
latency_ms | lower is better | interactive jobs |
energy_j | lower is better | mobile/edge |
queue_depth | lower is better | avoid contention |
Road to QPU APIs (layers) | Purpose |
High-level SDK | circuits, annealing, hybrid loop |
IR (QIR/OpenQASM) | portable quantum program |
Runtime/Orchestrator | qubit map, transpile, error-mitigation |
Hardware Driver | pulses, calibration, job submit |
CUDA (what it is) | Notes |
NVIDIA parallel platform & API | Kernels on GPU SMs, libraries (cuBLAS/cuDNN) |
Memory model | Unified Virtual Addressing, streams, events |
Why it matters | Peak throughput for dense numeric workloads |
Comments