top of page
Search

CUDA vs. UDA for Laniakea OS — and the Road to QPU APIs

  • Writer: Erick Rosado
    Erick Rosado
  • Sep 20
  • 5 min read
laniakea os app

TL;DR

  • CUDA = vendor-specific (NVIDIA) GPU platform with best-in-class tooling and performance.

  • UDA (Unified Device Architecture) = vendor-neutral idea: one interface for many accelerators (CPU/GPU/FPGA/…QPU).

  • Laniakea OS  support both: CUDA for peak GPU speed and UDA-style abstraction for portability today and QPU integration tomorrow.

1) What is CUDA?

NVIDIA’s Compute Unified Device Architecture: a programming model (kernels, grids/blocks/threads), compiler toolchain (nvcc), driver/runtime, and libraries (cuBLAS, cuDNN, NCCL, Thrust…) that expose massive GPU parallelism for general-purpose compute.

Why teams pick it

  • Mature ecosystem and profilers (Nsight), highly optimized libs, broad cloud/on-prem availability.

  • Tight control over memory hierarchy (global/shared/constant) and occupancy.

2) What is “UDA” (Unified Device Architecture)?

A concept rather than a single product: a unified API that can target diverse accelerators (CPU, NVIDIA/AMD GPU, FPGA, DSP—and eventually QPU). Think of it as the portability layer that keeps app code stable while the backend changes.

Concrete realizations of “UDA-like” portability include:

  • SYCL / oneAPI (Khronos/Intel): C++ single-source kernels, backends for CPU, Level-Zero, CUDA, HIP.

  • OpenCL: cross-vendor compute API (lower-level).

  • HIP/ROCm (AMD): CUDA-like model for AMD GPUs, with some CUDA translation.

  • ML frameworks runtimes that dispatch to multiple backends.

3) Why both matter to Laniakea OS

  • Performance now (CUDA): When NVIDIA GPUs are present, CUDA provides the fastest path.

  • Flexibility forever (UDA): A portability layer lets Laniakea schedule the same job on CPUs/GPUs/FPGAs—and future QPUs—without app rewrites.

  • Operations: A unified telemetry schema (utilization, memory, errors, energy) enables one scheduler to optimize placement across all devices.

4) Side-by-side comparison

Axis

CUDA (NVIDIA)

UDA-style (e.g., SYCL/oneAPI/OpenCL/HIP wrappers)

Vendor scope

NVIDIA only

Multi-vendor / multi-device

Performance

Peak on NVIDIA GPUs via tuned libs

Competitive; depends on backend vendor libraries

Tooling

Nsight, nvprof, CUPTI, rich ecosystem

Improving (VTune/Advisor, Codeplay tools, ROCm tools), more variance

Portability

Low (CUDA code)

High (single source, multiple targets)

Maintenance

Duplicate paths for non-NVIDIA

One codebase; backend adapters

Best use

NVIDIA-heavy fleets, latency-critical paths

Heterogeneous fleets, long-term portability, QPU runway

5) Laniakea OS: runtime & telemetry blueprint

Scheduler responsibilities

  1. Discover devices (CPU/GPU/FPGA/QPU) → capabilities (SMs/CU, mem, drivers, features).

  2. Normalize telemetry (utilization, mem pressure, temp, power, errors, queue depth).

  3. Match workload → backend: dense_linear_algebra → CUDA/cuBLAS, bit-exact streaming → FPGA, combinatorial search → QPU(hybrid)+CPU.

Unified telemetry schema (sample)

Metric

CPU

GPU

FPGA

QPU(hybrid)

Notes

Utilization (%)

normalized per backend

Mem used / total

✓ (HBM/VRAM)

✓ (shots cache)


Temperature (°C)

QPU uses dilution-fridge metrics instead

Power (W)

QPU has fridge power; abstract separately

Error counters

✓ (ECC)

✓ (T1/T2 drift, readout err)


Queue depth

back-pressure signal

6) “Workload → Accelerator” quick guide

Workload pattern

Best default

Portable fallback

Dense BLAS (GEMM/conv)

CUDA (cuBLAS/cuDNN)

oneAPI MKL / ROCm rocBLAS / CPU MKL

Sparse / graph traversals

GPU (CUDA) or CPU if branchy

SYCL/OpenCL backends

Streaming bit-level pipelines

FPGA

CPU SIMD / GPU custom kernels

Combinatorial optimization / QAOA-like

Hybrid QPU + CPU/GPU

Classical heuristic meta-solvers

Note: choose by measured perf; above are starting heuristics.

7) The road to QPU APIs

Layer

What it does

Today’s analog

Laniakea plan

Circuit DSL

Express circuits/ansätze

OpenQASM, QIR

Ingest standard DSLs

IR & transpile

Map to hardware topology

MLIR-Quantum, QIR, t

ket>

Runtime

Submit circuits/jobs; manage shots

Cloud runtimes (gRPC/REST)

Unified “DeviceQueue” API

Telemetry

Shots/s, queue depth, error rates, calib drift

Provider dashboards

Normalize to scheduler schema

Hybrid control

Classical loop around quantum calls

Orchestration SDKs

Built-in hybrid tasks (async futures)

8) Example graph (copy-paste ASCII)

Relative throughput (illustrative) across backendsBaseline CPU = 1.0; higher is better (not real benchmarks)

Vector Math (BLAS1)
CPU      |████████████ (1.0)
CUDA     |██████████████████████████████ (8.0)
ROCm/HIP |██████████████████████████ (6.5)
FPGA     |███████████ (3.0)
QPU(h)   |██ (0.2)  [classical task; QPU not ideal]

Tensor Convolution
CPU      |████████████ (1.0)
CUDA     |██████████████████████████████████████ (10.0)
ROCm/HIP |███████████████████████████████ (7.5)
FPGA     |██████████████ (4.0)
QPU(h)   |██ (0.2)

Graph Traversal (irregular)
CPU      |████████████ (1.0)
CUDA     |████████████████████ (3.5)
ROCm/HIP |██████████████████ (3.0)
FPGA     |███████████████ (3.8)
QPU(h)   |███ (0.3)
Note: Values are illustrative to visualize trade-offs; always measure on your hardware.

9) Two tables you can paste into docs

A) Device inventory → backend handle

Device type

Probe

Backend handle

Notes

CPU

/proc/cpuinfo, CPUID, sysfs

uda://cpu/0

NUMA & SIMD flags

NVIDIA GPU

CUDA runtime cudaGetDeviceProperties

cuda://gpu/0

SMs, HBM size

AMD GPU

ROCm SMI / HIP

hip://gpu/0

CUs, HBM size

FPGA

OpenCL / vendor SMI

ocl://fpga/0

bitstream ID

QPU

Provider gRPC/REST

qpu://provider/systemA

topology, calib time

B) Scheduler decision hints

Hint

Source

Effect

job.tags includes "dense-blas"

user/job meta

prefer CUDA/ROCm

power_cap < X

energy policy

prefer CPU/FPGA

qpu.calib_age > threshold

QPU telemetry

delay quantum jobs

gpu.mem_free < model_mem

GPU telemetry

spill or split batch


11) Practical guidance for Laniakea OS

  • Support CUDA directly where present; keep UDA-style path for everything else.

  • Unify telemetry early; scheduling wins come from shared metrics.

  • Abstract jobs (not devices): express intent (dense, sparse, bit-level, hybrid-quantum).

  • Prepare for QPUs via a clean async “submit/await” interface and hybrid loops.

  • Continuously benchmark; keep routing decisions empirical, not dogmatic.

12) One-page glossary

  • CUDA: NVIDIA platform for general GPU compute.

  • UDA (concept): one API for many accelerators.

  • SYCL/oneAPI: portable C++ single-source model targeting multiple backends.

  • HIP/ROCm: AMD’s CUDA-like stack.

  • QPU: Quantum Processing Unit (gate-model/annealer).

  • Hybrid: classical control loop + quantum subroutines.

  • Telemetry: normalized device metrics for scheduling.


Topic

CUDA (Compute Unified Device Architecture)

UDA (Unified Device Architecture, vendor-agnostic idea)

Scope

NVIDIA GPUs only

Abstracts CPUs/GPUs/FPGAs/accelerators (potentially QPUs)

Programming model

SIMT, CUDA C/C++, kernels, grids/blocks/threads

Portable device model; map to CUDA, HIP, OpenCL, SYCL, FPGA HDL, future QPU APIs

Tooling

nvcc, Nsight Compute/Systems, cuBLAS/cuDNN/cuFFT

Backend adapters; shared runtime; capability discovery

Strengths

Deep libraries + mature perf

Flexibility across vendors and form factors

Risks

Lock-in, single vendor

Lowest-common-denominator unless backends expose extensions

Unified Telemetry (minimal schema)

Type

Example

job_id

string

wkld-2025-09-17-001

backend

enum

`cpu

device_id

string

GPU0-AD102

ts_start, ts_end

ISO 8601

2025-09-17T18:22:03Z

workload

string

tensor_convolution

size_hint

string

N=8192, K=1024

throughput_x

float

10.0

latency_ms

float

14.2

energy_j

float

5.8

errors

array

[]

Workload → Accelerator Quick Guide

Best

Also viable

Dense linear algebra (BLAS/GEMM)

CUDA/ROCm

FPGA (pipelined), CPU (MKL/BLIS)

Convolutions

CUDA/ROCm (cuDNN/MIOpen)

FPGA (streaming)

Irregular graphs

FPGA≈CUDA/ROCm

CPU (NUMA-aware)

Low-latency control

CPU/FPGA

Combinatorial search

QPU-hybrid (future)

CPU/FPGA heuristics

| Device Inventory → Backend Handle | Example ||---|---|---|| CPU | cpu:0 || NVIDIA | cuda:0 || AMD | rocm:1 || FPGA | fpga:xcvu9p-0 || QPU (remote) | qpu:rigetti:Ankaa-9Q |

Scheduler Decision Hints

Signal

Use

throughput_x

higher is better

primary for batch

latency_ms

lower is better

interactive jobs

energy_j

lower is better

mobile/edge

queue_depth

lower is better

avoid contention

Road to QPU APIs (layers)

Purpose

High-level SDK

circuits, annealing, hybrid loop

IR (QIR/OpenQASM)

portable quantum program

Runtime/Orchestrator

qubit map, transpile, error-mitigation

Hardware Driver

pulses, calibration, job submit

CUDA (what it is)

Notes

NVIDIA parallel platform & API

Kernels on GPU SMs, libraries (cuBLAS/cuDNN)

Memory model

Unified Virtual Addressing, streams, events

Why it matters

Peak throughput for dense numeric workloads


 
 
 

Comments


  • Instagram

© 2021 by lightbulb.  

bottom of page