White elegantly utilizes C++20’s metaprogramming to achieve a "write once, run anywhere" paradigm for heterogeneous computing. This approach effectively shifts the burden of architectural abstraction from the developer to the compiler's static analysis.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Lightning Talk: Crafting CUDA Compatible C++ Code - Jon White - CppCon 2025Added:
attending any conference, [music] it's in it's incredibly important to to be there. That's kind of the only way to really dedicate your time uh uh to be there and kind of be immersed in the whole thing and not distracted by by other stuff going on. Even if you do get distract distracted with [music] interesting conversations in the hallway.
>> Hi, my name is John and I'm going to be talking about crafting CUDA compatible C++ code.
Um, so basically the problem is I'm trying to write a uh parallel math library that needs to run on CPU and GPU, but I don't want to write anything twice and I don't want CUDA features in my C++ code or CPU code. Um so just as a simple example of uh an operation you want to parallelize u single precision ax plus y um it's a embarrassingly parallel problem because uh every index is independent of all the other indices um so if you want to parallelize this on a CPU you get your vector of uh threads and then you distribute the work to each of the threads or if you want to uh paralyze it on a GPU I've implemented a grid stride loop and a uh CUDA kernel.
Um but the issue is uh those were both uh singlepurpose functions that you had to write both the operation and the parallelization method uh every time. Uh and ideally we want to separate concerns into the operation and the parallelization method.
Uh so step one obligatory context for all the things.
Um so the reason this is important is because on NVCC uh the NVIDIA compiler um you can pass the experimental relaxed con expert flag uh and that allows all of your con expert functions to be uh uh used on both the host and the device. Uh you write it once uh and you don't actually have to use any CUDA uh keywords.
Um so if you can read that uh um so up at the top we have uh the single precision ax plus y uh as a const expert function um and so that is being called from both of the parallelization methods the the one that's doing the CPU uh threading and then the one that's doing the CUDA kernel.
Um so step two uh basically you want to follow the example of the STL uh pass in an operation instead of having to call it uh from each of your par parallelization methods. Um so we're going to pass the operation as a runtime parameter and uh the parallelization method is going to be parameterized on the type of the operation.
Um, so yeah, now we're able to pass in the uh SAXP op to both of our parallelization methods.
Um, and so now we have separation of concerns and everything is only written once except that this doesn't work. Um, so anyone know what's wrong with this?
Uh so the problem is that the operation kernel is being passed at runtime and so it doesn't actually correctly resolve as the the host or device uh version of it.
Um so the CUDA version is actually getting the host version of the function.
Um and so the CUDA kernel is going to silently fail when you try to call it.
Um so actual final step uh make the operation a non-type type play parameter. um that's going to force resolution at compile time. And so the host version gets the the host the host code gets the host version and the uh CUDA version CUDA code gets the CUDA version. So this is what that looks like. Uh still just a con expert function uh defining the operation. Um but now we're passing in the operation as the template parameter, not as one of the runtime parameters. And so this works now.
We did it. Uh we now have a way of uh executing parallel operations that work on either the CPU or the GPU without having to write anything twice and without using CUDA in code that's meant for CPU.
Thank you. [applause] [applause]
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam
Pranavaa-y4y
104 views•2026-06-02











