Install our extension to search inside any video instantly.

Lightning Talk: Crafting CUDA Compatible C++ Code - Jon White - CppCon 2025
Added: 2026-05-23

320 views155:08CppConOriginal Release: 2026-05-22

White elegantly utilizes C++20’s metaprogramming to achieve a "write once, run anywhere" paradigm for heterogeneous computing. This approach effectively shifts the burden of architectural abstraction from the developer to the compiler's static analysis.

[00:00:00]attending any conference, [music] it's in it's incredibly important to to be there. That's kind of the only way to really dedicate your time uh uh to be there and kind of be immersed in the whole thing and not distracted by by other stuff going on. Even if you do get distract distracted with [music] interesting conversations in the hallway.

[00:00:30]>> Hi, my name is John and I'm going to be talking about crafting CUDA compatible C++ code.

[00:00:36]Um, so basically the problem is I'm trying to write a uh parallel math library that needs to run on CPU and GPU, but I don't want to write anything twice and I don't want CUDA features in my C++ code or CPU code. Um so just as a simple example of uh an operation you want to parallelize u single precision ax plus y um it's a embarrassingly parallel problem because uh every index is independent of all the other indices um so if you want to parallelize this on a CPU you get your vector of uh threads and then you distribute the work to each of the threads or if you want to uh paralyze it on a GPU I've implemented a grid stride loop and a uh CUDA kernel.

[00:01:23]Um but the issue is uh those were both uh singlepurpose functions that you had to write both the operation and the parallelization method uh every time. Uh and ideally we want to separate concerns into the operation and the parallelization method.

[00:01:42]Uh so step one obligatory context for all the things.

[00:01:47]Um so the reason this is important is because on NVCC uh the NVIDIA compiler um you can pass the experimental relaxed con expert flag uh and that allows all of your con expert functions to be uh uh used on both the host and the device. Uh you write it once uh and you don't actually have to use any CUDA uh keywords.

[00:02:13]Um so if you can read that uh um so up at the top we have uh the single precision ax plus y uh as a const expert function um and so that is being called from both of the parallelization methods the the one that's doing the CPU uh threading and then the one that's doing the CUDA kernel.

[00:02:38]Um so step two uh basically you want to follow the example of the STL uh pass in an operation instead of having to call it uh from each of your par parallelization methods. Um so we're going to pass the operation as a runtime parameter and uh the parallelization method is going to be parameterized on the type of the operation.

[00:03:04]Um, so yeah, now we're able to pass in the uh SAXP op to both of our parallelization methods.

[00:03:14]Um, and so now we have separation of concerns and everything is only written once except that this doesn't work. Um, so anyone know what's wrong with this?

[00:03:29]Uh so the problem is that the operation kernel is being passed at runtime and so it doesn't actually correctly resolve as the the host or device uh version of it.

[00:03:40]Um so the CUDA version is actually getting the host version of the function.

[00:03:45]Um and so the CUDA kernel is going to silently fail when you try to call it.

[00:03:50]Um so actual final step uh make the operation a non-type type play parameter. um that's going to force resolution at compile time. And so the host version gets the the host the host code gets the host version and the uh CUDA version CUDA code gets the CUDA version. So this is what that looks like. Uh still just a con expert function uh defining the operation. Um but now we're passing in the operation as the template parameter, not as one of the runtime parameters. And so this works now.

[00:04:27]We did it. Uh we now have a way of uh executing parallel operations that work on either the CPU or the GPU without having to write anything twice and without using CUDA in code that's meant for CPU.

[00:04:41]Thank you. [applause] [applause]

#shorts #code #gpu #CUDA #c++ code

Related Videos

Computer Science

Agentforce NOW AMA: Build with React and Salesforce Multi-Framework

SalesforceDevs

490 views•2026-05-28

Computer Science

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

aiDotEngineer

450 views•2026-05-28

Computer Science

WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅

LearnwithSahera

1K views•2026-05-29

Computer Science

More tests are always better? How to use AI to identify tests that bring little value

Alliance4Qualification

335 views•2026-05-29

Computer Science

Search Algorithms Explained in 60 Seconds! 🤖💨

samarthtuliofficial

218 views•2026-06-01

Computer Science

People of Game of Thrones using JavaScript DOM

AltCampus

296 views•2026-05-30

Computer Science

Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA

ascensionix

107 views•2026-05-29

Computer Science

🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam

Pranavaa-y4y

104 views•2026-06-02

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30