CUDA series: Part 4 — CUDA Stream

17 minute read

Published: August 19, 2025

Continuing from the previous part, which is about CUDA memory layout and coalescing technique, let’s continue with the feature allowing us to run several kernels asynchronously.

What is CUDA Stream

A CUDA stream is an ordered queue of operations. Operations in the same stream execute in issue order. Operations in different streams may run concurrently if the hardware and dependencies allow, such as an availability of copy engines and no dependencies between streams.

Note: In CUDA, stream 0, which is the default stream, has special behavior unless you change it. Its default setting is “Legacy default stream”. Another setting is “Per-thread default stream”. You can read more here and here. You should skip this note first and come back to read this after finishing reading this whole blog.

Synchronous and Asynchronous Operations

Synchronous operations block the CPU until the operation completes. The next line of code only executes after the current operation completes. An example is cudaMemcpy(...). The data copying to the GPU must be completed before the CPU can continue the remaining work.

Asynchronous operations does not block the CPU. It returns control to the CPU immediately. Therefore, the operation continues on the GPU while the CPU can do other work. An example is a kernel launch. After calling it, the kernel can be run on the GPU, while the CPU can continue to do the remaining work.

Let’s see an example. This is a pseudocode.

cudaMemcpy(...);
myCpuFunction(b);
increment<<<1,N>>>(...);
myCpuFunction(c);
cudaMemcpy(...);

If you want to optimize this code, you would need to use streams and asynchronous memory operations to allow increment to run on the GPU while myCpuFunction(c) runs on the CPU simultaneously.

The CUDA stream allows asynchronous operations. Therefore, it allows a better resource utilization because you can run work on CPU and multiple streams on GPU simultaneously, which is called heterogeneous computing.

Before going to a full code example, let’s discuss cudaMemcpy(...) first.

Asynchronous Memory Copying

cudaMemcpy() is synchronous by itself. It returns a control back to CPU after the copy completes only. In other word, it blocks the CPU until the copy completes.

To make a memory copying asynchronous, you must use cudaMemcpyAsync(...) instead. This function returns a control to CPU while copying is under way. However, to use this function, there are two requirements.

It only works on page-locked (pinned) host memory. If you allocate a memory on the host as a pageable memory, cudaMemcpyAsync(...) will become synchronous. You can read more about Pageable and Pinned Host Memory from my previous blog. But in brief, the reason pinned memory is required is that the pageable memory might be swapped to disk, making asynchronous access impossible.
A stream must be provided. For example, cudaMemcpyAsync(...,stream1).

Example of Asynchronous Program

Let’s have some examples to make sure that you understand how the stream works before continuing. How long does each code snippet run assuming that

Each operation takes exactly 1 unit of time.
All operations have perfect concurrency.
Our GPU has an unlimited number of Host-to-Device and Device-to-Host copy engines.

You may think why there are so many ideal assumptions. I just want to make it easiest to understand this concept first.

First Snippet

cudaMemcpyAsync(..., stream1);
kernel<<<..., stream2>>>(...);
cudaMemcpyAsync(..., stream3);

This snippet takes 1 unit since all operations run simultaneously.

=========
|memcpy|              <- Stream 1
=========
|kernel|              <- Stream 2
=========
|memcpy|              <- Stream 3
=========
<-1 Unit->

Second Snippet

cudaMemcpyAsync(..., stream1);
kernel<<<..., stream2>>>(...);
cudaStreamSynchronize(stream2);
cudaMemcpyAsync(..., stream3);

cudaStreamSynchronize(stream) waits only for operations in that stream to complete. The CPU will block until all operations in stream are completed. So, this snippet takes 2 units since cudaMemcpyAsync(..., stream1); and kernel<<<..., stream2>>>(...); run simultaneously, and cudaMemcpyAsync(..., stream3); runs after the first two operations finish.

=========
|memcpy|                                                          <- Stream 1
=========
|kernel||<- Barrier from cudaStreamSynchronize(stream2);          <- Stream 2
===================
          |memcpy|                                                <- Stream 3
          =========
<-1 Unit-><-1 Unit-> = 2 units

Third Snippet

cudaMemcpyAsync(..., stream1);
kernel<<<..., stream1>>>(...);
cudaMemcpyAsync(..., stream2);

This snippet takes 2 units. cudaMemcpyAsync(..., stream1); and kernel<<<..., stream1>>>(...); is on the same stream, so these two functions run sequentially. At the same time, cudaMemcpyAsync(..., stream2); starts to run at the same time as cudaMemcpyAsync(..., stream1);.

==================
|memcpy||kernel|        <- Stream 1
=========
|memcpy|                <- Stream 2
=========
<-1 Unit-><-1 Unit->

Fourth Snippet

cudaMemcpyAsync(..., stream1);
kernel<<<..., stream2>>>(...);
cudaMemcpyAsync(..., stream3);
cudaDeviceSynchronize();

This snippet takes 1 unit. cudaDeviceSynchronize() waits for ALL operations on the entire GPU device to complete. It blocks until every stream, every kernel, every memory transfer is done.

=========
|memcpy||              <- Stream 1
=========
|kernel||              <- Stream 2
=========
|memcpy||              <- Stream 3
=========
<-1 Unit->

Fifth Snippet

cudaMemcpyAsync(..., stream1);
kernel <<<..., stream2>>>(...);
cudaStreamSynchronize(stream6);
cudaMemcpyAsync(..., stream3);

This snippet takes 1 unit. cudaStreamSynchronize(stream6); will block the CPU until all operation on the stream6 finish; however, stream 6 has nothing to do. Therefore, all remaining commands can run simultaneously.

=========
|memcpy|              <- Stream 1
=========
|kernel|              <- Stream 2
=========
|memcpy|              <- Stream 3
=========
<-1 Unit->

Now, let’s have a full example on writing a heterogeneous computing code.

The Example of Heterogeneous Computing Program

As opposed to previous chapters, this chapter will provide a full code at the end of the section. We are going to develop this program from the bottom-up together.

First, what do we want from this heterogeneous computing program? We want the program to have work running on CPU, and another work running on GPU. We also want to have these two work run simultaneously.

In a synchronous approach, the CPU waits until the GPU finishes its work before starting to do the CPU work; however, in an asynchronous approach, CPU and GPU can do their work at the same time.

To achieve this, is there any limitation of CPU and GPU work? CPU work must not need data from GPU work. In other word, the GPU and CPU must run independent tasks.

Designing Workload

So, let’s design work based on this requirement.

GPU work: Process 262144, which is 2^18 or 1 << 18, numbers with intensive math repeated for 1 million times per number.

__global__ void gpu_intensive_work(float *data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float result = data[idx];
        for (int i = 0; i < 1000000; ++i) {
            result = sinf(result) * cosf(result) + sqrtf(fabsf(result));
        }
        data[idx] = result;
    }
}

CPU work: Calculate a sum using 100 million sin/cos operations.

float cpu_intensive_work() {
    float sum = 0;
    for (int i = 0; i < 100000000; ++i) {
        sum += sinf(i * 0.001f) * cosf(i * 0.001f);
    }
    return sum;
}

Both are independent intensive work taking a significant time.

Then, we can write something in a main function.

int main() {
    const int n = 1 << 18;
    const int bytes = n * sizeof(float);
    
    // Allocate memory
    float *h_data, *d_data;
    cudaMallocHost((void**)&h_data, bytes); // Use Pinned Host Memory Allocation to allocate data on the host!
    cudaMalloc((void**)&d_data, bytes);
    
    // Initialize data
    for (int i = 0; i < n; ++i) {
        h_data[i] = i * 0.01f;
    }

    // ... To be done later
}

Synchronous Version

Let’s have a little side work. What should we do if we want a SYNCHRONOUS program? Understanding this will help you write an asynchronous program. So, trust me! Do this first!

Use cudaMemcpy to copy data from the host to the device.
Run gpu_intensive_work kernel
Run cudaDeviceSynchronize(); to wait until gpu_intensive_work kernel finishes its work. Reminder: The kernel immediately returns control back to the CPU, so we need cudaDeviceSynchronize(); if we want to wait until the GPU finishes its kernel work before having the CPU to continue its work.
Run cudaMemcpy to copy data back from the device to the host.
Run cpu_intensive_work();

We will do a timing. Our expectation is that GPU working time + CPU working time = Overall working time because CPU waits until GPU finishes its work before CPU does its work.

To do a CPU timing, you can use std::chrono::high_resolution_clock::now();. It provides a wall-clock time. The wall-clock time is the the actual time that passes on a clock, like timing with a stopwatch. It is the “real world” time duration. So, to to an overall timing, you can do like this.

auto total_start_sync = std::chrono::high_resolution_clock::now();
// Do step 1 to 4
auto cpu_start_sync = std::chrono::high_resolution_clock::now();
// Do step 5
auto cpu_end_sync = std::chrono::high_resolution_clock::now();
auto total_end_sync = std::chrono::high_resolution_clock::now();

total_end_sync - total_start_sync will provide the overall working time, while cpu_end_sync - cpu_start_sync provides the CPU working time. You may get now why it seems like a stop watch.

HOWEVER, to do a GPU timing, you CANNOT use std::chrono::high_resolution_clock::now(); since it is for the CPU timing. You must use CUDA Event instead. I will not cover the Event in detail here, but this is how to use it to do the GPU timing.

cudaEvent_t gpu_start, gpu_end;
cudaEventCreate(&gpu_start);
// GPU work from step 1 to 4
cudaEventCreate(&gpu_end);

cudaEventElapsedTime(&gpu_time_sync, gpu_start, gpu_end); will provide the GPU working time to gpu_time_sync.

This is the host code to do a synchronous program.

int main(){

    // ... What you have already done ...
    
    cudaEvent_t gpu_start, gpu_end;
    cudaEventCreate(&gpu_start);
    cudaEventCreate(&gpu_end);
    
    std::cout << "=== SYNCHRONOUS VERSION ===\n";
    auto total_start_sync = std::chrono::high_resolution_clock::now();
    
    // Measure GPU time
    cudaEventRecord(gpu_start);
    cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);
    gpu_intensive_work<<<(n+255)/256, 256>>>(d_data, n);
    cudaDeviceSynchronize();  // CPU waits here doing NOTHING
    cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost);
    cudaEventRecord(gpu_end);
    
    // Measure CPU time  
    auto cpu_start_time = std::chrono::high_resolution_clock::now();
    float cpu_result_sync = cpu_intensive_work();
    auto cpu_end_time = std::chrono::high_resolution_clock::now();
    
    // Calculate times
    auto total_end_sync = std::chrono::high_resolution_clock::now();
    float gpu_time_sync;
    cudaEventElapsedTime(&gpu_time_sync, gpu_start, gpu_end);
    auto cpu_time_sync = std::chrono::duration_cast<std::chrono::milliseconds>(cpu_end_time - cpu_start_time);
    auto total_time_sync = std::chrono::duration_cast<std::chrono::milliseconds>(total_end_sync - total_start_sync);
    
    std::cout << "GPU time: " << gpu_time_sync << " ms\n";
    std::cout << "CPU time: " << cpu_time_sync.count() << " ms\n";
    std::cout << "Total time: " << total_time_sync.count() << " ms\n";
    std::cout << "Note: GPU + CPU = " << gpu_time_sync + cpu_time_sync.count() << " ms (sequential)\n\n";

    // ... To be done later
}

This is the result.

=== SYNCHRONOUS VERSION ===
GPU time: 2607.38 ms
CPU time: 2240 ms
Total time: 4848 ms
Note: GPU + CPU = 4847.38 ms (sequential)

It makes sense. GPU working time + CPU working time = Overall working time (okay, almost equal due to some uncontrolled thing).

Asynchronous Version

You have now seen the synchronous result. Let’s make our program asynchronous. You will know if your asynchronous version is correct or not. If you still get the same or similar result as the synchronous version, your asynchronous version is still wrong. This is the reason we go through the synchronous version first.

Use cudaMemcpyAsync to copy data from the host to the device.
Run gpu_intensive_work kernel
Run cudaMemcpyAsync to copy data back from the device to the host.
Run cpu_intensive_work();
Run cudaStreamSynchronize(stream); to make sure that both CPU and GPU finish their work before stopping the overall timing timer. If you feel confused, please remind that this version runs work on both GPU and CPU simultaneously. Although running at the same time, both must still finish their work before the program is ended. We do not know which one will finish earlier, so we wait here. If CPU finishes earlier, this command will wait for GPU. If GPU finishes earlier, when CPU comes across this command, it does not need to wait for anything. It can continue immediately.

Please note that when we have only one stream, using cudaStreamSynchronize(stream) is equivalent to cudaDeviceSynchronize().

For this time, our expectation is that Overall working time = max(GPU working time, CPU working time) because GPU and CPU run their work at the same time. Therefore, the final overall time is the one that is slower.

So, this is our asynchronous code in the host with timing.

int main(){

    // ... What you have already done ...

    // Reset data
    for (int i = 0; i < n; ++i) {
        h_data[i] = i * 0.01f;
    }

    // Create stream and events
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    
    std::cout << "=== ASYNCHRONOUS VERSION (with stream) ===\n";
    auto total_start_async = std::chrono::high_resolution_clock::now();
    
    // Start GPU work (async)
    cudaEventRecord(gpu_start, stream);
    cudaMemcpyAsync(d_data, h_data, bytes, cudaMemcpyHostToDevice, stream);
    gpu_intensive_work<<<(n+255)/256, 256, 0, stream>>>(d_data, n);
    cudaMemcpyAsync(h_data, d_data, bytes, cudaMemcpyDeviceToHost, stream);
    cudaEventRecord(gpu_end, stream);
    
    // Start CPU work immediately (overlapped!)
    auto cpu_start_async = std::chrono::high_resolution_clock::now();
    float cpu_result_async = cpu_intensive_work();
    auto cpu_end_async = std::chrono::high_resolution_clock::now();
    
    // Wait for GPU to finish. This is the point that both GPU and CPU must finish their work.
    // cudaStreamSynchronize(stream);
    cudaDeviceSynchronize(); // Okay too since we have only one stream.
    
    // Calculate times
    auto total_end_async = std::chrono::high_resolution_clock::now();
    float gpu_time_async;
    cudaEventElapsedTime(&gpu_time_async, gpu_start, gpu_end);
    auto cpu_time_async = std::chrono::duration_cast<std::chrono::milliseconds>(cpu_end_async - cpu_start_async);
    auto total_time_async = std::chrono::duration_cast<std::chrono::milliseconds>(total_end_async - total_start_async);
    
    std::cout << "GPU time: " << gpu_time_async << " ms\n";
    std::cout << "CPU time: " << cpu_time_async.count() << " ms\n";
    std::cout << "Total time: " << total_time_async.count() << " ms\n";
    std::cout << "Note: Work overlapped! Total ≈ max(GPU, CPU)\n";
    
    // Cleanup
    cudaEventDestroy(gpu_start);
    cudaEventDestroy(gpu_end);
    cudaStreamDestroy(stream);
    cudaFreeHost(h_data);
    cudaFree(d_data);
    
    return 0;
}

This is the result.

=== ASYNCHRONOUS VERSION (with stream) ===
GPU time: 2663.03 ms
CPU time: 2223 ms
Total time: 2663 ms
Note: Work overlapped! Total ≈ max(GPU, CPU)

Yes! The result makes sense. In this case, both start working at the same time; however, the GPU takes more time. So, when the CPU reaches cudaStreamSynchronize(stream);, it waits the GPU to finish its work for a few more hundreds milliseconds before stopping the overall timing timer.

As I promise, this is a full code. In fact, you can just combine all previous fragment codes together.

#include <cuda_runtime.h>
#include <iostream>
#include <chrono>
#include <cmath>

// Same GPU and CPU functions as before...
__global__ void gpu_intensive_work(float *data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float result = data[idx];
        for (int i = 0; i < 1000000; ++i) {
            result = sinf(result) * cosf(result) + sqrtf(fabsf(result));
        }
        data[idx] = result;
    }
}

float cpu_intensive_work() {
    float sum = 0;
    for (int i = 0; i < 100000000; ++i) {
        sum += sinf(i * 0.001f) * cosf(i * 0.001f);
    }
    return sum;
}

int main() {
    const int n = 1 << 18;
    const int bytes = n * sizeof(float);
    
    // Allocate memory
    float *h_data, *d_data;
    cudaMallocHost((void**)&h_data, bytes);
    cudaMalloc((void**)&d_data, bytes);
    
    // Initialize data
    for (int i = 0; i < n; ++i) {
        h_data[i] = i * 0.01f;
    }
    
    // Create stream and events
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    
    cudaEvent_t gpu_start, gpu_end;
    cudaEventCreate(&gpu_start);
    cudaEventCreate(&gpu_end);
    
    std::cout << "=== SYNCHRONOUS VERSION ===\n";
    auto total_start_sync = std::chrono::high_resolution_clock::now();
    
    // Measure GPU time
    cudaEventRecord(gpu_start);
    cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);
    gpu_intensive_work<<<(n+255)/256, 256>>>(d_data, n);
    cudaDeviceSynchronize();  // CPU waits here doing NOTHING
    cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost);
    cudaEventRecord(gpu_end);
    
    // Measure CPU time  
    auto cpu_start_time = std::chrono::high_resolution_clock::now();
    float cpu_result_sync = cpu_intensive_work();
    auto cpu_end_time = std::chrono::high_resolution_clock::now();
    
    // Calculate times
    auto total_end_sync = std::chrono::high_resolution_clock::now();
    float gpu_time_sync;
    cudaEventElapsedTime(&gpu_time_sync, gpu_start, gpu_end);
    auto cpu_time_sync = std::chrono::duration_cast<std::chrono::milliseconds>(cpu_end_time - cpu_start_time);
    auto total_time_sync = std::chrono::duration_cast<std::chrono::milliseconds>(total_end_sync - total_start_sync);
    
    std::cout << "GPU time: " << gpu_time_sync << " ms\n";
    std::cout << "CPU time: " << cpu_time_sync.count() << " ms\n";
    std::cout << "Total time: " << total_time_sync.count() << " ms\n";
    std::cout << "Note: GPU + CPU = " << gpu_time_sync + cpu_time_sync.count() << " ms (sequential)\n\n";
    
    // Reset data
    for (int i = 0; i < n; ++i) {
        h_data[i] = i * 0.01f;
    }
    
    std::cout << "=== ASYNCHRONOUS VERSION (with stream) ===\n";
    auto total_start_async = std::chrono::high_resolution_clock::now();
    
    // Start GPU work (async)
    cudaEventRecord(gpu_start, stream);
    cudaMemcpyAsync(d_data, h_data, bytes, cudaMemcpyHostToDevice, stream);
    gpu_intensive_work<<<(n+255)/256, 256, 0, stream>>>(d_data, n);
    cudaMemcpyAsync(h_data, d_data, bytes, cudaMemcpyDeviceToHost, stream);
    cudaEventRecord(gpu_end, stream);
    
    // Start CPU work immediately (overlapped!)
    auto cpu_start_async = std::chrono::high_resolution_clock::now();
    float cpu_result_async = cpu_intensive_work();
    auto cpu_end_async = std::chrono::high_resolution_clock::now();
    
    // Wait for GPU to finish. This is the point that both GPU and CPU must finish their work.
    // cudaStreamSynchronize(stream);
    cudaDeviceSynchronize(); // Okay too since we have only one stream.
    
    // Calculate times
    auto total_end_async = std::chrono::high_resolution_clock::now();
    float gpu_time_async;
    cudaEventElapsedTime(&gpu_time_async, gpu_start, gpu_end);
    auto cpu_time_async = std::chrono::duration_cast<std::chrono::milliseconds>(cpu_end_async - cpu_start_async);
    auto total_time_async = std::chrono::duration_cast<std::chrono::milliseconds>(total_end_async - total_start_async);
    
    std::cout << "GPU time: " << gpu_time_async << " ms\n";
    std::cout << "CPU time: " << cpu_time_async.count() << " ms\n";
    std::cout << "Total time: " << total_time_async.count() << " ms\n";
    std::cout << "Note: Work overlapped! Total ≈ max(GPU, CPU)\n";
    
    // Cleanup
    cudaEventDestroy(gpu_start);
    cudaEventDestroy(gpu_end);
    cudaStreamDestroy(stream);
    cudaFreeHost(h_data);
    cudaFree(d_data);
    
    return 0;
}

To run this code,

nvcc -o basic_async basic_async.cu
./basic_async

This is the end of this part. Now, you have an idea of how to use CUDA streams to run several kernels asynchronously, and how to implement a heterogeneous computing program to process independent data on both GPU and CPU simultaneously.

If you enjoy this blog, you can support me with a cup of coffee. Thank you for reading until here and see you then!

This blog is inspired from Lecture 8 and 9 of Low-Level Parallel Programming course offered by Uppsala University.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

CUDA series: Part 4 — CUDA Stream

What is CUDA Stream

Synchronous and Asynchronous Operations

Asynchronous Memory Copying

Example of Asynchronous Program

The Example of Heterogeneous Computing Program

Designing Workload

Synchronous Version

Asynchronous Version

Share on

You May Also Enjoy

Crafting Interpreters: A Bytecode Virtual Machine in Rust - Part 1

CUDA series: Part 3 — CUDA Memory

Crafting Interpreters: A Tree-walk Interpreter in Rust - Part 2

Crafting Interpreters: A Tree-walk Interpreter in Rust - Part 1