Sharing GPU Memory with IPC¶
Umpire provides a DeviceIpcAllocator strategy that allows GPU memory to be
shared between processes on the same node using CUDA or HIP IPC handles. This
strategy uses the shared memory infrastructure in Umpire to coordinate the
sharing of GPU memory between processes.
How It Works¶
The DeviceIpcAllocator works by:
Having one process (the “leader”, determined by the “scope”-local MPI rank 0) physically allocate GPU memory
The leader process gets an IPC handle for that memory and stores it in CPU shared memory
Other processes retrieve the IPC handle from CPU shared memory and import the GPU memory
MPI barriers are used to synchronize between processes and ensure safe access
This allows multiple processes to share GPU memory efficiently, which is useful for multi-process applications that need to operate on the same data.
Using DeviceIpcAllocator¶
A DeviceIpcAllocator can be created using the Umpire ResourceMangager. It
doesn’t require any additional arguments apart from a name. By default, device
memory will be allocated using the “DEVICE” resource. Optionally, you can
provide a device allocator, as well as a “scope” argument that determines who
will share each GPU allocation.
Here’s an example:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-25, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/Umpire.hpp"
#include "umpire/strategy/DeviceIpcAllocator.hpp"
#include "umpire/util/MemoryResourceTraits.hpp"
#if defined(UMPIRE_ENABLE_MPI)
#include <mpi.h>
#endif
#if defined(UMPIRE_ENABLE_CUDA)
#include <cuda_runtime_api.h>
#elif defined(UMPIRE_ENABLE_HIP)
#include <hip/hip_runtime_api.h>
#endif
// Example of using the DeviceIpcAllocator
int main(int argc, char** argv)
{
#if defined(UMPIRE_ENABLE_MPI)
MPI_Init(&argc, &argv);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#else
int rank = 0;
#endif
{
auto& rm = umpire::ResourceManager::getInstance();
// Create the IPC allocator with node scope and 1MB shared memory
auto ipc_allocator = rm.makeAllocator<umpire::strategy::DeviceIpcAllocator>("device_ipc_allocator");
auto host_allocator = rm.getAllocator("HOST");
// Allocate device memory - only rank 0 will physically allocate
// All other ranks will import via IPC
const size_t size = 1024 * sizeof(float);
float* data = static_cast<float*>(ipc_allocator.allocate(size));
std::cout << "Rank " << rank << ": Got device memory at " << data << std::endl;
auto scope_comm = umpire::get_communicator_for_allocator(ipc_allocator, MPI_COMM_WORLD);
int scope_rank;
MPI_Comm_rank(scope_comm, &scope_rank);
// Initialize data from rank 0
if (scope_rank == 0) {
// Initialize on host (simplified example)
float* host_data = static_cast<float*>(host_allocator.allocate(size));
for (int i = 0; i < 1024; i++) {
host_data[i] = static_cast<float>(rank);
}
// Copy to device
rm.copy(data, host_data, size);
host_allocator.deallocate(host_data);
}
#if defined(UMPIRE_ENABLE_MPI)
MPI_Barrier(scope_comm);
#endif
#if defined(UMPIRE_ENABLE_CUDA)
cudaDeviceSynchronize();
#elif defined(UMPIRE_ENABLE_HIP)
hipDeviceSynchronize();
#endif
// All ranks can now access the data
// Verify by copying a portion back to host
float* value = static_cast<float*>(host_allocator.allocate(sizeof(float)));
rm.copy(value, data + 1, sizeof(float));
std::cout << "Rank " << rank << ": second value is " << *value << std::endl;
// Deallocate memory on all ranks
ipc_allocator.deallocate(data);
host_allocator.deallocate(value);
}
#if defined(UMPIRE_ENABLE_MPI)
MPI_Finalize();
#endif
return 0;
}
The scope argument can either be “socket” or “node”. The “socket” scope means that all processes using the same socket (as identified by the PCI address of the GPU) will share the GPU memory, while the “node” scope means that all processes on the same node (as determined by MPI_COMM_TYPE_SHARED) will share the GPU memory.
Limitations¶
Requires UMPIRE_ENABLE_MPI to be enabled
Requires UMPIRE_ENABLE_IPC_SHARED_MEMORY to be enabled