Umpire¶
Umpire is a resource management library that allows the discovery, provision, and management of memory on next-generation hardware architectures with NUMA memory hierarchies.
- Take a look at our Getting Started guide for all you need to get up and running with Umpire.
- If you are looking for developer documentation on a particular function, check out the code documentation.
- Want to contribute? Take a look at our developer and contribution guides.
Any questions? File an issue on GitHub, or email umpire-dev@llnl.gov
Getting Started¶
This page provides information on how to quickly get up and running with Umpire.
Installation¶
Umpire is hosted on GitHub here. To clone the repo into your local working space, type:
$ git clone --recursive https://github.com/LLNL/Umpire.git
The --recursive
argument is required to ensure that the BLT submodule is
also checked out. BLT is the build system we
use for Umpire.
Building Umpire¶
Umpire uses CMake and BLT to handle builds. Make sure that you have a modern compiler loaded and the configuration is as simple as:
$ mkdir build && cd build
$ cmake ../
By default, Umpire will only support host memory. Additional backends for device support can be enabled using the options detailed in Advanced Configuration. CMake will provide output about which compiler is being used and the values of other options. Once CMake has completed, Umpire can be built with Make:
$ make
For more advanced configuration options, see Advanced Configuration.
Installing Umpire¶
To install Umpire, just run:
$ make install
Umpire install files to the lib
, include
and bin
directories of the
CMAKE_INSTALL_PREFIX
. Additionally, Umpire installs a CMake configuration
file that can help you use Umpire in other projects. By setting umpire_DIR to
point to the root of your Umpire installation, you can call
find_package(umpire)
inside your CMake project and Umpire will be
automatically detected and available for use.
Basic Usage¶
Let’s take a quick tour through Umpire’s most important features. A complete listing you can compile is included at the bottom of the page. First, let’s grab an Allocator and allocate some memory. This is the interface through which you will want to access data:
auto& rm = umpire::ResourceManager::getInstance();
umpire::Allocator allocator = rm.getAllocator("HOST");
float* my_data = static_cast<float*>(allocator.allocate(100*sizeof(float));
This code grabs the default allocator for the host memory, and uses it to allocate an array of 100 floats. We can ask for different Allocators to allocate memory in different places. Let’s ask for a device allocator:
umpire::Allocator device_allocator = rm.getAllocator("DEVICE");
float* my_data_device = static_cast<float*>(device_allocator.allocate(100*sizeof(float));
This code gets the default device allocator, and uses it to allocate an array of 100 floats. Remember, since this is a device pointer, there is no guarantee you will be able to access it on the host. Luckily, Umpire’s ResourceManager can copy one pointer to another transparently. Let’s copy the data from our first pointer to the DEVICE-allocated pointer.
rm.copy(my_data, my_data_device);
To free any memory allocated, you can use the deallocate function of the Allocator, or the ResourceManager. Asking the ResourceManager to deallocate memory is slower, but useful if you don’t know how or where an allocation was made:
allocator.deallocate(my_data); // deallocate using Allocator
rm.deallocate(my_data_device); // deallocate using ResourceManager
Umpire Tutorial¶
This section is a tutorial introduction to Umpire. We start with the most basic memory allocation, and move through topics like allocating on different resources, using allocation strategies to change how memory is allocated, using operations to move and modify data, and how to use Umpire introspection capability to find out information about Allocators and allocations.
These examples are all built as part of Umpire, and you can find the files in the examples directory at the root of the Umpire repository. Feel free to play around and modify these examples to experiment with all of Umpire’s functionality.
The following tutorial examples assume a working knowledge of C++ and a general understanding of how memory is laid out in modern heterogeneous computers. The main thing to remember is that in many systems, memory on other execution devices (like GPUs) might not be directly accessible from the CPU. If you try and access this memory your program will error! Luckily, Umpire makes it easy to move data around, and check where it is, as you will see in the following sections.
Allocators¶
The fundamental concept for accessing memory through Umpire is the
umpire::Allocator
. An umpire::Allocator
is a C++ object that
can be used to allocate and deallocate memory, as well as query a pointer to
get some extra information about it.
All umpire::Allocator
s are created and managed by Umpire’s
umpire::ResourceManager
. To get an Allocator, you need to ask for one:
umpire::Allocator allocator = rm.getAllocator("HOST");
You can also use an existing allocator to build a new allocator from it:
auto addon_allocator = rm.getAllocator(allocator.getName());
This new allocator will also be built with the same memory resource. More information
on memory resources is provided in the next section. Additionally, once you have
an umpire::Allocator
you can use it to allocate and deallocate memory:
double* data = static_cast<double*>(allocator.allocate(SIZE * sizeof(double)));
allocator.deallocate(data);
In the next section, we will see how to allocate memory using different resources.
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
int main(int, char**)
{
auto& rm = umpire::ResourceManager::getInstance();
// _sphinx_tag_tut_get_allocator_start
umpire::Allocator allocator = rm.getAllocator("HOST");
// _sphinx_tag_tut_get_allocator_end
constexpr std::size_t SIZE = 1024;
// _sphinx_tag_tut_allocate_start
double* data = static_cast<double*>(allocator.allocate(SIZE * sizeof(double)));
// _sphinx_tag_tut_allocate_end
std::cout << "Allocated " << (SIZE * sizeof(double)) << " bytes using the " << allocator.getName() << " allocator."
<< std::endl;
//_sphinx_tag_tut_getAllocator_start
auto addon_allocator = rm.getAllocator(allocator.getName());
//_sphinx_tag_tut_getAllocator_end
std::cout << "Created an add-on allocator of size " << addon_allocator.getCurrentSize() << " using the "
<< allocator.getName() << " allocator." << std::endl;
// _sphinx_tag_tut_deallocate_start
allocator.deallocate(data);
// _sphinx_tag_tut_deallocate_end
std::cout << "...Memory deallocated." << std::endl;
return 0;
}
Resources¶
Each computer system will have a number of distinct places in which the system will allow you to allocate memory. In Umpire’s world, these are memory resources. A memory resource can correspond to a hardware resource, but can also be used to identify memory with a particular characteristic, like “pinned” memory in a GPU system.
When you configure Umpire, it will create
umpire::resource::MemoryResource
s according to what is available on
the system you are building for. For each resource (defined by
MemoryResourceTraits::resource_type
), Umpire will create a
default umpire::Allocator
that you can use. In the previous example,
we were actually using an umpire::Allocator
created for the memory
resource corresponding to the CPU memory.
The easiest way to identify resources is by name. The “HOST” resource is always available. We also have resources that represent global GPU memory (“DEVICE”), constant GPU memory (“DEVICE_CONST”), unified memory that can be accessed by the CPU or GPU (“UM”), host memory that can be accessed by the GPU (“PINNED”), and mmapped file memory (“FILE”). If an incorrect name is used or if the allocator was not set up correctly, the “UNKNOWN” resource name is returned.
Umpire will create an umpire::Allocator
for each of these resources,
and you can get them using the same
umpire::ResourceManager::getAllocator()
call you saw in the previous
example:
umpire::Allocator allocator = rm.getAllocator(resource);
Note that since every allocator supports the same calls, no matter which resource it is for, this means we can run the same code for all the resources available in the system.
While using Umpire memory resources, it may be useful to query the memory
resource currently associated with a particular allocator. For example, if we wanted
to double check that our allocator is using the device
resource, we can
assert that MemoryResourceTraits::resource_type::device
is equal
to the return value of allocator.getAllocationStrategy()->getTraits().resource
.
The test code provided in memory_resource_traits_tests.cpp
shows a complete
example of how to query this information.
Note
In order to test some memory resources, you may need to configure your Umpire
build to use a particular platform (a member of the umpire::Allocator
, defined by
Platform.hpp
) that has access to that resource. See the Developer’s
Guide for more information.
Next, we will see an example of how to move data between resources using operations.
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
void allocate_and_deallocate(const std::string& resource)
{
auto& rm = umpire::ResourceManager::getInstance();
// _sphinx_tag_tut_get_allocator_start
umpire::Allocator allocator = rm.getAllocator(resource);
// _sphinx_tag_tut_get_allocator_end
constexpr std::size_t SIZE = 1024;
double* data = static_cast<double*>(allocator.allocate(SIZE * sizeof(double)));
std::cout << "Allocated " << (SIZE * sizeof(double)) << " bytes using the " << allocator.getName() << " allocator...";
allocator.deallocate(data);
std::cout << " deallocated." << std::endl;
}
int main(int, char**)
{
allocate_and_deallocate("HOST");
#if defined(UMPIRE_ENABLE_DEVICE)
allocate_and_deallocate("DEVICE");
#endif
#if defined(UMPIRE_ENABLE_UM)
allocate_and_deallocate("UM");
#endif
#if defined(UMPIRE_ENABLE_PINNED)
allocate_and_deallocate("PINNED");
#endif
return 0;
}
Operations¶
Moving and modifying data in a heterogenous memory system can be annoying. You have to keep track of the source and destination, and often use vendor-specific APIs to perform the modifications. In Umpire, all data modification and movement is wrapped up in a concept we call operations. Full documentation for all of these is available here. The full code listing for each example is include at the bottom of the page.
Copy¶
Let’s start by looking at how we copy data around. The
umpire::ResourceManager
provides an interface to copy that handles
figuring out where the source and destination pointers were allocated, and
selects the correct implementation to copy the data:
rm.copy(dest_data, source_data);
This example allocates the destination data using any valid Allocator.
Move¶
If you want to move data to a new Allocator and deallocate the old copy, Umpire
provides a umpire::ResourceManager::move()
operation.
double* dest_data = static_cast<double*>(rm.move(source_data, dest_allocator));
The move operation combines an allocation, a copy, and a deallocate into one function call, allowing you to move data without having to have the destination data allocated. As always, this operation will work with any valid destination Allocator.
Memset¶
Setting a whole block of memory to a value (like 0) is a common operation, that
most people know as a memset. Umpire provides a
umpire::ResourceManager::memset()
implementation that can be applied to
any allocation, regardless of where it came from:
rm.memset(data, 0);
Reallocate¶
Reallocating CPU memory is easy, there is a function designed specifically to
do it: realloc
. When the original allocation was made in a different memory
however, you can be out of luck. Umpire provides a
umpire::ResourceManager::reallocate()
operation:
data = static_cast<double*>(rm.reallocate(data, REALLOCATED_SIZE));
This method returns a pointer to the reallocated data. Like all operations, this can be used regardless of the Allocator used for the source data.
Listings¶
Copy Example Listing
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
void copy_data(double* source_data, std::size_t size, const std::string& destination)
{
auto& rm = umpire::ResourceManager::getInstance();
auto dest_allocator = rm.getAllocator(destination);
double* dest_data = static_cast<double*>(dest_allocator.allocate(size * sizeof(double)));
// _sphinx_tag_tut_copy_start
rm.copy(dest_data, source_data);
// _sphinx_tag_tut_copy_end
std::cout << "Copied source data (" << source_data << ") to destination " << destination << " (" << dest_data << ")"
<< std::endl;
dest_allocator.deallocate(dest_data);
}
int main(int, char**)
{
constexpr std::size_t SIZE = 1024;
auto& rm = umpire::ResourceManager::getInstance();
auto allocator = rm.getAllocator("HOST");
double* data = static_cast<double*>(allocator.allocate(SIZE * sizeof(double)));
std::cout << "Allocated " << (SIZE * sizeof(double)) << " bytes using the " << allocator.getName() << " allocator."
<< std::endl;
std::cout << "Filling with 0.0...";
for (std::size_t i = 0; i < SIZE; i++) {
data[i] = 0.0;
}
std::cout << "done." << std::endl;
copy_data(data, SIZE, "HOST");
#if defined(UMPIRE_ENABLE_DEVICE)
copy_data(data, SIZE, "DEVICE");
#endif
#if defined(UMPIRE_ENABLE_UM)
copy_data(data, SIZE, "UM");
#endif
#if defined(UMPIRE_ENABLE_PINNED)
copy_data(data, SIZE, "PINNED");
#endif
allocator.deallocate(data);
return 0;
}
Move Example Listing
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
double* move_data(double* source_data, const std::string& destination)
{
auto& rm = umpire::ResourceManager::getInstance();
auto dest_allocator = rm.getAllocator(destination);
std::cout << "Moved source data (" << source_data << ") to destination ";
// _sphinx_tag_tut_move_start
double* dest_data = static_cast<double*>(rm.move(source_data, dest_allocator));
// _sphinx_tag_tut_move_end
std::cout << destination << " (" << dest_data << ")" << std::endl;
return dest_data;
}
int main(int, char**)
{
constexpr std::size_t SIZE = 1024;
auto& rm = umpire::ResourceManager::getInstance();
auto allocator = rm.getAllocator("HOST");
double* data = static_cast<double*>(allocator.allocate(SIZE * sizeof(double)));
std::cout << "Allocated " << (SIZE * sizeof(double)) << " bytes using the " << allocator.getName() << " allocator."
<< std::endl;
std::cout << "Filling with 0.0...";
for (std::size_t i = 0; i < SIZE; i++) {
data[i] = 0.0;
}
std::cout << "done." << std::endl;
data = move_data(data, "HOST");
#if defined(UMPIRE_ENABLE_DEVICE)
data = move_data(data, "DEVICE");
#endif
#if defined(UMPIRE_ENABLE_UM)
data = move_data(data, "UM");
#endif
#if defined(UMPIRE_ENABLE_PINNED)
data = move_data(data, "PINNED");
#endif
rm.deallocate(data);
return 0;
}
Memset Example Listing
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
int main(int, char**)
{
constexpr std::size_t SIZE = 1024;
auto& rm = umpire::ResourceManager::getInstance();
const std::string destinations[] = {
"HOST"
#if defined(UMPIRE_ENABLE_DEVICE)
,
"DEVICE"
#endif
#if defined(UMPIRE_ENABLE_UM)
,
"UM"
#endif
#if defined(UMPIRE_ENABLE_PINNED)
,
"PINNED"
#endif
};
for (auto& destination : destinations) {
auto allocator = rm.getAllocator(destination);
double* data = static_cast<double*>(allocator.allocate(SIZE * sizeof(double)));
std::cout << "Allocated " << (SIZE * sizeof(double)) << " bytes using the " << allocator.getName() << " allocator."
<< std::endl;
// _sphinx_tag_tut_memset_start
rm.memset(data, 0);
// _sphinx_tag_tut_memset_end
std::cout << "Set data from " << destination << " (" << data << ") to 0." << std::endl;
allocator.deallocate(data);
}
return 0;
}
Reallocate Example Listing
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
int main(int, char**)
{
constexpr std::size_t SIZE = 1024;
constexpr std::size_t REALLOCATED_SIZE = 256;
auto& rm = umpire::ResourceManager::getInstance();
const std::string destinations[] = {
"HOST"
#if defined(UMPIRE_ENABLE_DEVICE)
,
"DEVICE"
#endif
#if defined(UMPIRE_ENABLE_UM)
,
"UM"
#endif
#if defined(UMPIRE_ENABLE_PINNED)
,
"PINNED"
#endif
};
for (auto& destination : destinations) {
auto allocator = rm.getAllocator(destination);
double* data = static_cast<double*>(allocator.allocate(SIZE * sizeof(double)));
std::cout << "Allocated " << (SIZE * sizeof(double)) << " bytes using the " << allocator.getName() << " allocator."
<< std::endl;
std::cout << "Reallocating data (" << data << ") to size " << REALLOCATED_SIZE << "...";
// _sphinx_tag_tut_realloc_start
data = static_cast<double*>(rm.reallocate(data, REALLOCATED_SIZE));
// _sphinx_tag_tut_realloc_end
std::cout << "done. Reallocated data (" << data << ")" << std::endl;
allocator.deallocate(data);
}
return 0;
}
Dynamic Pools¶
Frequently allocating and deallocating memory can be quite costly, especially when you are making large allocations or allocating on different memory resources. To mitigate this, Umpire provides allocation strategies that can be used to customize how data is obtained from the system.
In this example, we will look at the umpire::strategy::DynamicPoolList
strategy. This is a simple pooling algorithm that can fulfill requests for
allocations of any size. To create a new Allocator
using the
umpire::strategy::DynamicPoolList
strategy:
auto pooled_allocator = rm.makeAllocator<umpire::strategy::DynamicPoolList>(resource + "_pool", allocator);
We have to provide a new name for the Allocator, as well as the underlying Allocator we wish to use to grab memory.
Additionally, in the previous section on Allocators, we mentioned that you
could build a new allocator off of an existing one using the getAllocator
function. Here is another example of this, but using a strategy:
umpire::Allocator addon_allocator = rm.makeAllocator<umpire::strategy::SizeLimiter>(
resource + "_addon_pool", rm.getAllocator(pooled_allocator.getName()), 2098);
The purpose of this example is to show that the getAllocator
function
can be used more than just to get an initial allocator. The addon_allocator
will
be a dynamic pool allocator that is limited to 2098 bytes. Another good use case
for the getAllocator
function is grabbing each available allocator in a loop and
querying some property. (Note that addon_allocator
in the above example will be
created with the same memory resource as pooled_allocator
was.)
Once you have an Allocator
, you can allocate and deallocate memory as
before, without needing to worry about the underlying algorithm used for the
allocations:
double* data = static_cast<double*>(pooled_allocator.allocate(SIZE * sizeof(double)));
pooled_allocator.deallocate(data);
Don’t forget, these strategies can be created on top of any valid Allocator:
allocate_and_deallocate_pool("HOST");
#if defined(UMPIRE_ENABLE_DEVICE)
allocate_and_deallocate_pool("DEVICE");
#endif
#if defined(UMPIRE_ENABLE_UM)
allocate_and_deallocate_pool("UM");
#endif
#if defined(UMPIRE_ENABLE_PINNED)
allocate_and_deallocate_pool("PINNED");
#endif
Most Umpire users will make allocations that use the GPU via the
umpire::strategy::DynamicPoolList
, to help mitigate the cost of allocating
memory on these devices.
You can tune the way that umpire::strategy::DynamicPoolList
allocates
memory using two parameters: the initial size, and the minimum size. The
initial size controls how large the first underly allocation made will be,
regardless of the requested size. The minimum size controls the minimum size of
any future underlying allocations. These two parameters can be passed when
constructing a pool:
auto pooled_allocator = rm.makeAllocator<umpire::strategy::DynamicPoolList>(resource + "_pool", allocator,
initial_size, /* default = 512Mb*/
min_block_size /* default = 1Mb */);
Depending on where you are allocating data, you might want to use different sizes. It’s easy to construct multiple pools with different configurations:
allocate_and_deallocate_pool("HOST", 65536, 512);
#if defined(UMPIRE_ENABLE_DEVICE)
allocate_and_deallocate_pool("DEVICE", (1024 * 1024 * 1024), (1024 * 1024));
#endif
#if defined(UMPIRE_ENABLE_UM)
allocate_and_deallocate_pool("UM", (1024 * 64), 1024);
#endif
#if defined(UMPIRE_ENABLE_PINNED)
allocate_and_deallocate_pool("PINNED", (1024 * 16), 1024);
#endif
There are lots of different strategies that you can use, we will look at some of them in this tutorial. A complete list of strategies can be found here.
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/strategy/DynamicPoolList.hpp"
void allocate_and_deallocate_pool(const std::string& resource)
{
auto& rm = umpire::ResourceManager::getInstance();
auto allocator = rm.getAllocator(resource);
// _sphinx_tag_tut_makepool_start
auto pooled_allocator = rm.makeAllocator<umpire::strategy::DynamicPoolList>(resource + "_pool", allocator);
// _sphinx_tag_tut_makepool_end
constexpr std::size_t SIZE = 1024;
// _sphinx_tag_tut_allocate_start
double* data = static_cast<double*>(pooled_allocator.allocate(SIZE * sizeof(double)));
// _sphinx_tag_tut_allocate_end
std::cout << "Allocated " << (SIZE * sizeof(double)) << " bytes using the " << pooled_allocator.getName()
<< " allocator...";
// _sphinx_tag_tut_deallocate_start
pooled_allocator.deallocate(data);
// _sphinx_tag_tut_deallocate_end
std::cout << " deallocated." << std::endl;
}
int main(int, char**)
{
// _sphinx_tag_tut_anyallocator_start
allocate_and_deallocate_pool("HOST");
#if defined(UMPIRE_ENABLE_DEVICE)
allocate_and_deallocate_pool("DEVICE");
#endif
#if defined(UMPIRE_ENABLE_UM)
allocate_and_deallocate_pool("UM");
#endif
#if defined(UMPIRE_ENABLE_PINNED)
allocate_and_deallocate_pool("PINNED");
#endif
// _sphinx_tag_tut_anyallocator_end
return 0;
}
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/strategy/DynamicPoolList.hpp"
void allocate_and_deallocate_pool(const std::string& resource, std::size_t initial_size, std::size_t min_block_size)
{
constexpr std::size_t SIZE = 1024;
auto& rm = umpire::ResourceManager::getInstance();
auto allocator = rm.getAllocator(resource);
// _sphinx_tag_tut_allocator_tuning_start
auto pooled_allocator = rm.makeAllocator<umpire::strategy::DynamicPoolList>(resource + "_pool", allocator,
initial_size, /* default = 512Mb*/
min_block_size /* default = 1Mb */);
// _sphinx_tag_tut_allocator_tuning_end
double* data = static_cast<double*>(pooled_allocator.allocate(SIZE * sizeof(double)));
std::cout << "Allocated " << (SIZE * sizeof(double)) << " bytes using the " << pooled_allocator.getName()
<< " allocator...";
pooled_allocator.deallocate(data);
std::cout << " deallocated." << std::endl;
}
int main(int, char**)
{
// _sphinx_tag_tut_device_sized_pool_start
allocate_and_deallocate_pool("HOST", 65536, 512);
#if defined(UMPIRE_ENABLE_DEVICE)
allocate_and_deallocate_pool("DEVICE", (1024 * 1024 * 1024), (1024 * 1024));
#endif
#if defined(UMPIRE_ENABLE_UM)
allocate_and_deallocate_pool("UM", (1024 * 64), 1024);
#endif
#if defined(UMPIRE_ENABLE_PINNED)
allocate_and_deallocate_pool("PINNED", (1024 * 16), 1024);
#endif
// _sphinx_tag_tut_device_sized_pool_end
return 0;
}
Introspection¶
When writing code to run on computers with a complex memory hierarchy, one of the most difficult things can be keeping track of where each pointer has been allocated. Umpire’s instrospection capability keeps track of this information, as well as other useful bits and pieces you might want to know.
The umpire::ResourceManager
can be used to find the allocator
associated with an address:
auto found_allocator = rm.getAllocator(data);
Once you have this, it’s easy to query things like the name of the Allocator or
find out the associated umpire::Platform
, which can help
you decide where to operate on this data:
std::cout << "According to the ResourceManager, the Allocator used is " << found_allocator.getName()
<< ", which has the Platform " << static_cast<int>(found_allocator.getPlatform()) << std::endl;
You can also find out how big the allocation is, in case you forgot:
std::cout << "The size of the allocation is << " << found_allocator.getSize(data) << std::endl;
Remember that these functions will work on any allocation made using an
Allocator or umpire::TypedAllocator
.
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
int main(int, char**)
{
constexpr std::size_t SIZE = 1024;
auto& rm = umpire::ResourceManager::getInstance();
const std::string destinations[] = {
"HOST"
#if defined(UMPIRE_ENABLE_DEVICE)
,
"DEVICE"
#endif
#if defined(UMPIRE_ENABLE_UM)
,
"UM"
#endif
#if defined(UMPIRE_ENABLE_PINNED)
,
"PINNED"
#endif
};
for (auto& destination : destinations) {
auto allocator = rm.getAllocator(destination);
double* data = static_cast<double*>(allocator.allocate(SIZE * sizeof(double)));
std::cout << "Allocated " << (SIZE * sizeof(double)) << " bytes using the " << allocator.getName() << " allocator."
<< std::endl;
// _sphinx_tag_tut_getallocator_start
auto found_allocator = rm.getAllocator(data);
// _sphinx_tag_tut_getallocator_end
// _sphinx_tag_tut_getinfo_start
std::cout << "According to the ResourceManager, the Allocator used is " << found_allocator.getName()
<< ", which has the Platform " << static_cast<int>(found_allocator.getPlatform()) << std::endl;
// _sphinx_tag_tut_getinfo_end
// _sphinx_tag_tut_getsize_start
std::cout << "The size of the allocation is << " << found_allocator.getSize(data) << std::endl;
// _sphinx_tag_tut_getsize_end
allocator.deallocate(data);
}
return 0;
}
Typed Allocators¶
Sometimes, you might want to construct an allocator that allocates objects of a
specific type. Umpire provides a umpire::TypedAllocator
for this
purpose. It can also be used with STL objects like std::vector
.
A umpire::TypedAllocator
is constructed from any existing Allocator,
and provides the same interface as the normal umpire::Allocator
.
However, when you call allocate, this argument is the number of objects you
want to allocate, no the total number of bytes:
umpire::TypedAllocator<double> double_allocator{alloc};
double* my_doubles = double_allocator.allocate(1024);
double_allocator.deallocate(my_doubles, 1024);
To use this allocator with an STL object like a vector, you need to pass the type as a template parameter for the vector, and also pass the allocator to the vector when you construct it:
std::vector<double, umpire::TypedAllocator<double>> my_vector{double_allocator};
One thing to remember is that whatever allocator you use with an STL object, it must be compatible with the inner workings of that object. For example, if you try and use a “DEVICE”-based allocator it will fail, since the vector will try and construct each element. The CPU cannot access DEVICE memory in most systems, thus causing a segfault. Be careful!
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/TypedAllocator.hpp"
int main(int, char**)
{
auto& rm = umpire::ResourceManager::getInstance();
auto alloc = rm.getAllocator("HOST");
// _sphinx_tag_tut_typed_alloc_start
umpire::TypedAllocator<double> double_allocator{alloc};
double* my_doubles = double_allocator.allocate(1024);
double_allocator.deallocate(my_doubles, 1024);
// _sphinx_tag_tut_typed_alloc_end
// _sphinx_tag_tut_vector_alloc_start
std::vector<double, umpire::TypedAllocator<double>> my_vector{double_allocator};
// _sphinx_tag_tut_vector_alloc_end
my_vector.resize(100);
return 0;
}
Replay¶
Umpire provides a lightweight replay capability that can be used to investigate performance of particular allocation patterns and reproduce bugs.
Input Example¶
A log can be captured and stored as a JSON file, then used as input to the
replay
application (available under the bin
directory). The replay
program will read the replay log, and recreate the events that occured as part
of the run that generated the log.
The file tut_replay.cpp
makes a umpire::strategy::QuickPool
:
auto allocator = rm.getAllocator("HOST");
auto pool = rm.makeAllocator<umpire::strategy::QuickPool>("pool", allocator);
This allocator is used to perform some randomly sized allocations, and later free them:
std::generate(allocations.begin(), allocations.end(), [&]() { return pool.allocate(random_number()); });
for (auto& ptr : allocations)
pool.deallocate(ptr);
Running the Example¶
Running this program:
UMPIRE_REPLAY="On" ./bin/examples/tutorial/tut_replay > tut_replay_log.json
will write Umpire replay events to the file tut_replay_log.json
. You can
see that this file contains JSON formatted lines.
Replaying the session¶
Loading this file with the replay
program will replay this sequence of
umpire::Allocator
creation, allocations, and deallocations:
./bin/replay -i ../tutorial/examples/tut_replay_log.json
We also have a tutorial for the C interface to Umpire. Complete example
listings are available, and will be compiled if you have configured Umpire with
-DUMPIRE_ENABLE_C=On
.
The C tutorial assumes an understanding of C, and it would be useful to have some knowledge of C++ to understand how the C API maps to the native C++ classes that Umpire provides.
C API: Allocators¶
The fundamental concept for accessing memory through Umpire is an
umpire:Allocator
. In C, this means using the type
umpire_allocator
. There are corresponding functions that take an
umpire_allocator
and let you allocate and deallocate memory.
As with the native C++ interface, all allocators are accessed via the
umpire::ResourceManager
. In the C API, there is a corresponding
umpire_resourcemanager
type. To get an umpire_allocator
:
umpire_resourcemanager rm;
umpire_resourcemanager_get_instance(&rm);
umpire_allocator allocator;
umpire_resourcemanager_get_allocator_by_name(&rm, "HOST", &allocator);
Once you have an umpire_allocator
, you can use it to allocate and
deallocate memory:
double* data = (double*) umpire_allocator_allocate(&allocator, SIZE*sizeof(double));
printf("Allocated %lu bytes using the %s allocator...", (SIZE*sizeof(double)), umpire_allocator_get_name(&allocator));
umpire_allocator_deallocate(&allocator, data);
In the next section, we will see how to allocate memory in different places.
C API: Resources¶
Each computer system will have a number of distinct places in which the system will allow you to allocate memory. In Umpire’s world, these are memory resources. A memory resource can correspond to a hardware resource, but can also be used to identify memory with a particular characteristic, like “pinned” memory in a GPU system.
When you configure Umpire, it will create
umpire::resource::MemoryResource
s according to what is available on
the system you are building for. For each resource, Umpire will create a
default umpire_allocator
that you can use. In the previous example, we were
actually using an umpire_allocator
created for the memory resource
corresponding to the CPU memory.
The easiest way to identify resources is by name. The “HOST” resource is always available. In a system configured with NVIDIA GPUs, we also have resources that represent global GPU memory (“DEVICE”), unified memory that can be accessed by the CPU or GPU (“UM”) and host memory that can be accessed by the GPU (“PINNED”);
Umpire will create an umpire_allocator
for each of these resources, and you
can get them using the same umpire_resourcemanager_get_allocator_by_name
call you saw in the previous example:
Note that every allocator supports the same calls, no matter which resource it is for, this means we can run the same code for all the resources available in the system:
As you can see, we can call this function with any valid resource name:
In the next example, we will learn how to move data between resources using operations.
C API: Pools¶
Frequently allocating and deallocating memory can be quite costly, especially when you are making large allocations or allocating on different memory resources. To mitigate this, Umpire provides allocation strategies that can be used to customize how data is obtained from the system.
In this example, we will look at creating a pool that can fulfill requests for
allocations of any size. To create a new umpire_allocator
using the pooling
algorithm:
The two arguments are the size of the initial block that is allocated, and the
minimum size of any future blocks. We have to provide a new name for the
allocator, as well as the underlying umpire_allocator
we wish to use to
grab memory.
Once you have the allocator, you can allocate and deallocate memory as before, without needing to worry about the underlying algorithm used for the allocations:
This pool can be created with any valid underlying umpire_allocator
.
Finally, we have a tutorial for Umpire’s FORTRAN API. These examples will be
compiled when configuring with -DENABLE_FORTRAN=On
. The FORTRAN tutorial
assumes an understanding of FORTRAN. Familiarity with the FORTRAN’s ISO C
bindings can be useful for understanding why the interface looks the way it
does.
FORTRAN API: Allocators¶
The fundamental concept for accessing memory through Umpire is an
umpire:Allocator
. In FORTRAN, this means using the type
UmpireAllocator
. This type provides an allocate_pointer
function to
allocate raw memory, and a generic allocate
procedure that takes an array
pointer and an array of dimensions and will allocate the correct amount of
memory.
As with the native C++ interface, all allocators are accessed via the
umpire::ResourceManager
. In the FORTRAN API, there is a corresponding
UmpireResourceManager
type. To get an UmpireAllocator
:
In this example we fetch the allocator by id, using 0 means you will always get
a host allocator. Once you have an UmpireAllocator
, you can use it to allocate and
deallocate memory:
In this case, we allocate a one-dimensional array using the generic
allocate
function.
Advanced Configuration¶
In addition to the normal options provided by CMake, Umpire uses some additional configuration arguments to control optional features and behavior. Each argument is a boolean option, and can be turned on or off:
-DENABLE_CUDA=Off
Here is a summary of the configuration options, their default value, and meaning:
These arguments are explained in more detail below:
ENABLE_CUDA
This option enables support for NVIDIA GPUs using the CUDA programming model. If Umpire is built without CUDA or HIP support, then only theHOST
allocator is available for use.ENABLE_HIP
This option enables support for AMD GPUs using the ROCm stack and HIP programming model. If Umpire is built without CUDA or HIP support, then only theHOST
allocator is available for use.UMPIRE_ENABLE_NUMA
This option enables support for NUMA. Theumpire::strategy::NumaPolicy
is available when built with this option, which may be used to locate the allocation to a specific node.UMPIRE_ENABLE_FILE_RESOURCE
This option will allow the build to make all File Memory Allocation files. If Umpire is built without FILE, CUDA or HIP support, then only theHOST
allocator is available for use.ENABLE_TESTS
This option controls whether or not test executables will be built.ENABLE_BENCHMARKS
This option will build the benchmark programs used to test performance.UMPIRE_ENABLE_LOGGING
This option enables usage of Logging services for UmpireUMPIRE_ENABLE_SLIC
This option enables usage of logging services provided by SLIC.UMPIRE_ENABLE_BACKTRACE
This option enables collection of backtrace information for each allocation.UMPIRE_ENABLE_BACKTRACE_SYMBOLS
This option enables symbol information to be provided with backtraces. This requires -ldl to be specified for using programs.UMPIRE_ENABLE_TOOLS
Enable development tools for Umpire (replay, etc.)ENABLE_DOCS
Build user documentation (with Sphinx) and code documentation (with Doxygen)UMPIRE_ENABLE_C
Build the C API, this allows accessing Umpire Allocators and the ResourceManager through a C interface.ENABLE_FORTRAN
Build the Fortran API.UMPIRE_ENABLE_PERFORMANCE_TESTS
Build and run performance testsUMPIRE_ENABLE_IPC_SHARED_MEMORY
This option enables support for interprocess shared memory. Currently, this feature only exists for forHOST
memory.UMPIRE_ENABLE_ASAN
This option enables address sanitization checks within Umpire by compilers that support options like-fsanitize=address
Umpire Cookbook¶
This section provides a set of recipes that show you how to accomplish specific tasks using Umpire. The main focus is things that can be done by composing different parts of Umpire to achieve a particular use case.
Examples include being able to grow and shrink a pool, constructing Allocators that have introspection disabled for improved performance, and applying CUDA “memory advise” to all the allocations in a particular pool.
Growing and Shrinking a Pool¶
When sharing a pool between different parts of your application, or even between co-ordinating libraries in the same application, you might want to grow and shrink a pool on demand. By limiting the size of a pool using device memory, you leave more space on the GPU for “unified memory” to move data there.
The basic idea is to create a pool that allocates a block of your minimum size, and then allocate a single word from this pool to ensure the initial block is never freed:
auto pooled_allocator =
rm.makeAllocator<umpire::strategy::QuickPool>("GPU_POOL", allocator, 4ul * 1024ul * 1024ul * 1024ul + 1);
To increase the pool size you can preallocate a large chunk and then immediately free it. The pool will retain this memory for use by later allocations:
void* grow = pooled_allocator.allocate(8ul * 1024ul * 1024ul * 1024ul);
pooled_allocator.deallocate(grow);
std::cout << "Pool has allocated " << pooled_allocator.getActualSize() << " bytes of memory. "
<< pooled_allocator.getCurrentSize() << " bytes are used" << std::endl;
Assuming that there are no allocations left in the larger “chunk” of the pool,
you can shrink the pool back down to the initial size by calling
umpire::Allocator::release()
:
pooled_allocator.release();
std::cout << "Pool has allocated " << pooled_allocator.getActualSize() << " bytes of memory. "
<< pooled_allocator.getCurrentSize() << " bytes are used" << std::endl;
The complete example is included below:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/strategy/QuickPool.hpp"
#include "umpire/util/Macros.hpp"
int main(int, char**)
{
auto& rm = umpire::ResourceManager::getInstance();
auto allocator = rm.getAllocator("DEVICE");
//
// Create a 4 Gb pool and reserve one word (to maintain aligment)
//
// _sphinx_tag_tut_create_pool_start
auto pooled_allocator =
rm.makeAllocator<umpire::strategy::QuickPool>("GPU_POOL", allocator, 4ul * 1024ul * 1024ul * 1024ul + 1);
// _sphinx_tag_tut_create_pool_end
void* hold = pooled_allocator.allocate(64);
UMPIRE_USE_VAR(hold);
std::cout << "Pool has allocated " << pooled_allocator.getActualSize() << " bytes of memory. "
<< pooled_allocator.getCurrentSize() << " bytes are used" << std::endl;
//
// Grow pool to ~12 by grabbing a 8Gb chunk
//
// _sphinx_tag_tut_grow_pool_start
void* grow = pooled_allocator.allocate(8ul * 1024ul * 1024ul * 1024ul);
pooled_allocator.deallocate(grow);
std::cout << "Pool has allocated " << pooled_allocator.getActualSize() << " bytes of memory. "
<< pooled_allocator.getCurrentSize() << " bytes are used" << std::endl;
// _sphinx_tag_tut_grow_pool_end
//
// Shrink pool back to ~4Gb
//
// _sphinx_tag_tut_shrink_pool_back_start
pooled_allocator.release();
std::cout << "Pool has allocated " << pooled_allocator.getActualSize() << " bytes of memory. "
<< pooled_allocator.getCurrentSize() << " bytes are used" << std::endl;
// _sphinx_tag_tut_shrink_pool_back_end
return 0;
}
Disable Introspection¶
If you know that you won’t be using any of Umpire’s introspection capabalities
for allocations that come from a particular umpire::Allocator
, you can
turn off the introspection and avoid the overhead of tracking the associated
metadata.
Warning
Disabling introspection means that allocations from this Allocator cannot be used for operations, or size and location queries.
In this recipe, we look at disabling introspection for a pool. To turn off
introspection, you pass a boolean as the second template parameter to the
umpire::ResourceManager::makeAllocator()
method:
auto pooled_allocator = rm.makeAllocator<umpire::strategy::QuickPool, false>("NO_INTROSPECTION_POOL", allocator);
Remember that disabling introspection will stop tracking the size of
allocations made from the pool, so the
umpire::Allocator::getCurrentSize()
method will return 0:
std::cout << "Pool has allocated " << pooled_allocator.getActualSize() << " bytes of memory. "
<< pooled_allocator.getCurrentSize() << " bytes are used" << std::endl;
The complete example is included below:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/strategy/QuickPool.hpp"
#include "umpire/util/Macros.hpp"
int main(int, char**)
{
auto& rm = umpire::ResourceManager::getInstance();
auto allocator = rm.getAllocator("HOST");
//
// Create a pool with introspection disabled (can improve performance)
//
// _sphinx_tag_tut_nointro_start
auto pooled_allocator = rm.makeAllocator<umpire::strategy::QuickPool, false>("NO_INTROSPECTION_POOL", allocator);
// _sphinx_tag_tut_nointro_end
void* data = pooled_allocator.allocate(1024);
// _sphinx_tag_tut_getsize_start
std::cout << "Pool has allocated " << pooled_allocator.getActualSize() << " bytes of memory. "
<< pooled_allocator.getCurrentSize() << " bytes are used" << std::endl;
// _sphinx_tag_tut_getsize_end
pooled_allocator.deallocate(data);
return 0;
}
Apply Memory Advice to a Pool¶
When using unified memory on systems with CUDA GPUs, various types of memory advice can be applied to modify how the CUDA runtime moves this memory around between the CPU and GPU. One type of advice that can be applied is “preferred location”, and you can specificy where you want the preferred location of the memory to be. This can be useful for ensuring that the memory is kept on the GPU.
By creating a pool on top of an umpire::strategy::AllocationAdvisor
,
you can amortize the cost of applying memory advice:
//
// Create an allocator that applied "PREFFERED_LOCATION" advice to set the
// GPU as the preferred location.
//
auto preferred_location_allocator = rm.makeAllocator<umpire::strategy::AllocationAdvisor>(
"preferred_location_device", allocator, "PREFERRED_LOCATION");
//
// Create a pool using the preferred_location_allocator. This makes all
// allocations in the pool have the same preferred location, the GPU.
//
auto pooled_allocator = rm.makeAllocator<umpire::strategy::QuickPool>("GPU_POOL", preferred_location_allocator);
The complete example is included below:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/strategy/AllocationAdvisor.hpp"
#include "umpire/strategy/QuickPool.hpp"
#include "umpire/util/Macros.hpp"
int main(int, char**)
{
auto& rm = umpire::ResourceManager::getInstance();
auto allocator = rm.getAllocator("UM");
// _sphinx_tag_tut_pool_advice_start
//
// Create an allocator that applied "PREFFERED_LOCATION" advice to set the
// GPU as the preferred location.
//
auto preferred_location_allocator = rm.makeAllocator<umpire::strategy::AllocationAdvisor>(
"preferred_location_device", allocator, "PREFERRED_LOCATION");
//
// Create a pool using the preferred_location_allocator. This makes all
// allocations in the pool have the same preferred location, the GPU.
//
auto pooled_allocator = rm.makeAllocator<umpire::strategy::QuickPool>("GPU_POOL", preferred_location_allocator);
// _sphinx_tag_tut_pool_advice_end
UMPIRE_USE_VAR(pooled_allocator);
return 0;
}
Apply Memory Advice with a Specific Device ID¶
When using unified memory on systems with CUDA GPUs, various types of memory advice can be applied to modify how the CUDA runtime moves this memory around between the CPU and GPU. When applying memory advice, a device ID can be used to specific which device the advice relates to. One type of advice that can be applied is “preferred location”, and you can specificy where you want the preferred location of the memory to be. This can be useful for ensuring that the memory is kept on the GPU.
By passing a specific device id when constructing an
umpire::strategy::AllocationAdvisor
, you can ensure that the advice
will be applied with respect to that device
//
// Create an allocator that applied "PREFFERED_LOCATION" advice to set a
// specific GPU device as the preferred location.
//
// In this case, device #2.
//
const int device_id = 2;
try {
auto preferred_location_allocator = rm.makeAllocator<umpire::strategy::AllocationAdvisor>(
"preferred_location_device_2", allocator, "PREFERRED_LOCATION", device_id);
The complete example is included below:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/strategy/AllocationAdvisor.hpp"
#include "umpire/util/Exception.hpp"
int main(int, char**)
{
auto& rm = umpire::ResourceManager::getInstance();
auto allocator = rm.getAllocator("UM");
// _sphinx_tag_tut_device_advice_start
//
// Create an allocator that applied "PREFFERED_LOCATION" advice to set a
// specific GPU device as the preferred location.
//
// In this case, device #2.
//
const int device_id = 2;
try {
auto preferred_location_allocator = rm.makeAllocator<umpire::strategy::AllocationAdvisor>(
"preferred_location_device_2", allocator, "PREFERRED_LOCATION", device_id);
// _sphinx_tag_tut_device_advice_end
void* data = preferred_location_allocator.allocate(1024);
preferred_location_allocator.deallocate(data);
} catch (umpire::util::Exception& e) {
std::cout << "Couldn't create Allocator with device_id = " << device_id << std::endl;
std::cout << e.message() << std::endl;
}
return 0;
}
Moving Host Data to Managed Memory¶
When using a system with NVIDIA GPUs, you may realize that some host data
should be moved to unified memory in order to make it accessible by the GPU.
You can do this with the umpire::ResourceManager::move()
operation:
double* um_data = static_cast<double*>(rm.move(host_data, um_allocator));
The move operation will copy the data from host memory to unified memory,
allocated using the provided um_allocator
. The original allocation in host
memory will be deallocated. The complete example is included below:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
int main(int, char**)
{
constexpr std::size_t SIZE = 1024;
auto& rm = umpire::ResourceManager::getInstance();
auto allocator = rm.getAllocator("HOST");
//
// Allocate host data
//
double* host_data = static_cast<double*>(allocator.allocate(SIZE * sizeof(double)));
//
// Move data to unified memory
//
auto um_allocator = rm.getAllocator("UM");
// _sphinx_tag_tut_move_host_to_managed_start
double* um_data = static_cast<double*>(rm.move(host_data, um_allocator));
// _sphinx_tag_tut_move_host_to_managed_end
//
// Deallocate um_data, host_data is already deallocated by move operation.
//
rm.deallocate(um_data);
return 0;
}
Improving DynamicPoolList Performance with a Coalesce Heuristic¶
As needed, the
umpire::strategy::DynamicPoolList
will continue to allocate
blocks to satisfy allocation requests that cannot be satisfied by blocks
currently in the pool it is managing. Under certain application-specific
memory allocation patterns, fragmentation within the blocks or allocations that
are for sizes greater than the size of the largest available block can cause
the pool to grow too large. For example, a problematic allocation pattern is
when an application makes several allocations of incrementing size where each
allocation is larger than the previous block size allocated.
The
umpire::strategy::DynamicPoolList::coalesce()
method may be used to cause the
umpire::strategy::DynamicPoolList
to coalesce the releasable blocks into a single larger block. This is
accomplished by: tallying the size of all blocks without allocations against
them, releasing those blocks back to the memory resource, and creating a new
block of the previously tallied size.
Applications may offer a heuristic function to the
umpire::strategy::DynamicPoolList
during instantiation that will return true whenever a pool reaches a
specific threshold of releasable bytes (represented by completely free blocks)
to the total size of the pool. The DynamicPoolList will call this heuristic
function just before it returns from its
umpire::strategy::DynamicPoolList::deallocate()
method and when the function returns true, the DynamicPoolList will call
the umpire::strategy::DynamicPoolList::coalesce()
method.
The default heuristic of 100 will cause the DynamicPoolList to automatically coalesce when all of the bytes in the pool are releasable and there is more than one block in the pool.
A heuristic of 0 will cause the DynamicPoolList to never automatically coalesce.
Creation of the heuristic function is accomplished by:
//
// Create a heuristic function that will return true to the DynamicPoolList
// object when the threshold of releasable size to total size is 75%.
//
auto heuristic_function = umpire::strategy::DynamicPoolList::percent_releasable(75);
The heuristic function is then provided as a parameter when the object is instantiated:
//
// Create a pool with an initial block size of 1 Kb and 1 Kb block size for
// all subsequent allocations and with our previously created heuristic
// function.
//
auto pooled_allocator = rm.makeAllocator<umpire::strategy::DynamicPoolList>("HOST_POOL", allocator, 1024ul, 1024ul,
16, heuristic_function);
The complete example is included below:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/strategy/DynamicPoolList.hpp"
#include "umpire/util/Macros.hpp"
#include "umpire/util/wrap_allocator.hpp"
int main(int, char**)
{
auto& rm = umpire::ResourceManager::getInstance();
auto allocator = rm.getAllocator("HOST");
// _sphinx_tag_tut_creat_heuristic_fun_start
//
// Create a heuristic function that will return true to the DynamicPoolList
// object when the threshold of releasable size to total size is 75%.
//
auto heuristic_function = umpire::strategy::DynamicPoolList::percent_releasable(75);
// _sphinx_tag_tut_creat_heuristic_fun_end
// _sphinx_tag_tut_use_heuristic_fun_start
//
// Create a pool with an initial block size of 1 Kb and 1 Kb block size for
// all subsequent allocations and with our previously created heuristic
// function.
//
auto pooled_allocator = rm.makeAllocator<umpire::strategy::DynamicPoolList>("HOST_POOL", allocator, 1024ul, 1024ul,
16, heuristic_function);
// _sphinx_tag_tut_use_heuristic_fun_end
//
// Obtain a pointer to our specific DynamicPoolList instance in order to see the
// DynamicPoolList-specific statistics
//
auto dynamic_pool = umpire::util::unwrap_allocator<umpire::strategy::DynamicPoolList>(pooled_allocator);
void* a[4];
for (int i = 0; i < 4; ++i)
a[i] = pooled_allocator.allocate(1024);
for (int i = 0; i < 4; ++i) {
pooled_allocator.deallocate(a[i]);
std::cout << "Pool has " << pooled_allocator.getActualSize() << " bytes of memory. "
<< pooled_allocator.getCurrentSize() << " bytes are used. " << dynamic_pool->getBlocksInPool()
<< " blocks are in the pool. " << dynamic_pool->getReleasableSize() << " bytes are releaseable. "
<< std::endl;
}
return 0;
}
Move Allocations Between NUMA Nodes¶
When using NUMA (cache coherent or non uniform memory access) systems, there are different latencies to parts of the memory. From an application perspective, the memory looks the same, yet especially for high-performance computing it is advantageous to have finer control. malloc() attempts to allocate memory close to your node, but it can make no guarantees. Therefore, Linux provides both a process-level interface for setting NUMA policies with the system utility numactl, and a fine-grained interface with libnuma. These interfaces work on ranges of memory in multiples of the page size, which is the length or unit of address space loaded into a processor cache at once.
A page range may be bound to a NUMA node using the
umpire::strategy::NumaPolicy
. It can therefore also be moved between
NUMA nodes using the umpire::ResourceManager::move()
with a different
allocator. The power of using such an abstraction is that the NUMA node can be
associated with a device, in which case the memory is moved to, for example,
GPU memory.
In this recipe we create an allocation bound to a NUMA node, and move it to another NUMA node.
The complete example is included below:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/strategy/NumaPolicy.hpp"
#include "umpire/util/Macros.hpp"
#include "umpire/util/numa.hpp"
#if defined(UMPIRE_ENABLE_CUDA)
#include <cuda_runtime_api.h>
#endif
int main(int, char**)
{
auto& rm = umpire::ResourceManager::getInstance();
const std::size_t alloc_size = 5 * umpire::get_page_size();
// Get a list of the host NUMA nodes (e.g. one per socket)
auto host_nodes = umpire::numa::get_host_nodes();
if (host_nodes.size() < 1) {
UMPIRE_ERROR("No NUMA nodes detected");
}
// Create an allocator on the first NUMA node
auto host_src_alloc =
rm.makeAllocator<umpire::strategy::NumaPolicy>("host_numa_src_alloc", rm.getAllocator("HOST"), host_nodes[0]);
// Create an allocation on that node
void* src_ptr = host_src_alloc.allocate(alloc_size);
if (host_nodes.size() > 1) {
// Create an allocator on another host NUMA node.
auto host_dst_alloc =
rm.makeAllocator<umpire::strategy::NumaPolicy>("host_numa_dst_alloc", rm.getAllocator("HOST"), host_nodes[1]);
// Move the memory
void* dst_ptr = rm.move(src_ptr, host_dst_alloc);
// The pointer shouldn't change even though the memory location changes
if (dst_ptr != src_ptr) {
UMPIRE_ERROR("Pointers should match");
}
// Touch it
rm.memset(dst_ptr, 0);
// Verify NUMA node
if (umpire::numa::get_location(dst_ptr) != host_nodes[1]) {
UMPIRE_ERROR("Move was unsuccessful");
}
}
#if defined(UMPIRE_ENABLE_DEVICE)
// Get a list of the device nodes
auto device_nodes = umpire::numa::get_device_nodes();
if (device_nodes.size() > 0) {
// Create an allocator on the first device NUMA node. Note that
// this still requires using the "HOST" allocator. The allocations
// are moved after the address space is reserved.
auto device_alloc = rm.makeAllocator<umpire::strategy::NumaPolicy>("device_numa_src_alloc", rm.getAllocator("HOST"),
device_nodes[0]);
// Move the memory
void* dst_ptr = rm.move(src_ptr, device_alloc);
// The pointer shouldn't change even though the memory location changes
if (dst_ptr != src_ptr) {
UMPIRE_ERROR("Pointers should match");
}
// Touch it -- this currently uses the host memset operation (thus, copying
// the memory back)
rm.memset(dst_ptr, 0);
// Verify NUMA node
if (umpire::numa::get_location(dst_ptr) != device_nodes[0]) {
UMPIRE_ERROR("Move was unsuccessful");
}
}
#endif
// Clean up by deallocating from the original allocator, since the
// allocation record is still associated with that allocator
host_src_alloc.deallocate(src_ptr);
return 0;
}
Determining the Largest Block of Available Memory in Pool¶
The umpire::strategy::QuickPool
provides a
umpire::strategy::QuickPool::getLargestAvailableBlock()
that may be
used to determine the size of the largest block currently available for
allocation within the pool.
To call this
function, you must get the pointer to the
umpire::strategy::AllocationStrategy
from the
umpire::Allocator
:
auto pool = rm.makeAllocator<umpire::strategy::QuickPool>("pool", rm.getAllocator("HOST"));
auto quick_pool = umpire::util::unwrap_allocator<umpire::strategy::QuickPool>(pool);
Once you have the pointer to the appropriate strategy, you can call the function:
std::cout << "Largest available block in pool is " << quick_pool->getLargestAvailableBlock() << " bytes in size"
<< std::endl;
The complete example is included below:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/strategy/QuickPool.hpp"
#include "umpire/util/Exception.hpp"
#include "umpire/util/wrap_allocator.hpp"
int main(int, char**)
{
auto& rm = umpire::ResourceManager::getInstance();
// _sphinx_tag_tut_unwrap_start
auto pool = rm.makeAllocator<umpire::strategy::QuickPool>("pool", rm.getAllocator("HOST"));
auto quick_pool = umpire::util::unwrap_allocator<umpire::strategy::QuickPool>(pool);
// _sphinx_tag_tut_unwrap_end
if (quick_pool == nullptr) {
UMPIRE_ERROR(pool.getName() << " is not a QuickPool");
}
auto ptr = pool.allocate(1024);
// _sphinx_tag_tut_get_info_start
std::cout << "Largest available block in pool is " << quick_pool->getLargestAvailableBlock() << " bytes in size"
<< std::endl;
// _sphinx_tag_tut_get_info_end
pool.deallocate(ptr);
return 0;
}
Coalescing Pool Memory¶
The umpire::strategy::QuickPool
provides a
umpire::strategy::QuickPool::coalesce()
that can be used to release
unused memory and allocate a single large block that will be able to satisfy
allocations up to the previously observed high-watermark. To call this
function, you must get the pointer to the
umpire::strategy::AllocationStrategy
from the
umpire::Allocator
:
auto quick_pool = umpire::util::unwrap_allocator<umpire::strategy::QuickPool>(pool);
Once you have the pointer to the appropriate strategy, you can call the function:
quick_pool->coalesce();
The complete example is included below:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/strategy/QuickPool.hpp"
#include "umpire/util/Exception.hpp"
#include "umpire/util/wrap_allocator.hpp"
int main(int, char**)
{
auto& rm = umpire::ResourceManager::getInstance();
auto pool = rm.makeAllocator<umpire::strategy::QuickPool>("pool", rm.getAllocator("HOST"));
// _sphinx_tag_tut_unwrap_strategy_start
auto quick_pool = umpire::util::unwrap_allocator<umpire::strategy::QuickPool>(pool);
// _sphinx_tag_tut_unwrap_strategy_end
if (quick_pool) {
// _sphinx_tag_tut_call_coalesce_start
quick_pool->coalesce();
// _sphinx_tag_tut_call_coalesce_end
} else {
UMPIRE_ERROR(pool.getName() << " is not a QuickPool, cannot coalesce!");
}
return 0;
}
Building a Pinned Memory Pool in FORTRAN¶
In this recipe, we show you how to build a pool in pinned memory using Umpire’s FORTRAN API. These kinds of pools can be useful for allocating buffers to be used in communication routines in various scientific applications.
Building the pool takes two steps: 1) getting a base “PINNED” allocator, and 2) creating the pool:
rm = rm%get_instance()
base_allocator = rm%get_allocator_by_name("PINNED")
pinned_pool = rm%make_allocator_quick_pool("PINNED_POOL", &
base_allocator, &
512_8*1024_8, &
1024_8)
The complete example is included below:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
! project contributors. See the COPYRIGHT file for details.
!
! SPDX-License-Identifier: (MIT)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
program umpire_f_pinned_pool
use umpire_mod
implicit none
logical ok
integer(C_INT), pointer, dimension(:) :: array(:)
type(UmpireAllocator) base_allocator
type(UmpireAllocator) pinned_pool
type(UmpireResourceManager) rm
! _sphinx_tag_tut_pinned_fortran_start
rm = rm%get_instance()
base_allocator = rm%get_allocator_by_name("PINNED")
pinned_pool = rm%make_allocator_quick_pool("PINNED_POOL", &
base_allocator, &
512_8*1024_8, &
1024_8)
! _sphinx_tag_tut_pinned_fortran_end
call pinned_pool%allocate(array, [10])
end program umpire_f_pinned_pool
Visualizing Allocators¶
The python script plot_allocations.py is included with Umpire to plot allocations. This script uses series of three arguments: an output file with allocation records, a color, and an alpha (transparency) value 0.0-1.0. Although these could be used to plot records from a single allocator, 3 arguments, it can also be used to overlay multiple allocators, by passing 3n arguments after the script name. In this cookbook we use this feature to visualize a pooled allocator.
The cookbook generates two files, allocator.log and pooled_allocator.log, that contain the allocation records from the underlying allocator and the pool. These can then be plotted using a command similar to the following:
tools/plot_allocations allocator.log gray 0.2 pooled_allocator.log purple 0.8
That script uses Python and Matplotlib to generate the following image

The complete example is included below:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include <fstream>
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/Umpire.hpp"
#include "umpire/strategy/QuickPool.hpp"
int main(int, char**)
{
auto& rm = umpire::ResourceManager::getInstance();
auto allocator = rm.getAllocator("HOST");
auto pooled_allocator = rm.makeAllocator<umpire::strategy::QuickPool>("HOST_POOL", allocator, 1024 * 16);
void* a[4];
for (int i = 0; i < 4; ++i)
a[i] = pooled_allocator.allocate(1024);
// Create fragmentation
pooled_allocator.deallocate(a[2]);
a[2] = pooled_allocator.allocate(1024 * 2);
// Output the records from the underlying host allocator
{
std::ofstream out("allocator.log");
umpire::print_allocator_records(allocator, out);
out.close();
}
// Output the records from the pooled allocator
{
std::ofstream out("pooled_allocator.log");
umpire::print_allocator_records(pooled_allocator, out);
out.close();
}
for (int i = 0; i < 4; ++i)
pooled_allocator.deallocate(a[i]);
// Visualize this using the python script. Example usage:
// tools/analysis/plot_allocations allocator.log gray 0.2 pooled_allocator.log
// purple 0.8
return 0;
}
Mixed Pool Creation and Algorithm Basics¶
This recipe shows how to create a default mixed pool, and one that
might be tailored to a specific application’s needs. Mixed pools
allocate in an array of umpire::strategy::FixedPool
for small
allocations, because these have simpler bookkeeping and are very fast,
and a umpire::strategy::QuickPool
for larger allocations.
The class umpire::strategy::MixedPool
uses a generic choice
of umpire::strategy::FixedPool
of size 256 bytes to 4MB in
increments of powers of 2, while
umpire::strategy::MixedPoolImpl
has template arguments that
select the first, power of 2 increment, and last fixed pool size.
The complete example is included below:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/strategy/MixedPool.hpp"
int main(int, char **)
{
auto &rm = umpire::ResourceManager::getInstance();
auto allocator = rm.getAllocator("HOST");
/*
* Create a default mixed pool.
*/
auto default_mixed_allocator = rm.makeAllocator<umpire::strategy::MixedPool>("default_mixed_pool", allocator);
UMPIRE_USE_VAR(default_mixed_allocator);
/*
* Create a mixed pool using fixed pool bins of size 2^8 = 256 Bytes
* to 2^14 = 16 kB in increments of 5x, where each individual fixed
* pool is kept under 4MB in size to begin.
*/
auto custom_mixed_allocator =
rm.makeAllocator<umpire::strategy::MixedPool>("custom_mixed_pool", allocator, 256, 16 * 1024, 4 * 1024 * 1024, 5);
/*
* Although this calls for only 4*4=16 bytes, this allocation will
* come from the smallest fixed pool, thus ptr will actually be the
* first address in a range of 256 bytes.
*/
void *ptr1 = custom_mixed_allocator.allocate(4 * sizeof(int));
/*
* This is too beyond the range of the fixed pools, and therefore is
* allocated from a dynamic pool. The range of address space
* reserved will be exactly what was requested by the allocate()
* method.
*/
void *ptr2 = custom_mixed_allocator.allocate(1 << 18);
/*
* Clean up
*/
custom_mixed_allocator.deallocate(ptr1);
custom_mixed_allocator.deallocate(ptr2);
return 0;
}
Thread Safe Allocator¶
If you want thread-safe access to allocations that come from a particular
umpire::Allocator
, you can create an instance of a
umpire::strategy::ThreadSafeAllocator object that will synchronize access
to it.
In this recipe, we look at creating a umpire::strategy::ThreadSafeAllocator for an umpire::strategy::QuickPool object:
auto& rm = umpire::ResourceManager::getInstance();
auto pool = rm.makeAllocator<umpire::strategy::QuickPool>("pool", rm.getAllocator("HOST"));
auto thread_safe_pool = rm.makeAllocator<umpire::strategy::ThreadSafeAllocator>("thread_safe_pool", pool);
The complete example is included below:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/strategy/QuickPool.hpp"
#include "umpire/strategy/ThreadSafeAllocator.hpp"
int main(int, char**)
{
// _sphinx_tag_tut_thread_safe_start
auto& rm = umpire::ResourceManager::getInstance();
auto pool = rm.makeAllocator<umpire::strategy::QuickPool>("pool", rm.getAllocator("HOST"));
auto thread_safe_pool = rm.makeAllocator<umpire::strategy::ThreadSafeAllocator>("thread_safe_pool", pool);
// _sphinx_tag_tut_thread_safe_end
auto allocation = thread_safe_pool.allocate(256);
thread_safe_pool.deallocate(allocation);
return 0;
}
Using File System Allocator (FILE)¶
Umpire supports the use of file based memory allocation. When UMPIRE_ENABLE_FILE_RESOURCE
is enabled, the environment variables UMPIRE_MEMORY_FILE_DIR
can be used to determine
where memory can be allocated from:
Variable Default Description UMPIRE_MEMORY_FILE_DIR ./ Directory to create and allocate file based allocations
Requesting the allocation takes two steps: 1) getting a “FILE” allocator, 2) requesting the amount of memory to allocate.
auto& rm = umpire::ResourceManager::getInstance();
umpire::Allocator alloc = rm.getAllocator("FILE");
std::size_t* A = (std::size_t*)alloc.allocate(sizeof(size_t));
To deallocate:
alloc.deallocate(A);
The complete example is included below:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
int main(int, char** argv)
{
// _sphinx_tag_tut_file_allocate_start
auto& rm = umpire::ResourceManager::getInstance();
umpire::Allocator alloc = rm.getAllocator("FILE");
std::size_t* A = (std::size_t*)alloc.allocate(sizeof(size_t));
// _sphinx_tag_tut_file_allocate_end
// _sphinx_tag_tut_file_deallocate_start
alloc.deallocate(A);
// _sphinx_tag_tut_file_deallocate_end
return 0;
}
Using Burst Buffers On Lassen¶
On Lassen, 1) Download the latest version of Umpire 2) request a private node to build:
$ git clone --recursive https://github.com/LLNL/Umpire.git
$ lalloc 1 -stage storage=64
Note that -stage storage=64
is needed in order to work with the Burst Buffers.
3) Additionally, the environment variable needs to set to $BBPATH
:
$ export UMPIRE_MEMORY_FILE_DIR=$BBPATH/
Running File Resource Benchmarks¶
Continue building Umpire on 1 node, and set the -DUMPIRE_ENABLE_FILE_RESOURCE=On
:
$ mkdir build && cd build
$ lrun -n 1 cmake -DUMPIRE_ENABLE_FILE_RESOURCE=On -DENABLE_OPENMP=On ../ && make
To run the built-in benchmarks in Umpire from the build run:
$ lrun -n 1 --threads=** ./bin/file_resource_benchmarks ##
** is a placeholder for the amount of threads wanted to run the benchmark on. ## stands for the number of array elements wanted to be passed through the benchmark. This number can range from 1-100,000,000,000.
Results should appear like:
Array Size: 1 Memory Size: 8e-06 MB
Total Arrays: 3 Total Memory Size: 2.4e-05 MB
HOST
Initialization: 0.0247461 MB/sec
Initialization Time: 0.000969849 sec
---------------------------------------
Copy: 0.890918 MB/sec
Copy Time: 1.7959e-05 sec
---------------------------------------
Scale: 0.928074 MB/sec
Scale Time: 1.724e-05 sec
---------------------------------------
Add: 1.321 MB/sec
Add Time: 1.8168e-05 sec
---------------------------------------
Triad: 1.24102 MB/sec
Triad Time: 1.9339e-05 sec
---------------------------------------
Total Time: 0.00104323 sec
FILE
Initialization: 0.210659 MB/sec
Initialization Time: 0.000113928 sec
---------------------------------------
Copy: 0.84091 MB/sec
Copy Time: 1.9027e-05 sec
---------------------------------------
Scale: 0.938086 MB/sec
Scale Time: 1.7056e-05 sec
---------------------------------------
Add: 1.28542 MB/sec
Add Time: 1.8671e-05 sec
---------------------------------------
Triad: 1.54689 MB/sec
Triad Time: 1.5515e-05 sec
---------------------------------------
Total Time: 0.000184726 sec
This benchmark run similar to the STREAM Benchmark test and can also run a benchmark
for the additional allocators like UM
for CUDA and DEVICE
for HIP.
Getting the Strategy Name¶
Since every Allocator is represented by the same type after it’s been created, it can be difficult to determine exactly what kind of strategy the allocator is using. The name of the strategy can be accessed using the Allocator::getStrategyName() method:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/strategy/QuickPool.hpp"
int main(int, char**)
{
auto& rm = umpire::ResourceManager::getInstance();
auto allocator = rm.getAllocator("HOST");
// _sphinx_tag_tut_strategy_name_start
//
auto pool = rm.makeAllocator<umpire::strategy::QuickPool>("POOL", allocator);
std::cout << pool.getStrategyName() << std::endl;
// _sphinx_tag_tut_strategy_name_end
UMPIRE_USE_VAR(pool);
return 0;
}
Features¶
Allocators¶
Allocators are the fundamental object used to allocate and deallocate memory using Umpire.
Allocator Accessibility¶
The Umpire library provides a variety of umpire::resource::MemoryResource
s
which can be used to create umpire::Allocator
s depending on what’s available on
your system. The resources are explained more on the C API: Resources page.
Additionally, the platforms that Umpire supports is defined by the CAMP library. This means that there is also a selection of platforms for which an allocator can be associated with as well. For example, an Allocator created with the pinned memory resource can be used with the host, cuda, hip, or sycl platforms.
Because of these options, it can be difficult to trace
not only which memory resource an allocator has been created with but also
which allocators can be accessed by which platforms. Umpire has the memory resource
trait, resource_type
, to provide the ability to query which memory resource is
associated with a particular allocator (See example here).
Additionally, Umpire has a function, is_accessible(Platform p, Allocator a)
, that determines
if a particular allocator is accessible by a particular platform
(See example here). The allocator_accessibility.cpp
test checks what platforms are available and confirms that all memory resources which should be accessible
to that platform can actually be accessed and used.
For example, if a umpire::Allocator
, alloc
,
is created with the host memory resource and we want to know if it should be
accessible from the omp_target
CAMP platform, then we can use the is_accessible(Platform::omp_target, alloc)
function and find that it should be accessible. The allocator_access.cpp
file demonstrates this
functionality for the host platform specifically.
Allocator Inaccessibility Configuration¶
On a different note, for those allocators that are deemed inaccessible, it may be useful to
double check or confirm that the allocator can in fact NOT access memory on that given platform.
In this case, the cmake flag, UMPIRE_ENABLE_INACCESSIBILITY_TESTS
, will need to be turned on.
Build and Run Configuration¶
To build and run these files, either use uberenv or the appropriate cmake flags for the
desired platform and then run ctest -T test -R allocator_accessibility_tests --output-on-failure
for the test code and ./bin/alloc_access
for the example code.
Note
The Developer’s Guide shows how to configure Umpire with uberenv to build with different CAMP platforms.
Below, the allocator_access.cpp
code is shown to demonstrate how this functionality can be
used during development.
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include <string>
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"
#include "umpire/Umpire.hpp"
bool is_accessible_from_host(umpire::Allocator a)
{
if (umpire::is_accessible(umpire::Platform::host, a)) {
std::cout << "The allocator, " << a.getName() << ", is accessible." << std::endl;
return true;
} else {
std::cout << "The allocator, " << a.getName() << ", is _not_ accessible." << std::endl << std::endl;
return false;
}
}
/////////////////////////////////////////////////////////////////////////////////////////
// Depending on how Umpire has been set up, several different allocators could be accessible
// from the host CAMP platform. This example will create a list of all currently available
// allocators and then determine whether each can be accessed from the host platform.
//(To test other platforms, see allocator accessibility test.)
////////////////////////////////////////////////////////////////////////////////////////
int main()
{
auto& rm = umpire::ResourceManager::getInstance();
std::vector<std::string> allNames = rm.getResourceNames();
std::vector<umpire::Allocator> alloc;
///////////////////////////////////////////////////
// Create an allocator for each available type
//////////////////////////////////////////////////
std::cout << "Available allocators: ";
for (auto a : allNames) {
if (a.find("::") == std::string::npos) {
alloc.push_back(rm.getAllocator(a));
std::cout << a << " ";
}
}
std::cout << std::endl;
///////////////////////////////////////////////////
// Test accessibility
///////////////////////////////////////////////////
std::cout << "Testing the available allocators for accessibility from the CAMP host platform:" << std::endl;
const int size = 100;
for (auto a : alloc) {
if (is_accessible_from_host(a)) {
int* data = static_cast<int*>(a.allocate(size * sizeof(int)));
for (int i = 0; i < size; i++) {
data[i] = i * i;
}
UMPIRE_ASSERT(data[size - 1] == (size - 1) * (size - 1) && "Inequality found in array that should be accessible");
}
}
return 0;
}
Backtrace¶
The Umpire library may be configured to provide using programs with backtrace information as part of Umpire thrown exception description strings.
Umpire may also be configured to collect and provide backtrace information for each Umpire provided memory allocation performed.
Build Configuration¶
Backtrace is enabled in Umpire builds with the following:
- ``cmake … -DUMPIRE_ENABLE_BACKTRACE=On …`` to backtrace capability in Umpire.
- ``cmake -DUMPIRE_ENABLE_BACKTRACE=On -DUMPIRE_ENABLE_BACKTRACE_SYMBOLS=On …`` to
enable Umpire to display symbol information with backtrace. Note:
Using programs will need to add the
-rdyanmic
and-ldl
linker flags in order to properly link with this configuration of the Umpire library.
Runtime Configuration¶
For versions of the Umpire library that are backtrace enabled (from flags above), the user may expect the following.
Backtrace information will always be provided in the description strings of umpire generated exception throws.
Setting the environment variable UMPIRE_BACKTRACE=On
will cause
Umpire to record backtrace information for each memory allocation it provides.
Setting the environment variable UMPIRE_LOG_LEVEL=Error
will cause to
Umpire to log backtrace information for each of the leaked Umpire allocations
found during application exit.
A programatic interface is also availble via the func::umpire::print_allocator_records free function.
An example for checking and displaying the information this information logged above may be found here:
//////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2016-21, Lawrence Livermore National Security, LLC and Umpire
// project contributors. See the COPYRIGHT file for details.
//
// SPDX-License-Identifier: (MIT)
//////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include <sstream>
#include "umpire/ResourceManager.hpp"
#include "umpire/Umpire.hpp"
#include "umpire/strategy/QuickPool.hpp"
int main(int, char**)
{
auto& rm = umpire::ResourceManager::getInstance();
auto allocator = rm.getAllocator("HOST");
auto pool_allocator = rm.makeAllocator<umpire::strategy::QuickPool>("host_quick_pool", allocator);
allocator.allocate(16);
allocator.allocate(32);
allocator.allocate(64);
pool_allocator.allocate(128);
pool_allocator.allocate(256);
pool_allocator.allocate(512);
std::stringstream ss;
umpire::print_allocator_records(allocator, ss);
umpire::print_allocator_records(pool_allocator, ss);
// Example #1 of 3 - Leaked allocations
//
// If Umpire compiled with -DUMPIRE_ENABLE_BACKTRACE=On, then backtrace
// information will be printed for each of the allocations made above.
//
// Otherwise, if Umpire was not compiled with -DUMPIRE_ENABLE_BACKTRACE=On,
// then only the addresses and size information for each allocation will be
// printed.
//
if (!ss.str().empty())
std::cout << ss.str();
// Example #2 of 3 - Umpire error exceptions
//
// When umpire throws an exception, a backtrace to the offending call will
// be provided in the exception string.
//
void* bad_ptr = (void*)0xBADBADBAD;
try {
allocator.deallocate(bad_ptr); // Will cause a throw from umpire
} catch (const std::exception& exc) {
//
// exc.what() string will also contain a backtrace
//
std::cout << "Exception thrown from Umpire:" << std::endl << exc.what();
}
// Example #3 of 3 - Leak detection
//
// When the program terminates, Umpire's resource manager will be
// deconstructed. During deconstruction, Umpire will log the size and
// address, of each leaked allocation in each allocator.
//
// If Umpire was compiled with -DUMPIRE_ENABLE_BACKTRACE=On, backtrace
// information will also be logged for each leaked allocation in each
// allocator.
//
// To enable (and see) the umpire logs, set the environment variable
// UMPIRE_LOG_LEVEL=Error.
//
return 0;
}
File I/O¶
Umpire provides support for writing files containing log and replay data, rather than directing this output to stdout. When logging or replay are enabled, the following environment variables can be used to determine where the output is written:
UMPIRE_OUTPUT_DIR
. Directory to write log and replay filesUMPIRE_OUTPUT_BASENAME
umpire Basename of logging and relpay files
The values of these variables are used to construct unique filenames for
output. The extension .log
is used for logging output, and .replay
for
replay output. The filenames additionally contain three integers, one
corresponding to the rank of the process, one corresponding to the process ID,
and one that is used to make multiple files with the same basename and rank
unique. This ensures that multiple runs with the same IO configuration do not
overwrite files.
The format of the filenames is:
<UMPIRE_OUTPUT_BASENAME>.<RANK>.<PID>.<UID>.<log|replay>
If Umpire is compiled without MPI support, then rank will always be 0.
Logging and Replay of Umpire Events¶
Logging¶
When debugging memory operation problems, it is sometimes helpful to enable Umpire’s logging facility. The logging functionality is enabled for default builds unless -DUMPIRE_ENABLE_LOGGING=’Off’ has been specified in which case it is disabled.
If Umpire logging is enabled, it may be controlled by setting the
UMPIRE_LOG_LEVEL
environment variable to Error
, Warning
, Info
,
or Debug
. The Debug
value is the most verbose.
When UMPIRE_LOG_LEVEL
has been set, events will be logged to the standard
output.
Replay¶
Umpire provides a lightweight replay capability that can be used to investigate
performance of particular allocation patterns and reproduce bugs. By running an
executable that uses Umpire with the environment variable UMPIRE_REPLAY
set
to On
, Umpire will emit information for the following Umpire events:
- version
umpire::get_major_version()
,umpire::get_minor_version()
, andumpire::get_patch_version()
- makeMemoryResource
umpire::resource::MemoryResourceRegistry::makeMemoryResource()
- makeAllocator
umpire::ResourceManager::makeAllocator()
- allocate
umpire::Allocator::allocate()
- deallocate
umpire::Allocator::deallocate()
Running with Replay¶
To enable Umpire replay, one may execute as follows:
UMPIRE_REPLAY="On" ./my_umpire_using_program > replay_log.json
will write Umpire replay events to the file replay_log.json
that will
contain the following kinds of information:
Interpretting Results - Version Event¶
The first event captured is the version event which shows the version information as follows:
{ "kind":"replay", "uid":27494, "timestamp":1558388052211435757, "event": "version", "payload": { "major":0, "minor":3, "patch":3 } }
Each line contains the following set of common elements:
- kind
- Always set to
replay
- uid
- This is the MPI rank of the process generating the event for mpi programs or the PID for non-mpi.
- timestamp
- Set to the time when the event occurred.
- event
- Set to one of:
version
,makeMemoryResource
,makeAllocator
,allocate
, ordeallocate
- payload
- Optional and varies upon event type
- result
- Optional and varies upon event type
As can be seen, the major, minor, and patch version numbers are captured within the payload for this event.
makeMemoryResource Event¶
Next you will see events for the creation of the default memory resources provided by Umpire with the makeMemoryResource event:
{ "kind":"replay", "uid":27494, "timestamp":1558388052211477678, "event": "makeMemoryResource", "payload": { "name": "HOST" }, "result": "0x101626b0" }
{ "kind":"replay", "uid":27494, "timestamp":1558388052471684134, "event": "makeMemoryResource", "payload": { "name": "DEVICE" }, "result": "0x101d79a0" }
{ "kind":"replay", "uid":27494, "timestamp":1558388052471698804, "event": "makeMemoryResource", "payload": { "name": "PINNED" }, "result": "0x101d7a50" }
{ "kind":"replay", "uid":27494, "timestamp":1558388052472972935, "event": "makeMemoryResource", "payload": { "name": "UM" }, "result": "0x101d7b00" }
{ "kind":"replay", "uid":27494, "timestamp":1558388052595814979, "event": "makeMemoryResource", "payload": { "name": "DEVICE_CONST" }, "result": "0x101d7bb0" }
The payload shows that a memory resource was created for HOST, DEVICE, PINNED, UM, and DEVICE_CONST respectively. Note that this could also be done with the FILE memory resource. The result is a reference to the object that was created within Umpire for that resource.
makeAllocator Event¶
The makeAllocator event occurs whenever a new allocator instance is being created. Each call to makeAllocator will generate a pair of JSON lines. The first line will show the intent of the call and the second line will show both the intent and the result. This is because the makeAllocator call can fail and keeping both the intent and result allows us to reproduce this failure later.
umpire::Allocator
:
{ "kind":"replay", "uid":27494, "timestamp":1558388052595864262, "event": "makeAllocator", "payload": { "type":"umpire::strategy::QuickPool", "with_introspection":true, "allocator_name":"pool", "args": [ "HOST" ] } }
{ "kind":"replay", "uid":27494, "timestamp":1558388052595903505, "event": "makeAllocator", "payload": { "type":"umpire::strategy::QuickPool", "with_introspection":true, "allocator_name":"pool", "args": [ "HOST" ] }, "result": { "allocator_ref":"0x108a8730" } }
The payload shows how the allocator was constructed. The result shows the reference to the allocated object.
allocate Event¶
Like the makeAllocator event, the allocate event is captured as an intention/result pair so that an error may be replayed in the event that there is an allocation failure.
{ "kind":"replay", "uid":27494, "timestamp":1558388052595911583, "event": "allocate", "payload": { "allocator_ref": "0x108a8730", "size": 0 } }
{ "kind":"replay", "uid":27494, "timestamp":1558388052595934822, "event": "allocate", "payload": { "allocator_ref": "0x108a8730", "size": 0 }, "result": { "memory_ptr": "0x200040000010" } }
The payload shows the object reference of the allocator and the size of the allocation request. The result shows the pointer to the memory allocated.
deallocate Event¶
{ "kind":"replay", "uid":27494, "timestamp":1558388052596358577, "event": "deallocate", "payload": { "allocator_ref": "0x108a8730", "memory_ptr": "0x200040000010" } }
The payload shows the reference to the allocator object and the pointer to the allocated memory that is to be freed.
Replaying the session¶
Loading this file with the replay
program will replay this sequence of
umpire::Allocator
creation, allocations, and deallocations:
./bin/replay -i replay_log.json
Operations¶
Operations provide an abstract interface to modifying and moving data between Umpire :class:`umpire::Allocator`s.
Provided Operations¶
Strategies¶
Strategies are used in Umpire to allow custom algorithms to be applied when allocating memory. These strategies can do anything, from providing different pooling methods to speed up allocations to applying different operations to every alloctaion. Strategies can be composed to combine their functionality, allowing flexible and reusable implementations of different components.
Provided Strategies¶
Contribution Guide¶
This document is intented for developers who want to add new features or bugfixes to Umpire. It assumes you have some familiarity with git and GitHub. It will discuss what a good pull request (PR) looks like, and the tests that your PR must pass before it can be merged into Umpire.
Forking Umpire¶
If you aren’t an Umpire deveolper at LLNL, then you won’t have permission to push new branches to the repository. First, you should create a fork. This will create a copy of the Umpire repository that you own, and will ensure you can push your changes up to GitHub and create pull requests.
Developing a New Feature¶
New features should be based on the develop
branch. When you want to create
a new feature, first ensure you have an up-to-date copy of the develop
branch:
$ git checkout develop
$ git pull origin develop
You can now create a new branch to develop your feature on:
$ git checkout -b feature/<name-of-feature>
Proceed to develop your feature on this branch, and add tests that will exercise your new code. If you are creating new methods or classes, please add Doxygen documentation.
Once your feature is complete and your tests are passing, you can push your branch to GitHub and create a PR.
Developing a Bug Fix¶
First, check if the change you want to make has been fixed in develop
. If
so, we suggest you either start using the develop
branch, or temporarily
apply the fix to whichever version of Umpire you are using.
If the bug is still unfixed, first make sure you have an up-to-date copy of the develop branch:
$ git checkout develop
$ git pull origin develop
Then create a new branch for your bugfix:
$ git checkout -b bugfix/<name-of-bug>
First, add a test that reproduces the bug you have found. Then develop your
bugfix as normal, and ensure to make test
to check your changes actually
fix the bug.
Once you are finished, you can push your branch to GitHub, then create a PR.
Creating a Pull Request¶
You can create a new PR here.
Ensure that your PR base is the develop
branch of Umpire.
Add a descriptive title explaining the bug you fixed or the feature you have added, and put a longer description of the changes you have made in the comment box.
Once your PR has been created, it will be run through our automated tests and also be reviewed by Umpire team members. Providing the branch passes both the tests and reviews, it will be merged into Umpire.
Tests¶
Umpire uses Bamboo and Gitlab for continuous integration tests. Our tests are automatically run against every new pull request, and passing all tests is a requirement for merging your PR. If you are developing a bugfix or a new feature, please add a test that checks the correctness of your new code. Umpire is used on a wide variety of systems with a number of configurations, and adding new tests helps ensure that all features work as expected across these environments.
Umpire’s tests are all in the test
directory and are split up by component.
Developer Guide¶
This section provides documentation on Umpire’s Continuous Integration, how to configure Umpire using Uberenv, and the process for analyzing Umpire applications using HPCToolKit.
Continuous Integration¶
Gitlab CI¶
Umpire uses continuous integration to ensure that changes added to the repository are well integrated and tested for compatability with the rest of the existing code base. Our CI tests incude a variety of vetted configurations that run on different LC machines.
Umpire shares its Gitlab CI workflow with other projects. The documentation is therefore shared.
Uberenv¶
Umpire shares its Uberenv workflow with other projects. The documentation is therefore shared.
This page will provides some Umpire specific examples to illustrate the workflow described in the documentation.
Before to start¶
First of all, it is worth noting that Umpire does not have dependencies, except for CMake, which is most of the time installed externally.
That does not make the workflow useless: Uberenv will drive Spack which will generate a host-config file with the toolchain (including cuda if activated) and the options or variants pre-configured.
Machine specific configuration¶
$ ls -c1 scripts/uberenv/spack_configs
blueos_3_ppc64le_ib
darwin
toss_3_x86_64_ib
blueos_3_ppc64le_ib_p9
config.yaml
Umpire has been configured for toss_3_x86_64_ib
and other systems.
Vetted specs¶
$ ls -c1 .gitlab/*jobs.yml
.gitlab/lassen-jobs.yml
.gitlab/ruby-jobs.yml
CI contains jobs for ruby.
$ git grep -h "SPEC" .gitlab/ruby-jobs.yml | grep "gcc"
SPEC: "%gcc@4.9.3"
SPEC: "%gcc@6.1.0"
SPEC: "%gcc@7.1.0"
SPEC: "%gcc@7.3.0"
SPEC: "%gcc@8.1.0"
We now have a list of the specs vetted on ruby
/toss_3_x86_64_ib
.
Note
In practice, one should check if the job is not allowed to fail, or even deactivated.
MacOS case¶
In Umpire, the Spack configuration for MacOS contains the default compilers depending on the OS version (compilers.yaml), and a commented section to illustrate how to add CMake as an external package. You may install CMake with homebrew, for example.
Using Uberenv to generate the host-config file¶
We have seen that we can safely use gcc@8.1.0 on ruby. Let us ask for the default configuration first, and then produce static libs, have OpenMP support and run the benchmarks:
$ python scripts/uberenv/uberenv.py --spec="%gcc@8.1.0"
$ python scripts/uberenv/uberenv.py --spec="%gcc@8.1.0~shared+openmp tests=benchmarks"
Each will generate a CMake cache file, e.g.:
hc-ruby-toss_3_x86_64_ib-gcc@8.1.0-fjcjwd6ec3uen5rh6msdqujydsj74ubf.cmake
Using host-config files to build Umpire¶
$ mkdir build && cd build
$ cmake -C <path_to>/<host-config>.cmake ..
$ cmake --build -j .
$ ctest --output-on-failure -T test
It is also possible to use this configuration with the CI script outside of CI:
$ HOST_CONFIG=<path_to>/<host-config>.cmake scripts/gitlab/build_and_test.sh
Using Uberenv to configure and run Leak Sanitizer¶
During development, it may be beneficial to regularly check for memory leaks. This will help avoid the possibility of having many memory leaks showing up all at once during the CI tests later on. The Leak Sanitizer can easily be configured from the root directory with:
$ srun -ppdebug -N1 --exclusive python scripts/uberenv/uberenv.py --spec="%clang@9.0.0 cxxflags=-fsanitize=address"
$ cd build
$ cmake -C <path_to>/hc-ruby-toss_3_x86_64_ib-clang@9.0.0.cmake ..
$ cmake --build -j
$ ASAN_OPTIONS=detect_leaks=1 make test
Note
The host config file (i.e., hc-ruby-...cmake
) can be reused in order to rebuild with the same configuration if needed.
This will configure a build with Clang 9.0.0 and the Leak Sanitizer. If there is a leak in one of the tests, it can be useful to gather more information about what happened and more details about where it happened. One way to do this is to run:
$ ASAN_OPTIONS=detect_leaks=1 ctest -T test --output-on-failure
Additionally, the Leak Sanitizer can be run on one specific test (in this example, the “replay” tests) with:
$ ASAN_OPTIONS=detect_leaks=1 ctest -T test -R replay --output-on-failure
Depending on the output given when running the test with the Leak Sanitizer, it may be useful to use addr2line -e <./path_to/executable> <address_of_leak>
to see the exact line the output is referring to.
HPCToolKit¶
This page will describes the process and series of steps to analyze Umpire specific applications with HPCToolKit.
Using HPCToolKit¶
LLNL’s documentation for using HPCToolKit for general analysis is a great starting resource and can be found here. The HPCToolKit manual can be found here.
The LC machines have hpctoolkit
installed as a module which can be
loaded with module load hpctoolkit
. The rest of this page will describe
the steps for specific analysis examples with Umpire.
Getting Started¶
Below is the basic (Umpire-specific) set up to load, build with, and run with HPCToolKit:
$ ssh lassen
$ module load hpctoolkit
$ cmake -DENABLE_CUDA=On -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O3 -g" -DCMAKE_C_FLAGS="-O3 -g"
$ make -j
$ hpcrun -e CPUTIME ./bin/executable
$ hpcstruct ./bin/executable
$ hpcprof -S executable.hpcstruct hpctoolkit-executable-measurements-<job_id>/
Note
The HPCToolKit manual recommends building with a fully optimized version
of the executable (hence, the added flags in the cmake
command).
Note
The hpcrun
command can measure certain “events”. Those events are added with a -e
argument and include things like CPUTIME
, gpu=nvidia
, gpu=nvidia,pc
, IO
,
MEMLEAK
, etc. A full list of the possible events can be found by running hpcrun -L
.
After running the hpcrun
command, a hpctoolkit-executable-measurements-<job_id>/
folder
will be generated. After running the hpcstruct
command, a executable.hpcstruct
file will
be generated. These two generated items will then become input used with hpcprof
, as shown
above. If the file that you are analyzing is large or is using a lot of resources, then
hpcprof-mpi
could be better to use. The hpcprof-mpi
command looks the same otherwise. The
result of either hpcprof
command is a generated “database” folder.
At this point, we need the HPCViewer program to view the resulting database. The easiest way
to engage with the HPCViewer is to do it locally. Therefore, we can tar up the generated
database folder and use scp
to send it to a local machine. For example:
$ tar -czvf database.tar hpctoolkit-executable-database-<job_id>/
$ scp username@lassen.llnl.gov:path/to/database.tar .
$ tar -xzvf database.tar
From here, you can open the HPCViewer and select the untarred database folder we just sent over to be viewed. More information on how to use HPCViewer will be provided in the next section.
Otherwise, using HPCViewer from the command line can be tricky since we need X11 forwarding. In order to have X11 forwarding available, we have to ssh into the LC machine and compute node a little differently:
$ ssh -YC lassen
$ bsub -XF -W 60 -nnodes 1 -Is /bin/bash
Note
If that doesn’t work, you can also try ssh’ing into the LC machine with ssh -X -S none lassen
.
From here we can run the same steps as before (listed at the top of this section). When we
have generated the database folder, we will just call the hpcviewer program with
hpcviewer hpctoolkit-executable-database-<job_id>/
. LLNL’s documentation for HPCToolKit
also provides an example command to use the hpctraceviewer
tool.
Using HPCViewer¶
Once you have your own version of HPCViewer locally, it is very easy to launch and open up the
database folder generated earlier. You can do this with just ./hpcviewer
and selecting the
right database folder.
For our use cases, we mostly used the “Hot Child” feature, but that is by no means the most valuable or most important feature that HPCViewer offers. To learn more about what HPCViewer can do, the instruction manual is here.
Running with Hatchet¶
Hatchet is a tool that can better analyze performance metrics given from a variety of tools, including HPCToolKit. Using Hatchet to analyze the output from HPCToolKit can help visualize the performance of different parts of the same program.
To use Hatchet, we create a HPCToolKit analysis, just as before, but this time there is a
specialized hpcprof-mpi
command needed when generating the database folder. Below is an
example:
$ module load hpctoolkit
$ cmake -DENABLE_CUDA=On -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O3 -g" -DCMAKE_C_FLAGS="-O3 -g"
$ make -j
$ hpcrun -e CPUTIME ./bin/executable
$ hpcstruct ./bin/executable
$ hpcprof-mpi --metric-db yes -S executable.hpcstruct hpctoolkit-executable-measurements-<job_id>/
The flag, --metric-db yes
, is an optional argument to hpcprof-mpi
that allows Hatchet to better interpret information given
from HPCToolKit. Without it, it will be very hard to get Hatchet to understand the HPCToolKit
output.
We’ve now generated a HPCToolKit database folder which Hatchet can read. Now we need to launch Hatchet and get started with some analysis. Below is a Python3 interpreter mode example:
$ python3 #start the python interpreter
$ import hatchet as ht #import hatchet
$ dirname = "hpctoolkit-executable-database-<job_id>" #set var to hpctoolkit database
$ gf = ht.GraphFrame.from_hpctoolkit(dirname) #set up the graphframe for hatchet that uses database
$ print(gf.tree(depth=3)) #This is to check briefly that I recognize my tree by checking the root node + a couple sub-nodes
$ print(len(gf.graph)) #I can also verify the tree by checking the length of the graphframe
$ print(gf.dataframe.shape) #I can also print out the 'shape' of the tree (depth x column_metrics)
$ print(list(gf.dataframe.columns)) #I can print out all the column_metrics (e.g. "time", "nid", etc.)
$ print(gf.dataframe.index.names) #I can also print the node names (may be kind of confusing unless you know what you're looking for)
$ query1 = [{"name": "119:same_order\(umpire::Allocator\)"}, "*"] #Set up a query method to filter for the "same_order" sub tree
$ filtered_gf = gf.filter(query1) #apply the query method as a filter on the original tree
$ print(len(filtered_gf.graph)) #verifying that I now have a subtree (length will be smaller)
$ print(filtered_gf.tree(metric_column="time (inc)")) #printing the new filtered subtree by inclusive time metric
$ print(filtered_gf.tree()) #printing the whole filtered tree as is
$ query2 = [{"name": "120:reverse_order\(umpire::Allocator\)"}, "*"] #Set up a query method to filter for the "reverse_order" sub tree
$ filtered_gf_rev = gf.filter(query2) #apply the query method as a filter on the original tree
$ print(len(filtered_gf_rev.graph)) #verifying that I now have a subtree (length will be smaller)
$ print(filtered_gf_rev.tree(metric_column = "time (inc)")) #printing the new filtered subtree by inclusive time metric
$ filtered_gf.drop_index_levels() #As-is, the tree will include info for ranks - if that isn't needed, this function drops that info
$ filtered_gf.dataframe #this provides a spreadsheet of the data that is populating the graphframe (what the tree shows)
$ filtered_gf.dataframe.iloc[0] #gives the first entry of the spreadsheet, here that is the root node of the filtered tree
$ filtered_gf.dataframe.iloc[0,0] #gives the first part of the first entry of the spreadsheet (here, it's the inclusive time)
$ gf3 = filtered_gf - filtered_gf_rev #Stores the diff between two (comparable) trees in gf3
$ print(gf3.tree()) #prints the diff tree
$ gf3.dataframe #outputs the spreadsheet of data that populates the diff tree
This example was set up to analyze the performance of the no-op_stress_test.cpp
benchmark file
from the Umpire repo. It compares the performance from one part of the program (i.e., the part that
measure the performance when doing deallocations in the “same order” as they were allocated) versus
another part of the same program (i.e., the part that measures the performance when doing deallocations
in the “reverse order” as they were allocated).
In Hatchet, these two parts show up as subtrees within the entire call path tree of my example program. Therefore, I can compare one subtree to another in terms of performance (in my case, I compared in terms of inclusive time).
Analyzing results¶
After opening up a database folder in HPCViewer or analyzing the call paths in Hatchet, we can compare the performance (or whatever measurement we are looking at) of different parts of the program against other parts and try to find performance trends. In Umpire’s use case, we plan to use Hatchet as part of our CI to find out if integrating a new commit into our repository increases the performance by a certain threshold or more. If so, our CI test will fail. Our process looks something like:
- Grab the example program’s database from the develop branch
- Grab the example program’s database from a different branch I want to compare against
- Create a graphframe for each database
- Create a filtered graphframe for each that focuses on the specific part of the program I want to measure against
- Compare the inclusive time for each filtered graphframe (or whatever metric I want to analyze)
- If the metric (e.g., inclusive time) of the new branch’s filtered graphframe is more than
threshold
more than that of develop’s, then fail the test!