NUMA System

Overview

NUMA (Non-Uniform Memory Access) is a computer memory design used in multiprocessor systems where the memory access time depends on the memory location relative to the processor. The DAO NUMA system provides utilities to optimize performance on NUMA-based architectures by managing processor affinity and memory allocation.

On NUMA systems, memory access times are not uniform - each processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). The DAO NUMA utilities help applications optimize performance by:

Pinning threads to specific CPU cores
Allocating memory on the same NUMA node as the processing thread
Providing mapping between cores and NUMA nodes

Key Features

CPU core affinity management
NUMA-aware memory allocation
Core-to-node mapping
Cross-platform support (full functionality on Linux, compatibility layer on other platforms)

Namespace and Functions

The NUMA functions are contained within the Dao::Numa namespace:

Core Affinity Functions

void SetProcAffinity(int core);
int GetProcAffinity();

These functions control which CPU core a thread runs on:

SetProcAffinity: Pin the current thread to a specific CPU core
GetProcAffinity: Get the current CPU core the thread is running on

NUMA Node Mapping

int Core2Node(int core);
int Node2FirstCore(int node);

These functions provide mapping between CPU cores and NUMA nodes:

Core2Node: Get the NUMA node that a specific core belongs to
Node2FirstCore: Get the first CPU core on a specific NUMA node

NUMA Memory Allocation

void* AllocOnNode(size_t size, int node);

template <class T>
T* AllocOnNode(size_t nElements, int node, T fill);

void Free(void* start, size_t size);

template<class T>
void FreeT(T* start, size_t nElements);

These functions manage NUMA-aware memory allocation:

AllocOnNode: Allocate memory on a specific NUMA node
AllocOnNode<T>: Allocate typed memory on a specific node with fill value
Free: Free memory allocated with AllocOnNode
FreeT<T>: Free typed memory allocated with AllocOnNode<T>

Utility Functions

int GetMaxCores();
size_t GetMaxNode();

These functions provide system information:

GetMaxCores: Get the maximum number of CPU cores on the system
GetMaxNode: Get the maximum NUMA node number on the system

Platform Support

The NUMA implementation varies by platform:

Linux: Full NUMA support using the numa.h and numaif.h headers
macOS/Windows: Limited functionality, providing API compatibility

Usage Patterns

Basic CPU Affinity

#include <daoNuma.hpp>

// Pin current thread to core 2
Dao::Numa::SetProcAffinity(2);

// Get current core
int currentCore = Dao::Numa::GetProcAffinity();

NUMA-Aware Memory Allocation

// Get the current core
int core = Dao::Numa::GetProcAffinity();

// Get the NUMA node for this core
int node = Dao::Numa::Core2Node(core);

// Allocate memory on the same node as the current thread
float* data = (float*)Dao::Numa::AllocOnNode(1024 * sizeof(float), node);

// Use the memory
for (int i = 0; i < 1024; i++) {
    data[i] = i * 0.1f;
}

// Free the memory
Dao::Numa::Free(data, 1024 * sizeof(float));

Typed Memory Allocation with Fill Value

// Allocate an array of 1024 floats on node 0, initialized to 0.0f
float* data = Dao::Numa::AllocOnNode<float>(1024, 0, 0.0f);

// Free the memory
Dao::Numa::FreeT<float>(data, 1024);

Integration with Thread System

The NUMA system is designed to work with the DAO Thread system:

#include <daoThread.hpp>
#include <daoNuma.hpp>

class ProcessingThread : public Dao::Thread
{
public:
    ProcessingThread(Dao::Log::Logger& logger, int core)
    : Thread("Processor", logger, core)
    {
        // Get NUMA node for this thread
        m_node = Dao::Numa::Core2Node(core);

        // Allocate memory on this NUMA node
        m_data = Dao::Numa::AllocOnNode<float>(1024, m_node, 0.0f);
    }

    ~ProcessingThread()
    {
        // Free NUMA memory
        Dao::Numa::FreeT<float>(m_data, 1024);
    }

protected:
    void RestartableThread() override
    {
        // Process data
        processData();
    }

private:
    int m_node;
    float* m_data;
};

Best Practices

Thread Placement: Place threads performing related work on the same NUMA node
Memory Allocation: Allocate memory on the same node as the thread that will use it most
Data Sharing: Minimize data sharing between threads on different NUMA nodes
Memory Access Patterns: Be aware of memory access patterns that may cross NUMA node boundaries
Core Affinity: Use core affinity to ensure threads stay on their assigned cores

Performance Considerations

Memory Bandwidth: Each NUMA node has its own memory bandwidth
Cache Coherency: Cache coherency operations across NUMA nodes can be expensive
Access Latency: Remote memory access has higher latency than local memory access
Allocation Overhead: NUMA-aware allocation has slightly higher overhead than standard allocation
Core Mapping: Core to node mapping is system-specific

Example: Data Processing on Multiple NUMA Nodes

#include <daoNuma.hpp>
#include <daoThreadTable.hpp>

class NumaProcessor
{
public:
    NumaProcessor()
    {
        // Determine system topology
        m_maxNodes = Dao::Numa::GetMaxNode();
        m_maxCores = Dao::Numa::GetMaxCores();

        // Create thread for each node
        for (size_t node = 0; node <= m_maxNodes; node++) {
            int core = Dao::Numa::Node2FirstCore(node);
            ProcessingThread* thread = new ProcessingThread(m_logger, core, node);
            m_threadTable.Add(thread);
            m_threads.push_back(thread);
        }
    }

    void run()
    {
        // Start all threads
        m_threadTable.Spawn();
        m_threadTable.Start();

        // Wait for completion
        m_threadTable.Join();
    }

private:
    size_t m_maxNodes;
    int m_maxCores;
    Dao::Log::Logger m_logger;
    Dao::ThreadTable m_threadTable;
    std::vector<ProcessingThread*> m_threads;
};