The NUMA Explorer

Home
Back To Tips Page

This is an outgrowth of my attempts to study the support for NUMA architecture provided by Windows.

What is NUMA?

NUMA is an acronym invented at CMU in the late 1970s to describe a set of computer architectures exemplified by the CM* project at CMU.  CM* was a multiprocessor that consisted of 128 PDP-11/03 microcomputers connected by a very elaborate memory interconnect system, such that every processor had access to all memory, but the cost of getting to memory differed depending on where the memory was located.  Because the memory access time varied, the acronym NUMA, Non-Uniform Memory Access, was applied.

When you scale up, you often face problems that are entirely due to the scaling.  Consider the following multi-processor architecture:

This works fine when a small number of CPUs are contending for the memory bus, such as two or four.  But when you scale this up to 64 processors, this means that in the worst case each CPU gets only 1/64 of the available bus bandwidth.  This will not result in acceptable system performance.  Furthermore, all I/O ties up the bus during a transaction, so a 64-processsor system with 64GB of memory is completely throttled by the memory bus bandwidth.  While the L1 and L2 caches can improve performance significantly, it doesn't take a high cache-miss ratio to cause bus saturation.

In a NUMA architectures, each processor has local memory which it can access directly.  To access memory on another processor, however, it has to cross an interprocessor bus, which adds additional cost.  In a typical configuration as shown below, the cost of CPU1 accessing MEM2 is 3 times the cost of CPU1 accessing MEM1.  A two-hop crossing, such as CPU1 accessing MEM4, costs 9 times CPU1 accessing MEM1.  In higher-order processor architectures, more complex interconnects with different cost factors would be expected.  In addition, devices can be on local busses that mean that a device transferring data to the local memory of the processor it is connected to will not cause memory interference with devices transferring data from other devices on other processors to other local memory.

Vista has a number of NUMA-related API calls.  The program discussed in this article currently explores a number of these, and as the program evolves, more features will be added.

Windows support for NUMA

Windows support for NUMA began with Microsoft Vista and Server 2008.  A new suite of APIs was introduced to handle NUMA support.

The NUMA Explorer

download.gif (1234 bytes)Download the source, VS2008
download.gif (1234 bytes)Download an MSI file to install it

The NUMA Explorer is shown below.  The APIs illustrated are shown, as well as their results.

BOOL GetNumaHighestNodeNumber(PULONG HighestNodeNumber) gives the highest node number.  Note this is not the node count; it is the range of nodes.  So the value 1 means that nodes 0 and 1 exist.

GetSystemInfo(SYSTEMINFO  & info) returns a number of interesting parameters, but all that is shown here is the dwNumberOfProcessors value.

BOOL GetNumaProcessorNode(UCHAR processor, PUCHAR NodeNumber) gives the node number which is associated with the processor indicated.  It is done in a loop for(UCHAR i = 0; i < sysinfo.dwNumberOfProcessors; i++) to iterate across all the processors.

BOOL GetNumaNodeProcessorMask(UCHAR node, PULONGLONG ProcessorMask) gives the mask associated with each processor.  For example, on the left, we see that node 0 is associated with processors 0 and 1  (0x...3) and node 1 is associated with processors 2 and 3 (0x...C).

BOOL GetNumaAvailableMemoryNode(UCHAR Node, PULONGLONG AvailableBytes) gives the amount of available memory for the node, in bytes.  Note that this value is dynamic and will change from time to time.

BOOL GetLogicalProcessorInformation(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION Buffer, PDWORD ReturnLength) produces a set of information blocks which reveal the structure of the system.  The information is decoded and displayed in the ListBox below.

Architecture

The architecture drawing is rather simplistic; it does not, for example, show the relationship of nodes if there are more than two nodes in the system (see the 4-CPU drawing shown earlier).  This is because the API that is supposed to show node proximity, GetNumaProximityNode, has no usable documentation.

Memory Allocation

To use the VirtualAllocExNuma call, type into the edit control the # Pages for NUMA test.  Then click Refresh to perform an allocation.

The allocations are shown in the ListBox.  The columns are the thread ID  and time (this is an instance of my Logging ListBox Control), and the range of virtual memory and the pages.  Note that although the pages were all successfully allocated, they did not all get allocated on the preferred processor, as shown in the allocation map.

The allocation map shows the amount of memory allocated in each of the nodes on behalf of that CPU.  The background color (e.g., light red) indicates the preferred node while the dark color (e.g., dark red) indicates the actual allocation.

For example, CPU 0 allocated all 4096 pages in its preferred node.  Similarly, CPU 1 allocated all 4096 pages in Node 0.  But CPU 2 would have preferred to have all its pages in Node 1, but in fact most of the pages were allocated in Node 0.  Similarly CPU 3 did not accomplish its desired allocation, but got more pages in Node 1 than CPU 2.

The dropdown below the system map shows the actual allocations.  These values are obtained by using QueryWorkingSetEx but instead of listing every page, it reduces the data to sets of continuous runs of pages which are in the same node. This is shown in the picture below.

Note that CPU 2 has a very fractured allocation.  While the virtual addresses are contiguous, the physical addresses are split between different nodes.

Ideal allocation is when the block allocated is shown in the dark color corresponding to the background color.

There is an XML representation that can be seen by clicking on the XML display dropdown.  Using the File>Write XML file you can write the output to a file for later analysis.

 

 

The allocation dropdown The XML dropdown
 

A sample of the XML output for the above images is shown below.

<NUMA:Data>
      <NUMA:GetNumaHighestNodeNumber Value = "1"/>
       <NUMA:SYSTEM_INFO dwActiveProcessorMask = "15" dwAllocationGranularity = "65536" dwNumberOfProcessors = "4" dwPageSize = "4096" 
                         dwProcessorType = "586" lpMaximumApplicationAddress = "0x7FFEFFFF" lpMinimumApplicationAddress = "0x00010000" 
                         wProcessorArchitecture = "0" wProcessorLevel = "15" wProcessorRevision = "8450"/>
      <NUMA:GetNumaProcessorNode>
            <NUMA:ProcessorNode Node = "0" Processor = "0"/>
            <NUMA:ProcessorNode Node = "0" Processor = "1"/>
            <NUMA:ProcessorNode Node = "1" Processor = "2"/>
            <NUMA:ProcessorNode Node = "1" Processor = "3"/>
      </NUMA:GetNumaProcessorNode>
      <NUMA:GetNumaProcessorMask>
            <NUMA:ProcessorMask Node = "0" ProcessorMask = "0x3"/>
            <NUMA:ProcessorMask Node = "1" ProcessorMask = "0xc"/>
      </NUMA:GetNumaProcessorMask>
      <NUMA:GetNumaAvailableMemoryNode>
            <NUMA:NumaAvailableMemory KBytes = "47152" Node = "0"/>
            <NUMA:NumaAvailableMemory KBytes = "152" Node = "1"/>
      </NUMA:GetNumaAvailableMemoryNode>
      <NUMA:GetLogicalProcessorInformation>
            <NUMA:RelationProcessorCore Flags = "0x00" ProcessorMask = "0x00000001"/>
            <NUMA:RelationCache Associativity = "2" Level = "1" LineSize = "64" ProcessorMask = "0x00000001" Type = "2" TypeName = "CacheData"/>
            <NUMA:RelationCache Associativity = "2" Level = "1" LineSize = "64" ProcessorMask = "0x00000001" Type = "1" TypeName = "CacheInstruction"/>
            <NUMA:RelationCache Associativity = "16" Level = "2" LineSize = "64" ProcessorMask = "0x00000001" Type = "0" TypeName = "CacheUnified"/>
            <NUMA:RelationProcessorPackage ProcessorMask = "0x00000003"/>
            <NUMA:RelationProcessorCore Flags = "0x00" ProcessorMask = "0x00000002"/>
            <NUMA:RelationCache Associativity = "2" Level = "1" LineSize = "64" ProcessorMask = "0x00000002" Type = "2" TypeName = "CacheData"/>
            <NUMA:RelationCache Associativity = "2" Level = "1" LineSize = "64" ProcessorMask = "0x00000002" Type = "1" TypeName = "CacheInstruction"/>
            <NUMA:RelationCache Associativity = "16" Level = "2" LineSize = "64" ProcessorMask = "0x00000002" Type = "0" TypeName = "CacheUnified"/>
            <NUMA:RelationProcessorCore Flags = "0x00" ProcessorMask = "0x00000004"/>
            <NUMA:RelationCache Associativity = "2" Level = "1" LineSize = "64" ProcessorMask = "0x00000004" Type = "2" TypeName = "CacheData"/>
            <NUMA:RelationCache Associativity = "2" Level = "1" LineSize = "64" ProcessorMask = "0x00000004" Type = "1" TypeName = "CacheInstruction"/>
            <NUMA:RelationCache Associativity = "16" Level = "2" LineSize = "64" ProcessorMask = "0x00000004" Type = "0" TypeName = "CacheUnified"/>
            <NUMA:RelationProcessorPackage ProcessorMask = "0x0000000C"/>
            <NUMA:RelationProcessorCore Flags = "0x00" ProcessorMask = "0x00000008"/>
            <NUMA:RelationCache Associativity = "2" Level = "1" LineSize = "64" ProcessorMask = "0x00000008" Type = "2" TypeName = "CacheData"/>
            <NUMA:RelationCache Associativity = "2" Level = "1" LineSize = "64" ProcessorMask = "0x00000008" Type = "1" TypeName = "CacheInstruction"/>
            <NUMA:RelationCache Associativity = "16" Level = "2" LineSize = "64" ProcessorMask = "0x00000008" Type = "0" TypeName = "CacheUnified"/>
            <NUMA:RelationNumaNode Node = "0" ProcessorMask = "0x00000003"/>
            <NUMA:RelationNumaNode Node = "1" ProcessorMask = "0x0000000C"/>
      </NUMA:GetLogicalProcessorInformation>
      <NUMA:VirtualAllocExNuma>
            <NUMA:Allocation CPU = "0" High = "0x034DFFFF" Low = "0x024E0000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "1" High = "0x044DFFFF" Low = "0x034E0000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x04505FFF" Low = "0x044E0000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x052EDFFF" Low = "0x04506000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x052EEFFF" Low = "0x052EE000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x052F0FFF" Low = "0x052EF000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x052F5FFF" Low = "0x052F1000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x052F7FFF" Low = "0x052F6000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x052FDFFF" Low = "0x052F8000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x052FEFFF" Low = "0x052FE000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x052FFFFF" Low = "0x052FF000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x05301FFF" Low = "0x05300000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x05303FFF" Low = "0x05302000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x05304FFF" Low = "0x05304000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x05306FFF" Low = "0x05305000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x05353FFF" Low = "0x05307000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x0543AFFF" Low = "0x05354000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x0543CFFF" Low = "0x0543B000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x0545DFFF" Low = "0x0543D000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x05461FFF" Low = "0x0545E000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x05463FFF" Low = "0x05462000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x05464FFF" Low = "0x05464000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x05471FFF" Low = "0x05465000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x05472FFF" Low = "0x05472000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x0547BFFF" Low = "0x05473000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x0547CFFF" Low = "0x0547C000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x0548BFFF" Low = "0x0547D000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x05493FFF" Low = "0x0548C000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x05498FFF" Low = "0x05494000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x05499FFF" Low = "0x05499000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "2" High = "0x054DFFFF" Low = "0x0549A000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x058D8FFF" Low = "0x054E0000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x058E1FFF" Low = "0x058D9000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x058E3FFF" Low = "0x058E2000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x058E5FFF" Low = "0x058E4000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x058EAFFF" Low = "0x058E6000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x058ECFFF" Low = "0x058EB000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x058EDFFF" Low = "0x058ED000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x058FAFFF" Low = "0x058EE000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x058FEFFF" Low = "0x058FB000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05903FFF" Low = "0x058FF000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05904FFF" Low = "0x05904000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05907FFF" Low = "0x05905000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05908FFF" Low = "0x05908000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05909FFF" Low = "0x05909000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x0590AFFF" Low = "0x0590A000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x0590CFFF" Low = "0x0590B000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x0590EFFF" Low = "0x0590D000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05911FFF" Low = "0x0590F000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05912FFF" Low = "0x05912000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05913FFF" Low = "0x05913000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05914FFF" Low = "0x05914000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x0591EFFF" Low = "0x05915000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05920FFF" Low = "0x0591F000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05A7BFFF" Low = "0x05921000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05DC8FFF" Low = "0x05A7C000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05DCCFFF" Low = "0x05DC9000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05DCDFFF" Low = "0x05DCD000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05DCFFFF" Low = "0x05DCE000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05DD3FFF" Low = "0x05DD0000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05DDCFFF" Low = "0x05DD4000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05E44FFF" Low = "0x05DDD000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05E4FFFF" Low = "0x05E45000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05E6AFFF" Low = "0x05E50000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05E6CFFF" Low = "0x05E6B000" Node = "0" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x05E71FFF" Low = "0x05E6D000" Node = "1" Valid = "true"/>
            <NUMA:Allocation CPU = "3" High = "0x064DFFFF" Low = "0x05E72000" Node = "0" Valid = "true"/>
      </NUMA:VirtualAllocExNuma>
</NUMA:Data>

An example of output from a non-NUMA system, an 8-processor multiprocessor, is shown below:

[Dividing Line Image]

The views expressed in these essays are those of the author, and in no way represent, nor are they endorsed by, Microsoft.

Send mail to newcomer@flounder.com with questions or comments about this web site.
Copyright © 2009, Joseph M. Newcomer, All Rights Reserved.
Last modified: August 09, 2011

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 download.gif (1234 bytes)VS2008 source code

 

 

download.gif (1234 bytes)VS2008 .msi file

 

[Dividing Line Image]

The views expressed in these essays are those of the author, and in no way represent, nor are they endorsed by, Microsoft.

Send mail to newcomer@flounder.com with questions or comments about this web site.
Copyright © 2009, Joseph M. Newcomer, All Rights Reserved.
Last modified: August 09, 2011