Heterogeneous Threads
From Hybridthreads Wiki
Building Heterogeneous MPSoCs Using HThreads
Contents |
Heterogeneous Systems
Heterogeneous computing systems that are composed of non-homogeneous components; namely different types of processing units which may include custom hardware, DSP units, and CPUs with differing ISAs. Traditionally, these types of systems are built as one-off engineering projects in which each component executes an independent application that implicitly interacts with the other system components. The ad-hoc design and integration process was not purposeful, rather it was due to the fact that no standard framework has been established for integration heterogeneous computational components. The goal of this work is to use the abstractions found in operating systems to allow a single application description to span over the various computational units of a heterogeneous system-on-chip (SoC).
Specialization of individual processing units has increased as the demand for lower-power, higher-performance devices has grown. General purpose parallel processors have proven to be less effective than grouping together several specialized processors that are built to target specific domains (DSP, graphics, SIMD, etc.). As specialization continues, systems will become increasingly heterogeneous, and the need for standard system integration methods will grow.
Coordination and Synchronization
Synchronization in multiprocessor systems is often accomplished by using special architecture-specific atomic instructions. These instructions often come in the form of load-linked/store-conditional, compare-and-swap, fetch-and-add, etc.. Unfortunately, there is no standard atomic instruction type, and some architectures may not have any atomic instructions [1]. This can make synchronization and coordination in a heterogeneous processing system difficult and cumbersome. Many systems depend on remote-procedure call (RPC) mechanisms that make use of interrupt/exception processing routines. While flexible, these mechanisms involve considerable overhead and jitter -- 2 very undesirable traits, especially in embedded real-time systems.
More recently, systems such as the Cell B.E. (IBM) and EXOCHI (Intel) have shown that heterogeneous programs can communicate and interact without the need of a virtual machine environment. However, heterogeneous programs are not able to use a uniform set of OS APIs in these systems. In fact the abilities of a thread depend heavily on what type of processor it is spawned on in such systems. Complex programming rules that depends solely on a thread's location in a heterogeneous system present a steep obstacle for programmers.
Non-Uniform Access to OS Services
The DaCs specification is the first attempt at establishing a uniform parallel programming model for heterogeneous systems. A DaCs implementation exists for hybrid x86-Cell architectures: systems containing both x86, PPU, and SPU processors. The model consists of traditional process management, synchronization, and communication primitives that extend over the heterogeneous resources. The APIs mimic those of previous standards, but they have differences in their behavior and underlying semantics. The heterogeneous primitives interact with the DaCs run-time (daemons) in order to propagate information between the heterogeneous processors.
The current DaCs implementation for x86-Cell systems provides an example of complex non-uniform programming models. Below is a quote from page 12 of the "Data Communication and Sychronization for Hybrid-x86 Programmer's Guide and API Reference" that illustrates the non-uniformity (http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/ADFBD392E0ED2D4C00257353006B2744):
"All three (x86_64, PPU and SPU) can work together with remote memory and mutexes. When these are created on the PPU they are created in a way that allow them to be used with both the x86_64 HE and the SPU AE. In this way processes on all three can be synchronized using a mutex or can share remote memory. Note, however, that this must be initiated on the PPU (which is shared between the two). It is not possible for the x86_64 HE to create the mutex and share it with the PPU AE and then for the PPU (as an HE) to share it with its SPU AEs. This does not work as a mutex can only be shared by the DE that created it - the x86_64 DE cannot see the SPU DE."
This non-uniformity is due to the host-accelerator model of DaCs. This structure uses RPC mechanisms to call host code from an accelerator, however not all hosts are "reachable" from each accelerator. In the case of x86-Cell systems, the PPE is the only common resource in the two DaCs implementations. The non-uniformity can be seen in other systems as well, for instance both the Ceiling SHIM and CellVM systems make use of a commodity OS that is run on the host (PPE) processor. This forces all OS-level interactions originating from the SPEs to transfer control to the PPE (via interrupts, RPC, etc.), perform OS processing (which often requires state migration), and then return the result back to the SPE. An operating system framework that allows direct access to services from heterogeneous components eliminates the need to interrupt other application processors in order to invoke OS calls. This eliminates a great source of non-uniformity in heterogeneous systems; making them more efficient, and easier to program.
Thesis
The purpose of this work is to develop new architectures/methodologies for providing direct access to a set of uniform OS services in heterogeneous environments. This includes decoupling OS services from specific processors, thus making the services themselves ISA-neutral. The goal is to use a HW/SW co-design process to separate OS policies from the traditional mechanisms that limit OS services from being fully utilized in heterogeneous settings. The structure will allow OS-level processing to take place on custom accelerators as well as spare CPU cores in order to make the OS heterogeneous-friendly, distributed, and scalable. Overall, the goal is to provide a single, uniform programming model for heterogeneous multi-processor systems; allowing traditional APIs to be used uniformly by all processors. This technique promises an increase in programmer productivity for heterogeneous systems, facilitates migrating legacy codes to such platforms, and can lead to increased adoption of heterogeneous multi-core architectures.
HThreads
In HThreads, all operating system services are available to any component that is able to "master" the system bus. Each high-level operating system call can be transformed into a series of loads and stores on the system bus, in which each bus transaction interacts with the HThreads OS IP cores. Each individual transaction is atomic, in that each OS call remains uninterrupted from start to finish. This is accomplished by delaying the bus acknowledgment (ACK) until the given HThreads OS IP core finishes processing.
Hardware Threads
The notion of a hardware thread is defined within the HThreads system as "an independent, uninterruptible thread that has pre-allocated, reserved resources on which to execute". The notion of heterogeneous software threads is built upon the notion of hardware threads in hthreads. Heterogeneous threads are, upon creation, bound to a single heterogeneous core for the duration of the the thread's lifetime. Mapping this new thread construct onto an existing concept within hthreads allows for heterogeneous threads to make use of all of the resources of the hthreads operating system kernel without requiring modification of the internals of the OS or its supporting hardware architecture.
Heterogeneous Support
The HThreads system was designed for a highly heterogeneous environment consisting of traditional software threads and custom hardware threads. The underlying OS support mechanisms are directly accessible to any component that can master the system bus. An architecture of this type can thus be used to supply a common set of OS services to CPUs with different ISAs. Although each CPU has a unique instruction set, all can access a common set of OS services using basic memory-mapped I/O commands. Thus, HThreads can function as a heterogeneous integration layer for heterogeneous MPSoCs. Addtionally, this layer uses traditional POSIX-compatible OS calls which are not only familiar to programmers, but allows HThreads applications to remain portable to other, architectures.
HThreads + Embedded Executables
An architecture for heterogeneous SoCs can be formed by pairing embedded heterogeneous executables with the HThreads OS IP cores. This type of architecture holds promise to allow uniform access to system-level APIs from each type of heterogeneous processing resource without the need for the heavyweight RPC mechanisms found in the Cell (PPE-Assist, and PPE callbacks) and EXOCHI (ATR,CEH).
Heterogeneous Executables
An executable file for a heterogeneous system must contain either:
- Completely architecture-independent code - often referred to as byte code.
- Requires interpretation or just-in-time compilation.
- Architecture-dependent code.
- Does not require interpretation, but may require cross-architecture linking.
Heterogeneous systems have been built using virtual machine environments, however there can be a serious cost in terms of interpretation overhead. On the other hand, very few heterogeneous systems have been built using a pure heterogeneous executable in which the different architecture-dependent code sections interact. For instance, FAT binaries are a popular executable format used by Apple that contain multiple architecture-dependent sections. However, upon launching a FAT binary, one such section is selected, and the rest of the sections are ignored throughout the lifetime of the program. There is no inter-architectural interactions in such systems.
Fat Binaries
A fat binary [2] is a single binary executable image that contains multiple native exectuables for different architectures. Each executable is distinct, and there is no cross-linking between executables within a single fat binary. A single decision is made when loading a fat binary for execution; which architectures native binary is to be used. After this initial decision is made, all program execution is held within the context of a single ISA's binary. The main purpose of such binaries is to allow easier distribution, a single file, that is able to be executed on different architectures. The main difference between fat binaries, and heterogeneous system executables, is that there inter-ISA interactions in heterogeneous executables. Meaning that binaries from different ISAs are simultaneously used when executing the application, and the programs can interact via system calls, sharing data, etc.
Embedding Process
Building an embedded executable for a heterogeneous processor, in this case a MicroBlaze, can be done by building a thread that links against the heterogeneous MicroBlaze HAL library. This build system and library can be found in the GIT repository located at:
git://hthreads.csce.uark.edu/hthread_hal.git
The build system contained within this repository is set up to mirror the traditional HThreads build system, and user applications can be built by simply typing 'jam' from within the top-level of the repository. An embeddable version of the executable can be created by using the embedmb.py script found in the ./src/software/scripts directory of the repository. The proper command line usage is:
./src/software/scripts/embedmb.py <handle_name> <thread_name> <executable_name> <embeddable_output_name>
This script will examine the executable (ELF) file specified by <executable_name> looking for a thread (function) named <thread_name>. If found, the script will produce an embeddable C file, named <embeddable_output_name> that can be linked against. This file contains a byte array that contains the embedded MicroBlaze executable as well as a handle specified by <handle_name> that can be used as a function pointer for creating heterogeneous threads. The embedding process is very similar to the process used by IBM's embedspu scripts, however the process described above only uses commodity GNU compilers for the PowerPC and MicroBlaze processors. The IBM SPE-compiler is designed to use a special object format (CESOF) for heterogeneous linking that contains special symbolic information for inter-architectural program interactions such as the effective-address reference (EAR) and built-in thread handles (CESOF - CBE Embedded SPE Object Format). An explanation of how the embedding script works can be found in this document: Image:Howto embed.pdf
Productivity
Programmers know that it takes time to not only develop code, but to compile/synthesize it as well. Custom hardware accelerators force a user to re-synthesize a system every time a change is made to the accelerator's code. On the other hand, heterogeneous systems built solely from processors do not require re-synthesis, and in fact, code changes should only require re-compilation. The speed of compilation vs. synthesis allows for more turns per day [Nelson, ICFPT].
In a heterogeneous hthreads system, re-synthesis requires approximately 94 minutes while re-compilation of heterogeneous accelerator code only requires approximately 20 seconds (a ~280x speedup). These timing results were gathered on a 2.3 GHz Intel Core 2 Duo with 4 GB of RAM. Quicker development times come at a cost of reduced specialization, however, this may be justified in many cases due to the need for fast time-to-market requirements.
Compilation/synthesis time does not even include the additional development effort required to build and maintain custom hardware, nor does it consider the extra effort required to debug hardware via simulation and on-chip testing. Even if the best HLL-to-gates tool is used, it will not make a dent in the time required for re-synthesis. Heterogeneous processing systems promise faster development times than building custom hardware, while approximating the performance of custom machines much better than general purpose processing systems have.
Performance Results
System-Level Timing
The following results were gathered using an on-chip timer on a Xilinx ML507 development board running a heterogeneous HThreads system at 125 MHz. The results record system call latencies for native (PowerPC-based) and heterogeneous (MicroBlaze-based) threads. The calls measured include hthread_create, hthread_mutex_lock, and hthread_mutex_unlock. All measurements are based off of application runs of 100,000 events.
Thread creation time is measured by taking the difference of two time stamps: one taken by the parent thread before calling hthread_create, and one taken by the child thread as soon as it begins executing. The results of creating native and heterogeneous threads in hthreads is:
- hthread_create for PowerPC-based thread (Native)
- Average: 5,106 cycles
- Standard Deviation: 17 cycles
- hthread_create for MB-based thread (Heterogeneous)
- Average: 1,520 cycles
- Standard Deviation: 3 cycles
Mutex lock time is measured by taking the difference of two time stamps: one taken before the hthread_mutex_lock call, and one taken directly after the hthread_mutex_lock call:
- hthread_mutex_lock for PowerPC-based thread (Native)
- Average: 1,504 cycles
- Standard Deviation: 13 cycles
- hthread_mutex_lock for MB-based thread (Heterogeneous)
- Average: 330 cycles
- Standard Deviation: 3 cycles
Mutex unlock time is measured by taking the difference of two time stamps: one taken before the hthread_mutex_unlock call, and one taken directly after the hthread_mutex_unlock call:
- hthread_mutex_unlock for PowerPC-based thread (Native)
- Average: 1,490 cycles
- Standard Deviation: 15 cycles
- hthread_mutex_unlock for MB-based thread (Heterogeneous)
- Average: 469 cycles
- Standard Deviation: 30 cycles
Timing differences between Native and Heterogeneous system calls are primarily due to system-call handling structures. All native system calls perform a trap and context switch into a kernel, while heterogeneous executables are directly linked with system call handlers. The software trap and context switching introduce additional overhead and jitter not found in the heterogeneous calls. Overall, the overhead and jitter associated with such calls is extremely low -- making this a suitable platform for predictable real-time systems.
Mini-Benchmarks
Below is table of performance results gathered on a variety of mini-benchmarks. The mini-benchmarks are all executed on a Xilinx ML507 Development board using an HThreads base system (PPC440) and 3 "heterogeneous" MicroBlaze processors. The system is uniformly clocked at 125 MHz, and all timing results were gathered using an on-chip timer attached to the main PLB bus.
Code Examples
Mail Box Example
This example highlights the ability for heterogeneous threads to directly use all available system call APIs. This example uses mutexes and condition variables to construct a higher-level mailbox API. This API is used to send data between native- and heterogeneous-processors in a FIFO-fashion. While conceptually simple, this example tests many different system-level factors that include:
- The ability to launch heterogeneous threads from a single executable on CPUs of different ISAs.
- The ability for heterogeneous threads to share in-data memory structures.
- The ability for heterogeneous threads to use the same set of OS services. In this case, the threads running on processors of different ISAs are able to share condition variables and mutexes.
- The ability for a single OS scheduler to handle sets of heterogeneous threads.
Example - Using an Embedded Thread
The code below makes use of an embedded executable that implements the "mbox_thread" functionality on both types of processors in a heterogeneous system. Note the #IFDEF code which is used to conditionally create either a native (PPC) thread or a heterogeneous (MB) thread using the embedded executable. Note that a traditional function pointer can be used when creating a native thread. However, as function pointers are not first class objects, a thread handle is used to create the heterogeneous thread. This handle is analogous to a function pointer as it still points to the first instruction of the thread start function. However, this handle points to the thread contained in the embedded MicroBlaze executable.
This example is very similar to the IBM Cell mailbox example found at http://www.ibm.com/developerworks/power/library/pa-tacklecell2/index.html?S_TACT=105AGX16&S_CMP=EDU. The major difference is that the code below uses the same standard POSIX-compatible APIs within the threads running on both the PowerPC and the MicroBlaze, while the Cell example requires the use of special SPE-specific APIs.
/************************************************************************************
* Copyright (c) 2006, University of Kansas - Hybridthreads Group
* and/or
* Copyright (c) 2008, University of Arkansas - Hybridthreads Group
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*
* * Redistributions of source code must retain the above copyright notice,
* this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
* * Neither the name of the University of Kansas nor the name of the
* Hybridthreads Group nor the names of its contributors may be used to
* endorse or promote products derived from this software without specific
* prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
* ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
* ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
************************************************************************************/
#include <stdlib.h>
#include "stdio.h"
#include <hthread.h>
typedef struct
{
int size;
int head;
int tail;
int num;
void **mailbox;
hthread_mutex_t mutex;
hthread_cond_t notempty;
hthread_cond_t notfull;
} mailbox_t;
typedef struct targ{
mailbox_t mb_start;
mailbox_t mb_done;
int num_elements;
} sortarg_t;
int mailbox_write( mailbox_t *mailbox, void* value )
{
// Lock the mailbox mutex
hthread_mutex_lock( &mailbox->mutex );
// Wait until there is space in the mailbox
while( mailbox->num >= mailbox->size )
{
hthread_cond_wait( &mailbox->notfull, &mailbox->mutex );
}
// Store the value in the mailbox
mailbox->mailbox[ mailbox->tail ] = value;
mailbox->tail = (mailbox->tail + 1) % mailbox->size;
mailbox->num++;
// Unlock the mailbox mutex
hthread_mutex_unlock( &mailbox->mutex );
// Signal that the mailbox in not empty any longer
hthread_cond_signal( &mailbox->notempty );
// Return successfully
return 0;
}
void* mailbox_read( mailbox_t *mailbox )
{
void* value;
// Lock the mailbox mutex
hthread_mutex_lock( &mailbox->mutex );
// Wait until there is space in the mailbox
while( mailbox->num <= 0 )
{
hthread_cond_wait( &mailbox->notempty, &mailbox->mutex );
}
// Get the value out of the mailbox
value = mailbox->mailbox[ mailbox->head ];
mailbox->head = (mailbox->head + 1) % mailbox->size;
mailbox->num--;
// Unlock the mailbox mutex
hthread_mutex_unlock( &mailbox->mutex );
// Signal that the mailbox is not full any longer
hthread_cond_signal( &mailbox->notfull );
// Return the read value
return value;
}
int mailbox_init_no_globals(int mutexnum, int condnum, mailbox_t *mailbox, int size )
{
hthread_mutexattr_t attr;
hthread_condattr_t cattr;
// Allocate the mailbox memory
mailbox->mailbox = (void**)malloc( sizeof(int)*size );
if( mailbox->mailbox == NULL ) return ENOMEM;
// Setup the mailbox structure
mailbox->size = size;
mailbox->head = 0;
mailbox->tail = 0;
mailbox->num = 0;
// Setup the mailbox mutex attributes
hthread_mutexattr_init( &attr );
hthread_mutexattr_setnum( &attr, mutexnum++ );
hthread_mutexattr_settype( &attr, HTHREAD_MUTEX_RECURSIVE_NP );
// Setup the mailbox mutex
hthread_mutex_init( &mailbox->mutex, &attr );
hthread_mutexattr_destroy( &attr );
// Setup the mailbox condition variables
hthread_condattr_init( &cattr );
hthread_condattr_setnum( &cattr, condnum++ );
hthread_cond_init( &mailbox->notempty, &cattr );
hthread_condattr_setnum( &cattr, condnum++ );
hthread_cond_init( &mailbox->notfull, &cattr );
// Return successfully
return 0;
}
void* mbox_thread( void *data )
{
sortarg_t * my_arg;
void *ptr;
// Grab argument
my_arg = (sortarg_t *)data;
// Grab TID
hthread_t tid = hthread_self();
while ( 1 ) {
// Read from mbox
ptr = (void*)mailbox_read( &my_arg->mb_start );
int res = (0x08 << 16) + ((int)ptr << 4) + tid;
// Write result to mbox
mailbox_write( &my_arg->mb_done, (void*)res);
}
return NULL;
}
#define CHUNK_SIZE (10)
#define NUM_CHUNKS (5)
#define TOTAL_SIZE (NUM_CHUNKS*CHUNK_SIZE)
#define NUM_THREADS (2)
#define USE_HW_THREAD
// The base addresses of the hardware thread we are creating
#define HWTI_BASEADDR0 (0xB0000000)
#define HWTI_BASEADDR1 (0xB0000100)
unsigned int base_array[NUM_THREADS] = {HWTI_BASEADDR0, HWTI_BASEADDR1};
int main()
{
sortarg_t arg;
int mutexnum = 0;
int condnum = 0;
hthread_t tid[NUM_THREADS];
hthread_attr_t attr[NUM_THREADS];
// *********************************************
extern unsigned char intermediate[];
extern unsigned int mbox_handle_offset;
unsigned int mbox_handle = (mbox_handle_offset) + (unsigned int)(&intermediate);
// **********************************************
// Initialize thread argument and mailboxes
arg.num_elements = CHUNK_SIZE;
mailbox_init_no_globals(mutexnum++,condnum++, &arg.mb_start, NUM_CHUNKS);
mailbox_init_no_globals(mutexnum++,condnum++, &arg.mb_done, NUM_CHUNKS );
int i = 0;
// Create threads
for (i = 0; i < NUM_THREADS; i++)
{
// Initialize attributes
hthread_attr_init( &attr[i] );
hthread_attr_sethardware( &attr[i], (void*)base_array[i] );
// Spawn thread
#ifdef USE_HW_THREAD
hthread_create( &tid[i], &attr[i], (void*)mbox_handle, (void*)&arg );
#else
hthread_create( &tid[i], NULL, mbox_thread, (void*)&arg );
#endif
}
// Initialize count array
int counts[10];
for (i = 0; i < 10; i++)
{
counts[i] = 0;
}
// Write mbox values
for (i = 0; i < NUM_CHUNKS; i++)
{
mailbox_write( &arg.mb_start, (void*)i );
}
// Read mbox values
int ret;
int index;
for (i = 0; i < NUM_CHUNKS; i++)
{
ret = (int)mailbox_read( &arg.mb_done );
printf("Ret value(%d) = 0x%08x\n",i,ret);
index = ret & 0xf;
counts[index]++;
}
for (i = 0; i < 10; i++)
{
if (counts[i] != 0)
printf("Count for TID %d = %d\n",i,counts[i]);
}
return 0;
}
References
- IFIP-RSP'08 - "Multi-CPU/FPGA Platform Based Heterogeneous Multiprocessor Prototyping: New Challenges for Embedded Software Designers"
- ICFPT'08 Workshop on Design Productivity:
- Brent Nelson (BYU) - http://www.et.byu.edu/~nelson/FPT2008PreWorkshop/nelson_productivity.pdf
- IBM Cell - CBE Embedded SPE Object Format (CESOF) - http://www.embedded.com/columns/technicalinsights/188101999?_requestid=166869
- IBM Cell, PPE Assist - http://www.ibm.com/developerworks/blogs/page/powerarchitecture?entry=ibomb_secure_sdk30_15&S_TACT=105AGX16&S_CMP=HP
- IBM Cell libspe (PPE callbacks) - http://www.ibm.com/developerworks/library/pa-libspe2/
- Detailed Documentation
- LIBSPE Documentation - http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/1DFEF31B3211112587257242007883F3
- LIBSPE Migration Guide - http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/4818FA471C13A1BC872572AB006C4139
- Detailed Documentation
- IBM Cell mailboxes - http://www.ibm.com/developerworks/power/library/pa-tacklecell2/index.html?S_TACT=105AGX16&S_CMP=EDU
- Intel EXOCHI - http://portal.acm.org/citation.cfm?id=1250753
- CellVM: A Homogeneous Virtual Machine Runtime System for a Heterogeneous Single-Chip Multiprocessor
- Celling SHIM: compiling deterministic concurrency to a heterogeneous multicore
- Compiler Architectures for Heterogeneous Systems
Heterogeneous System Notes
Presentations
- Proposal Slides Presentation (PPT)
Copyright
Copyright by Jason Agron, if you have any questions feel free to email me. My contact info can be found on the "People" page.

