Heterogeneous Notes
From Hybridthreads Wiki
Contents |
CONTENTS
This page contains notes on what is required to create a uniform heterogeneous operating system. Many of the comments are tied to hthreads implementations, but in general, these notes could be applied to other operating systems as well.
Thread Names
Thread names, in C + PThreads, are absolute function pointers and are not able to cross heterogeneous boundaries. Currently, this limitation is circumvented using compile-time tools to extract all heterogeneous symbols in the form of thread handles. A facility for looking up ISA-specific thread start functions using "names" would greatly help in OS design for heterogeneous systems. It allows programmers to use a single high-level name, paired with a set of attributes, to refer to a set of heterogeneous versions of threads. Without a naming system, a programmer would have to manually work with multiple thread names (function pointers).
Static Thread Lookup Table
This table would be formed at compile-time using sets of architecture-specific tools (binutils, nm, etc.). Position-independent code is needed in these situations to keep from having to relocate code (all branches/jumps must be relative). The table must contain:
- Symbols names and addresses
- (MAYBE) Symbol extents (function start, end, etc.)
The contents of this table must be made available to the OS at run-time. It could be part of one of the OS IP cores, or it could be a stand-alone core. A BRAM implementation would probably work best, and could be initialized using DATA2MEM, or by software at boot up. One of the easiest ways to implement this is to force a user to define a file with thread names, and then use this to generate the thread lookup table (matrix) before running the application. This allows for error checks to be made to make sure the desired threads are in the lookup table in the proper places.
Dynamic Thread Lookup Table
A dynamic version of the table could be implemented, but this requires each heterogeneous processor to register threads at boot up, and can introduce run-time errors if threads have not yet been registered. This is probably dangerous to do and would/could lead to serious security problems
Thread Creation
thread_create(what, arg, attributes)
- what -- high-level thread name that can be used as an index into the thread lookup table
- arg -- standard thread argument to be passed to the newly created thread
- attributes -- thread attributes that allow the OS to make a decision as to where the thread should be run
- The high-level name (what) and attributes should allow the OS to decide on which type of resource the thread will run on
- This means that the OS is the entity which chooses the ISA-specific thread start function (using the lookup table)
A context needs to be created for each thread, however this is problematic as the CPU that is creating the thread may not know how to create a "proper" context for the target CPU on which the thread will run. Additional OS-level (in core) state is needed to allow context creation to be delayed until the thread is being invoked on the target CPU. This can be enabled by adding additional state to the TM/SCHED that includes the thread start function address (32-bit) and a flag that signifies whether or not the context for this thread is valid (1 = valid, 0 = invalid, needs initialization). One small implementation change -- currently the TM uses the current_thread_id_reg as the calling thread, but this may not be the case in systems with hardware and heterogeneous threads. The parameter may need to be passed as an argument from the caller.
Current steps to create a thread:
- CREATE_THREAD
- TM allocates a new thread ID
- Context Initialization (CPU-specific)
- ADD_THREAD
- Thread is added to the R2RQ
- Time passes, and the CPU told to run the created thread implicitly uses the in-memory context
New steps, for a heterogeneous system:
- CREATE_THREAD
- TM allocates a new thread ID
- SETUP_THREAD_INFO
- New call to TM/SCHED to set thread start function (post translation) and to set context invalid bit
- 32-bits for start function
- 1-bit for context flag
- (MAYBE) 32-bits for thread return (exit) value?
- This would allow for thread contexts to be freed upon exit, as the return value would be stored in the OS IP core itself.
- Currently exit values are stored in a global array and in HWTIs, so making a single, heterogeneous-friendly location would make things more uniform
- ??-bits for etc?
- ADD_THREAD
- Thread is added to the R2RQ
- Time passes, and the CPU told to run the created thread checks the new TM/SCHED status on the thread
- If the context is valid, then switch to it
- If the context is invalid, then create and initialize it, and then switch to it
This new process postpones context creation until it is needed. Additionally, it frees the creating thread from the responsibility of creating a new thread's context. This may also help in heterogeneous malloc/free calls, as a thread's context is now created (and maybe freed) by the CPU type on which it is meant for. The additional TM/SCHED state mentioned above allows for storage of OS thread info in a fixed, architecture-independent place (inside of the OS IP cores).
Uniform Thread Creation
To enable uniform thread creation in the PPC+MB prototype system, the following changes have been made (not in /trunk):
- Statically allocated thread stacks (eliminates the need for heterogeneous malloc/free of thread structures)
- Globally known locations of the few global kernel data structures:
- Array of thread stacks.
- Array of TCB blocks.
- Array of thread context blocks (pointed to by TCB blocks).
- Function pointer for _bootstrap_thread function.
- Needed to properly wrap up a user's thread in a scope that is guaranteed to call exit.
- A copy of the _arch_setup_thread function
- Needed to properly initialize a thread's context as it is created.
- --> Modification to allow late context initialization requires lower-level assembly routines to be re-written.
Currently, these changes are localized to only a few files:
- /src/software/system/setup.c
- /src/software/system/syscall.c
- /include/hthread/hwti/*
However these changes must be made coherent by changing the code on both the hthreads PPC kernel side, and the MB HAL side. Overall, these changes push the addresses of pertinent kernel data structures into the V-HWTI at run-time. This allows heterogeneous processors to locate and utilize the few global kernel data structures in the hthreads system; thus allowing heterogeneous processors to create threads. The current limitation on the "current_thread_register" being used as the parent ID for create/join operations is still problematic, but this can be easily fixed by passing in the calling thread's TID when performing create/join operations. This change is localized to the thread manager, and is a straight forward modification of the hardware core.
Thread Join
thread_join(who, returnVal)
- who - is a thread identifier (that is indirectly tied to an executing, joinable thread)
- returnVal - is the return value (exit value) of the thread.
Currently, when joining, the calling thread ID is implicit as it should be known by the thread manager. In a heterogeneous system, the thread manager will still need to keep track of the currently running threads on each CPU in order to correctly perform joins, as well as preemption services. The returnVal is currently captured by a bootstrap routine, and is explicitly passed to hthread_exit by the OS. The OS then places the return value into the global kernel-level "threads" array, for later use by subsequent join calls. Heterogeneous access to this array may be difficult to implement, so placing exit values into an OS-controlled IP core may be a good solution. This could either be part of the thread manager, or it could be part of another independent IP core or BRAM with a fixed address.
Join Cleanup Semantics
A joinable thread's state cannot be cleaned up upon exit for one simple reason.
Any thread that is to join on this thread needs to check to see if it has exited or not, therefore it should only be cleaned up upon a join. Otherwise, the thread ID could be recycled, prematurely, making it so that a parent would be incorreclty joining on an improper thread!!!
Joinable literally means that the thread could be joined on by the parent at any time. Thus, a joinable thread's state cannot be recycled until a join happens.
Thread Manager Internals
- Current CPU registers need to be replaced with a table, possibly BRAM to support scalable numbers of CPUs.
- BRAM should be dual-ported to allow TM and SCHED to access it.
- Each command should use encoded CPU ID to lookup which thread is currently running.
- However, what should be done with HW threads?? These were not really supported in the old SMP system as they didn't have a valid CPU ID (used in create/join)
Thread Scheduler Internals
Data Structures
- O(1) Ready-to-Run Queue Structure
- Array of priority masks (size = NUM_CPU_TYPES, each element is a mask for a particular CPU type)
- Array of scheduling decisions (size = NUM_CPU_TYPES, each element is a decision for a particular CPU type)
- Array of idle threads (size = NUM_CPUS, each element is an idle thread for a particular CPU)
- Array of current threads (size = NUM_CPUS, each element is an thread currently running on a particular CPU)
- CPU type queue:
- A partitioned linked-list in which each entry has an associated type, and a head/tail pointer.
- Pointers index a doubly-linked list of CPUs associated with this type.
- Allows for CPUs to be dynamically added to a CPU type (fault tolerance, dynamic scaling, etc.)
CPU Queue
- 2 Block RAMs:
- Type Queue, indexed by CPU type.
- Contents are mask, head pointer, and tail pointer
- CPU Queue, indexed by CPU id.
- Contents are type, current thread ID (and current ID valid flag), next pointer, and previous pointer
- Type Queue, indexed by CPU type.
Behavior
This allows the scheduler still has base operations that are O(1), meaning that for a particular CPU type, calculating a scheduling decision can be done in constant time. Updating scheduling decisions for every type of processor is iterative (fixed number of iterations for a "probably" small number). This involves applying each CPU type's priority mask, calculating a scheduling decision, and repeating, until each CPU type has an updated scheduling decision. The number of iterations in this loop should be small as systems will probably not have very wide levels of heterogeneity (as a guess, 4 at most).
The process of preemption requires that all currently running threads for a given CPU type are compared (in terms of priority) to the next scheduling decision. This process is also iterative, and requires traversing the new CPU type queue. While list traversal is not ideal in terms of performance, it does allow for a scheduler to handle varying degrees of heterogeneity without requiring re-synthesis. This allows the OS to handle system-level heterogeneity -- or systems with different numbers and mixes of cores.
Walk Through
Notes
- SetupThread initializes architecture independent section of TCB and sets new flag
- Context switches are required to check new flag:
- If new --> setup architecture-dependent (local) context info
- If !new --> proceed, use context as it is
- This allows for arch. dependent contexts to be created by the processor that they are intended to run on
- This may be difficult though!
- Puts a conditional branch inside of the context switcher AND there may be problems using a stack, i.e. in a context switcher, you cannot use the current thread or next thread's stack.
- May need a special kernel stack in order to do this safely as setting up an initial context will undoubtedly use the stack for something (function call, temporaries, etc.)
- This may be difficult though!
- Function pointers should be "pre-translated" before handed to the OS
- Prevents the OS from having to decipher function pointers
- Possible interface - keys (strings or integers):
- fcn_ptr_real = func_translate("my_fcn",ARCH_XX);
- Where "my_fcn" could be a string, or a #define of an integer (for better performance)
- Upon exit, a thread places it's return value in the architecture-independent section of it's TCB
- This data could be placed within the thread manager, or another core if desired.
- This would give the info a fixed-address, known by all heterogeneous kernels at compile-time.
