FSMLang Case Studies

From Hybridthreads Wiki

Jump to: navigation, search

Contents


ISQRT - Integer Square Root

Embedded programming is the hallmark of efficiency through approximation. Often times, shortcuts are taken that involve pre-computed lookup-tables, low-level bit arithmetic instead of multiplication/division, and data type approximations. One such example, popularized by Embedded.com in an article on the fast approximation of square roots using integers. The original article, written by Jack Crenshaw, as well as a Xilinx white paper describe the need for a fast square root algorithm.

The focus of the article demonstrates that many algorithms are iterative, and take variable numbers of instructions, and thus time to converge. Many times, programmers can deal with less precise/accurate results, but cannot deal with non-deterministic execution times. Nobody in their right mind would argue that the square root of 45 is 6 instead of 6.70820393, but many programs do not require the extra precision afforded by the use of floating point arithmetic. Overall, an integer-based approximation to square root can be found using a deterministic iterative method that does not require the use of division; in fact, it only uses bit-shifting, addition, and subtraction. This type of implementation suits real-time systems as its WCET can be predicted much more accurately. Additionally, it does not require expensive division operations, making it a good fit for low-power, embedded processors.

FSMLang Implementation

The FSMLang implementation of the isqrt (integer square root) function will make use of channels (FSLs) for input and output. The program uses a bi-directional channel, named "comm_chan" to communicate with the calling component. This allows the resulting isqrt core to be used in a pipelined fashion, as channels have built in FIFO buffering mechanisms. The body of the isqrt program is based off of the example shown in this Xilinx white paper. The initial FSMLang implementation of isqrt can be found here:

-- *******************************
-- FSL-based integer square root
-- *******************************
-- http://www.fpgajournal.com/whitepapers_2008/q3_dsp_xilinx2.htm
-- *******************************
CS: current_state;
NS: next_state;
 
GENERICS:
DATA_WIDTH, integer, 32;
 
PORTS:
 
CONNECTIONS:
 
MEMS:
 
CHANNELS:
comm_chan, DATA_WIDTH;
 
SIGS:
value, std_logic_vector(0 to DATA_WIDTH-1);
lc, std_logic_vector(0 to DATA_WIDTH-1);
a, std_logic_vector(0 to DATA_WIDTH-1);
b, std_logic_vector(0 to DATA_WIDTH-1);
c, std_logic_vector(0 to DATA_WIDTH-1);
 
INITIAL: reset;
 
TRANS:
 
reset -> grabInput
 
grabInput -> sqrt_loop where
{
    -- Grab a value from the input channel
    value'      <= #comm_chan;
    -- Initialize lc (loop counter) and other variables
    lc'         <= ALL_ZEROS; 
    a'          <= ALL_ZEROS;
    b'          <= ALL_ZEROS;
    c'          <= ALL_ZEROS;
}
 
sqrt_loop | ( lc < 16) -> sqrt_body1 where
{
    -- Increment counter
    lc' <= lc + 1;
 
    -- Computation (0)
    c'      <= ACCUM_SHIFT2(c);
    a'      <= ACCUM_SHIFT1(a);
}
sqrt_loop -> grabInput where
{
    -- Write back result
    #comm_chan  <= a;
}
 
sqrt_body1 ->  sqrt_body2 where
{
    -- Computation (1)
    c'      <= c + UPPER_BITS(value);
    value'  <= ACCUM_SHIFT2(value);
    b'      <= ACCUM_SHIFT1(a) or conv_std_logic_vector(1,DATA_WIDTH);
}
 
sqrt_body2 | (c >= b)-> sqrt_loop where
{
    -- Computation (2a)
    c'  <= c - b;
    a'  <= a + 1;
}
sqrt_body2 -> sqrt_loop where
{
    -- Computation (2b)
    -- NULL, just loop back
}
 
VHDL:
 
function UPPER_BITS(i : std_logic_vector(0 to DATA_WIDTH-1)) return std_logic_vector is
begin
    return i(0 to 1);  -- (C) i >> 30;
end function UPPER_BITS;
 
function ACCUM_SHIFT2(i : std_logic_vector(0 to DATA_WIDTH-1)) return std_logic_vector is
begin
    return i(2 to DATA_WIDTH-1) & "00"; -- (C) i << 2;
end function ACCUM_SHIFT2;
 
function ACCUM_SHIFT1(i : std_logic_vector(0 to DATA_WIDTH-1)) return std_logic_vector is
begin
    return i(1 to DATA_WIDTH-1) & "0"; -- (C) i << 1;
end function ACCUM_SHIFT1;

Experimental Results

The isqrt core is packaged as a Xilinx EDK pcore using the built-in MPD file generator and the gen_pcore script found in the FSMLang compiler repository. An EDK-based MicroBlaze system is built on an ML507 development board to exercise the isqrt core and compare it's performance to that of a software-emulation of isqrt as well as the traditional math.h version of square root (which utilizes floating point arithmetic). All timing results are gathered using an on-chip timer running at the main board clock frequency of 125 MHz. The hardware-based core produced by the FSMLang compiler is capable of calculating 32-bit isqrt in under 100 clock cycles, which is quite fast, compared to the 100s of clock cycles required by the software implementation on the MicroBlaze. The core is also an order of magnitude faster than a hand-optimized assembly isqrt implementation on a PIC microcontroller [PIC - isqrt].

Timing Results

Test Case Latency (Clock Cycles) Speedup (Vs. isqrt in C) Reasoning
isqrt (C) 717 1x Base Case
isqrt (SW) 2494 0.28x Interp. overhead
isqrt (SW-arb) 37908 0.019x Interp. overhead + boxing/unboxing
isqrt (HW) 64 11x ILP exploitation
isqrt (HW++) 32 22x ILP + fast kernel
isqrt (HW++U) 24 30x ILP + unrolled kernel

Notes:

  • ISQRT in SW suffers dramatic slowdown due to interpretation overhead (giant "case", branching).
    • ISQRT in SW using the arbitrary-bit width compiler suffers severe slowdown due to the boxing/unboxing operations used to implement arbitrary bit-width arithmetic for shifting and bit-slicing.
  • ISQRT in HW achieves speedup by exploiting instruction-level parallelism.
  • (++) is an optimized version written in FSMLang.
    • It performs each loop iteration in isqrt in a single clock cycle.
    • The code for the optimized version can be found in isqrt_optimized.des.
  • (++U) is an optimized and unrolled version written in FSMLang.
    • It performs two loop iterations in a single clock cycle.
    • The code for the unrolled version can be found in isqrt_unrolled.des.

Synthesis Report (XST)

Device utilization summary:
---------------------------
Selected Device : 5vfx70tff1136-1 

Slice Logic Utilization: 
 Number of Slice Registers:             164  out of  44800     0%  
 Number of Slice LUTs:                  344  out of  44800     0%  
    Number used as Logic:               344  out of  44800     0%  

Slice Logic Distribution: 
 Number of LUT Flip Flop pairs used:    345
   Number with an unused Flip Flop:     181  out of    345    52%  
   Number with an unused LUT:             1  out of    345     0%  
   Number of fully used LUT-FF pairs:   163  out of    345    47%  
   Number of unique control sets:         2

Timing Summary:
---------------
Speed Grade: -1

   Minimum period: 3.966ns (Maximum Frequency: 252.124MHz)
   Minimum input arrival time before clock: 1.528ns
   Maximum output required time after clock: 3.296ns
   Maximum combinational path delay: 0.468ns

Conclusion

Overall, the FSMLanguage implementation of ISQRT shows that the language can be used to effectively describe a computation destined for implementation in hardware. The resulting ISQRT implementations are uniformly accessible to software and hardware components (through a channel abstraction), are naturally pipelined (as channels are FIFOs), and are completely deterministic. The resulting hardware accelerator is also quite small in terms of area, as it requires far less than 1% of the resources of a medium-sized FPGA.

Condition Variables - HThreads Condition Variable Manager

HThreads is a microkernel-based operating system in which the core OS servers are specialized hardware cores. Currently, each core is implemented within the fabric of an FPGA, communicating with one another via a message-passing protocol implemented over IBM's PLB CoreConnect bus. The condition variables have been re-implemented in FSMLanguage for 2 major reasons:

  1. Update the code base to make it easier to port the condition variables to new bus interfaces.
  2. Update the code in terms of coding style to make it easier to read and maintain.

Past implementations of the condition variables did not make use of modern HDL coding styles including generics, inference templates, and functions/procedures. The new implementation will update the coding style of the core by using the built-in FSMLang primitives for memories, channels, functions, and generics.

FSMLang Implementation

The condition variable core is implemented as a single FSM with a set of external ports, a single channel interface, and an internal memory used for holding all condition variable state. The table is split into 4 major regions:

  1. Queue length.
    1. Indexed by condition variable ID.
    2. Current length of the queue for a given CV.
  2. Next owner.
    1. Indexed by condition variable ID.
    2. The next scheduled owner of the CV (head pointer).
  3. Link pointers.
    1. Indexed by thread ID.
    2. Linked-list pointers forming the queue for a given CV (next pointer).
  4. Last requester.
    1. Indexed by condition variable ID.
    2. The last thread that requested to wait on a CV (tail pointer).

All four table sections are used to implement a singly-linked list of blocked threads for each condition variable. Threads are added to the blocked queue via "wait" (enq) commands, and threads are removed from the queue using "signal" (deq) and "broadcast" (deq-all) commands. The condition variable core is exercised directly by a set of POSIX-compatible condition variable APIs. Heterogeneous (native software, heterogeneous software, and hardware) threads are able to share the services provided by the condition variable core.

Experimental Results

The FSMLang version of the condition variable core was written in under 300 lines of code. This is still quite smallcompared to the 1,000+ lines of hand-written VHDL that comprised the original implementation of the condition variable core. Below are some source lines of code (SLoC) metrics:

Source Lines Of Code (SLoC) Analysis

Test Case Slave LoC Post-Elab LoC
Original 1102 (VHDL) 1102 (VHDL)
FSMLang 286 (FSMLang) 634 (VHDL)

Conclusion

FSMLang provides a drastic reduction in the amount of code needed to implement the condition variable manager. The code is much more concise, compact, and easier to maintain. Performance and FPGA area of the core are on par with the original implementation.

Special PIC - HThreads Interrupt Scheduler

HThreads contains a CPU-bypass interrupt scheduler (CBIS) internally referred to as the "Special PIC". The core serves as a transformer: turning interrupt events into messages destined for the HW-based scheduler. This translation mechanism places interrupt events completely under scheduler control, and allows a programmer to use priorities to control the impact of interrupt events. The original CBIS was implemented by-hand in VHDL using an OPB bus interface. Recently, we have had a Master's student with little to no HW design experience re-implement the core CBIS logic in FSMLang, which will be later integrated into a more modern PLBv46 bus interface.

FSMLang Implementation

The Master's student was able to get the core logic of the CBIS up and running in simulation in under a week, with only a little bit of help in implementing the interrupt handling logic. Within 2 weeks this logic was fully tested on an FPGA development board.

A more experienced graduate student constructed a PLBv46 bus interface for the CBIS logic, and used this as a wrapper for the CBIS core logic developed by the Master's student. The Master's student's logic worked perfectly on board first time -- very impressive for someone that has no formal experience with digital hardware design. This is related to one of the design strengths of FSMLang; as it was built for describing FSMs in such a way that leads to correct synthesis results. The core described by the Master's student makes use of both internal memories and channels, which if done by-hand in VHDL, would inevitably lead to at least some timing/synchronization errors during development.

IDEA Encryption/Decryption

IDEA, or International Data Encryption Algorithm, encryption/decryption is a block-based cipher that utilizes 8.5 rounds of modular-arithmetic. The cipher is symmetric in that the same transformations can be used for both encryption and decryption.

FSMLang Implementation

An FSMLang implementation of IDEA was created and tested in under 24 hours. The implementation was tested against a "golden" C model using 2 different methods: compilation to C that was executed on a desktop processor, and compilation to VHDL and tested in simulation. Both compilations yielded correct results and the software implementation gave a clock cycle estimate of 90 clock cycles per block, which was verified in the VHDL model under ModelSim. The core utilizes individual registers to hold each key round, which leads to better performance, but at the cost of increased resource usage.

Experimental Results

The IDEA encryption core has been integrated into a heterogeneous multiprocessor system-on-chip (H-MPSoC) consisting of the following components on a Xilinx Virtex-5 FXT FPGA:

  • 125 MHz Clock Domain:
    • (1) IBM PowerPC 440 CPU
    • (1) Multi-Port Memory Controller
  • 62.5 MHz Clock Domain:
    • (6) MicroBlaze CPUs
    • (1) HThreads OS implementation (scheduler, thread manager, synchronization manager, and condition variable manager)

The IDEA encryption algorithm has been parallelized using hthreads to spawn threads to perform truly concurrent encryption operations. The threads are free to execute on either the PowerPC, MicroBlaze, or on a MicroBlaze augmented with an IDEA accelerator core. All synchronization and inter-core communication is handled by the hthreads operating system. Timing results of the parallelized IDEA implemetation is obtained through the use of free-running on-chip counters (click on graph to see larger version).

IDEA Encryption - Speedup Results
IDEA Encryption - Speedup Results

The heterogeneous nature of the MPSoC allows for a performance-space exploration by changing the mapping of threads to processors. Although the PowerPC is clocked twice as fast as the MicroBlaze cores, the use of parallel MicroBlaze cores enables more work to be accomplished in a smaller amount of time. If the MicroBlazes augmented with IDEA accelerators are utilized, then the affect is even more dramatic. Although the MicroBlaze sub-system is clocked at half of the rate of the PowerPC, the augmented cores are able to achieve a maximum of 9x speedup over PowerPC-based threads.

Personal tools