Algorithm Acceleration

Kadir · 16 June 2024 18:54

Hello everyone,

I am Kadir, a physics master’s student. In my master’s thesis, I am working on a heterogeneous computing system CPU + FPGAs(Connected to PCIe 4.0). I am using AMD Xilinx Alveo u280. I need to find the most efficient way to work with FPGAs to exploit the full potential of the device using High-Level Synthesis(HLS) with C++.
I am trying to accelerate a track reconstruction prototype algorithm. So far I familiarised myself with the tools and environment(Such as Vitis_Analyzer, Vivado etc.) a bit and mostly I am using terminal.
I am following the steps SW-EMU (Software Emulation) → HW-EMU(Hardware Emulation) → Running on Hardware. So far using the hardware emulation I managed to emulate when I send data and the kernel from HOST(CPU) to DEVICE(FPGAs) using and make calculations on the device then collect back on the host(CPU). To get a better performance I used the AXI interface and streaming data between functions(READ, COMPUTE, WRITE) inside the kernel and executing them in parallel.

What is the fastest and most efficient way to transfer the data for continuous dataflow between CPU and FPGA? Using Structs with multiple data points (such as struct position{float x, y, z;} makes performance worse or to get the best performance should we use only 1-dimensional arrays? What are the other things that need to be considered(Memory, Port-Bit Width, etc.)?

qberthet · 17 June 2024 08:56

From the host point-of-view, overlapping computing (kernel runs) and PCIe memory transfers is the best you can aim for for continuous dataflow.

But from the device side, the way you load your buffers from memory through AXI might have a big impact on the kernel performance. Most (all?) Alveo devices have a 512-bit wide memory interface, so if you can optimize your accesses to use burst of 512-bit read/write, it is probably the way you should define the buffers data-structure. But this is very dependent on the algorithm you are implementing and the data accesses patterns.

Kadir · 18 June 2024 10:23

Yes, max-bit-width is given as 512-bits. But there is also Port Bit Width Widening
property. Which actually confused me a bit. When we are sending an array of numbers port bit-width automatically adjusted to datatype but if a structure with multiple variables send then adjusts to 512 for float. When using float datatype port-bit-width shows 512-bits but for double 1024-bits. What is the advantage of port-bit-width widening? Is that also possible to widen the port-bit-width for float numbers to 1024? Using pre-tcl and pragmas I tried 1024 bit width for float but still showing 512 in the summary file.