High speed concentration of data for interfaces with wide data word

WZab · 2 July 2024 13:20

In the DAQ we often face the problem of fitting the data from multiple low-width input channels to a single wide output channel without leaving “holes” in the output data stream.
An example may be a detector readout where the front end ASIC delivers data as 32-bit words, and we want transmit them (after zero suppression) to the computer’s memory via an integrated PCI Express block, accepting data with a width of 256 or 512 bits. Our team has developed a solution ( described in DOI:10.3390/electronics12061437 , and available as open-source at GitLab ) working for the concentration of 6 to 12 input channels with widths of 32 bits into a 256-bit output record.
Later on, its scalability has been improved so that even 16, 32, or more channels may be concentrated. The new solution is described in DOI:10.3390/electronics13010081 , and its source is available under a dual GPL/BSD license at GitLab. I hope that it may be useful in various DAQ scenarios.

yifengw · 6 July 2024 19:16

Dear Wojciech,

Thank you for sharing this amazing solution, which is rarely discussed but foreseen be useful in the near future. I have a couple of comments and suggestions to discuss.

in section 1.1 of your paper(10.3390/electronics12061437), it is desirable to keep the timestamp of multiple input stream in the time order (mono-increasing in time), so the choice of arbitration logic is critical. In this paper, the round-robin scheme is proposed, but the “smallest timestamp” based arbitration strategy is suitable. For the input data streams (I assume they support back pressure with ready and valid signals to perform handshake), the arbiter can choose/grant the data word from input stream with smallest timestamp, so the output stream can still keep the timely order with the assumption that the input stream is already sorted (e.g. your heap sort, Dual port memory based heapsort implementation for FPGA, by the way, for this heap sorter I have followed up and expand this idea into a NoC based sorting network, using LE instead of BRAM. We can discuss it in other thread.)
Building non-blocking fabric with Benes network is an elegant solution. However, the concern of resource usage cannot be neglected. Do you have estimation of resource usage of Xilinx 6 input LUT compare to that of Intel 8 input LUT?
For the delay caused by a low data rate lane with introducing the mix-width FIFO, we also encountered this issue in our triggerless DAQ system. Our solution is to use a CAM (content addressable memory) to pack the hits with some calculated delay relative to current system timestamp. With the CAM, we can use timestamp as the sort key. So, the hits are addressable by their timestamp. The working principle is the following: for example, if the current system time is 1000 cycle, we are safe to pack hits with timestamp of 500 cycle, because they are already buffered in the system. We then form a packet in each lane simultaneously for the hits with timestamp of 400-500 cycle and merge the packets for all lanes, until all endofpacket is received in all lanes. The lane with low rate of hits can still generate an empty packet. With the advancing of system timestamp, we can pack the packets containing the hits of 500-600 cycle, 600-700 cycle, etc.

Best,

Yifeng

WZab · 18 July 2024 14:45

Dear Yifeng,

I’m sorry for the late response. I couldn’t make it earlier due to the holidays.

in section 1.1 of your paper(10.3390/electronics12061437), it is desirable to keep the timestamp of multiple input stream in the time order (mono-increasing in time), so the choice of arbitration logic is critical. In this paper, the round-robin scheme is proposed, but the “smallest timestamp” based arbitration strategy is suitable. For the input data streams (I assume they support back pressure with ready and valid signals to perform handshake), the arbiter can choose/grant the data word from input stream with smallest timestamp, so the output stream can still keep the timely order with the assumption that the input stream is already sorted (e.g. your heap sort, Dual port memory based heapsort implementation for FPGA, by the way, for this heap sorter I have followed up and expand this idea into a NoC based sorting network, using LE instead of BRAM. We can discuss it in other thread.)

In our readout chain (DOI: 10.1088/1748-0221/12/02/C02061), we use GBT links with FEC in downlinks but without FEC (wide frame mode) in uplinks to fully utilize the bandwidth.
The FEE ASIC (SMX) communication protocol (DOI: 10.1016/j.nima.2016.08.005) does not provide CRC protection of the hit data for the same reason.
Therefore, there is a non-negligible risk of receiving data with a non-correctable incorrect timestamp. Such data obstructs the heap sorter’s operation. We tried a solution based on the bin sorter (DOI: 10.15120/GSI-2021-00421, page 149), but it was also unsatisfactory (bad handling of beam intensity fluctuations).
The SMX may also introduce significant reordering of hit data due to its internal buffering.
So finally, another solution aimed at handling the above issues in software is considered (planned publication: [2]). Its significant part is the described concentrator aimed at transparent transmission of data streams from individual E-Links without affecting the time order of hit data. The detection of possible timestamp errors and sorting of data may be done in the software without sacrificing the scarce FPGA resources.

I’m glad to know that the heap sorter has been further developed. BTW, later on I have also attempted reimplementing it with HLS (DOI: 10.1117/12.2502093).

Building non-blocking fabric with Benes network is an elegant solution. However, the concern of resource usage cannot be neglected. Do you have estimation of resource usage of Xilinx 6 input LUT compare to that of Intel 8 input LUT?

Unfortunately, we were testing that solution only in AMD/Xilinx FPGAs. Careful optimization for different FPGA architectures may be a good topic for a student research project. Thanks for the suggestion.

For the delay caused by a low data rate lane with introducing the mix-width FIFO, we also encountered this issue in our triggerless DAQ system. Our solution is to use a CAM (content addressable memory) to pack the hits with some calculated delay relative to current system timestamp. With the CAM, we can use timestamp as the sort key. So, the hits are addressable by their timestamp. The working principle is the following: for example, if the current system time is 1000 cycle, we are safe to pack hits with timestamp of 500 cycle, because they are already buffered in the system. We then form a packet in each lane simultaneously for the hits with timestamp of 400-500 cycle and merge the packets for all lanes, until all endofpacket is received in all lanes. The lane with low rate of hits can still generate an empty packet. With the advancing of system timestamp, we can pack the packets containing the hits of 500-600 cycle, 600-700 cycle, etc.

That solution somehow reminds our bin-sorter-based approach. The data was routed to the particular bin based on its timestamp. However, we have abandoned this approach for reasons explained in the answer to point 1.

Thanks for your remarks,
Best regards,
Wojtek