Serial RapidIO Corner Turning DMA Engine
CoSine’s Serial RapidIO corner-turning DMA engine supports both local and remote striding for matrix transposition.  In many applications, DSP compute nodes can process data more efficiently in rows than in columns.  This may be due to organization of cache lines because rows, in this context, bring a larger portion of relevant data into cache as a result of storing data in consecutive address locations.

When activated, the DMA Engine becomes the Serial RapidIO bus master.  Features include linked-list data chaining (scatter/gather), interrupt on error and interrupt on completion of data transfer.  The DMA engine may transmit a series of read or write requests without waiting for a response.  It may also begin the transfer of the next descriptor in the chain while waiting for responses from previous descriptors.

Striding allows a single DMA descriptor to describe transfers involving non-consecutive segments of memory.

Local Striding when transmitting data is illustrated by the black arrows in Figure 1.  A matrix is stored in the CoSine's local memory in row order (left side of the figure).  That is, all of the data within a row are stored consecutively, and the data for one row immediately follow the data for the previous row.  Some columns are to be transmitted to the remote memory, as indicated by the Segment Size.  Other columns are to be skipped, as indicated by the Local Stride.  Presumably those other columns will be transmitted to other remote systems for parallel processing.  In the remote memory (right side of the figure), data for one segment immediately follow the data for the previous segment.  Since a stream of data is sent to the remote system in order of consecutive RapidIO addresses, the data can be transmitted without regard to segment boundaries.  It is possible for one RapidIO packet to begin with data for one segment and end with data for a different one.

Figure 1: Local Striding

The matrix on the remote system is still in row order, but the rows are shorter.  If each new row is a single element, the matrix is effectively in column order.  The remote system can now efficiently process the column of data without having to perform any transposition of the matrix.  The reduction of CPU utilization by elimination of the transpose means that fewer processors can accomplish the same task.

Local Striding when receiving data is exactly the same as in Figure 1 except the arrows showing the direction of transfer are reversed (red arrowheads).  A stream of data is read in consecutive memory order on the remote system, and CoSine writes it with strides across local memory.

Remote Striding when transmitting is illustrated by the blue arrows in Figure 2.  A matrix is stored in the CoSine's local memory in row order (now on the right side of the figure).  All columns are to be transmitted to the remote memory, as indicated by the Segment Size and the fact that the Local Stride is zero (not shown).  In the remote memory (now on the left side of the figure), the segments are not in consecutive memory locations.  CoSine will automatically perform the operations necessary in order to distribute the data with the remote stride.

Figure 2: Remote Striding

Remote Striding when receiving data is exactly the same as in Figure 2 except the arrows showing the direction of transfer are reversed (red arrowheads).

Local and Remote Striding can both be specified for a single transfer.  The Segment Size is the same for both local and remote matrices, but the strides may be different.