|
Serial RapidIO
Corner Turning DMA Engine
CoSine’s Serial RapidIO corner-turning DMA engine supports both
local and remote striding for matrix transposition. In many
applications, DSP compute nodes can process data more efficiently in
rows than in columns. This may be due to organization of cache
lines because rows, in this context, bring a larger portion of
relevant data into cache as a result of storing data in consecutive
address locations.
When activated,
the DMA Engine becomes the Serial RapidIO bus master. Features
include linked-list data chaining (scatter/gather), interrupt on
error and interrupt on completion of data transfer. The DMA engine
may transmit a series of read or write requests without waiting for
a response. It may also begin the transfer of the next descriptor
in the chain while waiting for responses from previous descriptors.
Striding allows a
single DMA descriptor to describe transfers involving
non-consecutive segments of memory.
Local Striding
when transmitting data is illustrated by the black arrows in Figure
1. A matrix is stored in the CoSine's local memory in row order
(left side of the figure). That is, all of the data within a row
are stored consecutively, and the data for one row immediately
follow the data for the previous row. Some columns are to be
transmitted to the remote memory, as indicated by the Segment Size.
Other columns are to be skipped, as indicated by the Local Stride.
Presumably those other columns will be transmitted to other remote
systems for parallel processing. In the remote memory (right side
of the figure), data for one segment immediately follow the data for
the previous segment. Since a stream of data is sent to the remote
system in order of consecutive RapidIO addresses, the data can be
transmitted without regard to segment boundaries. It is possible
for one RapidIO packet to begin with data for one segment and end
with data for a different one.

Figure 1: Local Striding
The matrix on the
remote system is still in row order, but the rows are shorter. If
each new row is a single element, the matrix is effectively in
column order. The remote system can now efficiently process the
column of data without having to perform any transposition of the
matrix. The reduction of CPU utilization by elimination of the
transpose means that fewer processors can accomplish the same task.
Local Striding
when receiving data is exactly the same as in Figure 1 except the
arrows showing the direction of transfer are reversed (red
arrowheads). A stream of data is read in consecutive memory order
on the remote system, and CoSine writes it with strides across local
memory.
Remote Striding
when transmitting is illustrated by the blue arrows in Figure 2. A
matrix is stored in the CoSine's local memory in row order (now on
the right side of the figure). All columns are to be transmitted to
the remote memory, as indicated by the Segment Size and the fact
that the Local Stride is zero (not shown). In the remote memory
(now on the left side of the figure), the segments are not in
consecutive memory locations. CoSine will automatically perform the
operations necessary in order to distribute the data with the remote
stride.

Figure 2: Remote Striding
Remote Striding
when receiving data is exactly the same as in Figure 2 except the
arrows showing the direction of transfer are reversed (red
arrowheads).
Local and Remote
Striding can both be specified for a single transfer. The Segment
Size is the same for both local and remote matrices, but the strides
may be different.
|