|Type||Case study of optimization of data conversion in gr-usrp module|
|State||closed / negative result|
is the standard hardware used to receive and transmit signals with the
GNU Radio project. It provides
its own driver to access the hardware via libusb. There is a wrapper in GNU Radio, the
submodule gr-ursp, that integrates the USRP with GNU Radio. It is using the USRP driver
and converts the read data to the used GNU Radio format.
The data delivered by the USRP are of type short integers, using 16 bit, and in groups of 2 shorts per complex sample. The gr-usrp module converts, for example, these shorts to complex values that consists of 2 floats. There are also conversion of 8 bit data from USRP to floats, and the possibility to do no conversion in gr-usrp, i.e. reusing the shorts. Now my main focus has been on the conversion of the complex short integer samples to complex float samples.
Starting from the disassembly of the existing conversion routine, compiled with GCC and flag -O3 turned on, I worked towards an implementation that uses builtin intrinsics.
Having a look at the Intel optimization manual, I've got the idea to make use of four specific instructions available with SSE2. That are
The idea is to first unpack the packed 16 bit integers to 32 bit integers into the upper 16 bits, then perform a right shift by 16 to shift the unpacked 16 bit to the lower bytes. This is to preserve the sign bit. Then the conversion to floats is carried out.
The inner conversion loop produces 4 complex valued samples per iteration. Now all buffers need to be aligned to a 16 byte boundary. All access additionally are multiples of 16 bytes.
It can be observed that the speedup is neglectible. Indeed, it even varies, so that for some runs, there is no gain at all. Besides, I can note that fortunately it never performs worse than the standard implementation. But nevertheless, the drawbacks are heavy. The optimized code needs its buffer aligned to 16 byte (change to gr-usrp module needed) and can only handle data sizes as multiple of 16 byte. While this seems to be no constraint in current usage of usrp, it may be in the future. However, the standard implementation does not have these constraints.
Generic: input: 6.4e+07 cpu: 0.232 items/sec: 2.759e+08 SSE2 intrinsic: input: 6.4e+07 cpu: 0.230 items/sec: 2.783e+08
I am not the author of the gr-usrp module. Relevant changes have been marked in the header of each file and there is a note added to the changelog.
|Modified gr-usrp module||TAR-Archive 2008-03-08 70kb|
|Patch||TAR-Archive 2008-03-08 4.5kb|