In recent times, artificial intelligence (AI) and machine learning (ML) have become hot topics, enabling useful applications such as assistive and autonomous driving. Intelligent accessories in the home are now mainstream, employing adaptive audio and acoustic beamforming.
This series of articles introduces what’s on the bench at SEGGER Labs…and coming soon.
SEGGER emVDSP
SEGGER’s emVDSP product is a signal processing and vector library that is targeted to multiple architectures. emVDSP presents a regular API across all data types for all targets. Where algorithms can be accelerated, they take advantage of underlying hardware features: that’s the “V” part of the name, emVDSP will use vector instructions to run multiple operations in parallel to deliver blistering performance.
emVDSP currently supports the following architectures:
- Cortex-M with DSP and SIMD instructions (v7EM)
- Cortex-A with NEON (Advanced SIMD) instructions (v7A, v8A)
- Cortex-M with Helium instructions (v8.1M+MVE)
- Older Arm cores with the DSP E extension (v5TE)
- RISC-V with the Packed SIMD P extension (RV32P, RV64P)
- RISC-V with the Vector extension (RV32V, RV64V)
- Intel IA32/AMD64 with MMX and Advanced Vector Extensions (AVX, AVX2, and AVX-512)
- Portable C code for use on any processor
The library contains a range of general-purpose algorithms that are well-tuned for typical digital signal processors and conventional processors.
Why construct emVDSP?
The answer is simple: to provide a quality library featuring a DSP and vector API that doesn’t lock you in and ensures the API is regular. This means that if an algorithm is available for a particular type, it should most likely be available for all supported types (only where it makes sense, of course!) This is in contrast to other DSP libraries that offer algorithms only for the operations and data types supported by the underlying hardware. Because emVDSP can run on conventional processors, all emVDSP functions are offered across all architectures—sure, things might run a little slower without hardware-level support, but not so much as to be unusable.
As there is no standardized API for DSP work, changing architectures may require overcoming porting inertia to use a different signal processing API. Using a vendor-neutral API such as emVDSP provides agility and independence as you’re able to switch processors without rewriting existing software.
Configuring emVDSP
Configuration of the library is controlled by a single file that parameterizes the C-level algorithms to use particular features of an architecture for best performance.
Retargeting the library to a new architecture starts with the portable C code and a minimal configuration file to deliver working code on the intended target. This is known as driving a spike through the library: the library works, but may not be efficient. Development continues by widening the spike, tailoring the configuration file to extract the best from the architecture.
At each stage it’s possible to run the emVDSP test suite and benchmarks to ensure correct operation and measure performance gains.
Preliminary results
Although unreleased, preliminary results comparing emVDSP against CMSIS-DSP and the Intel Performance Primitives are good.
Below is the benchmark of a selection of emVDSP functions against corresponding CMSIS-DSP functions on a Cortex-A9. The “Real.SD%” column is the relative standard deviation as a percentage, a measure of how repeatable the timing of the benchmark is. The relative standard deviations indicate that cycle timings are very accurate.
As you will see, emVDSP outperforms CMSIS-DSP across the board in the standard distribution without tuning. And, in fact, each function can be individually tuned in emVDSP, whereas CMSIS-DSP only offers coarse-grain optimization by unrolling as a project-wide option.
SEGGER Vector-DSP Library Benchmark Copyright (c) 2019-2021 SEGGER Microcontroller GmbH Target: Cortex-A Compiler: SEGGER cc 11.4.4 Config: VDSP_DEFAULT_UNROLL = 2 Config: VDSP_DEFAULT_PIPELINE = 2 SEGGER VDSP CMSIS-DSP ------------------ ---------------------------- Function Cycles Rel.SD% Cycles Rel.SD% Rel.Perf ------------------- ------------------ ------------------ -------- Abs, Q7 2334 0.14 32112 0.01 13.75x Abs, Q15 2333 0.09 8232 0.01 3.53x Abs, Q31 2336 0.17 2333 0.14 1.00x Abs, F32 2593 0.14 2844 0.08 1.10x ------------------- ------------------ ------------------ -------- Neg, Q7 2335 0.15 37930 0.00 16.24x Neg, Q15 2334 0.12 36393 0.01 15.59x Neg, Q31 2334 0.13 2745 0.16 1.18x Neg, F32 2590 0.14 5151 0.03 1.99x ------------------- ------------------ ------------------ -------- MinReduce, Q7 1008 0.31 22839 0.02 22.65x MinReduce, Q15 984 0.41 10809 0.03 10.98x MinReduce, Q31 972 0.37 3482 0.69 3.58x MinReduce, F32 1149 0.20 5433 0.36 4.73x ------------------- ------------------ ------------------ -------- MaxReduce, Q7 1008 0.33 22842 0.01 22.66x MaxReduce, Q15 980 0.31 10807 0.03 11.02x MaxReduce, Q31 971 0.34 3454 0.10 3.56x MaxReduce, F32 1143 0.95 5436 0.30 4.76x ------------------- ------------------ ------------------ -------- Add, Q7 3230 0.13 53292 0.01 16.50x Add, Q15 3231 0.11 53805 0.00 16.65x Add, Q31 3230 0.08 3624 0.07 1.12x Add, F32 3296 0.13 3605 0.06 1.09x ------------------- ------------------ ------------------ -------- Add, Scalar, Q7 2532 0.10 36909 0.01 14.57x Add, Scalar, Q15 2527 0.16 36394 0.00 14.40x Add, Scalar, Q31 2527 0.17 3107 0.09 1.23x Add, Scalar, F32 2783 0.13 7191 0.03 2.58x ------------------- ------------------ ------------------ -------- Sub, Q7 3424 0.11 53294 0.01 15.56x Sub, Q15 3422 0.07 53807 0.01 15.72x Sub, Q31 3429 0.08 3623 0.06 1.06x ------------------- ------------------ ------------------ -------- Mul, Q7 6420 0.07 42033 0.01 6.55x Mul, Q15 3358 0.12 55341 0.00 16.48x Mul, Q31 3741 0.10 6960 0.03 1.86x Mul, F32 3488 0.12 3606 0.08 1.03x ------------------- ------------------ ------------------ -------- Mul, Scalar, Q7 4759 0.07 38965 0.01 8.19x Mul, Scalar, Q15 3100 0.14 37425 0.01 12.07x Mul, Scalar, Q31 2848 0.11 11575 0.02 4.06x Mul, Scalar, F32 2909 0.15 4385 0.06 1.51x ------------------- ------------------ ------------------ -------- Mean, Q7 3671 0.09 22594 0.01 6.15x Mean, Q15 1658 0.23 19050 0.02 11.49x Mean, Q31 2116 0.17 3618 0.67 1.71x Mean, F32 1178 0.54 5149 0.12 4.37x ------------------- ------------------ ------------------ -------- STOP
Conclusion
emVDSP has excellent performance on Cortex devices. Comparing libraries for Intel x86 (both 32-bit and 64-bit) and for RISC-V (both Packed SIMD and vector extensions), the results are equally good.
Stay tuned for more articles on emVDSP’s features, its portable API, and the tools we use to tune it.
Interested?
If you’re interested in learning more about emVDSP, you can contact us at info@segger.com.