Innovative GSPS Signal Processing Solution with MATLAB Simulink for FPGA SoC - MATLAB & Simulink
Video length is 17:11

Innovative GSPS Signal Processing Solution with MATLAB Simulink for FPGA SoC

Overview

Join us for a detailed webinar that addresses the challenges and solutions associated with processing incoming high-speed data for pivotal applications such as 5G New Radio(NR), radar, and signal intelligence.The advancements in analog-to-digital converters (ADCs) have been instrumental in the development of new DSP algorithms that can sustain the demanding performance requirements of these sophisticated applications.Our discussion will focus on effective strategies for modeling, exploring, and simulating hardware architecture options, as well as methods for generating synthesizable VHDL and Verilog code.

Highlights

In this webinar, you will learn how to:

  • Apply Model-Based Design methodology to model and simulate DSP algorithms for effectively targeting FPGA/ASIC platforms.
  • Utilize a Digital Down Conversion (DDC) example to examine and enable efficient sample- and frame-based processing.
  • Analyze and enhance hardware design with a focus on latency, throughput, and resource usage.
  • Generate readable, synthesizable VHDL and Verilog code for FPGA deployment.

About the Presenter 

Nadereh Rooein, Principal Application Engineer at MathWorks 

Nadereh has extensive experience in ASIC/FPGA design and verification.She previously worked for Ericsson on 3G technology and Teradyne for automatic test equipment of large ASICs.At MathWorks, she has over 18 years of experience helping customers across various industries adopt our HDL workflow for designing, implementing, and verifying ASICs/FPGAs/SoCs.She holds a Master of Electronic Engineering from Chalmers University of Technology, Gothenburg, Sweden.

Recorded: 8 May 2024

Hi, everyone, and welcome to the innovative gigasamples per second digital signal processing solution for FPGA and SOC. My name is Nadereh Rooein. My background is ASIC/FPGA design and verification. Having joined MathWorks more than 18 years, my role involves helping engineers accelerate and optimize their algorithm design using advanced IPs for FPGA SoC deployment. In this presentation, I will introduce you to the MathWorks solution that significantly eases the implementation of gigasample per second or super sample rate digital signal processing algorithm using DSP HDL Toolbox.

We will start with an introduction to implementation of gigasample per second digital signal processing. Following that, I will show you a high-throughput digital down-converter example. You will witness how the DSP blocks effortlessly adapt to different architectures, ensuring optimal results for varying input data sizes. In our final section, I will concentrate on Simulink simulation with hardware latency.

We live in an era of high-bandwidth and gigasample per second data processing where different technologies, such as 5G, RADAR, software-defined radio, and medical devices require more bandwidth to process and transfer data, thereby demanding higher throughput. In recent years, there have been significant advances in analog to digital converters, which has enabled the delivery of data at gigasample per second rates from the AR interface.

Until recently, RF and IF operated in the analog domain while baseband was digital due to ADC and DAC being unable to sample RF frequency at more than a few kilohertz. However, the latest generation of ADC and DAC can now sample high-bandwidth RF frequency in the gigahertz range, allowing both RF and baseband to exist in the digital domain.

With these advancements come numerous challenges, and scalar processing demands gigahertz clock speed, yet clock frequencies are inherently limited in FPGA and ASIC technologies. Furthermore, pursuing higher clock frequency is not desirable due to the significant increase in power consumptions. So the question is how to design a gigasample per second throughput without requiring gigahertz clock?

What MathWorks offers is frame processing. So instead of processing sample by sample or a scalar, we support frame-based processing, which means the new DSP blocks take multiple samples and process them at a lower clock. Rather than processing a single sample at two gigahertz, for example, frame processing enables us to handle four samples simultaneous simultaneously at 500 megahertz. This method is not only more power efficient but also optimizes processing capabilities. But this frame processing approach required a redesign, so we have to redesign the hardware architecture to accommodate frame processing.

Redesigning the algorithm for frame processing has its own problem. For some algorithm, there is no known frame-based version of the algorithm, like CIC, Biquad, or any IIR filter. Adapting algorithms like FFT is challenging and time-consuming, requiring designers to repeatedly verify the design with each adjustment for different input frame sizes. Consequently, optimizing for area, speed, and throughput is a difficult and expensive process.

Another example is FIR filter. FIR scalar architecture is totally different than frame-based one. Using polyphase filter is a technique to support frame processing. Therefore, transitioning from a scalar to a frame-based architecture involves considerable changes in the filter's design.

Using frame processing may not solve all the issues, especially when you design a high-rate filter. The reason is that number of multipliers is proportional to the sampling frequency of the input signal. In addition, running the filter in frame mode results in an increase in the number of multipliers by the frame size factor.

In the super sample DDC, I'm going to show how we solve the resource problem in a high-throughput filter chain. So DSP HDL Toolbox implement DSP algorithm such as FIR, FFT, and IIR filter with automatically selecting the proper hardware architecture. For FFT, you can easily change the architecture and frame size to explore the latency performance and area speed and throughput. In this video, we choose the burst mode for the minimum resource FFT, but due to high latency, we decide to choose the streaming radix 2, which provide low latency, but the throughput still is not satisfactory. So by increasing vector size to 8, we can achieve 3.75 gigasamples per second.

So let's look at frame-based digital down-converter example that showcase how easy we can go through the different throughput input size and how easy it is to go through the area, speed, and throughput trade-off. DDCs are widely used in digital communication receivers to convert radio frequency or intermediate frequency signals to baseband. The DDC operation shift the signal to a lower frequency and reduces its sampling rate to facilitate subsequent processing stages. The DDC contains of NCO, CIC decimation, CIC compensation, and to halfband filter.

In this example, we first design the DDC using the specification and filters in MATLAB. This is going to be our reference model, and it is in double. Then we create a model and change it to the fixed point and compare it with double version to make sure they have equivalent functionality.

From fixed-point version, we generate HDL, and in this example, we define a parameter for frame size. As you see here, we're trying a frame size of 4 for the frame-based DDC and using that parameter-- if you look at the HDL DDC, using that parameter to incorporate in the filter chain and make sure that, by changing one parameter, we will be able to control the throughput of the design.

So when we parameterize the model, we're using the frame size and incorporate the frame size parameter in all the source block, like the input stimulus and NCO, to generate proper input data. The minimum number of cycle between valid input is used in all filter blocks to specify the spacing between valid samples. The spacing is used to optimize the hardware resources. Frame size will affect the spacing parameter, thus it's very important to specify it correctly in DDC example.

So after code generation, I'm going through synthesis and place and route. We were expecting to see four times more resources for a scalar processing DDC versus frame processing, DDC, but what you see here is actually, we have used only twice more-- almost twice more resources for the frame processing. While we increase the frame size by a factor of 4, the clock frequency is just dropping by 1%.

In return, we have achieved around 1.6 gigasample per second throughput, which is four times higher than the scalar mode. And let's look how it was possible to do that. In the frame-based DDC example, we use frame-based CIC decimator to reduce the sampling rate, and therefore, the following filters require much less multiplier and therefore DSP resources.

The frame-based CIC decimator provides a simple solution to bring down the sampling rate significantly without using a lot of resource or any multipliers. So DSP offers unique solution for frame-based CIC decimator and interpolator that simply makes super sample rate DDC and DUC possible.

Secondly, DSP uses a spacing. If you remember, that's minimum number of cycles between valid input. We use spacing for sharing the resources. In the table, you see the shared multiplier in each filter chain. Another secret sauce is that automatic architecture selection is the mechanism that DSP HDL uses.

This flowchart shows which filter architecture result from your parameter setting for a FIR decimator. It also shows the number of multipliers used by the filter implementation. The filter architecture will be automatically selected during code generation based on input frame size, the decimator factor, spacing, and the number of filter coefficients. Another example is the discrete FIR that support eight different architectures. The direct form systolic architecture provides a fully parallel implementation that makes efficient use of Intel and Xilinx DSP blocks, which-- I have to say Intel and AMD DSP blocks.

So with DSP HDL advance and hardware-ready blocks, engineers can easily explore their design from kilosample per second to megasample per second, up to gigasample per second DSP processing. And as you see here, you can use the same model and go through the throughput trade-off. You don't need to use or change the design. We change the architecture of the filter and design for you automatically.

So what we learned so far is that DSP HDL blocks are highly optimized. They support frame processing, facilitate architecture, throughput, and system-level exploration and support HDL code generation and deployment to the FPGA and ASIC.

Now I want to talk about another remarkable characteristic of the DSP HDL blocks, and that's about simulation with hardware latency. Knowing the latency of each module and block is a vital-- it's vital for system-level simulation. Therefore, in recent years, we have spent time to facilitate this task by implementing new and advanced DSP that simulate the hardware latency to help users see the latency based on the block settings. This is especially useful in applications where timing and synchronization are critical.

It's worth mentioning that wireless HDL Toolbox library also simulating with hardware latency. You may notice some earlier blocks that are not indicating the latency, but underneath they are indeed simulating with hardware latency to provide a cycle-accurate simulation. Vision HDL Toolbox also provide IP blocks that use a control signal which includes valid in and valid out.

Fixed-Point Designer HDL Toolbox provides HDL-optimized implementation for math operations, linear system solvers, and matrix factorization that is indeed challenging to design for hardware with limited multipliers. Similar to DSP HDL, they provide many different architectures that offer different latency. There is a table to help designer choose a proper architecture based on latency and throughput criteria.

To provide a cycle-accurate simulation of the generated code, the blocks model with the architectural latency, including pipeline registers and resource sharing, you can find algorithm and implementation detailed in the Help documentation.

The latency between valid input data and the corresponding valid output data depends on block parameters, such as block architecture, input frame size, the spacing between validIn samples, and filter and FFT length. For example, for the CIC decimator block, the latency depend on the size of the input gain correction, the summation factor, and number of sections. Here, you see latency increase for a CIC decimator filter with two different settings. When you design a system, you can easily obtain this information before generating code and selecting your hardware.

I want to highlight that DSP HDL IPs provide hardware-ready blocks and subsystem for developing signal processing application, such as wireless, radar, audio and sensor processing. You can attend one of the DSP for FPGA training class and contact sales department for trying the DSP license. The DSP for FPGA is a three-day training course that review fundamental DSP implementation for FPGA and ASIC. I want to thank you very much for attending the webinar.

View more related videos