Description

Overcoming Jitter in High-Speed Communications: Techniques and Testing with MATLAB

Overview

This webinar offers an in-depth exploration of jitter tolerance, a critical factor in the performance and reliability of high-speed communication systems. Attendees will be guided through a comprehensive workflow that encompasses the theoretical aspects of jitter and its sources, modeling, and simulation techniques for analyzing jitter effects, and practical strategies for testing and improving jitter tolerance.

While JTOL is commonly performed on a scope in a lab, participants will gain hands-on insights into the process of identifying potential jitter issues, simulating their impact on system performance, and applying effective mitigation strategies. By the end of the session, attendees will have a solid foundation in the principles of jitter tolerance and be equipped with the knowledge to enhance the reliability and efficiency of their communication systems for PCIe, USB, Ethernet, etc.

Highlights

Understanding Jitter and its Impact: Gain a solid understanding of what jitter is, its sources, and how it affects high-speed communication systems.
Modeling and Simulation for Jitter Analysis: Learn how to model and simulate communication systems to analyze the impact of jitter on system performance.
Practical Jitter Tolerance Testing: Discover strategies for testing jitter tolerance and ensuring your system meets the necessary performance criteria.
Mitigation Techniques: Explore practical techniques and strategies to minimize jitter and enhance system reliability.
Real-World Applications: See how these concepts apply to real-world scenarios, enhancing your ability to tackle jitter-related challenges in your projects.

About the Presenter

Andy Zambell is a Sr. Product Marketing Engineer for SerDes and Signal Integrity applications at MathWorks. Prior to joining MathWorks, Andy was a Signal Integrity Engineer at FCI USA LLC and Amphenol for a decade where he specialized in new product development and customer support of high-speed backplane connectors. He also was involved in the development of industry standards such as SAS, IEEE 802.3 and OIF CEI. He received a B.S in Physics from Lebanon Valley College and an M.E. in Electrical Engineering from Penn State University.

Recorded: 16 Oct 2024

Full Transcript

Hello, my name is Andy Zambell. I've been a product marketing manager at MathWorks for almost four years. Prior to joining MathWorks, I was a signal integrity engineer for about 10 years, where I developed high-speed connectors and participated on several standards committees like OIS and IEEE 802.3. Today I'll be presenting to you about jitter and jitter tolerance testing.

I'm going to start off by talking about the SerDes and Signal Integrity Solutions we have here at MathWorks, just in case you weren't aware of what we have to offer. Then, I'm going to go over what jitter is, the different types, and finally go over jitter tolerance testing in MATLAB. I'll wrap things up with a few next steps for you to try then as well.

The SerDes and Signal Integrity workflow for MathWorks focuses on the design and verification of wireline communications and high-speed I/O. So we're talking about systems that are composed of a transmitter, a receiver. Both those two things connected through a channel made up of things like cables, connectors, and PCBs, including lossy transmission lines and vias. So with this suite of tools, you can design, simulate, and analyze a complete end-to-end high-speed serial or parallel channel.

So first, SerDes Toolbox can help you address challenges by providing a powerful set of simulation and modeling tools to design and test chip-to-chip links that meet your required SerDes specifications for high-speed reliable data transmission. This includes industry-standard IBIS-AMI modeling and generation, which stems from MathWorks' leadership in the IBIS standards committee.

Furthermore, SerDes Toolbox supports a wide range of standard communication protocols like Ethernet, USB, PCIe, DDR, and more, thus allowing you to easily integrate chip designs with other products and technologies in the industry. Overall, SerDes Toolbox can serve as a valuable tool for you to create and optimize your designs.

Signal Integrity Toolbox provides the framework for analyzing different types of high-speed serial and parallel links, again, including Ethernet, USB, PCIe, DDR, and more. With it, you can perform pre-layout analysis to simulate and analyze the behavior of signals with a variety of impairments, including the effects of things like loss, reflection, and crosstalk. With Signal Integrity Toolbox you can perform a range of simulations and analyzes on your design to optimize the signal performance and minimize signal integrity issues.

RF PCB Toolbox provides functions and apps for designing, analyzing, and visualizing high-speed and RF PCBs. You can design components with parameterized or arbitrary geometry, including distributed passive structures such as traces, bends, and vias. Adding RF PCB Toolbox to Signal Integrity Toolbox allows for post-layout verification of a single or multiple PCBs from a variety of PCB CAD tools.

And finally, Mixed-Signal Blockset provides models of components, impairments, analysis tools, and test benches for designing and verifying mixed-signal ICs. You can model PLLs, data converters, and other systems at different levels of abstraction.

These models can be used to simulate mixed-signal components together with complex DSP algorithms and control logic. You can customize models to include impairments such as noise, non-linearity, jitter, and quantization effects. System-level simulations using Simulink lets you debug and identify design flaws without simulating the IC at the transistor level. You can import databases of circuit-level simulation results to analyze, identify trends in, and visualize mixed signal data as well.

So jumping into jitter. What is jitter, really? Well, according to Wikipedia, jitter is, quote, "the deviation from true periodicity of a presumably periodic signal, often in relationship to a reference clock signal."

Basically, in the case shown here, it's when the signal is supposed to switch from a 0 to a 1 or vice versa. It does so either a little late or a little early. There's also potential implications when it comes to a clock that I'll show in a minute. But you can think of jitter as a shift in time.

Here's an animation showing the difference between a simulation with no jitter added and another with some extra jitter. You can see that on the left, the signal is passing the zero threshold at anywhere between about 98 picoseconds and 102 picoseconds. On the right, however, the signal crosses the zero threshold between approximately 93 picoseconds and 107 picoseconds. This extra jitter, and depending on how it affects the clock, could cause some errors.

How exactly could this cause some errors, though? To understand, let's take a look at this picture of a bathtub curve and a clock PDF. By looking at this picture, jitter can affect the bathtub curve by bringing the walls of the bathtub curve in, making it smaller and narrower. This is that shift in timing of the signal, either a little early or a little late that I mentioned earlier.

Another thing it can do is it can also affect the Clock Probability Distribution Function, or the Clock PDF, by making it wider for the same reasons, by shifting the clock earlier or later in time, this will cause the PDF shown to increase in width and cover more time.

Since the clock-- since the data bathtub curve and the Clock PDF are both used in the Bit Error Rate or BER calculation, changing either of them changes the BER. And since jitter makes the bathtub curve and/or the Clock PDF worse, it inherently makes the BER worse as well.

You can see this if you imagine the vertical part of the green line of the bathtub curve getting closer together, and at the same time, the blue curve getting wider. This causes more overlap between the green and blue curves, thus causing the red area, the BER, to get larger. So that's how jitter can affect bathtub curves, Clock PDFs, and ultimately, the BER.

There are several different types of jitter, and they can be broken up in two ways, which I'll get to in a minute. First, jitter can be divided up as either deterministic or random. For me, the easiest way to think about these two is that deterministic jitter is predictable, while random jitter is unpredictable. I'm not going to spend any time on random jitter, but it can come from things like thermal fluctuations and other random noise fluctuations.

Deterministic jitter can be broken down by its sources, starting with bounded uncorrelated jitter, it's jitter that is independent of the data pattern, and an example of it is crosstalk. Periodic jitter is how much the clock deviates from where it should be and data-dependent jitter comes from the data pattern being-- itself being transmitted.

Data-dependent jitter can be broken down into duty cycle distortion and intersymbol interference, and a type of periodic jitter is sinusoidal jitter, which is used in the jitter tolerance testing we're about to do.

One way to group jitter is by whether it's bounded or unbounded. Bounded jitter has upper and lower amplitudes, while unbounded, theoretically does not. Another way to group jitter is whether it's correlated or uncorrelated. Correlated jitter is connected to the data pattern itself, while uncorrelated jitter is not connected to the data pattern. It comes from something unrelated to the data pattern, something like a clock or crosstalk or thermal noise.

If you're interested in jitter decomposition, I invite you to check out the new jitter function that just came out in the 2024 B release. With it, you can give the function a waveform, and based off of the symbol time, the sample interval, or even a reference waveform, you can measure various types of jitter. There's more that can be done with this new function, so you should really take a look at it. Also, this new jitter function is based on the IEEE 2414 standard for jitter and phase noise, which means it's not based on some proprietary algorithm, but instead, it's based on an actual industry standard.

Next, I want to go over the inner workings of a Clock and Data Recovery or CDR circuit, by looking at the-- by looking at a Phase Locked Loop or this PLL diagram we got here. The Frequency Detector or PFD, compares the phase and frequency between two signals, and it produces an output signal that differ in duty cycle. The difference in the duty cycle is proportional to the phase difference between the two input signals, so sometimes higher and sometimes lower.

In frequency synthesis circuits, such as this PLL here, the PFD block compares the phase and frequency between the reference signal and a signal generated by the Voltage Controlled Oscillator or VCO. And that determines the phase error.

Next, the signal goes into the charge pump block, which produces an output current that is proportional to the difference in the duty cycles between the signals coming out of the PFD block. In a PLL, the charge pump converts this phase error from the PFD block into a single current, and then it sends this off to the loop filter.

The signal from this charge pump passes through the loop filter, which is a low pass filter, and delivers the voltage-- the control voltage to the VCO to generate a frequency. It produces an output square wave signal whose frequency is controlled by the input voltage coming from the loop filter. And then the divider just divides the frequency coming from the VCO. This lower frequency is compared to the reference input at the PFD block and rinse and repeat.

So having said all that, the PLL is really just a negative feedback system that keeps the phase of a signal in check. If the phase starts to drift one way or the other, the circuit will just bring it back in line as long as the phases of the two signals are not drastically different. If they are, the circuit probably won't be able to compensate.

So that was looking at things in a block diagram. But what about the signals themselves? There are different types of phase detectors and for example, an Alexander or bang-bang phase detector samples the received waveform at the edge and middle of each symbol. The edge symbol e sub n, and the data samples d sub n minus 1, and d sub n, as shown here, are processed to determine if the edge sample and thus the clock phase is either early or late.

Driving the VCO directly from the phase detector output results in excessive jitter or clock jitter. This is where the loop filter comes in. To eliminate the jitter the signal is low pass-filtered by accumulating it in a vote. When the accumulated vote exceeds a specific count threshold, the phase of the VCO is incremented or decremented. So let's take a look at what that looks like.

The base, this baseline behavior is shown with an eye diagram and the resulting clock PDF. As you can see, the clock PDF is very near the center of the eye. I'll show you here, change my pointer to a laser pointer, so you can see this clock PDF shown here. The clock phase settles between a value of 0.570 symbol time and 0.578 symbol time as indicated by the red line in the graph on the top-right. So this red line going horizontally right here.

The dithering or indecision between the two values is a consequence of the non-linear bang-bang phase detector and is a source of CDRH hunting jitter. To reduce this magnitude of dithering, you can reduce the phase step size. Also to reduce the period of the dithering you can reduce the vote count threshold.

The output of the phase detector is accumulated in the early-late vote count. And if or when the count exceeds the vote, the count threshold that was set as shown with the red dashed line in the graph in the bottom right, the phase is incremented or then decremented.

So this graph is just a zoomed-in version of the one on the previous slide, and it shows just the first 350 symbols of the early-late count shown in blue and the threshold shown as the dashed red line. Internal to the CDR, the vote is incremented or decremented, checked against the threshold, and then reset if necessary. The external vote value shown in the figure doesn't touch the threshold in this case, but you can tell when the vote was reset to zero, which is this black horizontal line. So you can see as the vote count gets high, it gets reset back down to zero. And you can see that happens a few times again over here as well.

So what happens if you change something? So here's a channel-- here the channel loss is decreased from four dB down to two dB, which shows the clock converging to a different phase. The clock phase now adapts to around 0.35 symbol time, whereas before it was converging to about 0.57 symbol time. And you can see this here in the clock PDF and as well as in this graph here, this red horizontal line both converging around 0.35 symbol time.

And like I mentioned earlier, increasing the vote count threshold from 8 to 16 results in a larger dithering period, shown in this figure right here. You can see the period of that in this graph is greater than the one prior. And increasing the phase step size from 1/128th to 1/64th increases the dithering magnitude here. You can just see that that magnitude is larger than the one on the left and the one on the previous slide as well.

So this was a quick tutorial on how a CD operates. So now we'll look at how jitter can affect things. So Jitter Tolerance Testing or JTOL plays a vital role in assessing and validating high-speed communications. This process evaluates the performance of a system's CDR by varying the amplitude and frequency of injected sinusoidal jitter. The CDR can effectively track and adjust for jitter when the magnitude and frequency fall within a CDR's bandwidth.

However, as the jitter's amplitude and frequency rise beyond this range, the CDR's tracking ability diminishes. So there are two nice things about jitter tolerance testing. One, demonstrating adherence to industry standards, which helps ensure that a system will be able to operate properly. And two, ensuring that the CDR design possesses the required capabilities, something like for bandwidth, for instance.

The figure on the left depicts a typical JTOL representation found in industry standards. It includes a corner frequency f sub c, that specifies the minimum bandwidth of the system. A system is compliant if it maintains its required performance within the pass region. So for system designers, a JTOL test is crucial for determining the clock recovery bandwidth profile. Clock recovery units handle the phase and frequency of incoming signals.

A successful 20-point JTOL test is shown by the blue curve in the graph on the right, or with a CDR bandwidth of about four megahertz shown by the green arrow. So designers must balance a CDR's bandwidth. If it's too low, the system can't track frequency changes, but if it's too high, it misrepresents noise as data.

In the graph on the right, the green dots show successful simulations, while red dots indicate inadequate eye widths. So as the JTOL algorithm increases the jitter magnitude and frequency, it stresses the system until performance drops, marking the pass/fail boundary and thus defining the system's JTOL response.

In addition to assessing the bandwidth, a JTOL test can reveal other CDR characteristics to the designer. In this graph, JTOL responses are shown across various second-order gains with the second-order gain of 2e to the minus 11 shown in blue with squares, the JTOL response initially dips before stabilizing at higher frequencies. This indicates that the higher 2e to the minus 11 gain makes the system overly sensitive to the jitter and degrades the performance by reacting to noise and amplifying jitter.

In contrast, the dip is absent in or with the second order gain of 2e to the minus 13, the yellow plot with the triangles. This means that the system with a 2e to the minus 13 gain exhibits lower bandwidth, but more stable behavior, meaning that a higher CDR bandwidth isn't always better. This is something a designer needs to balance.

How does the JTOL algorithm work in this example? Well, for each sinusoidal jitter frequency, the algorithm starts with an upward search, just meaning that for each frequency it just slowly increases the jitter magnitude until it finds a case that fails. When this happens, the algorithm switches to a binary-type search where it brackets the jitter magnitude.

It starts close to the last passing magnitude and ends close to the first failing magnitude and sweeps jitter magnitudes between those two. And finally, when it's converged on a magnitude, it moves on to the next frequency. And I'll explain what I mean by converged in a few minutes.

So these are the simulation parameters you give at the beginning of the simulation. You give it a start and stop frequency, which in this case is a half a megahertz, and all the way up to 40 megahertz. The total number of logarithmically spaced frequency points, which in this case is just 20. The number of simulations to run in parallel at one time, which is just determined by how many simulations you're able to run in parallel. I set it up for 11 because I have a 12-core CPU and I didn't want to totally bog down my computer.

The maximum number of simulation batches in this case is 12, so what this means is how many batches of 11 simulations do I want to run? So in this case, it could be up to 12 times 11, which is 132 total simulations per frequency. So basically it's how many simulations do you want to run before you give up.

Next is the maximum jitter magnitude. So how high do you want to crank up the magnitude until you call it? Next is the eye-width target as a percentage of UI. So in my example, I set the UI to 100 picoseconds, which makes the math nice and easy. So 40% of 100 is just 40 picoseconds, which is nice.

Next is the tolerance of the eye width. So in other words, your plus or minus of the eye width target. So in my case, it's going to be 40 picoseconds plus or minus 2 picoseconds.

And finally, is the tolerance percentage. And this comes into play if one simulation has a magnitude of say, x amount of jitter and it's above the eye width target, it's outside of the range, the tolerance, and another simulation with a magnitude of say, y, and it's below the eye target, so it's failing. If the difference between x and y is less than this tolerance percentage, in this case 1%, the algorithm considers this to be complete. And I'll show you what I mean by that in a minute.

So this is what the simulation looks like, only sped up several times. I can't show you the simulation live for two reasons. One is because it takes up too much of my CPU, and I won't be able to do anything else while it's running. The second reason is that it takes too long to simulate. One of the drawbacks to simulating jitter tolerance testing is that it's slower than doing it in hardware. Hardware can process and collect the data much faster than in simulation.

One nice thing is that simulations can be run in parallel, which helps speed things up. Simulations can be repeated exactly over and over, even in the presence of random noise, meaning the exact same simulation setup can be distributed in parallel to accelerate the search. So anyway, that's what it looks like. So let's actually just take a look at the results.

This graph shows the sinusoidal jitter frequency on the x-axis and the magnitude on the y-axis. There are 20 frequency points ranging from half a megahertz to 40 megahertz, which again came from our setup. And the algorithm sweeps various magnitudes until it converges.

You can also see as the frequency increases, the pass/fail boundary decreases until it gets to a certain frequency and then starts increasing again. This is where you can find the CDR bandwidth because after this value, the CDR is basically tracking noise at this point.

If we look at the second-to-last column of points, that corresponds to the second-to-last frequency, you can see the jitter magnitude being swept in this graph and the resulting eye width for each jitter magnitude. And if you look at the target lines, you'll see a bunch of points all clustered together there.

If we zoom in on that cluster, you can see the algorithm searching for the magnitude closest to our target of 40% or 40 picoseconds in this case. And it found it to be 0.170696 UI of jitter. This is what the algorithm does. For each frequency, it sweeps the magnitude until it finds a value in intolerance and then moves on to the next frequency.

As these plots are being generated there's also some text showing up in the command window. If we look at the second to last frequency, which were the plots we were just looking at on the previous slide, we can see three sets of simulation batches in the 33 runs, that result in 33 runs.

So in this instance, this first set of brackets, it's doing a search between these two sets of jitter magnitude. And it's looking for something close to our target tolerance of 40 UI or I'm sorry, 40 picoseconds or 40% And in this case, it didn't find anything. So it moves up to the next bracket. And if you look closely, the last number in the first bracket and the first number of the second bracket are within 1% of each other. So it just moves up 1% and then sweeps another set of values in this range.

And then based off of the third set, you can deduce that, OK, it found something in between here, now it's just honing in on a value. So it went in between 0.164 and 0.174 and it found a value close to 0.170. And so it ran 33 simulations or three batches of 11 simulations. And if you also look, you can see sometimes it was successful in just two runs or two batches. And it can be successful in just one as well if you're lucky.

There are four possible conclusions or four possible results you can get from this simulation. One is it says success. Another one, it says the first two frequency points, says sharp drop within tolerance and quasi-stable behavior detected. I'll get to these two in a second and I'll show you what they look like. The last one is ceiling found. Basically, you hit the maximum of your sinusoidal jitter magnitude, which we set up to be 100 UI in this case. That never happened in this simulation, but if I were to lower the magnitude, the maximum magnitude to something, I could show it being hit but did not in this case.

So what is the result for when it says it's a sharp drop, what does it look like? This is what it looks like here. So it ran one batch of simulations, which just happen to be everything above the tolerance here. And then it didn't find anything. And then it moved on to a second batch. And it just so happened that everything in the second batch failed.

And because the way it's designed is that this last frequency-- I'm sorry, this last magnitude was 10. And then the second batch, it just moved up 1% to 10.1, they all failed, this is where that last tolerance percentage of 1% that we set up comes into play.

Since it only moved up 1% from 10 to 10.1, and it went from passing to failing, that means it's going to call it. And it just interpolates between those two points and it determined the magnitude to be 10.0289 UI in this case. If we were to decrease that tolerance percentage to maybe 0.1%, it would have ran a third simulation batch to try and find something in between these two values.

And the other result, the quasi-stable result, this situation is where there exists some failing cases and some passing cases, but the failing cases have a lower jitter magnitude than the passing cases, which is counterintuitive. You think if you increase the magnitude, it should fail, and as you decrease the magnitude they should be passing

This situation can happen when the adaptation loops are impacted differently by the jitter magnitude, such that some cases converge faster than others. And regardless of the specific cause of this behavior that we're looking at, the system is not performing well and the JTOL search can be suspended. But the algorithm itself did something similar to what it did in the previous example where this interpolates between these two points here to find an answer.

And that brings us to the end. So we went over what jitter is, how it can affect things, especially in a CDR. And then we went over the JTOL example and its results. And again, I apologize for not really going into great detail with the example, it's just because the simulation time takes so long and I wouldn't be able to show it anyway because it takes a lot of the CPU power.

So some suggestions for you as next steps are to try this example out for yourself. And when you get familiar with it, you can try changing some of the CDR settings and see what happens. You can try some of the things listed here, changing the first and second-order gains and things like that.

Then you can move on to adding more equalization. Maybe a CTLE or DFE. You can change the modulation from everything here was NRZ. You can change it to PAM3, PAM4, PAM8, 16 whatever you want. Or you can play with the channel parameters. You can play around with the loss, things like that.

And when you get a good handle on things you can try to apply an industry standard to this example. You can look up an industry example or industry standard and try to apply it to this example.

I also invite you to again to look at the new jitter function that just came out and to check out the SerDes Toolbox product page to find out more information on SerDes Toolbox. And also feel free to reach out to me for any more information about anything you saw here today or about our SerDes and Signal Integrity Workflows here at MathWorks. Thanks for taking the time to hang out with me today. I really appreciate it.