Alias-suppressed waveforms in hardware: DPW, PolyBLEP, other ideas?

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

I'm trying to implement a ramp generator in hardware using differential polynomial waveforms or PolyBLEP (discussed before) (anyone have a link to Valimaki's original paper on PolyBLEP? I've found references to it, but no actual paper).

I'd like to share some of the stuff I've worked out along the way. Particularly, generating only ramps with DPW; scaling factor considerations for DPW; and division for PolyBLEP, which basically has me stopped cold. I don't have comparisons for DPW vs. PolyBLEP; at some point I'll work out a Python implementation using (likely) FxpMath, and then translate that approach into hardware using Amaranth.

I'm using offset ramps (ramp(p) - ramp(p+offset)) to generate pulse waves, although I'm not 100% sure DPW will work starting from a phase offset. Generating triangle waves and tilted triangle waves is easy from there: DPW differences an integrated trivial waveform function, for example a fourth-order integral is differenced three times to get the first-order function. To generate triangular waveforms, you integrate a square wave; therefor, you can perform only two differences and skip the third when generating a square wave, and you'll get the corresponding triangle wave. The peak for the triangle will correspond to the falling edge of the pulse, e.g. a 20% duty cycle pulse wave, integrated, will have a peak +1 at 1/5 of its period, and then run back to -1 at the end/beginning of its period.

One question I haven't figured out about DPW: what happens if I have y=0 and move my phase forward 50% or 25%? This works with continuous functions, but moves the waveform down (DC offset); however, continuous functions work differently than descretized functions, and don't bandlimit the signal. If it works when descretized then this is fine, I just need 2 ramp generators to produce a single ramp, pulse, and triangle oscillator.

I'll also need to find out if this works with vibrato. Changing the frequency essentially means changing the step size, which I'm not sure will work the way I want... :?

Valimaki et al recommend using an order 2 and 4 (DPW2 and DPW4) DPW generator to deal with the high scaling factor. They diagram this as such:
Screenshot 2023-12-21 122918.png
According to Valimäki et al:
A practical implementation structure for DPW sawtooth synthesis is presented in Fig. 16, where low-frequency tones are produced by a second-order DPW method and high frequencies are produced using a fourth-order method. The crossover frequency may be about 500 Hz. Using a combination of two DPW algorithms it is possible to avoid a very large scale factor at low fundamental frequencies but still maintain a good audio quality at high fundamental frequencies.
Their frequency-dependent scaling factors are pi/4sin(pi/P) and pi^3/(24[2sin(pi/P)]^3) for DPW2 and DPW4. At a frequency of 27.5Hz (A0) and 48kHz sample rate, P=48000/27.5, these are 437 and 27,696,514, rounded up. log2(437)≈8.8, log2(27,696,514)≈24.7; therefor, your options are to either use a crossover centered at 500Hz (from 400Hz to 600Hz) or to add 25 bits to the DPW4 sample width. You will need an extra 9 bits to maintain precision with DPW2; you will need an extra 14 bits to maintain precision with DPW4 down to 400Hz; therefor you are only adding 11 more bits to the DPW4 sample precision. The sample can be cut down to its native sample depth (e.g. 24 bit integer or single-precision float) after scaling, so very few registers and large-width adds and multiplies are necessary.

Based on all of that, it's cheaper to just use wider data types. For each waveform generated, DPW4 needs 1 multiply and 1 multiply-add—a special case because it's x^4 - 2x^2, so you have x2=x^2, a=(x^2*x^2 + x^2<<1), whereas for a non-power-of-2 coefficient you'd need an extra multiply—plus 3 additions (differencing) and 1 multiply (scaling). That's 3 multipliers made 11 bits wider, and 3 adders made 11 bits wider, and 3 registers made 11 bits wider. DPW2 requires 1 multiply, 1 scaling multiply, and 1 differencing addition, all adding 9 bits of precision on top of the target sample size, e.g. 33 bits wide instead of 24. The savings would be minimal, and then you need an extra register and two additional multipliers (well, a multiplier and a fused multiply-add) to combine the two, plus control circuitry to determine how much to multiply by.

I've partially diagrammed their suggested DPW2+DPW4 system below. Each of the differencers (with z^-1 above them) is subtracting the input from z^-1, not adding.
Screenshot 2023-12-21 141140.png
The +, ×, and z^-1 along the top implement DPW2. What I didn't diagram was the additional calculations to calculate DPW2 scale factor (another multiply), to determine the mix (a subtraction and multiply, with conditional tests), and to complete the mix (two more multiplies and a subtraction). The alternative to all of that is, again, make the registers and operators along the bottom row 11 bits wider, which I think is a better approach.

All of that is fascinating, but I haven't compared it to PolyBLEP.

My issue with PolyBLEP is I can replace dt with (f0 * 1/fs) and hard-code 1/fs to avoid dividing by the sample rate; but whereas I need to divide by fs/f0 to compute the scaling factor in DPW (thus multiply by f0*1/fs), I need to divide by f0/fs (thus multiply by fs*1/f0) to compute PolyBLEP.

Division is hard.

There's no really fast way to approximate 1/x to a low error.

I'm starting to think it's more viable to just generate 20 sines per waveform and use a sum of sines. For the 3 channels in an AY-3 enhanced with 3 waveform types, vibrato, and chorus effect, that'd be only 240 sine generators. That or maybe modal synthesis?
You do not have the required permissions to view the files attached to this post.

Post

For PolyBLEPs I'd actually recommend my old tutorial viewtopic.php?t=398553 though BLEPs (poly or otherwise) fundamentally require division to solve the linear equation that gives you the sub-sample offset for crossing a phase threshold. Realistically you might also want some control flow.

It's been quite a long time since I tried DPW, since it's sort of in the category of "slightly better than naive" algorithm (usually I'm more into higher quality and arguably more expensive algorithms) and even though basic 2-sample PolyBLEPs aren't necessarily great either (and the polynomials for higher orders get a bit expensive to compute, so on general purpose systems it quickly makes more sense to just tabulate) on modern desktop systems it's mostly a non-issue to use BLEPs of 16-32 taps (or even more) which (especially if you throw in a factor of 2 oversampling) is basically "perfect" (where as the threshold of "good enough" in some academic papers leaves a lot to be desired).

That said, I'd imagine just about everything about the BLEP algorithm, especially with longer tabulated BLEPs is better suited to a general purpose computer or at least a DSP processor than straight hardware, because much of the efficiency comes from the fact that much of the work is only done when there's actually a transition that needs a BLEP inserted. This also kinda means some amount of buffering makes sense to amortise the cost.

If you're planning on FPGA (not that I'm an expert on those; zero practical experience, but perhaps I have some idea of how they work), perhaps it'd make sense to keep things simple and just bump the sampling rate by a moderate factor? Oversampling alone is not great for anti-aliasing (unless you can manage something like a factor of 1024 or whatever which might be a bit much; extra 6dB for saw/square every time you double the factor), but it generally does improve what you get out of any other aliasing mitigation strategies, so you might be able to get away with lower order DPW which then simplifies you numerical precision issues.

Jumping phase with DPW means your finite difference comes out as a spike and you probably don't want that. I have no idea if it works in practice, but the first thing that comes to mind working in fixed point would be just forcing continuity (ie. only changing slope when phase jumps) and then relying on modulo arithmetics with the differencing? The absolute level of integrated waveforms doesn't matter after all, just how the derivatives (or rather finite differences) vary.

No idea if any of my thoughts are useful, perhaps there's someone around with actual FPGA experience that can give more meaningful advice... but most of us mostly do software I guess.

Post

jrmoserbaltimore wrote: Thu Dec 21, 2023 8:21 pm Division is hard.

There's no really fast way to approximate 1/x to a low error.
It's not straightforward, but there are ways to compute divisions (or estimates of them) within reasonable time.

One of them is doing a lookup and doing a newton raphson iteration on it (and variations of it, look it up on the internet).

And you can get away with rough approximation for it. Probably around 8 or 10 bits will be enough, polyBLEP doesn't remove enough aliasing that extra jitter that comes from inaccurate division will matter. I may be wrong though, but that's the intuition.

Exact implementation will depend on hardware almost entirely. Some hardware can do fast reciprocal estimations even without dedicated divider. Or your hardware can't do floating point at all and it needs to be fixed-point. It's 100% hardware based and performance implications are completely based on what kind of CPU it is and what does it have.

Post

2DaT wrote: Fri Dec 22, 2023 3:42 am And you can get away with rough approximation for it. Probably around 8 or 10 bits will be enough, polyBLEP doesn't remove enough aliasing that extra jitter that comes from inaccurate division will matter. I may be wrong though, but that's the intuition.
One can think of every additional bit adding a factor of 2 "virtual oversampling" with PolyBLEP as the downsampling kernel, so every addition bit basically gives an extra octave for the harmonics to decay before they alias at this virtual sampling rate. For saw-spectrum, we have 8*6dB=48dB or 10*6dB=60dB down from the level at Nyquist and if we're doing PolyBLEPs one would hope that the fundamental is a few more octaves down... so your estimates are probably sensible.

If we compute more bits that are just inaccurate, then I think it depends on how the errors correllate with the waveform frequency. The best case is they are uncorrelated in which case the jitter essentially acts as temporal dither and spreads out the aliasing into slightly more even noise floor. I think in the worst-case (perfect correllation) we basically gain nothing compared to just computing the bits we can get accurately.

Post

jrmoserbaltimore wrote: Thu Dec 21, 2023 8:21 pm ...

My issue with PolyBLEP is I can replace dt with (f0 * 1/fs) and hard-code 1/fs to avoid dividing by the sample rate; but whereas I need to divide by fs/f0 to compute the scaling factor in DPW (thus multiply by f0*1/fs), I need to divide by f0/fs (thus multiply by fs*1/f0) to compute PolyBLEP.

Division is hard.

There's no really fast way to approximate 1/x to a low error.
...
If you have common fixed sample rates only, 1/fs can be approximated exactly by using degree 3 polynomial. You need separate polynomial for 44100 and 48000 based sample rates. Degree 7 polynomial approximates both bases exactly. Though, would one divide be CPU heavier than a degree 3 polynomial? Example 44.1kHz based : https://www.desmos.com/calculator/qugidvvrqv (tiny numbers but if you use index (1,2,...) instead of sample rate values then numbers are at min ~10^-7 level).

With hard coded base 1/fs sample rate lets say 44100 you get the next common same base fs by just multiplying the current 1/fs value by 0.5 (and previous by multiplying current 1/fs by 2).
Last edited by juha_p on Fri Dec 22, 2023 1:00 pm, edited 3 times in total.

Post

juha_p wrote: Fri Dec 22, 2023 7:00 am If you have common fixed sample rates only, 1/fs can be approximated exactly by using degree 3 polynomial.
Hold on... if we have a known fixed sampling rate, then 1/fs is a constant. If you have a finite set of known sampling rates to choose from, then 1/fs is an array of constant. There's not need to approximate anything here.

If we're working integers rather than floating point, then things are a tiny bit more complicated, because we obviously can't simply compute 1/fs and store the result, but it turns out that since we're typically using wrap-around arithmetics, we can still replace divisions by a constant with a multiplication and perhaps some bitshifts and add to fix rounding errors (see https://libdivide.com/ for example).

So really the only situation where you actually need to divide anything is when the divisor is variable. Unfortunately that's the case with BLEPs.

Post

Bwaaagh, y'all discussing "hardware" but it's computations and thus "software" after all :tantrum:
We are the KVR collective. Resistance is futile. You will be assimilated. Image
My MusicCalc is served over https!!

Post

2DaT wrote: Fri Dec 22, 2023 3:42 am It's not straightforward, but there are ways to compute divisions (or estimates of them) within reasonable time.

One of them is doing a lookup and doing a newton raphson iteration on it (and variations of it, look it up on the internet).
Yes, that's SRT division. Last time I needed division, it was impractical to use a lookup table (I needed single-cycle 38-bit division), so I spent a couple hours inventing a method of generating the correct quotient digit (a 60-year-old problem, but it's trivial to solve once you realize there isn't a lookup table, but rather a set of lookup tables, and that each one tells you something actionable about how to carry out the calculation—you can start multiplying and adding things before you know what numbers you're multiplying and adding. Those tables are easier to define with a shallow boolean function).

It still takes 24 cycles to do a 24-bit division versus 3 to do a 38-bit multiplication.
mystran wrote: Fri Dec 22, 2023 5:48 am One can think of every additional bit adding a factor of 2 "virtual oversampling" with PolyBLEP as the downsampling kernel, so every addition bit basically gives an extra octave for the harmonics to decay before they alias at this virtual sampling rate. For saw-spectrum, we have 8*6dB=48dB or 10*6dB=60dB down from the level at Nyquist and if we're doing PolyBLEPs one would hope that the fundamental is a few more octaves down... so your estimates are probably sensible.
That's an interesting way to think about it. So basically estimating t to a lower precision than the sample size introduces minimal additional distortion, and the amount of distortion introduced is cut in half by increasing the bit width of t by 1?
2DaT wrote: Fri Dec 22, 2023 3:42 am Exact implementation will depend on hardware almost entirely. Some hardware can do fast reciprocal estimations even without dedicated divider. Or your hardware can't do floating point at all and it needs to be fixed-point. It's 100% hardware based and performance implications are completely based on what kind of CPU it is and what does it have.
I have complete control over that. CPUs are far too slow and power-hungry to use as a tone generator for my purposes. One of my long-term goals is something like PianoTeq (which is using modal synthesis) but with 88 polyphony and able to run on a battery for several hours. It's not hard to get that working in under a tenth of a watt, at least not any harder than it is to get it working in the first place (modal synthesis itself is pretty trivial, but setting the right coefficients on the right number of resonators and giving it the right excitation impulse to get it to produce anything remotely like the sound you want is the kind of thing you'd have an engineering team spend years researching).
mystran wrote: Fri Dec 22, 2023 12:54 am It's been quite a long time since I tried DPW, since it's sort of in the category of "slightly better than naive" algorithm (usually I'm more into higher quality and arguably more expensive algorithms)
Maybe try using the averaged differentiator suggested in Oscillator and Filter Algorithms for Virtual Analog Synthesis in the Computer Music Journal.

"A remaining problem is that at high frequencies the level of aliased components is close to that of the harmonics. This may lead in some cases to beating. A solution to alleviate this is to replace the differentiator with its averaged version HD(z) = (1 – z^–2)/2 = (1 – z^–1) (1 + z^–1)/2, which does not spoil the simplicity of the algorithm. This filter is a combination of a two-point average filter and the first difference filter. Figure 5a compares the magnitude responses of the first-difference filter (1 – z^–1)/2 and the ideal differentiator, and also shows the magnitude response of the two-point average filter. It can be seen that the response of the simple differentiator is very close to the ideal at low frequencies. The error exceeds 1 dB above 11 kHz. Figure 5b gives the magnitude responses of the ideal differentiator and the proposed averaged differentiator, which is an FIR comb filter. The proposed filter is very similar to the ideal differentiator at low frequencies with less than a 1-dB deviation below 5 kHz. At high frequencies, much attenuation is achieved, which is necessary to suppress the potential beating artifacts."

I'm not sure exactly what this means (I read a lot of research but I'm still new at this). The differencer I used is just y-z^-1. He didn't give a diagram showing how the math works, probably because most people reading this stuff would understand it immediately.

Below (from the aforementioned paper) are the samples and spectrum for the naive waveform, its square, and the first-differenced DPW, at 44100Hz sample rate and 2793.8Hz fundamental:
Screenshot 2023-12-28 230506.png
Below, same waveform, but using HD(z) averaged differencer:
Screenshot 2023-12-28 230526.png
Välimäki has suggested in his more recent work that oversampling DPW2 by a factor of 2x (that is, second-order DPW, 96kHz instead of 48kHz sampling rate) works better, but you'll need to do the DPW calculations and scaling with something like 9 extra bits of precision. I'd rather use DPW4 instead and toss in an extra 20 bits of precision up to the final scaling.

If you're all that curious, you can try comparing DPW2 and DPW4 with and without the averaged differencer at 48kHz and 96kHz, but that's 8 different generating modes and so potentially more effort than you've got curiosity.
mystran wrote: Fri Dec 22, 2023 12:54 am Jumping phase with DPW means your finite difference comes out as a spike and you probably don't want that. I have no idea if it works in practice, but the first thing that comes to mind working in fixed point would be just forcing continuity (ie. only changing slope when phase jumps) and then relying on modulo arithmetics with the differencing? The absolute level of integrated waveforms doesn't matter after all, just how the derivatives (or rather finite differences) vary.
Same paper suggests:

"Figure 7a uses two sawtooth waveform generators with a phase shift to generate a pulse waveform. This two-oscillator method allows smooth pulse width modulation, since the phase shift can be continuously controlled by offsetting the counter in one of the oscillators. The computational cost of this method is simply twice that of a single sawtooth generaton."

i.e. to modulate duty cycle of a pulse wave, subtract one saw from another saw and modulate the phase of one of them. He goes on to suggest a comb filter approach creates a "zipper noise" and that another approach using delay lines is more computationally-intensive than just using DPW to compute two. I assume by "smooth pulse-width modulation" he means adjusting the pulse width, because…wouldn't it always be smooth if everything was unchanging?

I haven't tried it though.
mystran wrote: Fri Dec 22, 2023 12:54 am No idea if any of my thoughts are useful, perhaps there's someone around with actual FPGA experience that can give more meaningful advice... but most of us mostly do software I guess.
The big considerations for FPGA are 1) limited, non-dynamically-allocated RAM; and 2) if I want to do a bunch of stuff, I can do that stuff one operation at a time in a pipeline. That second point means if you loop on a=a+1, a=a*n, a=a*m, I don't "loop," but rather while a=a*n is happening I'm also computing b=b+1 since that part of the circuit is now available. It's more analogous to an assembly line than a pipeline.

It wasn't actually that hard to get started on. If you're any good with Python it's pretty easy to set up Amaranth and run things through its simulator. You'll have to open the VCD it produces in WaveTrace. I suspect it's less useful to most people here though, since everyone's probably writing VST plug-ins or something instead of designing a rack-mount synthesizer or a digital keyboard or whatnot.

That reminds me, I should install Spyder and WaveTrace on this machine (and figure out how to manage virtual environment with Spyder; it's easy with Thonny).
You do not have the required permissions to view the files attached to this post.

Post

I had planned to use differential polynomial waveforms for the oscillators of one of our products in the past. I wasted two weeks of work with it. While it might look like a good idea in theory it turned out that this does not work well in practise for a synthesizer. If you change the pitch of the oscillator serious crackles occur. It is very difficult to get rid of them. In practise, you can use this method only for oscillators with a static pitch.

My recommendation if you do not want to use precalculated wavetables:
One idea that works well as a compromise is lineary interpolate only the sample where the sawtooth/squarewave jumps with the decimal place of the offset. This way you can get efficiently rid of the low frequency components of the aliasing. Bastically it supresses the jitter of the waveform that is audible as aliasing by replacing the 'waveform jump' with a linear ramp or a crossfade.
You can see this like a very rough approximation of a BLIT. Unlike BLIT it does not suffer from ringing. However you have a slight damping in the very high frequencies. But it works very well in 96khz samplerate. It is extremely CPU efficient.
We do not have a support forum on kvr. Please refer to our offical location: https://www.tone2.com/faq.html

Post

jrmoserbaltimore wrote: Fri Dec 29, 2023 4:13 am
mystran wrote: Fri Dec 22, 2023 5:48 am One can think of every additional bit adding a factor of 2 "virtual oversampling" with PolyBLEP as the downsampling kernel, so every addition bit basically gives an extra octave for the harmonics to decay before they alias at this virtual sampling rate. For saw-spectrum, we have 8*6dB=48dB or 10*6dB=60dB down from the level at Nyquist and if we're doing PolyBLEPs one would hope that the fundamental is a few more octaves down... so your estimates are probably sensible.
That's an interesting way to think about it. So basically estimating t to a lower precision than the sample size introduces minimal additional distortion, and the amount of distortion introduced is cut in half by increasing the bit width of t by 1?
Not "distortion" in the ordinary sense, but rather aliasing.

Ideally we want to construct the desired waveform in continuous-time and then filter this waveform with an anti-aliasing filter before sampling.. and BLEP-synthesis is essentially a way to reorder computations in such a way that this continuous-time filtering becomes feasible for certain classes of waveforms.

Specifically, if the waveform is piece-wise polynomial (with some finite degree) and the chosen AA-kernel is an identity operator with respect to polynomials up the degree required, then there will be filter ripple only where the pieces join and the exact ripple only depends on the discontinuity in the signal and it's derivatives... and because polynomials (of finite degree) have a finite number of non-zero derivatives, we can essentially just take the filter kernel and precompute integrated versions for the step and each of the potentially discontinuous derivatives, then compute the actual time offset and magnitudes when generating the waveform and insert these precomputed "bleps" with approriate scaling.

Now, the "gotcha" here is that we're pretending to do this in continuous-time, but in fact when we're working in finite numerical precision, our time isn't truly continuous. So if we round the time-offsets at which we insert the BLEPs to say 8-bits, then essentially we're sampling the continuous-time waveform without filtering at a sampling rate of 256x the base rate (which causes aliasing which we can no longer filter) and then we're doing the BLEP-filtering with the result before resampling to the final rate.

So ideally you want enough bits that your "first round of sampling" (theoretically) happens at high enough rate that aliasing at that point is comparable to the attenuation you actually get from your chosen BLEP kernel. With a floating point implementation, we mostly have enough precision here anyway that you don't really even need think about it.

In the case of PolyBLEPs, we choose the filter kernel itself to be some piecewise polynomial construct, but this has no impact on any of the theory whatsoever, it just allows us to compute short kernels on the fly without having to store them in tables.

Now, the question with hardware though... is whether we could just brute-force oversample and filter instead. If we can brute-force oversample enough that aliasing is negligible (for whatever definition of negligible) and then resample this down to the desired rate with decent anti-aliasing (eg. some CIC-passes followed by a polyphase FIR perhaps?) then trying to be fancy might be just waste of time. Old 8-bit "home computers" and game consoles often had sound chips that we would just generate pulse waves at some reasonably high rate (eg. whatever the clock runs at, since generally these are just counters with adjustable reset points) and then pass them directly to an analog filter of some sort for low-pass filtering.. and they don't usually don't sound too horrible with regards to aliasing, because the (effective) sampling rate is high enough.
mystran wrote: Fri Dec 22, 2023 12:54 am It's been quite a long time since I tried DPW, since it's sort of in the category of "slightly better than naive" algorithm (usually I'm more into higher quality and arguably more expensive algorithms)
Maybe try using the averaged differentiator suggested in Oscillator and Filter Algorithms for Virtual Analog Synthesis in the Computer Music Journal.
Well, like I said earlier I mostly work on desktop and the quality targets I'm usually aiming at are a lot higher than what is being discussed here. On desktop where SIMD makes FIR taps relatively cheap, FIR-BLEPs are really the gold standard that let you push aliasing (at least for "slowly" modulated waveforms) essentially down to the numerical noise floor and you can have a filter response so steep that it causes issues with non-linear filters (so it actually can make sense to do 2x just so that you can use a less steep transition and still have "zero" aliasing).

Does this mean that cheaper, less "perfect" methods are useless? Not really... but they are somewhat outside the realm of what I'm usually interested in, because I'm usually looking for the kind of quality where you sweep the fundamental above the Nyquist and all you hear is faint broadband noise... so when I see a paper that demonstrates aliasing rejection with a spectral plot with -60dB dynamic range, I've already usually lost interest, because it doesn't help me improve the quality of what I already have.

You can certainly make a case of focusing more on performance, but I have chosen to mostly focus on quality with my work. If I ever found myself in a situation where I was trying to get the best possible quality out of very limited hardware that is simply not up the job when it comes to "transparent" quality, then I might take a look... but on a modern desktop algorithms like these are (in my very humble opinion.. and I emphasize this is an opinion that others might not share.. and that's fine) mostly just waste of time. :)

In fact.. if you can tolerate a bit of latency, even for a rackmount it might make sense to just go with a DSP processor or perhaps even some generic chip like an ARM. I'm pretty sure I could do a couple of "transparent" voices using BLEPs on my old RPi for example (even though RPi as a board is definitely not ideal for audio work)... could probably try that some day.

Doing FPGA can certainly be fun though... and fun is a perfectly good reason to do things. :)

Post

Tone2 Synthesizers wrote: Fri Dec 29, 2023 8:10 am If you change the pitch of the oscillator serious crackles occur. It is very difficult to get rid of them. In practise, you can use this method only for oscillators with a static pitch.
Well that's out then.
Tone2 Synthesizers wrote: Fri Dec 29, 2023 8:10 am My recommendation if you do not want to use precalculated wavetables:
One idea that works well as a compromise is lineary interpolate only the sample where the sawtooth/squarewave jumps with the decimal place of the offset.
That's not hard to do, it's a single register to track whether the waveshape was positive on the last tick and interpolate if it switched.

With the insane scaling factors needing a division in DPW, I was looking at linearization of the function, which worked well but requires a table of f(x) and its derivative f'(x) because I need the slope of the tangent line (accessing 2 different values in ROM on the same clock tick is hard, so no linear interpolation unless I stored pairs of f(x) and f(x+1) instead). It worked out if you kept a table of 64 entries for each binary partition (and you can extrapolate from 4096 up through above the highest note on the piano within 0.15% precision) but yeah, ultimately it needed division or a table.

Precalculated wavetables might be easier at this point, except I'm not sure how to change frequency except by just iterating by the phase increment and then interpolating between the two nearest points. Also if the waveform has like 10 harmonics and I generate 4186Hz, don't I get 8000Hz and 16000Hz and 32000Hz and 64000Hz etc in the output?
Tone2 Synthesizers wrote: Fri Dec 29, 2023 8:10 am However you have a slight damping in the very high frequencies. But it works very well in 96khz samplerate.
Don't they all?

mystran wrote: Fri Dec 29, 2023 3:13 pm
Now, the question with hardware though... is whether we could just brute-force oversample and filter instead. If we can brute-force oversample enough that aliasing is negligible (for whatever definition of negligible) and then resample this down to the desired rate with decent anti-aliasing (eg. some CIC-passes followed by a polyphase FIR perhaps?) then trying to be fancy might be just waste of time. Old 8-bit "home computers" and game consoles often had sound chips that we would just generate pulse waves at some reasonably high rate (eg. whatever the clock runs at, since generally these are just counters with adjustable reset points) and then pass them directly to an analog filter of some sort for low-pass filtering.. and they don't usually don't sound too horrible with regards to aliasing, because the (effective) sampling rate is high enough.
True, the AY-3-8930 native sample rate is like 2MHz.

Increasing the number of samples generated per second requires increasing the clock rate, since I can just run at the lowest clock rate needed to generate however much. That's not a big deal at my target clock rates; it'd be a big deal around 150MHz or so but I can run at like 10MHz. A combined chip might require higher clock rate anyway, e.g. FM synthesis at the level of something like Yamaha's FM-X (512 operators) requires something like 166MHz (can do 1024 operators at that rate).

I just use second-order butterworth filters, Chamberlin SVF for a variable filter or I guess you could use a biquad if you've got a fixed filter like one in line to remove aliasing. Decimation from 96kHz to 48kHz is just throwing out every other samples, so no need for FIR.
mystran wrote: Fri Dec 29, 2023 3:13 pmIn fact.. if you can tolerate a bit of latency, even for a rackmount it might make sense to just go with a DSP processor or perhaps even some generic chip like an ARM.
My actual latency is a few microseconds from key on 8)

The actual final latency will probably be more like keyboard key press → Pi Pico polling → code on the Pi Pico to figure out wtf to do with it → tell chip what to do → 20µs → DAC → output. There is basically no useful way to get 20µs of actual latency (you can hook the keyboard directly to the chip and add polling circuitry to handle key-on internally, but you can't run any fancy code to decide to do fancy things like interpret MIDI signals, so it'd just be showing off). Nobody can tell the difference below a few thousand µS anyway.

It's more about power for me.

The Pi Pico runs near half a watt at full power; the Pi 4B+ idles at 5W and reaches above 6W at full power. I had power analysis for a 166MHz design on Artix 7 show as 0.15W. An 18650 battery cell holds about 10wH of power, so running a Pi 4B+ just idle would work for 2 hours per cell; pi pico, 20 hours per cell; and FPGA, at least 66 hours per cell. Pico + FPGA, 15 hours if the Pico is going hard maxed out 100% of the time, more like 30 hours in sane conditions (and you can underclock the Pico to reduce dynamic power, possibly getting as far as 50-ish hours).

If you build a portable thing like a keytar or keyboard, you also need to drive LEDs, possibly a speaker, and generally feel up the buttons and keyboard with periodic electrical pulses, all of which add power usage. 3 7-segment LED displays at 10mA each (yes, that's 12mA times 21 segments or 210mA) and 3V is already 0.693W; you can drive a 3.3V 128x64 OLED display on 10mA-20mA (0.96 inch) and a 1.3 inch at 31mA-39mA, so between 0.033W-0.13W, bringing the time on one battery to 18-23 hours. Add a 5W RMS speaker and you have 1.8 hours of play time per 18650 cell, versus 2 hours for the speaker alone and under 1 hour for a Pi 4B+ and 5W speaker.

So 6 batteries at 12 hours or 6 hours of play time.

This obviously makes no real difference because who needs a portable keyboard to survive on battery for 6 hours before plugging in to recharge? It would make much more of a difference if you dropped the speakers and used a wireless transmitter to a plugged-in amp for stage performance, since then you're talking about the thing lasting literally all day on and playing.

All of this has one caveat.

Some of the things I've seen done—notably a digital waveguide model of a Hohner D6 clavinet—reached a blistering 10 polyphony with all software gutted so the synthesizer was running and consuming 99% of the CPU all by itself.

128 polyphony is hard. 88 polyphony is kind of necessary if you're modeling a piano (when you lift the damper, all the strings resonate softly), and while the top strings only need 5 resonators when doing modal synthesis (PianoTeq), bass strings need a hundred or so each. Several thousand resonators is trivial, I'd have to push the clock rate to like 30MHz but it's not difficult to implement something like that cheap and easy (making it sound like an instrument requires a lot more than having a resonator circuit; you're talking about tens of thousands of coefficients to work out and selecting an excitation signal). Modern CPU can keep up, they just consume dozens of watts of power to do so.

Post

mystran wrote: Fri Dec 29, 2023 3:13 pm In fact.. if you can tolerate a bit of latency, even for a rackmount it might make sense to just go with a DSP processor or perhaps even some generic chip like an ARM. I'm pretty sure I could do a couple of "transparent" voices using BLEPs on my old RPi for example (even though RPi as a board is definitely not ideal for audio work)... could probably try that some day.
Aren't the recent(ish) Korg digital synths running on RPi boards? The modwave, wavestation and opsix. So that's fairly comprehensive wavetable, sampler and FM catered for.

If you attach a spectrum analyser to the native version of modwave you can see the shortcut they took with wavetables. The spectrum truncates slightly above ~15kHz depending on which note in an octave is played. Meaning, they're using mip-map wavetables but aren't oversampling+filtering.

So I guess there are calculated tradeoffs to get reasonable performance.

Post

JustinJ wrote: Sat Dec 30, 2023 12:42 pm If you attach a spectrum analyser to the native version of modwave you can see the shortcut they took with wavetables. The spectrum truncates slightly above ~15kHz depending on which note in an octave is played. Meaning, they're using mip-map wavetables but aren't oversampling+filtering.

So I guess there are calculated tradeoffs to get reasonable performance.
Some popular soft-synths do this too.. but like it's always trade-offs, perhaps they wanted more polyphony or whatever. Wavetables are also kinda different in the sense that there's not necessarily that much computation, but potentially a whole lot of (somewhat incoherent) memory access if you're doing bilinear or trilinear interpolation (over samples, over mips, over different waveforms) and since there's a lot more data this won't fit very well in caches, so the memory subsystem plays a bigger role... and if you bump the sampling rate, the cost goes up more or less proportionally... and because a bulk of it is memory gathers it's hard to SIMD anything efficiently.

Post

JustinJ wrote: Sat Dec 30, 2023 12:42 pm If you attach a spectrum analyser to the native version of modwave you can see the shortcut they took with wavetables. The spectrum truncates slightly above ~15kHz depending on which note in an octave is played.
You know, I hadn't thought of just taking a full spectrum sample from 27.5Hz A0, filtering it, and then playing it back at a higher rate. I think I'd need FIR for that though.

For reference, the way I usually pipeline involves a lot of not accessing the same memory repeatedly. I have multiple 2KiB RAMs with separate addressing and data buses, so I can do things like have 512 IIR filters as 512 32-bit values stored in, for example, 4 BRAMs (see the Chamberlin SVF, the two multipliers set the cornering frequencies, so I need 2 variables for those and 2 for the z^-1 delay elements). The sequence for that is:
  • Read both z^-1. Forward the incoming sample 'x' to the next stage.
  • Receive z^-1 values on data buses. FMA m=z^-1[0]×q+x. Forward both z^-1.
  • Add m=m+z^-1[2]. Read 'f'. Forward 'm' and both z^-1. 'm' is now the highpass filter output as well.
  • Receive 'f'. FMA m=m*f+1 (this is slightly cheaper than forwarding m+1 and then multiplying. since a multiply always has a stage where you have 3n-1 numbers to add for integer 'n', and multipliers are a bunch of stages of adding groups of 3 numbers to get 2 in parallel, so you turn a group of 2 into a group of 3 at some level and that's the only difference between multiply and FMA). Forward 'm' and both z^-1.
  • Add m=m+z^-1[0]. Forward z^-1[1] and 'm'. 'm' is now your bandpass output.
  • Store 'm' into z^-1[0] (you can write to one address in RAM and read from another simultaneously—dual-port RAM—and the first step, reading z^-1, has happened 4 more times since this all started and is happening a fifth time during this step). Multiply n=z^-1[0]×f (this is not the new z^-1[0], it's the old one). Forward 'n' and z^-1[1].
  • Add n=n+z^-1[1]. Store in z^-1[1]. This is your lowpass output; summed to the highpass output, it becomes your band reject output.
This involves 1 read from each of the registers for each filter, and 1 write to each z^-1 register for each filter. Values I need early and late in the process are transferred forward through each stage of the pipeline until needed.

I also do this exactly once for each sample generated; I can't do it a variable number of times, meaning I can't e.g. raise a sampled waveform 2 octaves and filter its harmonics by running the next 4 samples through the IIR filter and decimating. If I need to loop, I need to stall the pipeline, which is doable but drastically limits throughput (e.g. if the longest iterating operation is 4 cycles long, then I have 1/4 the throughput). To run multiple filters on the same channel I literally just put multiple filters in the pipeline with separate RAMs.

That's why it's so fast: if the chip is nothing but a digital filter, then the above process—requiring 7 clock ticks to complete—can filter 512 separate audio streams in 512+7 clock cycles; and it doesn't need to wait for each batch to finish, so it can literally filter 512 separate audio streams in 512 cycles, meaning you can filter 512 96kHz sample rate audio channels in real time with a clock speed of 49.152MHz and a latency of 0.0142µS. It takes a minimum of 55µS for sound to get from your ear to your brain.

Assuming a single-clock multiplier and perfect cache, no misses, with non-superscalar execution, the above…if you have FMA, it doesn't really take more clock cycles per iteration on a CPU—I have no trees of operations to do 'n' operations in log2(n) clock cycles, although that is a thing I do for e.g. high-order polynomials (I'll do them in sequence instead if I need to save hardware, because Horner's method—you should be using Horner polynomials when writing software). So 7×512 plus the loop instruction—let's assume it's prefixed decrement-and-loop-if-not-zero—4096 clock ticks per cycle or 393MHz CPU needed under absolutely ideal conditions (without SIMD). You're probably not using a microcontroller to filter 512 separate audio channels, though…does a mixer board even exist with 512 audio channels?

In any case that creates a constraint that makes it enormously expensive for me to do anything involving looping during a single sample generation. If you can produce an audio sample from a wavetable without using backwards branches…that would work.
JustinJ wrote: Sat Dec 30, 2023 12:42 pm Meaning, they're using mip-map wavetables but aren't oversampling+filtering.
That's one way to avoid filtering, yeah, which I guess reduces the problem to just calculating a phase offset and skipping to the nearest sample that much further ahead. The wavetable might need to be double-wide, with each entry being sample n and sample n+1, if I want to do linear interpolation; although dual-port ROM can do two reads simultaneously instead of a read and a write, so I could read both samples in the same clock tick without contention (I cannot read 3 without duplicating the ROM).
mystran wrote: Sat Dec 30, 2023 3:06 pm Some popular soft-synths do this too.. but like it's always trade-offs, perhaps they wanted more polyphony or whatever. Wavetables are also kinda different in the sense that there's not necessarily that much computation, but potentially a whole lot of (somewhat incoherent) memory access if you're doing bilinear or trilinear interpolation (over samples, over mips, over different waveforms) and since there's a lot more data this won't fit very well in caches, so the memory subsystem plays a bigger role... and if you bump the sampling rate, the cost goes up more or less proportionally... and because a bulk of it is memory gathers it's hard to SIMD anything efficiently.
bilinear/trilinear vs linear o_o

Hilarity ensues if you just store the prior output and average it with the next sample in the table. That just gets you a moving average filter which would probably be very muddy.

Actually…how would mipmapped wave table with moving average work, and how would it work with frequency slides as you cross over to the next frequency group? Do you just use half as many samples per octave and only use the upper bits of the phase (relative to the highest '1' bit) to address the table? Like, for 440 to 880 you'd use a 128-sample waveform, for 880 to 1760 you'd use a 64-sample waveform, etc.? You'd run the corresponding addressing lines to all the ROMs, but select the data lines based on a priority encoder selecting using the top 1 bit, not a difficult thing to design in hardware.

Post

jrmoserbaltimore wrote: Sat Dec 30, 2023 9:33 pm Hilarity ensues if you just store the prior output and average it with the next sample in the table. That just gets you a moving average filter which would probably be very muddy.

Actually…how would mipmapped wave table with moving average work, and how would it work with frequency slides as you cross over to the next frequency group? Do you just use half as many samples per octave and only use the upper bits of the phase (relative to the highest '1' bit) to address the table? Like, for 440 to 880 you'd use a 128-sample waveform, for 880 to 1760 you'd use a 64-sample waveform, etc.? You'd run the corresponding addressing lines to all the ROMs, but select the data lines based on a priority encoder selecting using the top 1 bit, not a difficult thing to design in hardware.
Not "moving average" but rather linear interpolation. When you want to read the table at fractional position, you fetch the two closest samples with linear weighting (ie. essentially treating the table as piece-wise linear, "connect the dots" in a sense).

Linear interpolation introduces noise and this tends to get more obvious at higher fundamentals (in the extreme case 2-samples is theoretically enough to represent one sine at fundamental, but obviously you'll get a triangle when reading the table with just linear interpolation.. not what we want), so what you'd normally do with a wavetable synth is just keep all the mips the same size. For example we might have 2048 samples (or 4096 .. but somewhere in that ballpark typically) in each table, which is 1024 harmonics at the base-level, but then the rest of it progressively filtered (well, we can just zero FFT bins since this is periodic) with fewer and fewer harmonics (eg. usually something like 1 or 2 tables per octave), but same number of samples such that the stored waveform is essentially oversampled bringing the interpolation noise down. When you're on a system where memory is cheap, this tends to be more efficient than trying to do better interpolation since at the lowest fundamentals the noise isn't nearly as obvious.

ps. bilinear or trilinear comes into play when you're doing interpolation in multiple dimensions: you could interpolate the fractional sample offsets (one dimension), then interpolate between different numbers of harmonics (eg. when sliding, that's the 2nd dimension, or "bilinear" if we use linear interpolation again) and then if we do scanning of different waveforms on top of that it's the 3rd dimension (hence "trilinear" if we use linear for this as well). Each dimension doubles the number of memory fetches though... so bilinear for example takes 4 and trilinear 8 fetches. Sometimes you might do cubic (usually 4 samples) somewhere, but if we did tricubic that'd be 4^3=64 memory fetches (and a whole bunch of madds) so often you try to rather store my data to allow for cheaper interpolation.

Post Reply

Return to “DSP and Plugin Development”