KVR Audio

Nowhk · Post by **Nowhk** » Tue Oct 02, 2018 10:25 am

That's basically my code for processing envelope within my plug.
I've isolated it, testing some performance in my DAW:

	// buffer
	int remainingSamples = nFrames;
	while (remainingSamples > 0) {
		int blockSize = remainingSamples;
		if (blockSize > PLUG_MAX_PROCESS_BLOCK) {
			blockSize = PLUG_MAX_PROCESS_BLOCK;
		}

		// voices (buffer is 32; 16 simultaneous + 16 free slot for cutoof previous one)
		for (int voiceIndex = 0; voiceIndex < PLUG_VOICES_BUFFER_SIZE; voiceIndex++) {
			Voice &voice = pVoiceManager->mVoices[voiceIndex];
			if (!voice.mIsPlaying) { continue; }

			// samples
			int remainingVoiceSamples = blockSize;
			while (remainingVoiceSamples > 0) {
				for (int envelopeIndex = 0; envelopeIndex < mNumEnvelopes; envelopeIndex++) {
					Envelope &envelope = *pEnvelope[envelopeIndex];
					EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

					// skip disabled envelopes
					if (!envelope.mIsEnabled) { continue; }

					// process block
					if (envelopeVoiceData.mBlockStep >= gBlockSize) {
						// calculate new envelope values for this block. its processed every 100 samples, not so heavy as operation, so it seems I can ignore the core of my code here
					}

					// update output value
					double value = envelopeVoiceData.mBlockStartAmp + (envelopeVoiceData.mBlockStep * envelopeVoiceData.mBlockDeltaAmp);
					envelope.mValue[voiceIndex] = ((1 + envelope.mIsBipolar) / 2.0 * value + (1 - envelope.mIsBipolar) / 2.0) * envelope.mAmount;

					// next phase
					envelopeVoiceData.mBlockStep += envelope.mRate;
					envelopeVoiceData.mStep += envelope.mRate;
				}

				voice.mSampleIndex++;
				remainingVoiceSamples--;
			}
		}

		remainingSamples -= blockSize;
	}

16 voices playing simultaneous. The CPU hit on the DAW is 6-7%.
For what I see, it is not doing very huge and intensive calculations (some sum and division); also, first I iterate voices, than samples buffers for each voice (as learned in these years, its better for cache swapping).

There are some parts I can optimize it out?
Branch predictions could be a problem here?
Or allocate References to envelope and theirs voice data is it so expencive?

Any tips or suggestions?
Thanks masters!!!

JCJR · Post by **JCJR** » Tue Oct 02, 2018 8:38 pm

Well a simple thing, not all divisions can be avoided but it probably makes sense to avoid all avoidable divisions, especially in tight loops. Many divisions can be re-phrased as multiplications. Divs are slower than muls.

In your case of " xx / 2.0 " of course it is trivial to use instead " xx * 0.5 "

OTOH that " xx / 2.0" is such a simple obvious candidate for optimization that perhaps any optimizing compiler might automatically fix it for you nowadays, dunno. I don't keep up with compilers anymore. On the other hand, if you never use divs where a mul would do the same job, it probably isn't likely that a compiler would ever decide to "optimize" your code by substituting a slow div for the fast mul you wrote in the source code.

2DaT · Post by **2DaT** » Tue Oct 02, 2018 11:58 pm

Some thoughts.
Obvious:
1. Make sure you get all optimizations from the compiler.
2. Denormals off.
3. Check the assembly.
Non-obvious:

Code: Select all

if (!envelope.mIsEnabled) { continue; }

This branch is highly likely to mispredict.

Code: Select all

Envelope &envelope = *pEnvelope[envelopeIndex];
EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];
envelope.mValue[voiceIndex] =...

Potential aliasing problems if used with pointers.
If compiler has to reread every internal envelope value for every sample, that would be awful.

Code: Select all

while (remainingVoiceSamples > 0) {
				for (int envelopeIndex = 0; envelopeIndex < mNumEnvelopes; envelopeIndex++) {

Reordering these loops could help with data locality and branch mispredicts.

Big Tick · Post by **Big Tick** » Wed Oct 03, 2018 2:53 am

I would manage a list of the currently 'active' envelopes and get rid of the dynamic tests for voice.mIsPlaying and envelope.mIsEnabled entirely.

PurpleSunray · Post by **PurpleSunray** » Wed Oct 03, 2018 9:55 am

As already said, try to re-structure the code a bit.
i.e.

Code: Select all

envelope.mValue[voiceIndex] = ((1 + envelope.mIsBipolar) / 2.0  * value + (1 - envelope.mIsBipolar) / 2.0) * envelope.mAmount;

most of the values on this line are once per envelope, but you calcuate it once per sample.
Try something like that (not tested):

Code: Select all

// 
// Process voices

for (int voiceIndex = 0; voiceIndex < PLUG_VOICES_BUFFER_SIZE; voiceIndex++)
{
	Voice &voice = pVoiceManager->mVoices[voiceIndex];

	if (!voice.mIsPlaying) {
		continue;
	}

	//
	// Process envelopes

	for (int envelopeIndex = 0; envelopeIndex < mNumEnvelopes; envelopeIndex++)
	{
		Envelope &envelope = pEnvelopes[envelopeIndex];
		EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

		if (!envelope.mIsEnabled) {
			continue;
		}

		//
		// Pre-calc the env values to be used below

		double bp0 = (1 + envelope.mIsBipolar) / 2.0;
		double bp1 = ((1 - envelope.mIsBipolar) / 2.0) * envelope.mAmount;

		// Process the samples

		for (int i = 0; i < nFrames; i++)
		{
			// update output value
			double value = envelopeVoiceData.mBlockStartAmp + (envelopeVoiceData.mBlockStep * envelopeVoiceData.mBlockDeltaAmp);
			envelope.mValue[voiceIndex] = bp0 * value + bp1;

			// next phase
			envelopeVoiceData.mBlockStep += envelope.mRate;
			envelopeVoiceData.mStep += envelope.mRate;

			// ?? what does this do?
			voice.mSampleIndex++;
		}
	}
}

vortico · Post by **vortico** » Fri Oct 05, 2018 6:29 am

What compile flags and optimizations are you using?

Chris Walton · Post by **Chris Walton** » Fri Oct 05, 2018 6:52 pm

As vortico said, make sure it's compiled in Release mode with the appropriate compiler flags.

Also, the best way to go about it is to profile and see where the hot paths are. Then either optimize those, or look if you can reduce how often that hot path is processed algorithmically. Rinse and repeat.

Nowhk · Post by **Nowhk** » Sat Oct 06, 2018 8:59 am

Here's the settings I've (owned by default IPlug project settings in VS):

so I think I'm ok with these. Also note that the N of envelopes I'm iterating are 10.
I'll refactor my code trying the above suggestions, and let you know (in the next weeks, now I'm too busy with work).

Thanks

Nowhk · Post by **Nowhk** » Wed Oct 10, 2018 12:50 pm

Basically, I've tried all of your suggestions:

Code: Select all

while (remainingSamples > 0) {
	int blockSize = remainingSamples;
	if (blockSize > PLUG_MAX_PROCESS_BLOCK) {
		blockSize = PLUG_MAX_PROCESS_BLOCK;
	}

	// voices
	for (int voiceIndex = 0; voiceIndex < 16; voiceIndex++) {
		Voice &voice = pVoiceManager->mVoices[voiceIndex];

		pEnvelopesManager->ProcessBlock(voice.mIndex, blockSize);
	}

	remainingSamples -= blockSize;
}

void EnvelopesManager::ProcessBlock(int voiceIndex, int remainingVoiceSamples) {
	for (int envelopeIndex = 0; envelopeIndex < 10; envelopeIndex++) {
		Envelope &envelope = *pEnvelope[envelopeIndex];
		EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

		double bp0 = (1 + envelope.mIsBipolar) * 0.5;
		double bp1 = (1 - envelope.mIsBipolar) * 0.5;

		// process block
		for (int sample = 0; sample < remainingVoiceSamples; sample++) {
			// update output value
			double value = envelopeVoiceData.mBlockStartAmp + (envelopeVoiceData.mBlockStep * envelopeVoiceData.mBlockDeltaAmp);
			envelope.mValue[voiceIndex] = (bp0 * value + bp1);

			// next phase
			envelopeVoiceData.mBlockStep += envelope.mRate;
		}
	}
}

But I'm still at 5%.
It seems very much for only 16 voices, 10 envelopes and simple calculating a linear value for each sample... isn't?

Chris Walton · Post by **Chris Walton** » Wed Oct 10, 2018 1:26 pm

Again, I can't stress enough that you will want to use Visual Studio's profiler to figure out where the hot paths are. Analyze -> Performance Profiler... -> Performance Wizard, there choose the sampling path and see what comes up.

noizebox · Post by **noizebox** » Wed Oct 10, 2018 1:52 pm

If I was developing this under linux I would run it with valgrind and callgrind and cachegrind in order to profile which function calls take up the majority of the execution time and get an idea of how the cache performance is working, if there are a lot of cach misses, false sharings, etc. I'm sure that someone can recommend similar tools under windows, I haven't done much windows development myself the last years.

If you can extract out any dependencies to a self contained file, you could past that into https://gcc.godbolt.org/ and look at the resulting assembly for different compilers and compiler settings and see if there is something obvious there.

PurpleSunray · Post by **PurpleSunray** » Wed Oct 10, 2018 3:13 pm

It seems very much for only 16 voices, 10 envelopes and simple calculating a linear value for each sample... isn't?

Nah, loops inside loops inside loops are always dangerous, because they multiplicate into an insane amount of runs on inner-loop very easily.
You do 10x16x48000(?) runs, that are 7,7mio. of envelope data calculations (all the code in inner-most loop), or about 60MB of double's to read and 60MB to store (at least). 5% CPU seems to be ok for me, considering what you'r doing.

Try the profiler, but IMHO it won't help you much.
That code is pretty simple and straight forward, the 'problem' is just that you have a lot of stuff to calculate.
So to make it faster you need to find a way to do less calculations /loop runs or simplify the math the of calculation even further.
If nothing of this is possible, you can try hacking down the SSE code on your own.
That piece of code is perfect for using any kind of packing instructions, because it will decrease loop-count

2DaT · Post by **2DaT** » Thu Oct 11, 2018 2:38 pm

I would try a 64 bit build.
Also, maybe worth it to make a local copy of every object value. Though it's hard to guess without assembly.

Code: Select all

void EnvelopesManager::ProcessBlock(int voiceIndex, int remainingVoiceSamples) {
	for (int envelopeIndex = 0; envelopeIndex < 10; envelopeIndex++) {
		Envelope &envelope = *pEnvelope[envelopeIndex];
		EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

		double bp0 = (1 + envelope.mIsBipolar) * 0.5;
		double bp1 = (1 - envelope.mIsBipolar) * 0.5;

		// process block
		double blockStart = envelopeVoiceData.mBlockStartAmp;
		double blockStep = envelopeVoiceData.mBlockStep;
		double blockDelta = envelopeVoiceData.mBlockDeltaAmp;
		double rate = envelope.mRate;
		double * __restrict value = envelope.mValue;
		for (int sample = 0; sample < remainingVoiceSamples; sample++) {
			// update output value
			double value = blockStart + (blockStep  * blockDelta);
			value[voiceIndex] = (bp0 * value + bp1);

			// next phase
			blockStep  += rate;
		}
		envelopeVoiceData.mBlockStep = blockStep;
	}
}

Nowhk · Post by **Nowhk** » Fri Oct 12, 2018 12:35 pm

2DaT wrote: ↑Thu Oct 11, 2018 2:38 pm Also, maybe worth it to make a local copy of every object value.

Interesting! Placing local copy (and re-store them outside the loop) switch from 5% to 3%.

Not sure why this happens: since "envelopeVoiceData.variable" are "read-only", using "local copy" or "envelopeVoiceData.variable" shouldn't be the same?

In the end, they are just address to read (except blockStep and value, which I'll also written on them). Very interesting...

BertKoor · Post by **BertKoor** » Fri Oct 12, 2018 12:52 pm

The difference is stack vs heap, and ability of the CPU to cache it.
Never assume your compiler optimises everything.

Any tips for optimize this code?