KVR Audio

PurpleSunray · Post by **PurpleSunray** » Fri Oct 12, 2018 12:53 pm

Quickly wrote down some SSE code for that and did a runtime measure. turns out that is about twice as fast.. but not sure if the code is correct and if I meassured correctly, was just a very quick test

Code: Select all


#include <emmintrin.h>

void ProcessBlock_new(int voiceIndex, int remainingVoiceSamples) 
{
	for (int envelopeIndex = 0; envelopeIndex < 10; envelopeIndex++)
	{
		Envelope &envelope = *pEnvelope[envelopeIndex];
		EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

		// load values to MMX reigsters

		const double h = 0.5;
		__m128d half = _mm_load1_pd(&h);

		__m128d rate = _mm_load1_pd(&envelope.mRate);

		__m128d blockstart = _mm_load1_pd(&envelopeVoiceData.mBlockStartAmp);
		__m128d blockdelta = _mm_load1_pd(&envelopeVoiceData.mBlockDeltaAmp);
		__m128d blockstep = _mm_load1_pd(&envelopeVoiceData.mBlockStep);

		// run the loop

		if (envelope.mIsBipolar)
		{
			for (int sample = 0; sample < remainingVoiceSamples; sample++)
			{
				//  value = envelopeVoiceData.mBlockStartAmp + (envelopeVoiceData.mBlockStep * envelopeVoiceData.mBlockDeltaAmp);
				__m128d value = _mm_add_pd(blockstart, _mm_mul_pd(blockstep, blockdelta));

				// value = (0.5 * value + 0.5);
				value = _mm_add_pd(_mm_mul_pd(half, value), half);

				// envelope.mValue[voiceIndex] = value;
				_mm_storel_pd(&envelope.mValue[voiceIndex], value);

				// blockstep += rate;
				blockstep = _mm_add_pd(blockstep, rate);
			}

			// envelopeVoiceData.mBlockStep = blockstep;
			_mm_storel_pd(&envelopeVoiceData.mBlockStep, blockstep);
		}
		else
		{
			for (int sample = 0; sample < remainingVoiceSamples; sample++)
			{
				//  value = envelopeVoiceData.mBlockStartAmp + (envelopeVoiceData.mBlockStep * envelopeVoiceData.mBlockDeltaAmp);
				_mm_storel_pd(&envelope.mValue[voiceIndex], _mm_add_pd(blockstart, _mm_mul_pd(blockstep, blockdelta)));

				// blockstep += rate;
				blockstep = _mm_add_pd(blockstep, rate);
			}

			// envelopeVoiceData.mBlockStep = blockstep;
			_mm_storel_pd(&envelopeVoiceData.mBlockStep, blockstep);
		}
	}
}

vs

Code: Select all


void ProcessBlock_old(int voiceIndex, int remainingVoiceSamples) {
	for (int envelopeIndex = 0; envelopeIndex < 10; envelopeIndex++) {
		Envelope &envelope = *pEnvelope[envelopeIndex];
		EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

		double bp0 = (1 + envelope.mIsBipolar) * 0.5;
		double bp1 = (1 - envelope.mIsBipolar) * 0.5;

		// process block
		for (int sample = 0; sample < remainingVoiceSamples; sample++) {
			// update output value
			double value = envelopeVoiceData.mBlockStartAmp + (envelopeVoiceData.mBlockStep * envelopeVoiceData.mBlockDeltaAmp);
			envelope.mValue[voiceIndex] = (bp0 * value + bp1);

			// next phase
			envelopeVoiceData.mBlockStep += envelope.mRate;
		}
	}
}

result:

Code: Select all

Start...
NEW: 100000 runs in 1031ms
OLD: 100000 runs in 2188ms

might be worth a try coding it in assembly (or intrinsics ). I'm not using any packing on the code above, so I'm pretty sure it could way faster if you spend some more time on it to actually use SIMD

PurpleSunray · Post by **PurpleSunray** » Fri Oct 12, 2018 1:11 pm

Interesting! Placing local copy (and re-store them outside the loop) switch from 5% to 3%.

Oh.. actually this also might be the reason why my code is faster

_mm_storel_pd has a cost.. so no good idea to write back mBlockStep in each loop run.

Code: Select all

envelopeVoiceData.mBlockStep += envelope.mRate;

mBlockStep resides on RAM (on the envelopeVoiceData struct).
So every time to you read or write it, it causes RAM access (the _mm_storel_pd )
If you make mBlockStep a local, there is no RAM traffic - instead mBlockStep will be stored on a CPU register ... most likley (the compiler will decide, unless you code it on assmebly and manage registers on your own while the loop is running

)

Nowhk · Post by **Nowhk** » Fri Oct 12, 2018 2:10 pm

I see, clear: thanks! Now, a further question.

What if instead of having a 1-dimension mValue array:

Code: Select all

  double mValue[PLUG_VOICES_BUFFER_SIZE];

I have a 2d array? So I can fill on sample block iteration, using it later for audio processing:

Code: Select all

  double mValue[PLUG_VOICES_BUFFER_SIZE][PLUG_MAX_PROCESS_BLOCK];

How would you access faster to it within the loop, optimizing the code?
Tried to make a "local" array and than copy in the end with:

Code: Select all

  values[PLUG_MAX_PROCESS_BLOCK]
  
  for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
    values[sampleIndex] = (bp0 * value + bp1)
  }
 
  std::memcpy(envelope.mValue[voiceIndex], values, PLUG_MAX_PROCESS_BLOCK)

But I don't really like it. It doesn't seem so performant. Any way to "fill" faster the local array values[PLUG_MAX_PROCESS_BLOCK] and in the end re-assign it to mValue[voiceIndex] position?

If I use pointer to mValue[PLUG_MAX_PROCESS_BLOCK], I'll write go/back to RAM, I guess...

IN SHORT: since I'm iterating all samples (block), doing this:

Code: Select all

  value[voiceIndex] = (bp0 * value + bp1);

do nothing hehe

I need to "store" the calculated value for the following process. Best optimization (local) way?

PurpleSunray · Post by **PurpleSunray** » Fri Oct 12, 2018 3:06 pm

You need to arrange the data so that you can easily process a bunch of it at once (if you'r after SIMD code).

Example:
Inside the loop there is a multiply+add.
The SSE instructions for mul would be https://software.intel.com/sites/landin ... d=115,3886
If you read creafully, you notice that it does a mulitplication on a 128bit register, packed with 2 64bit doubles.
If you use the float operations, you can pack 4 floats instead of 2 doubles.
Now your data should be arranged in a way so that it can easily be processed with this operations. You will most likely end up with some kind of interleaved layout, such as:
value[SampleIndex0Voice0], value[SampleIndex1Voice0], value[SampleIndex0Voice1], value[SampleIndex1Voice0]
will allow you load first two samples of voice 1 (_mm_load_pd - pack to 128bit register), than calucate it via SIMD than store it back - then next voice, or next 2 samples, or ... go figure out on your own

PurpleSunray · Post by **PurpleSunray** » Fri Oct 12, 2018 3:14 pm

If I use pointer to mValue[PLUG_MAX_PROCESS_BLOCK], I'll write go/back to RAM, I guess...

Yes, I have't understood this line from beginning

You are wirting on same value over and over agian. If there is no operator overloading other magix involved this is completely usless. The sample-by-sample increment loop becomes obselete then too, because: value = envelope.mRate * sampleIndex * blockDelta + blockStart; .. or something like that

Just calucate when you do

Code: Select all

if (envelopeVoiceData.mBlockStep >= gBlockSize) {
						// calculate new envelope values for this block. its processed every 100 samples, not so heavy as operation, so it seems I can ignore the core of my code here
					}

instead of running that increment-one-by-one loop.

Nowhk · Post by **Nowhk** » Mon Oct 15, 2018 2:45 pm

PurpleSunray wrote: ↑Fri Oct 12, 2018 3:06 pm If you read creafully, you notice that it does a mulitplication on a 128bit register, packed with 2 64bit doubles.
If you use the float operations, you can pack 4 floats instead of 2 doubles.

Not sure what you are suggesting here: convert local var from "double" to "float"? Speed instead of precision?

PurpleSunray wrote: ↑Fri Oct 12, 2018 3:06 pm Now your data should be arranged in a way so that it can easily be processed with this operations. You will most likely end up with some kind of interleaved layout, such as:
value[SampleIndex0Voice0], value[SampleIndex1Voice0], value[SampleIndex0Voice1], value[SampleIndex1Voice0]

I think its already in this way. Here's where I am:

Code: Select all

double mValue[PLUG_VOICES_BUFFER_SIZE][PLUG_MAX_PROCESS_BLOCK];

// ---

for (int envelopeIndex = 0; envelopeIndex < mNumEnvelopes; envelopeIndex++) {
	Envelope &envelope = *pEnvelope[envelopeIndex];
	EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

	// skip disabled envelopes
	if (!envelope.mIsEnabled) { continue; }

	// local copy
	double blockStep = envelopeVoiceData.mBlockStep;
	double blockStartAmp = envelopeVoiceData.mBlockStartAmp;
	double blockDeltaAmp = envelopeVoiceData.mBlockDeltaAmp;
	double values[PLUG_MAX_PROCESS_BLOCK];

	bool isBipolar = envelope.mIsBipolar;
	double amount = envelope.mAmount;
	double rate = envelope.mRate;

	double bp0 = ((1 + isBipolar) * 0.5) * amount;
	double bp1 = ((1 - isBipolar) * 0.5) * amount;

	// process block
	for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
		if (blockStep >= gBlockSize) {
			// here I'll update blockStartAmp, blockDeltaAmp and fmod blockStep, every 100 samples. but I'm ignoring this part right now
		}
	
		// update output value
		double value = blockStartAmp + (blockStep * blockDeltaAmp);
		values[sampleIndex] = (bp0 * value + bp1);

		// next phase
		blockStep += rate;
	}

	envelopeVoiceData.mBlockStep = blockStep;
	envelopeVoiceData.mBlockStartAmp = blockStartAmp;
	envelopeVoiceData.mBlockDeltaAmp = blockDeltaAmp;
	
	std::memcpy(envelope.mValue[voiceIndex], values, PLUG_MAX_PROCESS_BLOCK);
}

First voice, than each envelope, than sample-by-sample to the same voice. So it should already be:

Code: Select all

value[Voice0Envelope0Sample0]
value[Voice0Envelope0Sample1]
value[Voice0Envelope0Sample2]
...
value[Voice0Envelope0SampleBlockSize]
value[Voice0Envelope1Sample0]
value[Voice0Envelope1Sample1]
value[Voice0Envelope1Sample2]
...
value[Voice0Envelope1SampleBlockSize]
value[Voice0Envelope16Sample0]
value[Voice0Envelope16Sample1]
value[Voice0Envelope16Sample2]
...
value[Voice0Envelope16SampleBlockSize]
...
value[Voice16Envelope0Sample0]
value[Voice16Envelope0Sample1]
value[Voice16Envelope0Sample2]
...
value[Voice16Envelope0SampleBlockSize]
...
value[Voice16Envelope16Sample0]
value[Voice16Envelope16Sample1]
value[Voice16Envelope16Sample2]
...
value[Voice16Envelope16SampleBlockSize]

PurpleSunray · Post by **PurpleSunray** » Tue Oct 16, 2018 9:30 am

Question, is there any way you can get rid of this "if (blockStep >= gBlockSize)" ?
Can gBlockSize be blockSize? Or vice versa?

The "problem" right now is that blockStep is changed every loop run, feeds into next again and might even be changed depending on an if condition. Kind of worst case to optimize.
It is hard to vectorize and/or predict such code, because one value depends on next but only if... So most compilers only do loop-unroll or nothing at all.
If you can get rid of the if (blockStep >= gBlockSize)", this would be:

Code: Select all

for (int envelopeIndex = 0; envelopeIndex < mNumEnvelopes; envelopeIndex++) {
		Envelope &envelope = *pEnvelope[envelopeIndex];
		EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

		// skip disabled envelopes
		if (!envelope.mIsEnabled) { continue; }

		// here I'll update blockStartAmp, blockDeltaAmp and fmod blockStep, every blockSize. but I'm ignoring this part right now

		// local copy
		double blockStep = envelopeVoiceData.mBlockStep;
		double blockStartAmp = envelopeVoiceData.mBlockStartAmp;
		double blockDeltaAmp = envelopeVoiceData.mBlockDeltaAmp;
		double values[PLUG_MAX_PROCESS_BLOCK];

		bool isBipolar = envelope.mIsBipolar;
		double amount = envelope.mAmount;
		double rate = envelope.mRate;

		double bp0 = ((1 + isBipolar) * 0.5) * amount;
		double bp1 = ((1 - isBipolar) * 0.5) * amount;

		// process block
		for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) 
		{
			// update output value
			double value = blockStartAmp + ((blockStep + rate * sampleIndex) * blockDeltaAmp);
			values[sampleIndex] = (bp0 * value + bp1);
		}

		envelopeVoiceData.mBlockStep = blockStep + rate * blockSize;
		envelopeVoiceData.mBlockStartAmp = blockStartAmp;
		envelopeVoiceData.mBlockDeltaAmp = blockDeltaAmp;

		std::memcpy(envelope.mValue[voiceIndex], values, PLUG_MAX_PROCESS_BLOCK);
	}

Now this "process block" loop parallelises to hell. No value depends anything than a set of input values. SIMD code is easy to write now (process a bunch of 2 samples at once), compilers will be way better at optimizing this further and CPUs can pack and predict until they melt down.

btw, your memcpy is wrong, you need a sizeof double somewhere

PurpleSunray · Post by **PurpleSunray** » Wed Oct 17, 2018 1:50 pm

If you want to even further optimize: here is an example of an SSE2 code (the SIMD stuff I was talking about), with lots of comments so that you get an idea of what's going on. Measured against the C++ code above and your lastest code.

Code: Select all

SSE intrinsics: 100000 runs in 817ms
C++ removed if: 100000 runs in 1219ms
C++ with if: 100000 runs in 1781ms

Code: Select all

	for (int envelopeIndex = 0; envelopeIndex < mNumEnvelopes; envelopeIndex++) {
		Envelope &envelope = *pEnvelope[envelopeIndex];
		EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

		// skip disabled envelopes
		if (!envelope.mIsEnabled) { continue; }

		// here I'll update blockStartAmp, blockDeltaAmp and fmod blockStep, every blockSize. but I'm ignoring this part right now

		// fill a 128bit register with 2 double, set to ((1 + envelope.mIsBipolar) * 0.5) * envelope.mAmount
		__m128d bp0 = _mm_set1_pd(((1 + envelope.mIsBipolar) * 0.5) * envelope.mAmount);

		// fill a 128bit register with 2 double, set to ((1 - envelope.mIsBipolar) * 0.5) * envelope.mAmount
		__m128d bp1 = _mm_set1_pd(((1 - envelope.mIsBipolar) * 0.5) * envelope.mAmount);

		// fill a 128bit register with 2 double, set to envelopeVoiceData.mBlockStep
		__m128d blockStep = _mm_set1_pd(envelopeVoiceData.mBlockStep);

		// fill a 128bit register with 2 double, set to envelopeVoiceData.mBlockStartAmp
		__m128d blockStartAmp = _mm_set1_pd(envelopeVoiceData.mBlockStartAmp);

		// fill a 128bit register with 2 double, set to envelopeVoiceData.mBlockDeltaAmp
		__m128d blockDeltaAmp = _mm_set1_pd(envelopeVoiceData.mBlockDeltaAmp);

		// fill a 128bit register with 2 double, set to envelopeVoiceData.mRate
		__m128d rate = _mm_set1_pd(envelope.mRate);

		// fill a 128bit register with 4 int32, set to 0
		__m128i sampleIndex = _mm_set1_epi32(0);

		// fill a 128bit register with 4 int32, set to 3, 2, 1 and 0
		__m128i sampleIndexInc = _mm_set_epi32(3, 2, 1, 0);

		// run the loop - increase in step of 4 since we process 4 samples at once.
		for (int i = 0; i < blockSize; i += 4)
		{
		        // set all int32 in sampleIndex to loop run counter
			sampleIndex = _mm_set1_epi32(i);
			
			// add 3, 2, 1 and 0 to the 4 int32 in sampleIndex
			sampleIndex = _mm_add_epi32(sampleIndex, sampleIndexInc);

			// convert lower two int32 to doubles
			__m128d sampleIndexDouble = _mm_cvtepi32_pd(sampleIndex);

			// calculate (two times, on 2 doubles packed to the 128bit registers):
			//     mValue[voiceIndex][i] = bp1 + (bp0 * (blockStartAmp + ((blockStep + rate * sampleIndex) * blockDeltaAmp)))
			_mm_store_pd(&(mValue[voiceIndex][i]),
				_mm_add_pd(bp1,
					_mm_mul_pd(bp0,
						_mm_add_pd(blockStartAmp,
							_mm_mul_pd(blockDeltaAmp,
								_mm_add_pd(blockStep,
									_mm_mul_pd(rate, sampleIndexDouble)))))));

			// shift sampleIndex 64bit to right (so we get upper 2 int32 for _mm_cvtepi32_pd)
			sampleIndex = _mm_srli_si128(sampleIndex, 64);

			// convert lower two int32 to double
			sampleIndexDouble = _mm_cvtepi32_pd(sampleIndex);
			
			// calculate again
			_mm_store_pd(&(mValue[voiceIndex][i + 2]),
				_mm_add_pd(bp1,
					_mm_mul_pd(bp0,
						_mm_add_pd(blockStartAmp,
							_mm_mul_pd(blockDeltaAmp,
								_mm_add_pd(blockStep,
									_mm_mul_pd(rate, sampleIndexDouble)))))));
		}
	}

Nowhk · Post by **Nowhk** » Wed Oct 17, 2018 3:32 pm

I've made this version (which don't use SIMD stuff as you wrote):

Code: Select all

for (int envelopeIndex = 0; envelopeIndex < mNumEnvelopes; envelopeIndex++) {
	Envelope &envelope = *pEnvelope[envelopeIndex];
	EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

	// skip disabled envelopes
	if (!envelope.mIsEnabled) { continue; }

	// local copy
	EnvelopeVoiceStage stage = envelopeVoiceData.mStage;
	double blockStep = envelopeVoiceData.mBlockStep;
	double blockStepOffset = envelopeVoiceData.mBlockStepOffset;
	double blockStartAmp = envelopeVoiceData.mBlockStartAmp;
	double blockEndAmp = envelopeVoiceData.mBlockEndAmp;
	double blockDeltaAmp = envelopeVoiceData.mBlockDeltaAmp;
	double step = envelopeVoiceData.mStep;

	bool isBipolar = envelope.mIsBipolar;
	unsigned int numPoints = envelope.mNumPoints;
	unsigned int lengthInSamples = envelope.mLengthInSamples;
	unsigned int loopPointIndex = envelope.mLoopPointIndex;
	unsigned int loopLengthInSamples = envelope.mLoopLengthInSamples;
	double amount = envelope.mAmount;
	double rate = envelope.mRate;
	//double * __restrict  pValue = envelope.mValue[voiceIndex];
	double values[PLUG_MAX_PROCESS_BLOCK];
	EnvelopeLoopType loopType = envelope.mLoopType;

	// pre-computed data
	double bp0 = ((1 + isBipolar) * 0.5) * amount;
	double bp1 = ((1 - isBipolar) * 0.5) * amount;
	double value = (bp0 * blockStartAmp) + (bp0 * blockStep * blockDeltaAmp) + bp1;
	double deltaValue = bp0 * rate * blockDeltaAmp;

	// process block
	for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
		// update output value
		values[sampleIndex] = value;
		value += deltaValue;

		// next phase
		blockStep += rate;
	}

	// revert local copy
	envelopeVoiceData.mStage = stage;
	envelopeVoiceData.mBlockStep = blockStep;
	envelopeVoiceData.mBlockStepOffset = blockStepOffset;
	envelopeVoiceData.mBlockStartAmp = blockStartAmp;
	envelopeVoiceData.mBlockEndAmp = blockEndAmp;
	envelopeVoiceData.mBlockDeltaAmp = blockDeltaAmp;
	envelopeVoiceData.mStep = step;

	std::memcpy(envelope.mValue[voiceIndex], values, sizeof envelope.mValue[voiceIndex]);
}

It got the same performance as your: 3%. Tried other further semplifications (such switching from double to float), but can't go under 3%.

Probably that's the limit? Not sure, I still feel 3% is too much for such calculations

2DaT · Post by **2DaT** » Wed Oct 17, 2018 4:15 pm

PurpleSunray wrote: ↑Wed Oct 17, 2018 1:50 pm If you want to even further optimize: here is an example of an SSE2 code (the SIMD stuff I was talking about), with lots of comments so that you get an idea of what's going on. Measured against the C++ code above and your lastest code.
Code: Select all
		__m128i sampleIndexInc = _mm_set_epi32(3, 2, 1, 0);
		sampleIndex = _mm_set1_epi32(i);
		sampleIndex = _mm_add_epi32(sampleIndex, sampleIndexInc);
		__m128d sampleIndexDouble = _mm_cvtepi32_pd(sampleIndex);
		sampleIndex = _mm_srli_si128(sampleIndex, 64);
		sampleIndexDouble = _mm_cvtepi32_pd(sampleIndex);

Why would you do all of this integer stuff? Your can increment in a SIMD friendly way.

Code: Select all

__m128 i1 = _mm_set_pd(1.0,0.0);
__m128 i2 = _mm_set_pd(3.0,2.0);
__m128 increment = _mm_set1_pd(4.0);
i1 = _mm_add_pd(i1,increment);
i2 = _mm_add_pd(i2,increment);

It has a drawback of a loop carried dependency chain, but it's not a big deal for this loop.

2DaT · Post by **2DaT** » Wed Oct 17, 2018 4:38 pm

Nowhk wrote: ↑Wed Oct 17, 2018 3:32 pm I've made this version (which don't use SIMD stuff as you wrote):

Code: Select all

for (int envelopeIndex = 0; envelopeIndex < mNumEnvelopes; envelopeIndex++) {
	Envelope &envelope = *pEnvelope[envelopeIndex];
	EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

	// skip disabled envelopes
	if (!envelope.mIsEnabled) { continue; }

	// local copy
	EnvelopeVoiceStage stage = envelopeVoiceData.mStage;
	double blockStep = envelopeVoiceData.mBlockStep;
	double blockStepOffset = envelopeVoiceData.mBlockStepOffset;
	double blockStartAmp = envelopeVoiceData.mBlockStartAmp;
	double blockEndAmp = envelopeVoiceData.mBlockEndAmp;
	double blockDeltaAmp = envelopeVoiceData.mBlockDeltaAmp;
	double step = envelopeVoiceData.mStep;

	bool isBipolar = envelope.mIsBipolar;
	unsigned int numPoints = envelope.mNumPoints;
	unsigned int lengthInSamples = envelope.mLengthInSamples;
	unsigned int loopPointIndex = envelope.mLoopPointIndex;
	unsigned int loopLengthInSamples = envelope.mLoopLengthInSamples;
	double amount = envelope.mAmount;
	double rate = envelope.mRate;
	//double * __restrict  pValue = envelope.mValue[voiceIndex];
	double values[PLUG_MAX_PROCESS_BLOCK];
	EnvelopeLoopType loopType = envelope.mLoopType;

	// pre-computed data
	double bp0 = ((1 + isBipolar) * 0.5) * amount;
	double bp1 = ((1 - isBipolar) * 0.5) * amount;
	double value = (bp0 * blockStartAmp) + (bp0 * blockStep * blockDeltaAmp) + bp1;
	double deltaValue = bp0 * rate * blockDeltaAmp;

	// process block
	for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
		// update output value
		values[sampleIndex] = value;
		value += deltaValue;

		// next phase
		blockStep += rate;
	}

	// revert local copy
	envelopeVoiceData.mStage = stage;
	envelopeVoiceData.mBlockStep = blockStep;
	envelopeVoiceData.mBlockStepOffset = blockStepOffset;
	envelopeVoiceData.mBlockStartAmp = blockStartAmp;
	envelopeVoiceData.mBlockEndAmp = blockEndAmp;
	envelopeVoiceData.mBlockDeltaAmp = blockDeltaAmp;
	envelopeVoiceData.mStep = step;

	std::memcpy(envelope.mValue[voiceIndex], values, sizeof envelope.mValue[voiceIndex]);
}

It got the same performance as your: 3%. Tried other further semplifications (such switching from double to float), but can't go under 3%.

Probably that's the limit? Not sure, I still feel 3% is too much for such calculations

Hmm. std::memcpy brings no benefit here, write your array directly. Also, you don't need to write all that data to different locations. Write to a local array and utilize it for your voice computations.

You've got around 100% performance boost from nothing and you're not happy ?

Nowhk · Post by **Nowhk** » Wed Oct 17, 2018 4:54 pm

2DaT wrote: ↑Wed Oct 17, 2018 4:38 pm You've got around 100% performance boost from nothing and you're not happy ?

Haha not yet, I feel I can do better

That't the first chain of processing: the lower I go, the better is later!

2DaT wrote: ↑Wed Oct 17, 2018 4:38 pm Hmm. std::memcpy brings no benefit here, write your array directly.

Uhm, with a pointer you mean? Wouldn't this means that I have to write/read forward/back to RAM? That array (i.e. mValue) is in memory: every access to it (read and write) isn't a ram switch?
I tried with a pointer to that mValue instead of fill a local array and in the end memcopy mValue (as I do right now): it goes from 3% to 4%. It seems slower...

2DaT wrote: ↑Wed Oct 17, 2018 4:38 pm Also, you don't need to write all that data to different locations. Write to a local array and utilize it for your voice computations.

Not sure again what you mean here: do you mean create a "local copy" of EnvelopeVoiceData and access to its members instead of instance each single variable of its members?

2DaT · Post by **2DaT** » Wed Oct 17, 2018 7:19 pm

Nowhk wrote: ↑Wed Oct 17, 2018 4:54 pm
2DaT wrote: ↑Wed Oct 17, 2018 4:38 pm Also, you don't need to write all that data to different locations. Write to a local array and utilize it for your voice computations.
Not sure again what you mean here: do you mean create a "local copy" of EnvelopeVoiceData and access to its members instead of instance each single variable of its members?

Code: Select all

for(auto& voice : voices)
{
	double envelope_values[PLUG_MAX_PROCESS_BLOCK*ENVELOPE_COUNT];

	for (int envelopeIndex = 0; envelopeIndex < mNumEnvelopes; envelopeIndex++) {
		//Process envelopes here
		//no need to memcpy anything
	}

	//Process voice using envelope_values
	
	
}

You want your block size small enough for it's data to fit in cache, but large enough to minimize loop overhead.

Nowhk · Post by **Nowhk** » Wed Oct 17, 2018 7:48 pm

2DaT wrote: ↑Wed Oct 17, 2018 7:19 pm
Nowhk wrote: ↑Wed Oct 17, 2018 4:54 pm
2DaT wrote: ↑Wed Oct 17, 2018 4:38 pm Also, you don't need to write all that data to different locations. Write to a local array and utilize it for your voice computations.
Not sure again what you mean here: do you mean create a "local copy" of EnvelopeVoiceData and access to its members instead of instance each single variable of its members?
Code: Select all
for(auto& voice : voices)
{
	double envelope_values[PLUG_MAX_PROCESS_BLOCK*ENVELOPE_COUNT];

	for (int envelopeIndex = 0; envelopeIndex < mNumEnvelopes; envelopeIndex++) {
		//Process envelopes here
		//no need to memcpy anything
	}

	//Process voice using envelope_values
	
	
}
You want your block size small enough for it's data to fit in cache, but large enough to minimize loop overhead.

Uhm?

Isn't making "local copy of envelope data" and process them within the loop that increase the performance? Than, once I've processed "local copy of envelope data" I'll need to use them outside the loop (i.e. in after the sampleBlock processed for each envelope). That's why I copy back to objects in "RAM" the calculated local values.

That's actually how all its splitted (I don't have a unique function that process all DSP; its a pain to it):

Code: Select all

// main process call
for (int voiceIndex = 0; voiceIndex < PLUG_VOICES_BUFFER_SIZE; voiceIndex++) {
	pEnvelopes->ProcessBlock(blockSize, voiceIndex);
	pOscs->ProcessBlock(blockSize, voiceIndex);
}

void Envelopes::ProcessBlock(int blockSize, int voiceIndex) {
	// local copy of pEnvelopes->mValue[numVoices][sampleBuffer]

	for (int envelopeIndex = 0; envelopeIndex < mNumEnvelopes; envelopeIndex++) {
		...
		for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
			... process envelope data
		}
		...
		
		// now revert local copy to mValue for each envelope, so I can use it later
	}
}

void Oscs::ProcessBlock(int blockSize, int voiceIndex) {
	for (int oscIndex = 0; oscIndex < mNumOscs; oscIndex++) {
		...
		for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
			// example
			double value = pEnvelopes->envelopes[gainIndex]->mValue[voiceIndex][sampleIndex] * oscGain[oscIndex][sampleIndex];
			...
		}
		...
	}
}

PurpleSunray · Post by **PurpleSunray** » Wed Oct 17, 2018 8:20 pm

2DaT wrote: ↑Wed Oct 17, 2018 4:15 pm
PurpleSunray wrote: ↑Wed Oct 17, 2018 1:50 pm Why would you do all of this integer stuff? Your can increment in a SIMD friendly way.
Code: Select all
__m128 i1 = _mm_set_pd(1.0,0.0);
__m128 i2 = _mm_set_pd(3.0,2.0);
__m128 increment = _mm_set1_pd(4.0);
i1 = _mm_add_pd(i1,increment);
i2 = _mm_add_pd(i2,increment);
It has a drawback of a loop carried dependency chain, but it's not a big deal for this loop.

Because of 1 add_epi32 vs 2 add_pd. There is no point calculating doubles if you only need int, is there?

Any tips for optimize this code?