Any tips for optimize this code?
-
- KVRian
- 853 posts since 13 Mar, 2012
@Nowhk
The the local copy of your envelope data again is on RAM (on stack). So the memcpy copies from RAM to RAM - pretty useless.
You gain performance from a "local copy" if that value fits into a CPU register. Than the compiler might decide to load it from RAM once and keep on the register during loop runs.
The the local copy of your envelope data again is on RAM (on stack). So the memcpy copies from RAM to RAM - pretty useless.
You gain performance from a "local copy" if that value fits into a CPU register. Than the compiler might decide to load it from RAM once and keep on the register during loop runs.
~~ ॐ http://soundcloud.com/mfr ॐ ~~
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
How can I ensure data I need (for that loop) goes to CPU register during loop runs so?PurpleSunray wrote: ↑Wed Oct 17, 2018 8:33 pm @Nowhk
The the local copy of your envelope data again is on RAM (on stack). So the memcpy copies from RAM to RAM - pretty useless.
You gain performance from a "local copy" if that value fits into a CPU register. Than the compiler might decide to load it from RAM once and keep on the register during loop runs.
-
- KVRian
- 631 posts since 21 Jun, 2013
Your code also contains these instructions:PurpleSunray wrote: ↑Wed Oct 17, 2018 8:20 pmBecause of 1 add_epi32 vs 2 add_pd. There is no point calculating doubles if you only need int, is there?2DaT wrote: ↑Wed Oct 17, 2018 4:15 pmPurpleSunray wrote: ↑Wed Oct 17, 2018 1:50 pm Why would you do all of this integer stuff? Your can increment in a SIMD friendly way.It has a drawback of a loop carried dependency chain, but it's not a big deal for this loop.Code: Select all
__m128 i1 = _mm_set_pd(1.0,0.0); __m128 i2 = _mm_set_pd(3.0,2.0); __m128 increment = _mm_set1_pd(4.0); i1 = _mm_add_pd(i1,increment); i2 = _mm_add_pd(i2,increment);
_mm_set1_epi32 which is a cross-domain mov and shuffle.
_mm_cvtepi32_pd * 2 which is roughly equivalent of add but sometimes worse.
_mm_srli_si128 - a shift or shuffle.
-
- KVRian
- 853 posts since 13 Mar, 2012
@2Dat OK
@Nowhk you can't.
The compiler will decide. All you can do is making it easy for it. You can do so by knowing your CPU (your CPU will have 16 XMM registers to play with, most likley).
You did so already, by storing some values on a local double instead of reading it from the struct. So the compiler can build code to read it once from RAM and keep it on a XMM register. When you read from struct directly, compiler can't keep it on register, because you want to read it from sturct
@Nowhk you can't.
The compiler will decide. All you can do is making it easy for it. You can do so by knowing your CPU (your CPU will have 16 XMM registers to play with, most likley).
You did so already, by storing some values on a local double instead of reading it from the struct. So the compiler can build code to read it once from RAM and keep it on a XMM register. When you read from struct directly, compiler can't keep it on register, because you want to read it from sturct
~~ ॐ http://soundcloud.com/mfr ॐ ~~
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Exactly I place most data on local register, so it will use them instead of accessing the RAM.PurpleSunray wrote: ↑Wed Oct 17, 2018 9:11 pm You did so already, by storing some values on a local double instead of reading it from the struct. So the compiler can build code to read it once from RAM and keep it on a XMM register. When you read from struct directly, compiler can't keep it on register, because you want to read it from sturct
That's why "later" I memcopy: I'm filling a local array, avoid cpu to access RAM. But once I've filled it (at the end of the for) I need to copy back to the array in ram, else I'll lost the calculations
Why you said the local copy of your envelope data again is on RAM (on stack)?
I miss this point, since (as you already said) I've "storing some values on a local double instead of reading it from the struct"
- KVRAF
- 15294 posts since 8 Mar, 2005 from Utrecht, Holland
We are the KVR collective. Resistance is futile. You will be assimilated.
My MusicCalc is served over https!!
My MusicCalc is served over https!!
-
- KVRAF
- 4007 posts since 8 Jan, 2005 from Hamilton, New Zealand
While I can't comment on the local/SIMD situation (and it seems like you guys have got that covered), I did some code review of my own on the original and found some redundancies. Also I don't recommend splitting stuff out into a function unless two or more functions are using that code, performance-wise.
blocksize isn't needed, remainingsamples isn't unsigned, you can safely negate the macro from it and it'll terminate the loop. For loops often compile to better code than while loops in my experience, depending on the compiler. Also, just basic stuff - it may help the compiler to optimize faster when the code intent is clearer. Or not. Compilers are super-crazy nowadays. Speaking of which, I recommend compiling in nuwen mingw if you can, not VCC. Codelite is an excellent IDE which picks up mingw easily. Much faster code output. Often up to 20% in my benchmarks (with -march=native). Won't auto-SIMD until -O3, but sometimes that's not faster.
Hope it helps.
Code: Select all
// buffer
for (int remainingSamples = nFrames; remainingSamples > 0; remainingSamples -= PLUG_MAX_PROCESS_BLOCK)
{
// voices (buffer is 32; 16 simultaneous + 16 free slot for cutoof previous one)
for (int voiceIndex = 0; voiceIndex != PLUG_VOICES_BUFFER_SIZE; ++voiceIndex)
{
if (pVoiceManager->mVoices[voiceIndex].mIsPlaying)
{
for (int remainingVoiceSamples = (remainingSamples > PLUG_MAX_PROCESS_BLOCK) ? remainingSamples : PLUG_MAX_PROCESS_BLOCK; remainingVoiceSamples != 0; --remainingVoiceSamples)
{
for (int envelopeIndex = 0; envelopeIndex != mNumEnvelopes; ++envelopeIndex)
{
Envelope &envelope = *pEnvelope[envelopeIndex];
EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];
if (envelope.mIsEnabled)
{
// process block
if (envelopeVoiceData.mBlockStep >= gBlockSize)
{
// calculate new envelope values for this block. its processed every 100 samples, not so heavy as operation, so it seems I can ignore the core of my code here
}
// update output value
const double value = envelopeVoiceData.mBlockStartAmp + (envelopeVoiceData.mBlockStep * envelopeVoiceData.mBlockDeltaAmp);
envelope.mValue[voiceIndex] = ((1 + envelope.mIsBipolar) / 2.0 * value + (1 - envelope.mIsBipolar) / 2.0) * envelope.mAmount;
// next phase
envelopeVoiceData.mBlockStep += envelope.mRate;
envelopeVoiceData.mStep += envelope.mRate;
}
}
++voice.mSampleIndex;
}
}
}
}
Hope it helps.
I make music: progressive-acoustic | electronica/game-soundtrack work | progressive alt-metal
Win 10/11 Simplifier | Also, Specialized C++ containers
Win 10/11 Simplifier | Also, Specialized C++ containers
-
- KVRian
- 853 posts since 13 Mar, 2012
Your envelope data is not stored a CPU register only.Nowhk wrote: ↑Thu Oct 18, 2018 7:36 amExactly I place most data on local register, so it will use them instead of accessing the RAM.PurpleSunray wrote: ↑Wed Oct 17, 2018 9:11 pm You did so already, by storing some values on a local double instead of reading it from the struct. So the compiler can build code to read it once from RAM and keep it on a XMM register. When you read from struct directly, compiler can't keep it on register, because you want to read it from sturct
That's why "later" I memcopy: I'm filling a local array, avoid cpu to access RAM. But once I've filled it (at the end of the for) I need to copy back to the array in ram, else I'll lost the calculations
Why you said the local copy of your envelope data again is on RAM (on stack)?
I miss this point, since (as you already said) I've "storing some values on a local double instead of reading it from the struct"
An SSE2 CPU has 16 registers, but your envelope data is an array of 64 doubles.
That doesn't fit into and so it is allocated on stack.
An example with some dummy code to show the difference.
Code: Select all
for (int i = 0; i < n; i++) {
envelope.mValue[i] = envelope.data0 + envelope.data1;
}
Code: Select all
for (int i = 0; i < n; i++) {
mov XMM0, envelope.data0; // Load envelope.data0 from RAM to register
mov XMM1, envelope.data1; // Load envelope.data1 from RAM to register
add XMM0, XMM1, XMM2; // Add XMM0 to XMM1 and store on XMM2
mov envelope.mValue[i], XMM2; // Store the result to RAM
}
Code: Select all
double d0 = envelope.data0;
double d1 = envelope.data1;
for (int i = 0; i < n; i++) {
envelope.mValue[i] = d0 + d1;
}
Code: Select all
mov XMM0, envelope.data0; // Load envelope.data0 from RAM to register
mov XMM1, envelope.data1; // Load envelope.data1 from RAM to register
for (int i = 0; i < n; i++) {
add XMM0, XMM1, XMM2; // Add XMM0 to XMM1 and store on XMM2
mov envelope.mValue[i], XMM2; // Store the result to RAM
}
Now you try to do that:
Code: Select all
double d0 = envelope.data0;
double d1 = envelope.data1;
double values[64];
for (int i = 0; i < n; i++) {
values[i] = d0 + d1;
}
Code: Select all
mov XMM0, envelope.data0; // Load envelope.data0 from RAM to register
mov XMM1, envelope.data1; // Load envelope.data1 from RAM to register
add esp, 512; // grow stack by size of dobule[64] and remeber the address (that is values[64])
for (int i = 0; i < n; i++) {
add XMM0, XMM1, XMM2; // Add XMM0 to XMM1 and store on XMM2
mov values[i], XMM2; // Store the result to RAM (on stack, as allocated above).
}
In general - as soon as you need a pointer on a value it cannot reside on a register only.
CPU registers do not have memory addresses. So the fact that you are passing a double* into the memcpy already tells you that this cannot reside on a CPU register only, but it needs a representation on RAM since you want to know the address of it.
~~ ॐ http://soundcloud.com/mfr ॐ ~~
-
- KVRian
- 853 posts since 13 Mar, 2012
And I still don't get this:
Why all this increments?
Why not
?
This "value += deltaValue;" makes it hard to vectorize. A dumb compiler won't do, while the second verson is easy to vectorize, even for a dumb compiler.
Long story:
The register on FPU can be packet with multiple values. You can do an operation on 2 doubles (SSE) or 4 doubles (AVX) in parallel. But.. "value += deltaValue" means, "calucate one after the other".. you need the result of the previous operation to do next. Hard to do calcuate two values in parallel if one depends on the previous.
On the other hand, calucalting "= value + deltaValue * sampleIndex;" in parallel is incredible easy. Just load sampleIndex register with [index][index+1] and calcuate two values for the cost of one (more or less.. )
Try changing it - MSVC seems to run a dumb compiler. Got a boost of about 50% by replacing += with *
Code: Select all
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
// update output value
values[sampleIndex] = value;
value += deltaValue;
// next phase
blockStep += rate;
}
Why not
Code: Select all
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
values[sampleIndex] = value + deltaValue * sampleIndex;
}
blockStep += rate * blockSize;
This "value += deltaValue;" makes it hard to vectorize. A dumb compiler won't do, while the second verson is easy to vectorize, even for a dumb compiler.
Long story:
The register on FPU can be packet with multiple values. You can do an operation on 2 doubles (SSE) or 4 doubles (AVX) in parallel. But.. "value += deltaValue" means, "calucate one after the other".. you need the result of the previous operation to do next. Hard to do calcuate two values in parallel if one depends on the previous.
On the other hand, calucalting "= value + deltaValue * sampleIndex;" in parallel is incredible easy. Just load sampleIndex register with [index][index+1] and calcuate two values for the cost of one (more or less.. )
Try changing it - MSVC seems to run a dumb compiler. Got a boost of about 50% by replacing += with *
~~ ॐ http://soundcloud.com/mfr ॐ ~~
-
- KVRAF
- 7412 posts since 17 Feb, 2005
Extraneous variables are not a concern when using SSA optimization (in GCC this is enabled to varying degrees at all -O levels).
However, the typing of the variables might matter. For example, perhaps if two different conditional comparisons use an operand that is based on a single value, BUT that value is then typecasted to a second value (which effectively changes the underlying "machine mode") for one of the two comparisons, the compiler might generate a second value when only one is necessary..
I say might because I don't know.
So this code here has 3 valid comparisons of literal constants to an incrementer that is typecasted in 2 of 3 uses. Notice that one cast is to a lower bitrange and the other to a higher bitrange, but the static limit of the loop also fits in the range of the smallest type used with the incrementer.
Anyway, back on topic, I have observed that loop incrementers, for the use of pointer dereferencing, work best with a type that is the same size as the platform pointer size. Use the 'size_t' type to index your arrays and pointers, however be aware that too many of these cost more in memory space.
However, the typing of the variables might matter. For example, perhaps if two different conditional comparisons use an operand that is based on a single value, BUT that value is then typecasted to a second value (which effectively changes the underlying "machine mode") for one of the two comparisons, the compiler might generate a second value when only one is necessary..
I say might because I don't know.
Code: Select all
for (int i = 0; ; i++) // 'i' should be assigned a runtime input, NOT a compile-time constant.
{
if ( i == 10 ) { printf("10\n"); }
if ( (char)i == 'C' ) { printf("C\n"); }
if ( (long long int)i == 255ll ) { printf("break"); break; }
}
Anyway, back on topic, I have observed that loop incrementers, for the use of pointer dereferencing, work best with a type that is the same size as the platform pointer size. Use the 'size_t' type to index your arrays and pointers, however be aware that too many of these cost more in memory space.
-
- KVRAF
- 4007 posts since 8 Jan, 2005 from Hamilton, New Zealand
It depends how they're used. In this case it did matter, so I restructured the code.camsr wrote: ↑Thu Oct 18, 2018 1:55 pm Extraneous variables are not a concern when using SSA optimization (in GCC this is enabled to varying degrees at all -O levels).
I make music: progressive-acoustic | electronica/game-soundtrack work | progressive alt-metal
Win 10/11 Simplifier | Also, Specialized C++ containers
Win 10/11 Simplifier | Also, Specialized C++ containers
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Got it. I've always thought that it will be "inserted" in chunks, so if it doesn't fit all, it will place "part of it, elaborate, than swap with another part". It seems I was wrong.PurpleSunray wrote: ↑Thu Oct 18, 2018 9:57 am double values[64] will not go into a register, because it doesn't fit. Instead it allocates on stack
Said this, is better in your opinions access to array on stack "directly" or with pointer?
I mean, this way:
Code: Select all
double mValue[PLUG_VOICES_BUFFER_SIZE][PLUG_MAX_PROCESS_BLOCK];
...
double *pValue = envelope.mValue[voiceIndex];
*pValue++ = value
Code: Select all
double mValue[PLUG_VOICES_BUFFER_SIZE][PLUG_MAX_PROCESS_BLOCK];
...
envelope.mValue[voiceIndex][sampleIndex] = value;
You have right again: this really speed up the process! Thanks!PurpleSunray wrote: ↑Thu Oct 18, 2018 11:14 am Try changing it - MSVC seems to run a dumb compiler. Got a boost of about 50% by replacing += with *
-
- KVRian
- 853 posts since 13 Mar, 2012
It shouldn't make much of a difference, but if you are down on counting CPU cycles:
For writing to values, the compiler needs to load values address to a register, than add voiceIndex*sizeof(double), than add sampleIndex sizeof(double) and than it has the right address to write to.
The compiler might optimize the first add since it is same on every loop run, so you get
Now it is down to load an address and add sampleIndex.
Still it is not optimal, because there still is sampleIndex.
What we actually do is increase sampleIndex, than add sampleIndex to values address.
So this are again 2 adds. You can reduce it to 1 add in total by removing sampleIndex and increaseing the pointer directly (this is not always possible thou - if you need sampleIndex inside the loop you cannot remove it).
Code: Select all
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
values[voiceIndex][sampleIndex] = value;
}
The compiler might optimize the first add since it is same on every loop run, so you get
Code: Select all
double* values = envelope.mValue[voiceIndex];
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
values[sampleIndex] = value;
}
Still it is not optimal, because there still is sampleIndex.
What we actually do is increase sampleIndex, than add sampleIndex to values address.
So this are again 2 adds. You can reduce it to 1 add in total by removing sampleIndex and increaseing the pointer directly (this is not always possible thou - if you need sampleIndex inside the loop you cannot remove it).
Code: Select all
double* start = envelope.mValue[voiceIndex];
double* end = start + blockSize;
for (double* ptr = start ; ptr < end; ptr++) {
*ptr = value;
}
~~ ॐ http://soundcloud.com/mfr ॐ ~~
-
- KVRian
- 853 posts since 13 Mar, 2012
And to even further confuse you
It actually works in chunks now. They are just way smaller as you were thinking: 2 double if you turned on SSE or 4 if you turned on AVX
A single instruction can load or store (or add or mul) 2 doubles now. That chunk-processing is the speed gain you see after you removed the += (vectorized code vs one-by-one code).
Nope, you were right.Nowhk wrote: ↑Fri Oct 19, 2018 7:10 amGot it. I've always thought that it will be "inserted" in chunks, so if it doesn't fit all, it will place "part of it, elaborate, than swap with another part". It seems I was wrong.PurpleSunray wrote: ↑Thu Oct 18, 2018 9:57 am double values[64] will not go into a register, because it doesn't fit. Instead it allocates on stack
It actually works in chunks now. They are just way smaller as you were thinking: 2 double if you turned on SSE or 4 if you turned on AVX
A single instruction can load or store (or add or mul) 2 doubles now. That chunk-processing is the speed gain you see after you removed the += (vectorized code vs one-by-one code).
Last edited by PurpleSunray on Fri Oct 19, 2018 8:15 am, edited 1 time in total.
~~ ॐ http://soundcloud.com/mfr ॐ ~~
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Which I can't, since I'll use sampleIndex internally What also make differences I've noticed is to change this:PurpleSunray wrote: ↑Fri Oct 19, 2018 7:32 am So this are again 2 adds. You can reduce it to 1 add in total by removing sampleIndex and increaseing the pointer directly (this is not always possible thou - if you need sampleIndex inside the loop you cannot remove it).
Code: Select all
double mValue[PLUG_VOICES_BUFFER_SIZE][PLUG_MAX_PROCESS_BLOCK];
Code: Select all
double mValue[PLUG_VOICES_BUFFER_SIZE * PLUG_MAX_PROCESS_BLOCK];