Delphi-PRAXiS - Einzelnen Beitrag anzeigen - Delphi Floyd-Steinberg Dithering

**Kas Ob.**

SSE, MMX
Unfortunately Pixels are 3 Bytes, not 4 Bytes in pf24bit-Bitmaps.
Furthermore loading the r,g,b Values from memory into SSE/MMX registers and storing from SSE/MMX registers into memory is (my opinion) slower than my code.
Feel free to prove me wrong.
SSE, MMX
Unfortunately Pixels are 3 Bytes, not 4 Bytes in pf24bit-Bitmaps.
Furthermore loading the r,g,b Values from memory into SSE/MMX registers and storing from SSE/MMX registers into memory is (my opinion) slower than my code.
Feel free to prove me wrong.

Does it matter what value in the fourth byte !?, no it doesn't, you can load four bytes and perform the same operation on them the cycles are the same and just drop the last value, if there is a fear of overflowing a buffer, perform the loop on all unless there except the last byte of them all.

Anyway i might find the mood and time to do it in MMX and SSE, why not !

Zitat von Amateurprofi:

CMOV
I am fully aware of that instruction.
However:
1) Would need 2 additional registers for the 0 and 255 (CMOV with #Values is not supported), alternatively I could push a 0 and a 255 on the Stack i.e. CMOV from memory.
2) Both CMOV from registers and CMOV from memory are significantly slower than my code.

1139 CMOV from registers
1123 CMOV from memory
842 my code
Times are ms.
May be, my codes contain errors (did not spend too much time).

Let do clamping right then discuss it, so here a version of that clamping which we should perform very similar or close to in SIMD (MMX/SSE)

markieren

Code:

			procedure TestCMovShort;

asm

        push    0           // creating temp at [esp]

        mov     edi, 0

        mov     esi, 255

        mov     ecx, Count

@1:     mov     edx,  - 255

@2:     xor     eax, eax      // eax is destination value filled with the lowest value

        cmp     edx, esi      // comapare v (edx) with high value

        cmovg   eax, esi      // if bigger then take the highest esi

        cmovbe  eax, edx      // if below or equal we fill value v (edx)

        mov     [esp], al

@N:     add     edx, 1

        cmp     edx, 255

        jbe     @2

        sub     ecx, 1

        jne     @1

@End:   pop     eax

end;

4 instructions and that is it, one CMP and two CMOV were enough here, why are they slower ? , they are in fact not slower but your code is faster, i am not sure if i can explain it enough to be clear, see, modern CPU do tricks, two of them play big role here in being your branching code faster than CMOV, branch prediction (BP) and Out-of-Order-Execution, OoOE window is getting bigger each CPU generation, the measurement in your code is gaining speed instead of being faster than CMOV basically, CMOV in the above are introducing data hazard by depending on eax in 3 of four consequence instructions, this will hinder OoOE, hence not gaining speed or lets say not gaining the boost, now returning to our case of 3 consequence clamping, this will put both BP and OoOE under stress and test them for size and speed, the bigger the code then they will start to fail to provide the boost in speed, here will CMOV will win.
I hope that was clear

I think CMOV (the short one) in the original code should perform better at wider CPU range specially the ones with few years back, larger and longer loop cmov will be better, also on stressed OS with many multitasking less branching means less cache shuffling, hence speed, with branching and multitasking the L1 cache specially will thrashed continuously, and BP will not boost anything.

PS my CPU is i2600k and it is old , TestMov result is ~890 and TestCMovShort is ~930, tested on the same code above not in the original with 3 clamping.

Einzelnen Beitrag anzeigen

AW: Floyd-Steinberg Dithering