Floyd-Steinberg Dithering

**Amateurprofi**

Hi for all,
Please bare a little with me, i mean really don't take my calling the Delphi Compiler as an offense toward anyone, it is just i know it, i used to be and still earning my bread from optimizing old and new code, i know the compiler very intimately to call it a fucking centuries old brick.

I will explain just this piece of code, and if you want to expand on it, but first and foremost please, at least consider your knowledge of the code generated by Delphi is wrong(outdated or naively trusting) or unoptimized (inefficient) and will go from there to proof a contradiction, how about that ?.. it is my best method as mathematician by education.

so many agreed on div 16 is always shr 4 or to be more concise should be sar 4, i agree and i know that the compiler wrongly do it for unsigned integers, but this is not the case, as i was talking about that specific case at hand, may be it is my mistake may to not wrote an essay for each line i wrote.

So here a proof that the compiler doesn't use sar 4 or shr 4 for div 16, the proof is just look at x64 version of it !!

try this function in the above optimized version

markieren

Code:

			PROCEDURE SetPixel(XOffset,YOffset,Factor:NativeInt);

var AP:TPBGR;

begin

   // XOffset=Horizontaler Offset in Pixel

   // YOffset=Vertikaler Offset in Bytes

   AP:=P;

   Inc(AP,XOffset);

   Inc(NativeInt(AP),YOffset);

   with AP^, Delta do begin

      Blue:=EnsureRange(Blue+B*Factor shr 4,0,255);

      Green:=EnsureRange(Green+G*Factor shr 4,0,255);

      Red:=EnsureRange(Red+R*Factor shr 4,0,255);

   end;

end;

My speed shows that it is faster by 18% in Win32 and 29% on Win64 !! do it please, it is not slower by 200%, and these values as positive so no problem here.
also if you look at the generated assembly code, then this is it
Anhang 56355

Also try this

markieren

Code:

			  procedure SetPixel(XOffset, YOffset, Factor: NativeInt);

  var

    AP: TPBGR;

    v: NativeInt;

  begin

    AP := P;

    Inc(AP, XOffset);

    Inc(NativeInt(AP), YOffset);

    with AP^, Delta do

    begin

      v := Blue + B * Factor shr 4;

      if v < 0 then

        Blue := 0

      else if v > 255 then

        Blue := 255

      else

        Blue := v;

      v := Green + G * Factor shr 4;

      if v < 0 then

        Green := 0

      else if v > 255 then

        Green := 255

      else

        Green := v;

      v := Red + R * Factor shr 4;

      if v < 0 then

        Red := 0

      else if v > 255 then

        Red := 255

      else

        Red := v;

    end;

  end;

The above function makes the whole process around double the speed,

for both platform .

and again not saying that i know it all, i do mistakes, but not in this case, would love to be proven wrong, but with factual code done right not with you assumptions based on something you didn't see for sure.

My test is attached here Anhang 56359 and hope it is working unlike the above attached project as it is empty.

As for "with" the compiler might fail to generate nice assembly and will revert to shuffle the data and access them continuously on the stack introducing unneeded memory access, this happen with complex loops also, with "with" it in many case will resolve the pointer and reused it from a register, alas it seems no gain in the above mentioned function, but once the function have few more local variables and it will go 90s turbo pascal mode specially in x64 platforms, i don't have the mood to sit and tweak such case for you now, but the effect is there.

ps re-reading before posting this, i sound retarded and offended, and i am sorry for that, i don't mean to offend anyone and never meant to, just had very bad experience from an neighbor forum and trying to be triggered by personal sentences.

markieren

Delphi-Quellcode:

			    with AP^, Delta do

    begin

      v := Blue + B * Factor shr 4;

Please be aware of that the values in Delta (R,G,B) may be negative.
Assume AP.Blue=200 and Delta.B=-10 and Factor=7.
Then V will evaluated as
V := Blue + ((B * Factor) shr 4);
V := 200 + ((-10 * 7) shr 4);
V := 200 + (-70 shr 4);
V := 200 + 268435451;
V := 268435651;
Korrekt is
V := Blue + ((B * Factor) div 16);
V := 200 + ((-10 * 7) div 16);
V := 200 + (-70 div 16);
V := 200 + -4;
V := 196;

**Kas Ob.**

[QUOTE=himitsu;1528459]

Zitat von skybibo:

Please be aware of that the values in Delta (R,G,B) may be negative.
Assume AP.Blue=200 and Delta.B=-10 and Factor=7.
Then V will evaluated as
V := Blue + ((B * Factor) shr 4);
V := 200 + ((-10 * 7) shr 4);
V := 200 + (-70 shr 4);
V := 200 + 268435451;
V := 268435651;
Korrekt is
V := Blue + ((B * Factor) div 16);
V := 200 + ((-10 * 7) div 16);
V := 200 + (-70 div 16);
V := 200 + -4;
V := 196;

Thank you, and i was wrong, it is a mistake, as i was after the speed not the functionality per se, and that is wrong, i was aware of of negative values, but was after proving a different thing.

So here fixed a function that will generate the right image, (well i hope so this time

)

markieren

Code:

			  procedure SetPixel(XOffset, YOffset: NativeInt; Factor: NativeUInt);

  var

    AP: TPBGR;

    v: Int16;

  begin

    AP := P;

    Inc(AP, XOffset);

    Inc(NativeInt(AP), YOffset);

    with AP^ do

    begin

      v := Int16(Blue) + Int16(Delta.B) * Uint16(Factor) shr 4;

      if v < 0 then

        Blue := 0

      else if v > 255 then

        Blue := 255

      else

        Blue := v;

      v := Int16(Green) + Int16(Delta.G) * Uint16(Factor) shr 4;

      if v < 0 then

        Green := 0

      else if v > 255 then

        Green := 255

      else

        Green := v;

      v := Int16(Red) + Int16(Delta.R) * Uint16(Factor) shr 4;

      if v < 0 then

        Red := 0

      else if v > 255 then

        Red := 255

      else

        Red := v;

    end;

  end;

and that shows another inefficiency in the compiler as we can't control the size of the operation and it will insist on using what ever get in its little brain, so i had to use (U)Int16 to force the compiler to use iMul, losing speed in the process.
This can be fixed but will need change the structure of the data in overall, but this still faster though we went from Native(U)Int instructions to much slower 16 bit instructions.

**Mavarik**

Im Thread!

https://cs.wellesley.edu/~pmetaxas/pdcs99.pdf

**Amateurprofi**

Ich habe mir das Thema "Floyd Steinberg" noch einmal zu Gemüte geführt und einige Assembler-Prozeduren zusammengefrickelt, die das Dithering deutlich beschleunigen.
Laufzeiten für Bitmaps mit 2.4MPixel bzw. 13.3MPixel im 32Bit- und 64Bit-Modus

markieren

Code:

			Pas:   149   209   /   857   1165 ms

Pas42:   529   531   /   3048   3027 ms

Pas48:   498   534   /   2866   3021 ms

Asm1:   53,5   54.8   /   342   343 ms

Asm2:   54.1   54.8   /   309   321 ms 

Asm3:   29.3   22.4   /   181   143 ms

Um die Testprozeduren zu nutzen, müssen in der Prozedur "TestFloydSteinberg" im
Array "Filenames:Array[TTestMode] of String"
die Dateinamen entsprechend angepasst werden.

Region "PurePascal"
Den Prozeduren FloydSteinberg, FloydSteinberg42 und FloydSteinberg48,werden als Parameter die Bitmap übergeben.
FloydSteinberg42 und FloydSteinberg48 dithern entsprechend den, in der von Mavarik in #23 geposteten PDF

https://cs.wellesley.edu/~pmetaxas/pdcs99.pdf auf Seite 574 in Fig. 9 gezeigten Schemata.

Region "PureAssembler"
Die Assembler-Prozeduren FloydSteinbergAsm1, FloydSteinbergAsm2 und FloydSteinbergAsm3 werden direkt aufgerufen, als Parameter wird die Bitmap übergeben.

Region "Pascal/Assembler"
Die Prozeduren FloydSteinbergAsm1, FloydSteinbergAsm2 und FloydSteinbergAsm3 werden von der Prozedur FloydSteinbergPasAsm aufgerufen wobei folgende Parameter übergeben werden:
PP: Pointer auf das erste Pixel der Bitmap
LO: Offset zur jeweils nächsten Zeile in Bytes
W: Breite der Bitmap - 1
H: Höhe der Bitmap -1

In den Anhängen sind das Projekt "FloydSteinberg", die verwendeten Testbilder als JPEGs und die Protokolle der Umwandlungen.
Die in Schwarz/Weiß umgewandelten Bilder konnten aus mir nicht bekannten Gründen nicht hochgeladen werden.

**Kas Ob.**

You are really hardworking on this !

Are you familiar with CMOV instruction ?!!
Read about it from here

https://www.felixcloutier.com/x86/cmovcc
Also search and research example for it and its usage, and trust me you will feel real good after that and your above code will go brrrrrrr faster.

eg. That part of clamping the slow one replacing values with

markieren

Code:

			      if v < 0 then

        Blue := 0

      else if v > 255 then

        Blue := 255

      else

        Blue := v;

This could be two CMP and two CMOV, with 0 branching/jumping, will boost the speed nicely, by relieving branch prediction (removing the jmps) letting out-of-order-execution kick in unhindered.
instead of

markieren

Code:

			               jle      @BlueZero            // Ist <= 0

               cmp      eax,255

               jbe      @BlueSet             // Ist <= 255

               mov      byte[edx],255        // Blauanteil = 255 setzen

               jmp      @Green               // Gr&#1100;nanteil errechnen

@BlueZero:     xor      eax,eax              // Blau = 0

@BlueSet:      mov      [edx],al             // Blauanteil speichern

Also you could ditch all that and try SSE or MMX , both are supported on CPUs for almost 3 decades (MMX) and no need to check for CPU compatibility for it, MMX will perform the all these operation on one pixel (4 colors) in parallel, the speed should be around 4 times than simple plain linear assembly.

**Amateurprofi**

Zitat von Kas Ob.:

You are really hardworking on this !

Are you familiar with CMOV instruction ?!!
Read about it from here

https://www.felixcloutier.com/x86/cmovcc
Also search and research example for it and its usage, and trust me you will feel real good after that and your above code will go brrrrrrr faster.

eg. That part of clamping the slow one replacing values with

markieren

Code:

			      if v < 0 then

        Blue := 0

      else if v > 255 then

        Blue := 255

      else

        Blue := v;

This could be two CMP and two CMOV, with 0 branching/jumping, will boost the speed nicely, by relieving branch prediction (removing the jmps) letting out-of-order-execution kick in unhindered.
instead of

markieren

Code:

			               jle      @BlueZero            // Ist <= 0

               cmp      eax,255

               jbe      @BlueSet             // Ist <= 255

               mov      byte[edx],255        // Blauanteil = 255 setzen

               jmp      @Green               // Gr&#1100;nanteil errechnen

@BlueZero:     xor      eax,eax              // Blau = 0

@BlueSet:      mov      [edx],al             // Blauanteil speichern

Also you could ditch all that and try SSE or MMX , both are supported on CPUs for almost 3 decades (MMX) and no need to check for CPU compatibility for it, MMX will perform the all these operation on one pixel (4 colors) in parallel, the speed should be around 4 times than simple plain linear assembly.

SSE, MMX
Unfortunately Pixels are 3 Bytes, not 4 Bytes in pf24bit-Bitmaps.
Furthermore loading the r,g,b Values from memory into SSE/MMX registers and storing from SSE/MMX registers into memory is (my opinion) slower than my code.
Feel free to prove me wrong.
SSE, MMX
Unfortunately Pixels are 3 Bytes, not 4 Bytes in pf24bit-Bitmaps.
Furthermore loading the r,g,b Values from memory into SSE/MMX registers and storing from SSE/MMX registers into memory is (my opinion) slower than my code.
Feel free to prove me wrong.

CMOV
I am fully aware of that instruction.
However:
1) Would need 2 additional registers for the 0 and 255 (CMOV with #Values is not supported), alternatively I could push a 0 and a 255 on the Stack i.e. CMOV from memory.
2) Both CMOV from registers and CMOV from memory are significantly slower than my code.

1139 CMOV from registers
1123 CMOV from memory
842 my code
Times are ms.
May be, my codes contain errors (did not spend too much time).

PS:
I use the "Intel® 64 and IA-32 Architectures Software Developer’s Manual" to get informations about instructions.
(See Attachments)

zusammenfalten · markieren

Delphi-Quellcode:

			const Count=1000000000;

PROCEDURE TestCMov1;

const S:String='  ';

asm

      push     edi

      push     esi

      push     0

      mov      edi,0

      mov      esi,255

      mov      ecx,Count

@1:   mov      edx,-255

@2:   mov      eax,edx

      cmp      edx,0

      cmovl    eax,edi

      cmp      edx,255

      cmova    eax,esi

      mov      [esp],al

      add      edx,1

      cmp      edx,255

      jbe      @2

      sub      ecx,1

      jne      @1

@End: pop      ecx

      pop      esi

      pop      edi

end;

zusammenfalten · markieren

Delphi-Quellcode:

			PROCEDURE TestCMov2;

const S:String='  ';

asm

      push     0

      push     255

      push     0

      mov      edi,0

      mov      esi,255

      mov      ecx,Count

@1:   mov      edx,-255

@2:   mov      eax,edx

      cmp      edx,0

      cmovl    eax,[esp+8]

      cmp      edx,255

      cmova    eax,[esp+4]

      mov      [esp],al

      add      edx,1

      cmp      edx,255

      jbe      @2

      sub      ecx,1

      jne      @1

@End: add      esp,12

end;

zusammenfalten · markieren

Delphi-Quellcode:

			PROCEDURE TestMov;

const S:String='  ';

asm

      push     0

      mov      edi,0

      mov      esi,255

      mov      ecx,Count

@1:   mov      edx,-255

@2:   mov      eax,edx

      cmp      eax,0

      jle      @Z

      cmp      eax,255

      jbe      @S

      mov      byte[esp],255

      jmp      @N

@Z:   xor      eax,eax

@S:   mov      [esp],al

@N:   add      edx,1

      cmp      edx,255

      jbe      @2

      sub      ecx,1

      jne      @1

@End: pop      ecx

end;

zusammenfalten · markieren

Delphi-Quellcode:

			PROCEDURE Test;

var T0,T1,T2,T3:Cardinal;

begin

   T0:=GetTickCount;

   TestCMov1;

   T1:=GetTickCount;

   TestCMov2;

   T2:=GetTickCount;

   TestMov;

   T3:=GetTickCount;

   Dec(T3,T2);

   Dec(T2,T1);

   Dec(T1,T0);

   ShowMessage(Format('%D CMOV from registers'#13'%D CMOV from memory'#13+

                      '%D my code',[T1,T2,T3]));

end;

**Kas Ob.**

Zitat von Amateurprofi:

SSE, MMX
Unfortunately Pixels are 3 Bytes, not 4 Bytes in pf24bit-Bitmaps.
Furthermore loading the r,g,b Values from memory into SSE/MMX registers and storing from SSE/MMX registers into memory is (my opinion) slower than my code.
Feel free to prove me wrong.
SSE, MMX
Unfortunately Pixels are 3 Bytes, not 4 Bytes in pf24bit-Bitmaps.
Furthermore loading the r,g,b Values from memory into SSE/MMX registers and storing from SSE/MMX registers into memory is (my opinion) slower than my code.
Feel free to prove me wrong.

Does it matter what value in the fourth byte !?, no it doesn't, you can load four bytes and perform the same operation on them the cycles are the same and just drop the last value, if there is a fear of overflowing a buffer, perform the loop on all unless there except the last byte of them all.

Anyway i might find the mood and time to do it in MMX and SSE, why not !

Zitat von Amateurprofi:

CMOV
I am fully aware of that instruction.
However:
1) Would need 2 additional registers for the 0 and 255 (CMOV with #Values is not supported), alternatively I could push a 0 and a 255 on the Stack i.e. CMOV from memory.
2) Both CMOV from registers and CMOV from memory are significantly slower than my code.

1139 CMOV from registers
1123 CMOV from memory
842 my code
Times are ms.
May be, my codes contain errors (did not spend too much time).

Let do clamping right then discuss it, so here a version of that clamping which we should perform very similar or close to in SIMD (MMX/SSE)

markieren

Code:

			procedure TestCMovShort;

asm

        push    0           // creating temp at [esp]

        mov     edi, 0

        mov     esi, 255

        mov     ecx, Count

@1:     mov     edx,  - 255

@2:     xor     eax, eax      // eax is destination value filled with the lowest value

        cmp     edx, esi      // comapare v (edx) with high value

        cmovg   eax, esi      // if bigger then take the highest esi

        cmovbe  eax, edx      // if below or equal we fill value v (edx)

        mov     [esp], al

@N:     add     edx, 1

        cmp     edx, 255

        jbe     @2

        sub     ecx, 1

        jne     @1

@End:   pop     eax

end;

4 instructions and that is it, one CMP and two CMOV were enough here, why are they slower ? , they are in fact not slower but your code is faster, i am not sure if i can explain it enough to be clear, see, modern CPU do tricks, two of them play big role here in being your branching code faster than CMOV, branch prediction (BP) and Out-of-Order-Execution, OoOE window is getting bigger each CPU generation, the measurement in your code is gaining speed instead of being faster than CMOV basically, CMOV in the above are introducing data hazard by depending on eax in 3 of four consequence instructions, this will hinder OoOE, hence not gaining speed or lets say not gaining the boost, now returning to our case of 3 consequence clamping, this will put both BP and OoOE under stress and test them for size and speed, the bigger the code then they will start to fail to provide the boost in speed, here will CMOV will win.
I hope that was clear

I think CMOV (the short one) in the original code should perform better at wider CPU range specially the ones with few years back, larger and longer loop cmov will be better, also on stressed OS with many multitasking less branching means less cache shuffling, hence speed, with branching and multitasking the L1 cache specially will thrashed continuously, and BP will not boost anything.

PS my CPU is i2600k and it is old , TestMov result is ~890 and TestCMovShort is ~930, tested on the same code above not in the original with 3 clamping.

**Stevie**

Der Benchmark ist Blödsinn, denn in TestMov wird der erste jle immer genommen, also ist der Branchpredictor ziemlich happy.
Branchy Code, bei dem ein conditional branch immer genommen wird oder nie genommen wird, ist für einen solchen Test Unfug.

Generell gilt: Conditional branches sind ok, wenn sie sehr predictable sind - d.h. es wird meist der eine Branch und selten der andere genommen. Sie funktionieren auch, wenn es immer abwechselnd ist, die Branchpredictors auf modernen CPUs sind ziemlich schlau. Schlimm wird es allerdings, wenn sie nicht vorhersehbar sind - selbst wenn es im Schnitt 50/50 ist aber ob der Sprung genommen wird oder nicht, ist z.B nicht immer abwechselnd, dann wirds schlimm und man ist mit einem cmov besser aufgehoben.

Übrigens schreibt man dec ecx und nicht sub ecx, 1 - dec ist 1 byte instruction, sub benötigt 3 byte

FWIW:

https://codereview.stackexchange.com...he-range-0-255

**Kas Ob.**

Zitat von Stevie:

Der Benchmark ist Blödsinn, denn in TestMov wird der erste jle immer genommen, also ist der Branchpredictor ziemlich happy.
Branchy Code, bei dem ein conditional branch immer genommen wird oder nie genommen wird, ist für einen solchen Test Unfug.

Generell gilt: Conditional branches sind ok, wenn sie sehr predictable sind - d.h. es wird meist der eine Branch und selten der andere genommen. Sie funktionieren auch, wenn es immer abwechselnd ist, die Branchpredictors auf modernen CPUs sind ziemlich schlau. Schlimm wird es allerdings, wenn sie nicht vorhersehbar sind - selbst wenn es im Schnitt 50/50 ist aber ob der Sprung genommen wird oder nicht, ist z.B nicht immer abwechselnd, dann wirds schlimm und man ist mit einem cmov besser aufgehoben.

Übrigens schreibt man dec ecx und nicht sub ecx, 1 - dec ist 1 byte instruction, sub benötigt 3 byte

FWIW:

https://codereview.stackexchange.com...he-range-0-255

Thank you very much, i have no idea what i was thinking missing all that.

here revised version of that benchmark with some semi-real data simulation over 1kb repeated million time.

markieren

Code:

			uses

  System.SysUtils,

  Windows;

const

  Count = 1000000;

  DATA_LEN = 1024;

var

  Data: array[0..DATA_LEN - 1] of Integer;

  CData: array[0..DATA_LEN - 1] of Byte;

procedure TestMov;

asm

        mov     esi, 255

        mov     ecx, Count

@1:     mov     edi, 0

@2:     mov     edx, dword[Data + edi * 4]

        mov     eax, edx

        cmp     eax, 0

        jle     @Z

        cmp     eax, 255

        jbe     @S

        mov     Byte[CData + edi], 255

        jmp     @N

@Z:     xor     eax, eax

@S:     mov     Byte[CData + edi], al

@N:     inc     edi

        cmp     edi, DATA_LEN

        jle     @2

        dec     ecx

        jne     @1

end;

procedure TestCMovShort;

asm

        mov     esi, 255          // High Limit

        mov     ecx, Count

@1:     mov     edi, 0

@2:     mov     edx, dword[Data + edi * 4]

        XOR     eax, eax     // eax is destination value filled with the lowest value

        cmp     edx, esi     // comapare v (edx) with high value

        cmovg   eax, esi     // if bigger then take the highest esi

        cmovbe  eax, edx     // if below or equal we fill value v (edx)

        mov     Byte[CData + edi], al

        inc     edi

        cmp     edi, DATA_LEN

        jl      @2

        dec     ecx

        jne     @1

end;

procedure Test;

var

  T: Cardinal;

  i: Integer;

begin

  Randomize;

  for i := Low(Data) to High(Data) do

    Data[i] := Random(256 * 3) - 256;

  T := GetTickCount;

  TestMov;

  T := GetTickCount - T;

  Writeln('TestMov ' + IntToStr(T));

  T := GetTickCount;

  TestCMovShort;

  T := GetTickCount - T;

  Writeln('TestCMovShort ' + IntToStr(T));

end;

begin

  Test;

  Readln;

end.

and the result

markieren

Code:

			TestMov 1703

TestCMovShort 969

CMOV is almost twice faster.

**Stevie**

In den

Kommentaren verweist Peter Cordes auf einen Trick, der ermöglicht, direkt in eax zu laden und benötigt nicht das wiederholte nullen.

Ich war auch mal so frei, die non-volatilen Register korrekt zu sichern:

markieren

Code:

			procedure TestCMov_Better;

asm

        push   esi

        push   edi

        xor    edx, edx

        mov    esi, 255          // High Limit

        mov    ecx, Count

@1:     mov    edi, 0

@2:     mov    eax, dword[Data + edi * 4]

        cmp    eax, esi    // compare v (eax) with high value

        cmovae eax, edx    // if negative (unsigned check) or 255 we fill 0 (edx)

        cmovge eax, esi    // if greater (signed check) or 255 we fill 255 (esi)

        mov    Byte[CData + edi], al

        inc    edi

        cmp    edi, DATA_LEN

        jl     @2

        dec    ecx

        jne    @1

        pop  edi

        pop  esi

end;

Wenn nicht genügend Register zur Verfügung sind, können wir auch das hier (nur die Stelle mit dem Vergleich) machen - ist ein kleines bisschen langsamer aber immernoch schneller als alles andere:

markieren

Code:

			        xor    edx, edx

        cmp    eax, 255    // compare v (eax) with high value

        cmovae eax, edx    // if negative or 255 we fill 0 (edx)

        mov    edx, 255

        cmovge eax, edx    // if greater or 255 we fill 255 (edx)

Floyd-Steinberg Dithering

AW: Floyd-Steinberg Dithering

AW: Floyd-Steinberg Dithering

AW: Floyd-Steinberg Dithering

AW: Floyd-Steinberg Dithering

AW: Floyd-Steinberg Dithering

AW: Floyd-Steinberg Dithering

AW: Floyd-Steinberg Dithering

AW: Floyd-Steinberg Dithering

AW: Floyd-Steinberg Dithering

AW: Floyd-Steinberg Dithering

Forumregeln

Mavarik Registriert seit: 9. Feb 2006 Ort: Stolberg (Rhld) 4.154 Beiträge Delphi 10.3 Rio	#23 AW: Floyd-Steinberg Dithering 23. Okt 2023, 21:54 Im Thread! https://cs.wellesley.edu/~pmetaxas/pdcs99.pdf Frank Lauter Embarcadero MVP •FDK das Firemonkey-Development-Kit • Mein Blog • Schulungen für FMX •Mein YouTube Channel • Wiki unserer Delphifrühstück Usergroup
	Zitat