![]() |
AW: Floyd-Steinberg Dithering
Zitat:
Delphi-Quellcode:
Please be aware of that the values in Delta (R,G,B) may be negative.
with AP^, Delta do
begin v := Blue + B * Factor shr 4; Assume AP.Blue=200 and Delta.B=-10 and Factor=7. Then V will evaluated as V := Blue + ((B * Factor) shr 4); V := 200 + ((-10 * 7) shr 4); V := 200 + (-70 shr 4); V := 200 + 268435451; V := 268435651; Korrekt is V := Blue + ((B * Factor) div 16); V := 200 + ((-10 * 7) div 16); V := 200 + (-70 div 16); V := 200 + -4; V := 196; |
AW: Floyd-Steinberg Dithering
[QUOTE=himitsu;1528459]
Zitat:
So here fixed a function that will generate the right image, (well i hope so this time :oops: )
Code:
and that shows another inefficiency in the compiler as we can't control the size of the operation and it will insist on using what ever get in its little brain, so i had to use (U)Int16 to force the compiler to use iMul, losing speed in the process.
procedure SetPixel(XOffset, YOffset: NativeInt; Factor: NativeUInt);
var AP: TPBGR; v: Int16; begin AP := P; Inc(AP, XOffset); Inc(NativeInt(AP), YOffset); with AP^ do begin v := Int16(Blue) + Int16(Delta.B) * Uint16(Factor) shr 4; if v < 0 then Blue := 0 else if v > 255 then Blue := 255 else Blue := v; v := Int16(Green) + Int16(Delta.G) * Uint16(Factor) shr 4; if v < 0 then Green := 0 else if v > 255 then Green := 255 else Green := v; v := Int16(Red) + Int16(Delta.R) * Uint16(Factor) shr 4; if v < 0 then Red := 0 else if v > 255 then Red := 255 else Red := v; end; end; This can be fixed but will need change the structure of the data in overall, but this still faster though we went from Native(U)Int instructions to much slower 16 bit instructions. |
AW: Floyd-Steinberg Dithering
|
AW: Floyd-Steinberg Dithering
Liste der Anhänge anzeigen (Anzahl: 4)
Ich habe mir das Thema "Floyd Steinberg" noch einmal zu Gemüte geführt und einige Assembler-Prozeduren zusammengefrickelt, die das Dithering deutlich beschleunigen.
Laufzeiten für Bitmaps mit 2.4MPixel bzw. 13.3MPixel im 32Bit- und 64Bit-Modus
Code:
Um die Testprozeduren zu nutzen, müssen in der Prozedur "TestFloydSteinberg" im
Pas: 149 209 / 857 1165 ms
Pas42: 529 531 / 3048 3027 ms Pas48: 498 534 / 2866 3021 ms Asm1: 53,5 54.8 / 342 343 ms Asm2: 54.1 54.8 / 309 321 ms Asm3: 29.3 22.4 / 181 143 ms Array "Filenames:Array[TTestMode] of String" die Dateinamen entsprechend angepasst werden. Region "PurePascal" Den Prozeduren FloydSteinberg, FloydSteinberg42 und FloydSteinberg48,werden als Parameter die Bitmap übergeben. FloydSteinberg42 und FloydSteinberg48 dithern entsprechend den, in der von Mavarik in #23 geposteten PDF ![]() Region "PureAssembler" Die Assembler-Prozeduren FloydSteinbergAsm1, FloydSteinbergAsm2 und FloydSteinbergAsm3 werden direkt aufgerufen, als Parameter wird die Bitmap übergeben. Region "Pascal/Assembler" Die Prozeduren FloydSteinbergAsm1, FloydSteinbergAsm2 und FloydSteinbergAsm3 werden von der Prozedur FloydSteinbergPasAsm aufgerufen wobei folgende Parameter übergeben werden: PP: Pointer auf das erste Pixel der Bitmap LO: Offset zur jeweils nächsten Zeile in Bytes W: Breite der Bitmap - 1 H: Höhe der Bitmap -1 In den Anhängen sind das Projekt "FloydSteinberg", die verwendeten Testbilder als JPEGs und die Protokolle der Umwandlungen. Die in Schwarz/Weiß umgewandelten Bilder konnten aus mir nicht bekannten Gründen nicht hochgeladen werden. |
AW: Floyd-Steinberg Dithering
You are really hardworking on this !
Are you familiar with CMOV instruction ?!! Read about it from here ![]() Also search and research example for it and its usage, and trust me you will feel real good after that and your above code will go brrrrrrr faster. eg. That part of clamping the slow one replacing values with
Code:
This could be two CMP and two CMOV, with 0 branching/jumping, will boost the speed nicely, by relieving branch prediction (removing the jmps) letting out-of-order-execution kick in unhindered.
if v < 0 then
Blue := 0 else if v > 255 then Blue := 255 else Blue := v; instead of
Code:
Also you could ditch all that and try SSE or MMX , both are supported on CPUs for almost 3 decades (MMX) and no need to check for CPU compatibility for it, MMX will perform the all these operation on one pixel (4 colors) in parallel, the speed should be around 4 times than simple plain linear assembly.
jle @BlueZero // Ist <= 0
cmp eax,255 jbe @BlueSet // Ist <= 255 mov byte[edx],255 // Blauanteil = 255 setzen jmp @Green // Grьnanteil errechnen @BlueZero: xor eax,eax // Blau = 0 @BlueSet: mov [edx],al // Blauanteil speichern |
AW: Floyd-Steinberg Dithering
Liste der Anhänge anzeigen (Anzahl: 4)
Zitat:
Unfortunately Pixels are 3 Bytes, not 4 Bytes in pf24bit-Bitmaps. Furthermore loading the r,g,b Values from memory into SSE/MMX registers and storing from SSE/MMX registers into memory is (my opinion) slower than my code. Feel free to prove me wrong. SSE, MMX Unfortunately Pixels are 3 Bytes, not 4 Bytes in pf24bit-Bitmaps. Furthermore loading the r,g,b Values from memory into SSE/MMX registers and storing from SSE/MMX registers into memory is (my opinion) slower than my code. Feel free to prove me wrong. CMOV I am fully aware of that instruction. However: 1) Would need 2 additional registers for the 0 and 255 (CMOV with #Values is not supported), alternatively I could push a 0 and a 255 on the Stack i.e. CMOV from memory. 2) Both CMOV from registers and CMOV from memory are significantly slower than my code. 1139 CMOV from registers 1123 CMOV from memory 842 my code Times are ms. May be, my codes contain errors (did not spend too much time). PS: I use the "Intel® 64 and IA-32 Architectures Software Developer’s Manual" to get informations about instructions. (See Attachments)
Delphi-Quellcode:
const Count=1000000000;
PROCEDURE TestCMov1; const S:String=' '; asm push edi push esi push 0 mov edi,0 mov esi,255 mov ecx,Count @1: mov edx,-255 @2: mov eax,edx cmp edx,0 cmovl eax,edi cmp edx,255 cmova eax,esi mov [esp],al add edx,1 cmp edx,255 jbe @2 sub ecx,1 jne @1 @End: pop ecx pop esi pop edi end;
Delphi-Quellcode:
PROCEDURE TestCMov2;
const S:String=' '; asm push 0 push 255 push 0 mov edi,0 mov esi,255 mov ecx,Count @1: mov edx,-255 @2: mov eax,edx cmp edx,0 cmovl eax,[esp+8] cmp edx,255 cmova eax,[esp+4] mov [esp],al add edx,1 cmp edx,255 jbe @2 sub ecx,1 jne @1 @End: add esp,12 end;
Delphi-Quellcode:
PROCEDURE TestMov;
const S:String=' '; asm push 0 mov edi,0 mov esi,255 mov ecx,Count @1: mov edx,-255 @2: mov eax,edx cmp eax,0 jle @Z cmp eax,255 jbe @S mov byte[esp],255 jmp @N @Z: xor eax,eax @S: mov [esp],al @N: add edx,1 cmp edx,255 jbe @2 sub ecx,1 jne @1 @End: pop ecx end;
Delphi-Quellcode:
PROCEDURE Test;
var T0,T1,T2,T3:Cardinal; begin T0:=GetTickCount; TestCMov1; T1:=GetTickCount; TestCMov2; T2:=GetTickCount; TestMov; T3:=GetTickCount; Dec(T3,T2); Dec(T2,T1); Dec(T1,T0); ShowMessage(Format('%D CMOV from registers'#13'%D CMOV from memory'#13+ '%D my code',[T1,T2,T3])); end; |
AW: Floyd-Steinberg Dithering
Zitat:
Anyway i might find the mood and time to do it in MMX and SSE, why not ! Zitat:
Code:
4 instructions and that is it, one CMP and two CMOV were enough here, why are they slower ? , they are in fact not slower but your code is faster, i am not sure if i can explain it enough to be clear, see, modern CPU do tricks, two of them play big role here in being your branching code faster than CMOV, branch prediction (BP) and Out-of-Order-Execution, OoOE window is getting bigger each CPU generation, the measurement in your code is gaining speed instead of being faster than CMOV basically, CMOV in the above are introducing data hazard by depending on eax in 3 of four consequence instructions, this will hinder OoOE, hence not gaining speed or lets say not gaining the boost, now returning to our case of 3 consequence clamping, this will put both BP and OoOE under stress and test them for size and speed, the bigger the code then they will start to fail to provide the boost in speed, here will CMOV will win.
procedure TestCMovShort;
asm push 0 // creating temp at [esp] mov edi, 0 mov esi, 255 mov ecx, Count @1: mov edx, - 255 @2: xor eax, eax // eax is destination value filled with the lowest value cmp edx, esi // comapare v (edx) with high value cmovg eax, esi // if bigger then take the highest esi cmovbe eax, edx // if below or equal we fill value v (edx) mov [esp], al @N: add edx, 1 cmp edx, 255 jbe @2 sub ecx, 1 jne @1 @End: pop eax end; I hope that was clear :oops: I think CMOV (the short one) in the original code should perform better at wider CPU range specially the ones with few years back, larger and longer loop cmov will be better, also on stressed OS with many multitasking less branching means less cache shuffling, hence speed, with branching and multitasking the L1 cache specially will thrashed continuously, and BP will not boost anything. PS my CPU is i2600k and it is old , TestMov result is ~890 and TestCMovShort is ~930, tested on the same code above not in the original with 3 clamping. |
AW: Floyd-Steinberg Dithering
Der Benchmark ist Blödsinn, denn in TestMov wird der erste jle immer genommen, also ist der Branchpredictor ziemlich happy.
Branchy Code, bei dem ein conditional branch immer genommen wird oder nie genommen wird, ist für einen solchen Test Unfug. Generell gilt: Conditional branches sind ok, wenn sie sehr predictable sind - d.h. es wird meist der eine Branch und selten der andere genommen. Sie funktionieren auch, wenn es immer abwechselnd ist, die Branchpredictors auf modernen CPUs sind ziemlich schlau. Schlimm wird es allerdings, wenn sie nicht vorhersehbar sind - selbst wenn es im Schnitt 50/50 ist aber ob der Sprung genommen wird oder nicht, ist z.B nicht immer abwechselnd, dann wirds schlimm und man ist mit einem cmov besser aufgehoben. Übrigens schreibt man
Delphi-Quellcode:
und nicht
dec ecx
Delphi-Quellcode:
- dec ist 1 byte instruction, sub benötigt 3 byte
sub ecx, 1
FWIW: ![]() |
AW: Floyd-Steinberg Dithering
Zitat:
here revised version of that benchmark with some semi-real data simulation over 1kb repeated million time.
Code:
and the result
uses
System.SysUtils, Windows; const Count = 1000000; DATA_LEN = 1024; var Data: array[0..DATA_LEN - 1] of Integer; CData: array[0..DATA_LEN - 1] of Byte; procedure TestMov; asm mov esi, 255 mov ecx, Count @1: mov edi, 0 @2: mov edx, dword[Data + edi * 4] mov eax, edx cmp eax, 0 jle @Z cmp eax, 255 jbe @S mov Byte[CData + edi], 255 jmp @N @Z: xor eax, eax @S: mov Byte[CData + edi], al @N: inc edi cmp edi, DATA_LEN jle @2 dec ecx jne @1 end; procedure TestCMovShort; asm mov esi, 255 // High Limit mov ecx, Count @1: mov edi, 0 @2: mov edx, dword[Data + edi * 4] XOR eax, eax // eax is destination value filled with the lowest value cmp edx, esi // comapare v (edx) with high value cmovg eax, esi // if bigger then take the highest esi cmovbe eax, edx // if below or equal we fill value v (edx) mov Byte[CData + edi], al inc edi cmp edi, DATA_LEN jl @2 dec ecx jne @1 end; procedure Test; var T: Cardinal; i: Integer; begin Randomize; for i := Low(Data) to High(Data) do Data[i] := Random(256 * 3) - 256; T := GetTickCount; TestMov; T := GetTickCount - T; Writeln('TestMov ' + IntToStr(T)); T := GetTickCount; TestCMovShort; T := GetTickCount - T; Writeln('TestCMovShort ' + IntToStr(T)); end; begin Test; Readln; end.
Code:
CMOV is almost twice faster.
TestMov 1703
TestCMovShort 969 |
AW: Floyd-Steinberg Dithering
In den
![]() Ich war auch mal so frei, die non-volatilen Register korrekt zu sichern:
Code:
Wenn nicht genügend Register zur Verfügung sind, können wir auch das hier (nur die Stelle mit dem Vergleich) machen - ist ein kleines bisschen langsamer aber immernoch schneller als alles andere:
procedure TestCMov_Better;
asm push esi push edi xor edx, edx mov esi, 255 // High Limit mov ecx, Count @1: mov edi, 0 @2: mov eax, dword[Data + edi * 4] cmp eax, esi // compare v (eax) with high value cmovae eax, edx // if negative (unsigned check) or 255 we fill 0 (edx) cmovge eax, esi // if greater (signed check) or 255 we fill 255 (esi) mov Byte[CData + edi], al inc edi cmp edi, DATA_LEN jl @2 dec ecx jne @1 pop edi pop esi end;
Code:
xor edx, edx
cmp eax, 255 // compare v (eax) with high value cmovae eax, edx // if negative or 255 we fill 0 (edx) mov edx, 255 cmovge eax, edx // if greater or 255 we fill 255 (edx) |
Alle Zeitangaben in WEZ +1. Es ist jetzt 02:16 Uhr. |
Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
LinkBacks Enabled by vBSEO © 2011, Crawlability, Inc.
Delphi-PRAXiS (c) 2002 - 2023 by Daniel R. Wolf, 2024 by Thomas Breitkreuz