Rawheds Tutorial#5:
---==[Introduction]==---------------------------------------------------------MMX Coding [Introduction] [MMX Registers] [MMX Instructions] [Saturated vs Wraparound] [MMX Instructions - EMMS] [MMX Instructions - Moving 32/64bits] [MMX Instructions - Addition & Subtraction] [MMX Instructions - Shifting] [MMX Instructions - Logical Instructions] [MMX Instructions - Multiply] [MMX Instructions - Comparing] [MMX Instructions - Packing/Unpacking] [Little Tips] [Some Implementation Ideas - Vector Rotation] [Some Implementation Ideas - ARGB pixels] [Some Implementation Ideas - Byte Shifting] [Some Implementation Ideas - Crossfading] [Some Implementation Ideas - Bluring] [Some Implementation Ideas - Complex Multiplications] [Closing Words]
When Intel released MMX I thought(and still think) that it sucked! I thought
this until I tried it out a few months ago, and its actually VERY cool. I
think the reason no one took it seriously was Intels marketing. I mean look
at the P3 - its surposed to enhance your internet experience. "What crap!"
everyone thinks. hehe. P3 has very cool SSE instructions which is basically
their reply to AMDs 3DNow! technology - but more about that in another tutorial.
I've seen a few demos starting to use MMX, which is very cool. My current
demo engine checks if the machine has MMX/3DNow! and if so - uses the appropriate
functions optimised using those instructions. I think i've started off badly
with all my rambling, so let me give you a bit of background about MMX.
Basically MMX is a set of instructions for the pentium range of machines
and is Intels first big change to their x86 instruction set since the 386(1985).
There are 57 new instructions in all. The instructions are very good for
multimedia type processing - things like audio, video, imagery etc. MMX
also comes with 8 new 64bit registers(well sort of) and uses SIMD
(Single Instruction Multiple Data) which basically means that instructions
can handle multiple data in parallel. Noone is totally sure what MMX
stands for because Intel have never said, however it seems most people
agree its MultiMedia eXtenstions.
Most machines thesedays support MMX. I'm not sure if modernday compilers
optimise using MMX, or if big commercial programs use MMX much - but they
should! Its high time everyone excepts MMX as standard. Even Cyrix
and AMD machines support it. I hope to see more MMX demo stuff too :)
Maybe it will help demos have more particles/polygons/whatever than
ever before.
Anyways, enough gibberish - on with the tut.
---==[MMX Registers]==--------------------------------------------------------
There are 8 new MMX registers. They are called MM0, MM1, MM2...MM7.
They are actually not really new, because physically on the chip
they are not there - instead they using using the floating point
stack. As you know the registers in the FP unit are 80bits wide,
The signed bit(bit 79) and the exponent part of each register are filled
with 1's, and the remaining 64bit of each FP register is where the MMX
register lies.
So basically the MMX registers are aliased onto the floating
point registers. What this means is that while using MMX can you can't
use FP instructions. You have to call EMMS when finished a block of FP/MMX
code.
MMX instructions work with the 64bit registers in various ways - depending
on the instruction. There are 4 new ways the instructions can look at the
64bit data:
Packed Bytes - 64bits divided into 8 bytes.
Packed Words - 64bits divided into 4 words.
Packed Dwords - 64bits divided into 2 dwords.
Quadword - 64bits undivided.
So if you are using an instruction that works on packed bytes,
it will perform 8 operations - one on each byte. Each byte will
be treated as an independant entity and will not touch any of
the other bytes. The same goes for the packed words and dwords.
---==[MMX Instructions]==-----------------------------------------------------
MMX instructions are all pretty much formatted the same way:
instruction dest,source
The mmx instructions often have suffixes which describe
a)signed/unsigned operation
b)saturated/wraparound operation
c)whether the instruction works on packed bytes,words,dwords or qword.
For example the instruction padd can be used as:
paddusd MM2,mem1 (add unsigned, saturated operation using packed dwords)
paddb mem1,MM2 (add using wraparound, on packed bytes)
Some mmx instructions only work on certain types of datatypes, so I've indicated this when i describe the instruction.
---==[Saturated vs Wraparound]==----------------------------------------------
You will see that some of the instructions support something called
"saturation". This is a very cool new thing in mmx that stops
wraparounds(overflows) from happening when you exceed the datarange
limits. For example:
mov al,250
add al,10 ;al is now equal to 9. This is wraparound/overflow.
mov eax,250
mov ebx,10
movd MM0,eax
movd MM1,ebx
paddsb MM0,MM1
movd eax,MM0 ;eax = 255. This is saturation.
Similarly if we'd been dealing with 16bits then they would saturate at 65535
and zero. If its signed saturation, then the clipping values will be the
signed limits of that datatype eg, for bytes: -127..127.
In the above example of saturation, the paddsb is actually
doing 8 additions and saturation, all at the same time!
---==[MMX Instructions - EMMS]==----------------------------------------------
Since the MMX registers are using the same space that the FPU uses, they
can't be used simultaniously. EMMS must always be called after a block
of MMX code, otherwise when FP code is executed after it, stack overflows
and wrong answers will arise as it'll be using residual MMX data. The only
problem is that EMMS is very slow (50 cycles). AMD fixed this problem with
their FEMMS instruction which does the same thing, except its a lot faster
(5 cycles).
---==[MMX Instructions - Moving 32/64bits]==----------------------------------
These instuctions are very important because they are how you move data
into/around/from the MMX registers. There are 2 MMX data moving instructions,
MOVD and MOVQ. As I'm sure you've already worked out, MOVD moves 32bits of data
and MOVQ moves 64bits. Here are their operands:
MOVD dest,src
MOVD MMXreg/x86reg/Mem,MMXreg/x86reg/Mem
MOVQ MMXreg/Mem,MMXreg/Mem
However you cannot have both the destination and source operands as memory addresses!
This is a major bummer because otherwise nice fast 64bit memory copies could be done.
MOVD can be used to load data into the MMX register from normal x86 registers. So if
you want to move the 32bit value of EAX into MM3 you would use MOVD MM3,eax. This
fills the lower 32bits of MM3 with eax value and fills the upper 32bits with zeros.
This is what is often used to getting data into the MMX registers for them to play with.
After your MMX routine has been done and you want to get the result you can use a MOVD eax,MM3.
MOVD can also be used to copy the lower 32bits from one MMX register to another, however
the upper 32bits is zero filled.
MOVQ can't access normal 32bit registers so you have to use MOVD to load/unload x86 register
data. MOVQ is used to load/unload data to/from memory. For example if you have a ARGB
memory buffer you can use MOVQ MM0,mem1 to load 2 pixels(64bits) into register MM0. To put the
data back once its been through you MMX routine just use something like MOVQ mem2,MM0.
---==[MMX Instructions - Addition & Subtraction]==----------------------------
PADD and PSUB are the base mmx addition and subtraction instructions.
Applying suffixes to them allows you to specify whether you are
wanting it to be a signed/unsigned and wraparound/saturated instruction.
These instructions can accept MMX registers or memory addresses as
source operands, but only MMX regisers as source operands:
PADDx dest,src
PADDx MMXreg,MMXreg/Mem
PSUBx MMXreg,MMXreg/Mem
Here are the add/sub instructions and on what datatypes they work:
PADD (packed wraparound add) - byte - word - dword
PADDS (packed signed saturated add) - byte - word
PADDUS (packed unsigned saturated add) - byte - word
PSUB (packed wraparound sub) - byte - word - dword
PSUBS (packed signed saturated sub) - byte - word
PSUBUS (packed unsigned saturated sub) - byte - word
Now that you know the datatypes that they work on you can just
add the (b,w,d or q) suffix to the instructions eg:
PADDB, PADDW, PADDD - each one for a different data type
PADDSB, PADDSW - each one for a different data type
PADDUSB, PADDUSW - each one ofr a different data type
So PADDB works on packed bytes and PADDW works on packed words - but how?
This is the beauty of MMX - it does things in parallel. A PADDB will
do 8 additions. Here is how:
MM0 - |008|000|005|000|255|000|001|045| 8 bytes(64bits)
MM1 - |000|057|005|000|005|000|001|002| 8 bytes(64bits)
PADDB MM0,MM1
result(mm1 unchanged):
MM0 - |008|057|010|000|004|000|002|047| 8 bytes(64bits)
This will add each 8byte entity and put the resulting 8 bytes into the
destination operand(MM0). PADDB is a wraparound instruction of course to
255+5=4;
If we were using the PADDSB instruction it would have worked the same, except
for the 255+5:
MM0 - |008|000|005|000|255|000|001|045| 8 bytes(64bits)
MM1 - |000|057|005|000|005|000|001|002| 8 bytes(64bits)
PADDSB MM0,MM1
result(mm1 unchanged):
MM0 - |008|057|010|000|255|000|002|047| 8 bytes(64bits)
One more example now, except using packed words and saturated subtraction:
MM0 - |001234|000010|000005|008516| 4 words(64bits)
MM1 - |000001|000020|000001|009343| 4 words(64bits)
PSUBSW MM0,MM1
result(mm1 unchanged):
MM0 - |001233|000000|000004|000000| 4 words(64bits)
---==[MMX Instructions - Shifting]==-----------------------------------------
These are instructions very similar to the old x86 SHL and SHR instructions
only they are very cool because they work on the different packed formats
to they can shift multiple values in one instruction. Here are the base
shifting instructions and what datatypes they act on:
PSLL (Packed Shift Left Logical) - word - dword - qword
PSRA (Packed Shift Right Arithmetic) - word - dword
PSRL (Packed Shift Right Logical) - word - dword - qword
So once again (just like the padd & psub instructions) just add the suffixes
to the base instruction name to get the instruction name that works on a
certain datatype. Eg:
PSLLW - does a left logical shift on the packed word datatype
PSRAD - does a right arithmetic shift on the packed dword datatype
These shifting instructions are all formatted the same way as the SHL and SHR
instructions:
instruction dest, shiftamount
PSLLW MMXreg, MMXreg/Mem/Immed
eg: PSLLW MM1, 3
The PSLLx and PSRLx instructions are all basically the same. They shift
the bits to the left/right and fills the low/high order bits with zeros.
Here is an example of the PSRLW instruction:
MM4(64bits, 4words):
|0000100001001100|0000000000011111|0000011000001100|1111110000000000|
PSRLW MM4,5 (packed logical shift to the right by 5)
MM4(64bits, 4words):
|0000000001000010|0000000000000000|0000000000110000|0000011111100000|
As you can see from this example zeros fill the highorder bits and the
loworder bits that shift right too far are killed. The PSLLx instruction
works just like this except that it shift to the left and the low order
bits the filled with zeros.
Both of those sets of instructions are called "logical" while the PSRAx
instructions are "arithmetic". This is basically calling them unsigned
and signed instructions. The arithmetic instruction takes into account
whether the data is positive/negative. PSRAx shifts data to the right.
If the data element is positive then it fills the high order bits of
the destination with zeros. If the data elecment is negative then it fills
the high order bits of the destination with ones. Remember that
a data element is negative if its highest bit is signed(1). Here is an
example of how it works:
MM4(64bits, 4words):
|0000100001001100|0000000000011111|1000011000001100|1111110000000000|
PSRAW MM4,5 (packed arithmetic shift to the right by 5)
MM4(64bits, 4words):
|0000000001000010|0000000000000000|1111100000110000|1111111111100000|
Its a GREAT pitty that the shifting instructions don't work on the packed
byte datatype otherwise we could shift 8 bytes at a time and if using ARGB
data this would be invaluable! OH well..We can get around this by doing
the hack mentioned later on in this tut, in the section called "some
implementation ideas".
---==[MMX Instructions - Logical Instructions]==------------------------------
These are your MMX equivilent bitwise instructions like AND XOR NOT etc.
They only work on 64bits(qword) so the instruction is formatted:
instruction dest,src
instruction MMXreg,MMXreg/Mem
There are 4 MMX bitwise instructions: pand, pandn, por and pxor.
PAND works just like normal ANDing except that its being applies to 64bits.
To refresh and for example:
0 AND 0 = 0
1 AND 1 = 1
1 AND 0 = 0
0 AND 1 = 0
PANDN(Not AND) first inverts the bits of the destination then applies the logical AND.
0(1) ANDN 0 = 0
1(0) ANDN 1 = 0
1(0) ANDN 0 = 0
0(1) ANDN 1 = 1
POR:
0 OR 0 = 0
1 OR 1 = 1
1 OR 0 = 1
0 OR 1 = 1
PXOR(exclusive OR):
0 XOR 0 = 0
1 XOR 1 = 0
1 XOR 0 = 1
0 XOR 1 = 1
I often PXOR MM7,MM7 to make my MM7 register==0. This is very useful when
doing packing/unpacking as you will see later.
---==[MMX Instructions - Multiply]==------------------------------------------
The 3 MMX multiplication instructions all operate on 16bits of data and
output 32bit results of the multiplication. The 3 instructions are:
PMADD (Packed Multiply Add) - word-->dword
PMULH (Packed Multiply High) - word
PMULL (Packed Multiply Low) - word
All 3 work as:
instruction dest,src
instrction MMXreg,MMXreg/Mem
PMADDWD multiplies each of the 4 words in the source operand with each of the
4 words in the destination operand - producing 4 dwords. The lower two
dwords are added together and stored as 1 dword in the lower 32bits of the
destination register. The same is done for the higher 2 dwords, except
that they are stored in the highest 32bits of the destination register.
You can see why the suffix of the instruction is "WD", because it takes input
of words but the output is in packed dwords. This instruction could be very
useful for a variety of things. Complex number multiplicatin can benefit
from this instruction immensely as is requires 4 multiplications and two
additions. Also Imagine how easity it could do 2 lots of (x*x)+(y*y) in parallel!
PMULHW multipies each of the 4 words in the source operand with each of the
4 words in the destincation operand. This again produces 4 32bit numbers,
so it discards the lower 16bits of each result and stores the higher 16bits
in the corresponding destination operand.
PMULLW does the same as PMULHW except it discards the higher 16bits and
stores the lower 16bits of the multiplcaition result.
---==[MMX Instructions - Comparing]==------------------------------------------
Yes, there are even some new additions to the CMP family! :)
They are quite weird, let me introduce them:
PCMPEQ (Packed Compare for Equality) - byte - word - dword
PCMPGT (PAcked Compare for Greater Than) - byte - word - dword
From that you can work out that all the actual instructions:
PCMPEQB, PCMPEQW, PCMPEQD
PCMPGTB, PCMPGTW, PCMPGTD
PCMPEQx compares the data elements (whatever their size) in the source
operand to those in the destination operand. If they are equal, 1's
are written to that part of the destincation operand, if not then
0's are written. So you end up with a destination operand comprised of
zero and FF(PCMPEQB)/FFFF(PCMPEQW)/FFFFFFFF(PCMPEQD) data elements.
PCMPGTx does the same as PCMPEQx, except that if the data in the destination
data element is greater than the data in the source data element, 1's are
written to the destincation data element, otherwise 0's are written.
---==[MMX Instructions - Packing/Unpacking]==--------------------------------
I've left a very important set of instructions till last. I'm not sure why..
perhaps its all about saving the best till last. These aren't a magical
set of instructions which will make your coding amazingly fast. They are
very important because they allow you do control the format of data going
into the MMX registers so that you can use its parallelism. They are also
pretty cool because some of them perform saturation too. Often your data
don't be in the nice format needed for parallel number crunching, these
instructions can convert it into this format. You then use your cool
MMX function on it and unpack the number back out of its format.
PACKSS (Pack Signed Saturated) - byte<--word - word<--dword
PACKUS (Pack Unsigned Saturated) - byte<--word
PUNPCKH (Unpack High Data) - byte-->word - word-->dword - dword-->qword
PUNPCKL (Unpack Low Data) - byte-->word - word-->dword - dword-->qword
This whole "-->" thing might seem confusing, buts its the same as it is for the
PMADDWD instruction. Look at PACKSS and PUNPCKH, it means that the
instructions are:
PACKSSWB, PACKSSDW
PUNPCKLBW, PUNPCKLWD, PUNPCKLDQ
The packing instructions take larger data elements and convert them
to smaller data elements(eg word<--dword). The unpacking instructions
take smaller data elements and convert them to larger data elements.
PACKUSWB: First off a saturation check is performed on the data elements
in the source and destination operands. If the word is negative is makes
it 0, and if the word is greater than 255(the maximum size of a byte) it
clips it to 255. Now in the source and destination operands you have
16bit words with values between 0 and 255. All it does now is collect
these 8 bytes and put them into the destination operand. First the
destination 4 words and put into the first 4 bytes of the destination
operand, and then the 4 words of the source and put into the second
4 bytes of the destination operand. Here is how:
MM0 - |0008000|0000005|000230|0001045| 4 words(64bits)
MM1 - |00000-5|0024525|002345|0000112| 4 words(64bits)
PACKUSWB MM0,MM1(Pack Unsigned Saturated Words to Bytes)
result of saturation:
(done in processor so MM1 doesn't actually change)
MM0 - |0000255|0000005|000230|0000255| 4 words(64bits)
MM1 - |0000000|0000255|000255|0000112| 4 words(64bits)
final result:
MM0 - |000|255|255|112|255|005|230|255| 8 bytes(64bits)
PACKSSx does the same as PACKUSWB except that because its signed it
looked to saturate if the number is bigger than 127 and if its less than
-127. Also it can convert from dword to word.
PUNPCKHx only works on the higher 32bits of the destination and source
operands. It takes data elements from each and intertwines them into
the destincation operand. Here is an example(in hex for convenience):
MM0 |AF|45|0E|8A|12|67|FF|00| 8 bytes(64bits)
MM1 |11|91|AB|5C|93|B8|0F|09| 8 bytes(64bits)
PUNPCKHBW MM0,MM1(Unpack High Data from Bytes to Words)
result:
MM0 |11AF|9145|AB0E|5C8A| 8 bytes(64bits)
PUNPCKLx does the same except that it takes data from the lower 32bits of the
source and destination operand. eg:
MM0 |AF|45|0E|8A|12|67|FF|00| 8 bytes(64bits)
MM1 |11|91|AB|5C|93|B8|0F|09| 8 bytes(64bits)
PUNPCKLBW MM0,MM1(Unpack Low Data from Bytes to Words)
result:
MM0 |9312|B867|0FFF|0900| 8 bytes(64bits)
One might wonder why on earth you need such WEIRD instructions? Well let
me give you an example. Now lets say we want to add and average 2 ARGB
pixels together from different memory locations. Now if we put 2 pixels
into different MMX registers and tried to add them(without saturation)
we would get overflows and all sorts of weird things happening. Look:
movd MM0,[edi] ;load pixel 1
movd MM1,[esi] ;load pixel 2
punpcklbw MM0,MM7 ;copy the lower 32bits of MM0 into MM0
punpcklbw MM1,MM7 ;copy the lower 32bits of MM1 into MM1
paddusw MM0,MM1
psrlw MM0,1 ;/2
packuswb MM0,MM7
movd [esi],MM0
---==[Little Tips]==---------------------------------------------------------
These are little tips/tricks I've made myself and most I've collected:
Making an MMX register=0
PXOR MM0, MM0
Filling all 64bits of a MMX register with 1s.
PCMPEQ MM1, MM1
Compute the absolute difference of 2 unsigned numbers.
(assuming packed-byte or packed-words)
Input: MM0: source operand
MM1: source operand
Output: MM0: The absolute difference of the unsigned operands
MOVQ MM2, MM0 ; make a copy of MM0
PSUBUSB MM0, MM1 ; compute difference one way
PSUBUSB MM1, MM2 ; compute difference the other way
POR MM0, MM1 ; OR them together
---==[Some Implementation Ideas - Vector Rotation]==---------------------------
LSD/Meltdown actually gave me this idea after implimenting it in his 3D engine.
Whether you're using a 12 or 9 multiplication rotation formula, you can execute
those multiplations in parallel - making it a lot faster.
---==[Some Implementation Ideas - ARGB pixels]==--------------------------------
The Alpha-Red-Green-Blue pixel format is perfect for fast manipulation with MMX.
Doing a 64bit read you can load 2 of these pixels into each register. From there
on you are free to use MMXs parallelism to the max. You can now process up to 4
32bit pixels per instruction - adding them, multiplying them, subtracting - and
all with or without automatic saturation.
Here is an example of a 320x200x32bpp loop which additively copies a buffer
onto another buffer, and saturates the RGB at 255:
;ASM 32bpp MMX adding
mov edi,[dest]
mov esi,[src]
mov ecx,32000
@MMX_layeraddloop:
movq MM0,[edi] ;Move QUAD(64bits)
movq MM1,[esi] ;Move QUAD(64bits)
paddusb MM0,MM1 ;Saturated Add
movq [esi],MM0 ;Move QUAD(64bits)
add esi,8
add edi,8
dec ecx
jnz @MMX_layeraddloop
EMMS ;Must always do this after about of
;MMX instructions
Its fast.
---==[Some Implementation Ideas - Byte Shifting]==------------------------------
One of the weird things with MMX shifting is that it doesn't do shifting for the
byte data element. This would be very handly for things like ARGB pixel manipulation.
What this forces you do to, is load the 2 pixels in 2 registers, unpack them to words,
manipulate them, then repack them. Very long process. I've got a method which is
a bit of a hack(as usual), but is faster.
Look at this simple example below of shifting:
source data: |pppaaa|hhhrrr| (1 word)
word shift 2: |00pppa|aahhhr| (1 word)
byte shift 2: |00pppa|00hhhr| (1 word)
You can see that the only difference between word and byte shifting is that there
are zeros in the byte shift where the overflows occur from the word shift. This
can easily be eliminated by masking those bits off. So depending on the abount
we shift by, and the direction of the shift, a different mask will have to be used:
shr1mask = 0111111101111111011111110111111101111111011111110111111101111111b
shr2mask = 0011111100111111001111110011111100111111001111110011111100111111b
shr3mask = 0001111100011111000111110001111100011111000111110001111100011111b
shl1mask = 1111111011111110111111101111111011111110111111101111111011111110b
shl2mask = 1111110011111100111111001111110011111100111111001111110011111100b
shl3mask = 1111100011111000111110001111100011111000111110001111100011111000b
In most tight loops the shift amount and direction is fixed, so you can use this method,
however where the shift isn't constant it won't be so good. Here is an example of
loading 2 32bit pixels and byte shifting it using this method:
MOVQ MM7,[shr3mask] ;loads 64bit mask
MOVQ MM0,[edi] ;loads 2 32bit pixels
PSRLW MM0,3 ;shifts word elements 3 to the right.
PAND MM0,MM7 ;mask off irrelevant bits
MOVQ [edi],MM0 ;put modified pixels back.
---==[Some Implementation Ideas - Crossfading]==----------------------------
On the sademoscene mailing list Jacques posted an interesting challenge. He
wanted to find a fast implimentation of the alpha blend function - basically
a crossfader. The functions formula is:
a=ARGB pixel1
b=ARGB pixel2
alpha=(0..1 value of the percentage of each image to blend)
finalpixel=[alpha*(a-b)]+b
You can see that if alpha==0 then 100% of image b will be shown and 0 percent of
image a will be shown. Also if alpha==1, 100% of image a will be shown and 0
percent of image b will be shown. Now how to speed up this very useful algorithm
using MMX? First of all, lets rewrite the algorithm to remove the floating
point alpha value:
alpha=alpha<<8; //scale it up by 256.
so now:
finalpixel=b+[alpha*(a-b)]>>8;
There are 4 main parts to the formula:
1 : (a-b)
2 : *alpha
3 : >>8
4 : +b
Try to make your own implementation of it using MMX before looking how I did it.
Here is how I did it:
;assume alpha is a value 0..255
;assume edi points to buffer1
;assume esi points to buffer2
;asuume edx points to destination buffer
pxor MM6,MM6 ;make MM6==0
mov eax,[alpha]
mov ebx,eax
shl ebx,16
add eax,ebx
movd MM7,eax
movq MM6,MM7
punpckldq MM7,MM6 ;MM7=alpha
;//////////inner LOOP/////////////
movq MM0,[esi] ;pixels a
movq MM1,[edi] ;pixels b
punpcklbw MM0,MM6 ;byte-->word(pixel 1)
punpcklbw MM1,MM6 ;byte-->word(pixel 1)
psubw MM0,MM1 ;a-b
pmullw MM0,MM7 ;pixel 1 * alpha
PSRLW MM0,8 ;shifts word elements 8 to the right.
paddb MM0,MM1 ;add (b) to result
packuswb MM0,MM6 ;convert back into byte form
movq [edx],MM0
;//////////inner LOOP/////////////
Its a weird, I know :) If you know a better way - and I've no doubt there is
one, please let me know.
---==[Some Implementation Ideas - Blurring]==---------------------------------
Here is my MMX bluring routine:
_MMX_blur_:
push edi
mov edi,[destaddr]
mov ecx,256000 ;320x200x4
sub ecx,2564
add edi,1284
pxor MM7,MM7 ;=0
movd MM0,[edi-4]
@blur_more:
movd MM1,[edi+4]
movd MM2,[edi-1280]
movd MM3,[edi+1280]
punpcklbw MM0,MM7
punpcklbw MM1,MM7
punpcklbw MM2,MM7
punpcklbw MM3,MM7
paddusw MM0,MM1
paddusw MM0,MM2
paddusw MM0,MM3
psrlw MM0,2
packuswb MM0,MM7
movd eax,MM0
stosd
sub ecx,4
jnz near @blur_more
EMMS
pop edi
ret
I think it can be optimised a lot, especially since it doesn't operate on pixels
in parallel.
---==[Some Implementation Ideas - Complex Multiplications]==-------------------
MMX can be VERY useful for doing complex multiplications - which is
useful in things like fractals. Now I'm no expert on imaginary number
planes or anything like that so this is straight out of an Intel document:
Let the input data be Dr and Di where
Dr = real component of the data
Di = imaginary component of the data
Format the constant complex coefficients in memory as four 16-bit
values [Cr -Ci Ci Cr]. Remember to load the values into the MMX technology
register using a MOVQ instruction.
Input: MM0 : a complex number Dr, Di
MM1 : constant complex coefficient in the form[Cr-Ci Ci Cr]
Output: MM0 : two 32-bit dwords containing [ Pr Pi ]
The real component of the complex product is Pr = Dr*Cr - Di*Ci, and the
imaginary component of the complex product is Pi = Dr*Ci + Di*Cr
PUNPCKLDQ MM0,MM0 ; This makes [Dr Di Dr Di]
PMADDWD MM0, MM1 ; and you're done, the result is
; [(Dr*Cr-Di*Ci)(Dr*Ci+Di*Cr)]
Note that the output is a packed word. If needed, a pack instruction can
be used to convert the result to 16-bit (thereby matching the format of
the input).
---==[Closing Words]==--------------------------------------------------------
Well I really hope that people start using MMX more - because it REALLY is very cool.
If you'd like to comment to me about anything in this document, please don't hesitate!
I think next time I'll look into 3DNow! which is AMDs new
set of floating point instructions. Just to give you a taste - 3DNow!'s
registers are also MM0-MM7, except that the 64bits is divided into 2 single
precision floating point numbers. What this means is that you can do
floating point functions in parallel - much like MMX. There is also new few
much needed MMX instructions which is included in AMDs 3DNow!, and in the
new PIII range, which are just an extension to the integer MMX instructions
mentioned in this doc.
Greets: Demoscene, All the people at Optimise'99, Everyone at #programming,
ColdBlood, Cyberphreak, Deadpoet, LSD, Maverick, Neuron, NiMH, Saurax, Viper,
and everybody that I know!
-Rawhed/Sensory Overload
-Mailto:sfeist@netactive.co.za
-HTTP://www.surf.to/demos/
-Andrew Griffiths
-South Africa
-01-12-1999