Back in the days when a single misplaced semi-colon or careless ASM
instruction could mean wiping out days of work, the Zen timer was a cool
thing to have. It was a tool written by Michael Abrash that let you time just
how long your ASM loops were taking. If you knew a few optimising rules,
you could take the results and figure out what was taking the most time in
your loops.
These days under multi-tasking OSs, the Zen timer is no longer practical. But
Intel have come up with something that's even more handy the than the Zen
timer. It's an undocumented instruction that works on the Pentium or better
(Pentium MMX, Pentium Pro, Pentium II or Pentium III). It reads the current
time stamp counter in clock cycles from the chip's internal counter and writes
it to a register. In short, it's a really, really handy way of measuring *exactly*
how many clock cycles your routine is taking. It doesn't replace global
optimisation and profiling of course - those are things that must happen first,
but when you have an ASM routine that has to run faster, then RDTSC is
handy for doing exact cycle counts.
ReaDTimeStampCounter
RDTSC reads the value of the Pentium's internal cycle counter and writes the
64-bit result into EDX:EAX. Most of the time 2^32 clock cycles is more than
enough to measure your code so we can ignore EDX and just compare two
different time values in EAX. If you want to measure clock cycles in more
units than that, then it's easy to modify the code.
Here's the basic code for timing a loop at the assembly level. It uses NASM
(which understands RDTSC as an op code) and is C-callable from Linux (or
probably DOS running DJGPP as well). Check out rdtsc.zip which should be
provided with this diskmag and should have full working source in both C
and ASM for a Linux code timer.
Anyway here it is:
BITS 32 SECTION .data time dd 0 SECTION .text proc TestRDTSC %define count [ebp+8] mov ecx, [count] test ecx, ecx jz .bailout rdtsc mov [time], eax .timerloop ; insert your test code in here dec ecx jnz .timerloop rdtsc sub eax, [time] .bailout endproc
(BTW my procs and endprocs are helper macros which do all the grunt work
of pushing ebp and mov'ing ebp, esp and so on, but that is all in the source).
This function when called from C takes one argument - how many times you
want to run your code snippet, and returns an unsigned int value - how many
cycles it took to run that snippet multiplied by the times it ran.
So:
int cycle_count = TestRDTSC(1000);
will run your test routine 1000 times and return the value in cycle_count.
I tried this out under Linux to see how good it was. It's very accurate indeed.
I had a resampling loop for a mixer which was running in about 12 clock
ticks per iteration of the loop - reasonably quick I thought. After revising
some of the instruction pairing rules on the Pentium, I re-arranged some
instructions, ran the timer and found it was down to 3 clock ticks - a fourfold
speed improvement. So what, you might say. But this routine is called fifty
times per second and the loop can execute on data sizes up to 8k. Four times
the speed should be noticeable - and it was. Combined with a couple of other
speedups, the mixer went from 25% CPU usage down to about 8%.
Another good example was a code snippet from a text on Pentium
optimisation. It's a complex floating point calculation which takes twelve
clock cycles and illustrates some overlapping properties of the FPU. I never
knew about these overlapping properties so I cut and pasted the code into the
timer and ran it 1000 times (not manually of course :). Bingo - the result was
13000 and something. It wasn't exactly 13000 because of a little overhead
plus an extra cycle to count the loop (plus the fact that the OS might have
interrupted my program occasionally to do other things) but the average was
near as dammit 12 clock ticks for the code I was testing.
Conclusion: if you have ASM code and you think it could be performing
better, then try the following steps:
1. look up all the cycle counts for the instructions you're using
2. try and see how they will pair in the U and V pipe of the Pentium
3. add up how long you think the loop should take
4. using RDTSC, time exactly how long it takes
5. if the results aren't what you expected, then try and see what's taking the
time.
The cool thing is an exact timer eliminates guesswork from assembly
optimisation completely. There's no need to be a guru or have a gut feel
about your code - see what's slow, find out why it's slow and make it go
faster :)
Have fun.