Grr... people need to get out of the 80's already...
Okay... there's a really nifty, quite wonderful image-scaling library for low-colour low-res pixel artwork. Specifically, it's meant for re-rasterizing older video-game artwork to a higher resolution while keeping angles and lines and what-not intact, by the name of HQ3x originally, though now it supports 2x and 4x rescaling.
Here's where it gets stupid. First off, it requires to be handed a 16bpp image, in 565 format. Why? Because it converts to YUV colour-space for comparisons so it can be seperately sensitive to colour versus brightness changes to determine borders between areas on sprites, so it knows where to put the hard edges and where to blur. This is a good thing. So what's the bad?
It uses a massively pre-calculated lookup table for this. And another one to convert the 16bpp graphics to 32bpp before actually blending! So it wastes 512k, HALF A MEG, of very randomly accessed memory. Meaning you can assume that the majority of the accesses to both arrays will be cache misses, generating a stall in the CPU until the memory is read. Limiting the code to the speed of the memory bus.
Modern CPU's run as much as 10x faster than the memory bus they're attached to, and only have around 32-64k of full-speed cache. While a lookup table used to be an always-win, on modern CPU's it's a HUGE loss. Specifically because it's often faster to execute as many as 100 clock cycles than to wait for one cache miss. That's around 100-150 instructions on the same modern CPU's, because they can run multiple bits of non-interacting code at once.
Yes, this is a complete reversal of earlier days. CPU's used to take 100 or more cycles to perform a divide. Now they take on the order of 5-9, in some cases even less. In part due to better design, in part due to fancier mathematics tricks. But gods... to see code specifically targetted as higher-end CPU's because it's CPU-intensive... and realizing it's set up in a way that specifically cripples modern CPU's... it explains why good results are being returned on everything from a P2 to a modern P4/Athlon with this code. It's purely limited by how fast system RAM can be accessed.
Another bit of boneheadedness it does... it has a nicely-designed 256-entry 'function lookup table' that's based on a slightly sub-optimal pattern calculation block I won't pick apart just yet. Over half of those sub-functions performs from 2-4 further colour-comparisons, raising the total to 10-12 per pixel (of which there are only 4 unique comparisons possible), and introducing severe jump-prediction issues, when simply expanding the function-lookup table to 4096 entries allows for the other 4 pattern-comparisons to be included in the single jump-table, letting them be merged into the other 8 and simplified down to 4 TOTAL comparisons per pixel, AND removing jump-prediction entirely from the inner-loop code.
Now what's with the ranting on jump prediction? Older CPU's simply assumed jumps would either pass or fail, and you programmed accordingly to the clock-count charts. But most of the time, it's complex to reverse a conditional-jump without noticable re-arranging of the surrounding code, which can break the instruction pairing that allows for 2 or more instructions to be processed at once. This, again, breaks severely with earlier CPU's. Previously, you always assumed jumps would be taken. Now, you don't need to make that assumption, you simply need to make the same jump location perform the same jump the majority of the time. The problem is... once again, due to the layout of this function, the conditional jumps are essentially perfectly random, resulting in ruined jump prediction and more CPU stalls on the order of 25ish for Athlon's to as many as another 75ish for P4's. That's, again, quite a few instructions.
So anything else this library does that's bad? Not really. It does a very obtuse per-pixel-per-situation macro definition block, that doesn't really fix anything. But that's simply a coding-style complaint, and actually an understandable one. The macro's are good, they just could be broken up into 2 parts instead of being kept as one contigious unit, and simplified if they were broken up.
But gah... so far I'm managing to reduce the memory load from 513k to 16k, AND reducing the code-size and sub-function calls by over half, not counting the colour-blurring functions which I'm rebuilding to handle full 32bpp input using bit-shifting instead of bit-masked two-stage multiplication combinations that the compiler can't optimize. And that will use the same 'block' of code with a single switch between 3 4k-entry function-pointer tables to switch between 2x, 3x, and 4x scaling, instead of being entirely seperate functions with fully duplicated internal logic.
Would this be considered re-writing procerdural code into an object-oriented format, or am I mis-understanding what object-oriented means? =^.^=

no subject
Heh... and I usually avoid printf entirely.
Then again, I'm an inverted-size-queen for code size. I tend to shy away from the overly general libc functions towards the more controlled ones, unless I specifically need the functionality of the more complex ones, like to display floating-point functions.
no subject
ow ow ow...
;P
Sorry about that.
no subject
*grins*
*pads off*
Thanks, hon. I'm still poking at things to try to figure it out.
Re: Thanks, hon. I'm still poking at things to try to figure it out.
*licks happily*
no subject
Actually...
That's all that has been 'standard' over all these years. The simpler, more straight-forward the code, the faster the CPU's can make it run. If anything, they've made optimization easier because you can simply write actual code, instead of trying to target anything in particular.
Smaller code running with fewer branches runs faster. That's always been true. For a short while though, most of the more essoteric math was very slow, while memory suddenly surged ahead in speed for a couple years, resulting in a short-lived period where everything, even simple multiplication, was faster to do by looking up the answer in a lookup table. Long-term, simpler, less branching code, has always been an improvement.
On the 8086, everything was slow, but memory especially so. On the 186 and 286, this was still true mostly. The 386 and 486 however, turned this on it's ear. The 586 fixed this trend, and the 686/P2/P3/P4/Athlon/K6 and beyond all went back to the way the 8086 and so-forth were, memory:CPU ratio-wise of speed.
Unfortunately, too many people are still stuck in the very short-lived 'speedy memory' days, where it was 1-2 clock cycles for a cache miss. A multiply alone could take 10 clock cycles too, which only made things more silly. The 386 and 486 were the 'glory days' of optimization, where all sorts of insane, silly tricks were needed to make things run Really Fast.
Surprisingly, much of the older 80286 and before code still runs blazingly fast on my Athlon, in many cases faster than the 386 and 486-era 'demo scene' code that was so impressive for how much it could accomplish, using the tricks and shortcuts of the day.
no subject
Believe it or not...
Some people drink coffee.
Some people drink alchohol.
Some people exercise.
Some people meditate.
I code.