This beats the previous 3-LUT version and even beats SSE on my system.
Test code:
---
int main( int argc, char *argv[] )
{
SDL_Surface *orig = SDL_LoadPNG("testyuv.png");
SDL_Surface *surf16 = SDL_ConvertSurface(orig, SDL_PIXELFORMAT_RGB565);
SDL_Surface *surf32 = SDL_ConvertSurface(surf16, SDL_PIXELFORMAT_ARGB8888);
Uint64 then = SDL_GetTicks();
for (int i = 0; i < 100000; ++i) {
SDL_BlitSurface(surf16, NULL, surf32, NULL);
}
Uint64 now = SDL_GetTicks();
SDL_Log("Blit took %d ms\n", (int)(now - then));
return 0;
}
---
Results on my system:
BlitNtoN: Blit took 34522 ms
Blit_RGB565_32 (3 LUT): Blit took 9316 ms
Blit_RGB565_32 (1 LUT): Blit took 5268 ms
Blit_RGB565_32_SSE41: Blit took 6399 ms