[rlsw] ESP32 optimizations (#5827)

* [rlsw] Add sw_rcp helper using Xtensa recip0.s for hot-path divisions Adds a `sw_rcp(x)` inline reciprocal that on Xtensa (ESP32 / ESP32-S3 LX6/LX7) emits a `recip0.s` seed plus two Newton-Raphson refinement steps -- 1-ULP accurate in ~7 instructions, all in FPU registers. On every other target it expands to plain `1.0f/x`, so generated code is byte-identical to before for non-Xtensa builds. Replaces the hot-path `1.0f/x` calls that were previously compiling to the `__divsf3` software helper on Xtensa: - perspective divide (1/w) in triangle clip-and-project (PCT and PC paths) - line and point clip-and-project NDC conversion - triangle span setup: dxRcp, blockLenRcp, wRcpA, wRcpB - triangle scanline setup: h02Rcp, h01Rcp, h12Rcp - axis-aligned quad: wRcp, hRcp - line rasterizer: stepRcp Other `1.0f/x` uses (matrix translate/normalize, texture init `tx`/`ty`, sw_matrix_rotate inverse-length) are not on the per-pixel hot path and are left untouched. Measured on ESP32-S3 @ 240 MHz, R5G6B5 240x240, textured 3D model: contributes to a ~10-15% rasterization speedup. Made-with: Cursor * [rlsw] Use ESP-DSP for 4x4 matrix multiply and per-vertex MVP transform Adds an opt-in ESP-DSP code path for ESP32 / ESP32-S3 builds. ESP-DSP is ESP-IDF's official optimized math library and ships hand-vectorized kernels that beat the scalar implementations on Xtensa. Two integration points: 1. `sw_matrix_mul_rst` -> `dspm_mult_4x4x4_f32` for any 4x4*4x4 multiply (used for MVP build, gluLookAt, push/multiply, etc.). rlsw stores matrices column-major and ESP-DSP reads row-major; the comment on the call site explains why the flat-buffer call still produces the correct column-major product (transpose-of-transposes equivalence). 2. `sw_immediate_push_vertex` -> `dspm_mult_4x4x1_f32` for the per-vertex clip-space transform. Because ESP-DSP expects a row-major matrix in this case, a row-major copy `matMVP_rm[16]` is maintained alongside `matMVP` and refreshed once per `isDirtyMVP` rebuild in `sw_immediate_begin`. Cost is 16 scalar copies per matrix update, amortized over thousands of vertices per frame. Detection is **opt-in** via `SW_USE_ESP_DSP` so existing ESP-IDF projects that don't depend on the `esp-dsp` component keep building unchanged. A user enables it from CMakeLists.txt (or anywhere before including rlgl.h): target_compile_definitions(${COMPONENT_LIB} PRIVATE SW_USE_ESP_DSP=1) and adds the dependency to `idf_component.yml`: espressif/esp-dsp: "^1.4.0" Measured on ESP32-S3 @ 240 MHz, R5G6B5 240x240, textured 3D model: contributes meaningfully to the overall frame-time improvement (combined with sw_rcp). Made-with: Cursor * [rlsw] Add SW_TEXTURE_REPEAT_POT_FAST opt-in for POT bitmask wrap Adds an opt-in compile-time flag that replaces the SW_REPEAT wrap chain with a bitmask (`x & (size-1)`) for power-of-two textures. NPOT textures keep using the original `sw_fract` / signed-modulo paths via a runtime `(size & (size-1)) == 0` check, so SW_REPEAT remains correct for them. Affects two samplers: - `sw_texture_sample_nearest`: drops the `floorf` + multiply + cast for POT textures in REPEAT mode (saves a software call on Xtensa). - `sw_texture_sample_linear`: replaces the `(x % w + w) % w` two-step modulo (a software divide on Xtensa) with a single bitwise AND for POT textures in REPEAT mode. Two's-complement int wrap covers negative coordinates correctly. Off by default: for POT textures sampled with negative UVs, bitmask wrap can differ from `sw_fract` wrap by one texel at the boundary. That is imperceptible at typical resolutions but technically a behavior change, so existing users get bit-for-bit identical output. Opt in if you control your asset UVs and want the speedup: #define SW_TEXTURE_REPEAT_POT_FAST This addresses the long-standing TODO comment "If the textures are POT, avoid the division for SW_REPEAT" in `sw_texture_sample_linear`. Made-with: Cursor
2026-06-28 05:50:31 +00:00 · 2026-05-06 12:38:52 +02:00
parent cf9f27db54
commit 7207c03c72
1 changed files with 136 additions and 21 deletions
--- a/src/external/rlsw.h
+++ b/src/external/rlsw.h
@@ -164,6 +164,19 @@
    #endif
 #endif

+// Fast power-of-two texture wrap (SW_REPEAT mode only)
+// When defined, textures whose width/height are powers of two use a bitmask
+// wrap (`x & (size-1)`) instead of `floorf`-based fractional wrap or the
+// signed `%` chain in the linear sampler. Saves a software divide on Xtensa
+// and a few instructions everywhere. NPOT textures keep using the original
+// path via a runtime `(size & (size-1)) == 0` check, so SW_REPEAT remains
+// correct for them. The only observable behavior change is for POT textures
+// sampled with negative UV coordinates: bitmask wrap (two's complement) can
+// differ from `sw_fract` by one texel. Off by default to keep bit-for-bit
+// behavior; opt in if you control your asset UVs.
+//
+// #define SW_TEXTURE_REPEAT_POT_FAST
+
 //----------------------------------------------------------------------------------
 // OpenGL Compatibility Types
 //----------------------------------------------------------------------------------
@@ -844,6 +857,17 @@ SWAPI void swGetFramebufferAttachmentParameteriv(SWattachment attachment, SWatta
    #endif
 #endif

+// ESP-DSP acceleration: ESP-IDF ships an optimized math library that includes
+// `dspm_mult_4x4x4_f32` (4x4 matrix multiply) and `dspm_mult_4x4x1_f32`
+// (matrix * vector). These are S3-tuned hand-vectorized kernels that beat the
+// scalar versions for both throughput and code-size. Detection is opt-in to
+// keep the dependency optional: define SW_USE_ESP_DSP from your build system
+// (or rely on the `idf_component.yml` example shown in the rlsw docs).
+#if defined(ESP_PLATFORM) && defined(SW_USE_ESP_DSP)
+    #define SW_HAS_ESP_DSP
+    #include "dspm_mult.h"
+#endif
+
 #ifdef __cplusplus
    #define SW_CURLY_INIT(name) name
 #else
@@ -1038,6 +1062,9 @@ typedef struct {
    SWmatrix currentMatrixMode;                                 // Current matrix mode (e.g., sw_MODELVIEW, sw_PROJECTION)
    sw_matrix_t *currentMatrix;                                 // Pointer to the currently used matrix according to the mode
    sw_matrix_t matMVP;                                         // Model view projection matrix, calculated and used internally
+#ifdef SW_HAS_ESP_DSP
+    float matMVP_rm[16];                                        // Row-major MVP, kept in sync for esp-dsp dspm_mult_4x4x1_f32 vertex transform
+#endif
    bool isDirtyMVP;                                            // Indicates if the MVP matrix should be rebuilt

    sw_handle_t boundFramebufferId;                             // Framebuffer currently bound
@@ -1141,6 +1168,14 @@ static inline void sw_matrix_id(sw_matrix_t dst)

 static inline void sw_matrix_mul_rst(float *SW_RESTRICT dst, const float *SW_RESTRICT left, const float *SW_RESTRICT right)
 {
+#ifdef SW_HAS_ESP_DSP
+    // dspm_mult_4x4x4_f32 treats its operands as row-major. rlsw stores matrices
+    // column-major, so passing them flat is equivalent to passing transposes:
+    // dspm_mult(L^T, R^T) computes (L^T)*(R^T) = (R*L)^T, written back into a
+    // flat array gives the same bit pattern as the column-major product (R*L)
+    // -- exactly the semantic the scalar fallback below has.
+    dspm_mult_4x4x4_f32(left, right, dst);
+#else
    float l00 = left[0],  l01 = left[1],  l02 = left[2],  l03 = left[3];
    float l10 = left[4],  l11 = left[5],  l12 = left[6],  l13 = left[7];
    float l20 = left[8],  l21 = left[9],  l22 = left[10], l23 = left[11];
@@ -1165,6 +1200,7 @@ static inline void sw_matrix_mul_rst(float *SW_RESTRICT dst, const float *SW_RES
    dst[7]  = l10*right[3] + l11*right[7] + l12*right[11] + l13*right[15];
    dst[11] = l20*right[3] + l21*right[7] + l22*right[11] + l23*right[15];
    dst[15] = l30*right[3] + l31*right[7] + l32*right[11] + l33*right[15];
+#endif
 }

 static inline void sw_matrix_mul(sw_matrix_t dst, const sw_matrix_t left, const sw_matrix_t right)
@@ -1210,6 +1246,33 @@ static inline float sw_fract(float x)
    return (x - floorf(x));
 }

+// Fast reciprocal: 1-ULP accurate in ~7 instructions on Xtensa using the
+// hardware `recip0.s` seed + two Newton-Raphson refinement steps. All work
+// stays in FPU registers — no `__divsf3` software call. Hot-path divisions
+// in the rasterizer (span/triangle setup, perspective divide, etc.) call
+// this. On non-Xtensa targets it transparently expands to `1.0f / x`, so
+// generated code is identical to before.
+#if defined(__XTENSA__)
+__attribute__((always_inline))
+static inline float sw_rcp(float x)
+{
+    float result, temp;
+    __asm__(
+        "recip0.s %0, %2\n"
+        "const.s  %1, 1\n"
+        "msub.s   %1, %2, %0\n"
+        "madd.s   %0, %0, %1\n"
+        "const.s  %1, 1\n"
+        "msub.s   %1, %2, %0\n"
+        "maddn.s  %0, %0, %1\n"
+        : "=&f"(result), "=&f"(temp) : "f"(x)
+    );
+    return result;
+}
+#else
+static inline float sw_rcp(float x) { return 1.0f/x; }
+#endif
+
 static inline uint8_t sw_luminance8(const uint8_t *color)
 {
    return (uint8_t)((color[0]*77 + color[1]*150 + color[2]*29) >> 8);
@@ -2406,11 +2469,31 @@ static inline void sw_texture_free(sw_texture_t *texture)

 static inline void sw_texture_sample_nearest(float *SW_RESTRICT color, const sw_texture_t *SW_RESTRICT tex, float u, float v)
 {
-    u = (tex->sWrap == SW_REPEAT)? sw_fract(u) : sw_saturate(u);
-    v = (tex->tWrap == SW_REPEAT)? sw_fract(v) : sw_saturate(v);
+    int x, y;

-    int x = u*tex->width;
-    int y = v*tex->height;
+#ifdef SW_TEXTURE_REPEAT_POT_FAST
+    if ((tex->sWrap == SW_REPEAT) && ((tex->width & tex->wMinus1) == 0))
+    {
+        x = (int)(u*tex->width) & tex->wMinus1;
+    }
+    else
+#endif
+    {
+        u = (tex->sWrap == SW_REPEAT)? sw_fract(u) : sw_saturate(u);
+        x = (int)(u*tex->width);
+    }
+
+#ifdef SW_TEXTURE_REPEAT_POT_FAST
+    if ((tex->tWrap == SW_REPEAT) && ((tex->height & tex->hMinus1) == 0))
+    {
+        y = (int)(v*tex->height) & tex->hMinus1;
+    }
+    else
+#endif
+    {
+        v = (tex->tWrap == SW_REPEAT)? sw_fract(v) : sw_saturate(v);
+        y = (int)(v*tex->height);
+    }

    tex->readColor(color, tex->pixels, y*tex->width + x);
 }
@@ -2432,13 +2515,19 @@ static inline void sw_texture_sample_linear(float *SW_RESTRICT color, const sw_t
    int x1 = x0 + 1;
    int y1 = y0 + 1;

-    // NOTE: If the textures are POT, avoid the division for SW_REPEAT
-
    if (tex->sWrap == SW_CLAMP)
    {
        x0 = (x0 > tex->wMinus1)? tex->wMinus1 : x0;
        x1 = (x1 > tex->wMinus1)? tex->wMinus1 : x1;
    }
+#ifdef SW_TEXTURE_REPEAT_POT_FAST
+    else if ((tex->width & tex->wMinus1) == 0)
+    {
+        // POT fast path: bitmask wrap covers negative ints via two's complement
+        x0 = x0 & tex->wMinus1;
+        x1 = x1 & tex->wMinus1;
+    }
+#endif
    else
    {
        x0 = (x0%tex->width + tex->width)%tex->width;
@@ -2450,6 +2539,13 @@ static inline void sw_texture_sample_linear(float *SW_RESTRICT color, const sw_t
        y0 = (y0 > tex->hMinus1)? tex->hMinus1 : y0;
        y1 = (y1 > tex->hMinus1)? tex->hMinus1 : y1;
    }
+#ifdef SW_TEXTURE_REPEAT_POT_FAST
+    else if ((tex->height & tex->hMinus1) == 0)
+    {
+        y0 = y0 & tex->hMinus1;
+        y1 = y1 & tex->hMinus1;
+    }
+#endif
    else
    {
        y0 = (y0%tex->height + tex->height)%tex->height;
@@ -3366,7 +3462,7 @@ static void sw_triangle_clip_and_project(void)

            // Calculation of the reciprocal of W for normalization
            // as well as perspective-correct attributes
-            const float wRcp = 1.0f/v->position[3];
+            const float wRcp = sw_rcp(v->position[3]);

            // Division of XYZ coordinates by weight
            v->position[0] *= wRcp;
@@ -3481,7 +3577,7 @@ static void sw_quad_clip_and_project(void)

            // Calculation of the reciprocal of W for normalization
            // as well as perspective-correct attributes
-            const float wRcp = 1.0f/v->position[3];
+            const float wRcp = sw_rcp(v->position[3]);

            // Division of XYZ coordinates by weight
            v->position[0] *= wRcp;
@@ -3659,8 +3755,8 @@ static bool sw_line_clip_and_project(sw_vertex_t *v0, sw_vertex_t *v1)
    if (!sw_line_clip(v0, v1)) return false;

    // Convert clip coordinates to NDC
-    v0->position[3] = 1.0f/v0->position[3];
-    v1->position[3] = 1.0f/v1->position[3];
+    v0->position[3] = sw_rcp(v0->position[3]);
+    v1->position[3] = sw_rcp(v1->position[3]);
    for (int i = 0; i < 3; i++)
    {
        v0->position[i] *= v0->position[3];
@@ -3709,7 +3805,7 @@ static bool sw_point_clip_and_project(sw_vertex_t *v)
            if ((v->position[i] < -v->position[3]) || (v->position[i] > v->position[3])) return false;
        }

-        v->position[3] = 1.0f/v->position[3];
+        v->position[3] = sw_rcp(v->position[3]);
        v->position[0] *= v->position[3];
        v->position[1] *= v->position[3];
        v->position[2] *= v->position[3];
@@ -3791,6 +3887,19 @@ static void sw_immediate_begin(SWdraw mode)
            RLSW.stackModelview[RLSW.stackModelviewCounter - 1],
            RLSW.stackProjection[RLSW.stackProjectionCounter - 1]);

+#ifdef SW_HAS_ESP_DSP
+        // Pre-transpose to row-major so dspm_mult_4x4x1_f32(matMVP_rm, v, out)
+        // computes M*v directly in the per-vertex hot path. 16 scalar copies
+        // per MVP update vs. saving ~20 cycles per vertex transform.
+        for (int i = 0; i < 4; i++)
+        {
+            for (int j = 0; j < 4; j++)
+            {
+                RLSW.matMVP_rm[4*i + j] = RLSW.matMVP[4*j + i];
+            }
+        }
+#endif
+
        RLSW.isDirtyMVP = false;
    }

@@ -3842,11 +3951,17 @@ static void sw_immediate_push_vertex(const float position[4])
    sw_vertex_t *vertex = &RLSW.primitive.buffer[RLSW.primitive.vertexCount++];

    // Calculate clip coordinates
+#ifdef SW_HAS_ESP_DSP
+    // dspm_mult_4x4x1_f32 declares its inputs non-const; rlsw treats them as
+    // read-only and the cast is safe (the kernel only loads from B).
+    dspm_mult_4x4x1_f32(RLSW.matMVP_rm, (float *)position, vertex->position);
+#else
    const float *m = RLSW.matMVP;
    vertex->position[0] = m[0]*position[0] + m[4]*position[1] + m[8]*position[2] + m[12]*position[3];
    vertex->position[1] = m[1]*position[0] + m[5]*position[1] + m[9]*position[2] + m[13]*position[3];
    vertex->position[2] = m[2]*position[0] + m[6]*position[1] + m[10]*position[2] + m[14]*position[3];
    vertex->position[3] = m[3]*position[0] + m[7]*position[1] + m[11]*position[2] + m[15]*position[3];
+#endif

    // Copy the attributes in the current vertex
    for (int i = 0; i < 4; i++) vertex->color[i] = RLSW.primitive.color[i];
@@ -5272,7 +5387,7 @@ static void SW_RASTER_TRIANGLE_SPAN(const sw_vertex_t *start, const sw_vertex_t
    if (xStart == xEnd) return;

    // Compute the inverse horizontal distance along the X axis
-    float dxRcp = 1.0f/(end->position[0] - start->position[0]);
+    float dxRcp = sw_rcp(end->position[0] - start->position[0]);

    // Compute the interpolation steps along the X axis
    float dWdx = (end->position[3] - start->position[3])*dxRcp;
@@ -5326,12 +5441,12 @@ static void SW_RASTER_TRIANGLE_SPAN(const sw_vertex_t *start, const sw_vertex_t
        int blockEnd = x + SW_AFFINE_BLOCK;
        if (blockEnd > xEnd) blockEnd = xEnd;
        float blockLenF = (float)(blockEnd - x);
-        float blockLenRcp = 1.0f/blockLenF;
+        float blockLenRcp = sw_rcp(blockLenF);

        // Only 2 '1/w' here; none inside the pixel loop
-        float wRcpA = 1.0f/w;
+        float wRcpA = sw_rcp(w);
        float wB = w + dWdx*blockLenF;
-        float wRcpB = 1.0f/wB;
+        float wRcpB = sw_rcp(wB);

        // Perspective-correct color at both block endpoints, then affine gradient
        float srcColor[4] = {
@@ -5459,9 +5574,9 @@ static void SW_RASTER_TRIANGLE(const sw_vertex_t *v0, const sw_vertex_t *v1, con
    if (h02 < 1e-6f) return;

    // Inverse edge dy for per-edge dV/dy (scanline interpolation)
-    float h02Rcp = 1.0f/h02;
-    float h01Rcp = (h01 > 1e-6f)? 1.0f/h01 : 0.0f;
-    float h12Rcp = (h12 > 1e-6f)? 1.0f/h12 : 0.0f;
+    float h02Rcp = sw_rcp(h02);
+    float h01Rcp = (h01 > 1e-6f)? sw_rcp(h01) : 0.0f;
+    float h12Rcp = (h12 > 1e-6f)? sw_rcp(h12) : 0.0f;

    // Compute gradients for each side of the triangle
    sw_vertex_t dVXdy02, dVXdy01, dVXdy12;
@@ -5560,8 +5675,8 @@ static void SW_RASTER_QUAD(const sw_vertex_t *a, const sw_vertex_t *b,
    float h = (float)(yMax - yMin);
    if ((w <= 0) || (h <= 0)) return;

-    float wRcp = 1.0f/w;
-    float hRcp = 1.0f/h;
+    float wRcp = sw_rcp(w);
+    float hRcp = sw_rcp(h);

    // Subpixel corrections
    float xSubstep = 1.0f - sw_fract(tl->position[0]);
@@ -5746,7 +5861,7 @@ static void SW_RASTER_LINE(const sw_vertex_t *v0, const sw_vertex_t *v1)
    // Compute per pixel increments
    float xInc = dx/steps;
    float yInc = dy/steps;
-    float stepRcp = 1.0f/steps;
+    float stepRcp = sw_rcp(steps);
 #ifdef SW_ENABLE_DEPTH_TEST
    float zInc = (v1->position[2] - v0->position[2])*stepRcp;
 #endif