* fix warnings: goto label not used outside of SW_ENABLE_DEPTH_TEST
* comment out x coordinates that aren't used in SW_RASTER_TRIANGLE
* silence warnings: unused DrmModeConnector functions in rcore_drm.c when using GRAPHICS_API_OPENGL_SOFTWARE
* [rlsw] Add sw_rcp helper using Xtensa recip0.s for hot-path divisions
Adds a `sw_rcp(x)` inline reciprocal that on Xtensa (ESP32 / ESP32-S3
LX6/LX7) emits a `recip0.s` seed plus two Newton-Raphson refinement
steps -- 1-ULP accurate in ~7 instructions, all in FPU registers.
On every other target it expands to plain `1.0f/x`, so generated code
is byte-identical to before for non-Xtensa builds.
Replaces the hot-path `1.0f/x` calls that were previously compiling to
the `__divsf3` software helper on Xtensa:
- perspective divide (1/w) in triangle clip-and-project (PCT and PC paths)
- line and point clip-and-project NDC conversion
- triangle span setup: dxRcp, blockLenRcp, wRcpA, wRcpB
- triangle scanline setup: h02Rcp, h01Rcp, h12Rcp
- axis-aligned quad: wRcp, hRcp
- line rasterizer: stepRcp
Other `1.0f/x` uses (matrix translate/normalize, texture init `tx`/`ty`,
sw_matrix_rotate inverse-length) are not on the per-pixel hot path and
are left untouched.
Measured on ESP32-S3 @ 240 MHz, R5G6B5 240x240, textured 3D model:
contributes to a ~10-15% rasterization speedup.
Made-with: Cursor
* [rlsw] Use ESP-DSP for 4x4 matrix multiply and per-vertex MVP transform
Adds an opt-in ESP-DSP code path for ESP32 / ESP32-S3 builds. ESP-DSP is
ESP-IDF's official optimized math library and ships hand-vectorized
kernels that beat the scalar implementations on Xtensa.
Two integration points:
1. `sw_matrix_mul_rst` -> `dspm_mult_4x4x4_f32` for any 4x4*4x4 multiply
(used for MVP build, gluLookAt, push/multiply, etc.). rlsw stores
matrices column-major and ESP-DSP reads row-major; the comment on the
call site explains why the flat-buffer call still produces the
correct column-major product (transpose-of-transposes equivalence).
2. `sw_immediate_push_vertex` -> `dspm_mult_4x4x1_f32` for the per-vertex
clip-space transform. Because ESP-DSP expects a row-major matrix in
this case, a row-major copy `matMVP_rm[16]` is maintained alongside
`matMVP` and refreshed once per `isDirtyMVP` rebuild in
`sw_immediate_begin`. Cost is 16 scalar copies per matrix update,
amortized over thousands of vertices per frame.
Detection is **opt-in** via `SW_USE_ESP_DSP` so existing ESP-IDF projects
that don't depend on the `esp-dsp` component keep building unchanged.
A user enables it from CMakeLists.txt (or anywhere before including
rlgl.h):
target_compile_definitions(${COMPONENT_LIB} PRIVATE SW_USE_ESP_DSP=1)
and adds the dependency to `idf_component.yml`:
espressif/esp-dsp: "^1.4.0"
Measured on ESP32-S3 @ 240 MHz, R5G6B5 240x240, textured 3D model:
contributes meaningfully to the overall frame-time improvement
(combined with sw_rcp).
Made-with: Cursor
* [rlsw] Add SW_TEXTURE_REPEAT_POT_FAST opt-in for POT bitmask wrap
Adds an opt-in compile-time flag that replaces the SW_REPEAT wrap chain
with a bitmask (`x & (size-1)`) for power-of-two textures. NPOT textures
keep using the original `sw_fract` / signed-modulo paths via a runtime
`(size & (size-1)) == 0` check, so SW_REPEAT remains correct for them.
Affects two samplers:
- `sw_texture_sample_nearest`: drops the `floorf` + multiply + cast for
POT textures in REPEAT mode (saves a software call on Xtensa).
- `sw_texture_sample_linear`: replaces the `(x % w + w) % w` two-step
modulo (a software divide on Xtensa) with a single bitwise AND for
POT textures in REPEAT mode. Two's-complement int wrap covers
negative coordinates correctly.
Off by default: for POT textures sampled with negative UVs, bitmask wrap
can differ from `sw_fract` wrap by one texel at the boundary. That is
imperceptible at typical resolutions but technically a behavior change,
so existing users get bit-for-bit identical output. Opt in if you
control your asset UVs and want the speedup:
#define SW_TEXTURE_REPEAT_POT_FAST
This addresses the long-standing TODO comment "If the textures are POT,
avoid the division for SW_REPEAT" in `sw_texture_sample_linear`.
Made-with: Cursor
* auto generates all combinations of blending factors
This adds a macro system that generate a function for each possible combination of blending factors, resulting in 11*11 functions, hence 121.
This then allows for only one indirection and function call instead of two previously (assuming the first call was inlined).
* rename dispatch tables for consistency
* change blend funcs validity check
Simplifies the validation of blend functions.
Can allow `SW_SRC_ALPHA_SATURATE` as dst factor, but hey
* disables blending when it requires alpha and there is none
* review immediate rendering functions and attribute layout
* prevent state changes during immediate record
* reduce number of op for each vertex push + review primitive struct
* simplified draw functions
* review `sw_vertex_t`
removes `float screen[2]`; each step stores the transformed coordinates in `float coord[4]`.
This also simplifies vertex interpolation during triangle rasterization.
* reduces unnecessary interpolation costs during triangle rasterization + cleanup
* extends the simd color conversion to more cases
* affine interpolation per blocks
* long side check for each triangle line
My mistake in a previous commit
* style tweaks
* select the read function on texture load
This removes the per-pixel switch; it's slightly more efficient on my hardware, but probably a poor prediction
Should remain profitable or at worst the same
* use optionnal LUT for uint8_t -> float conversion
* sets internal the number of vertices post-clipping and the epsilon clipping + a little cleanup
* moves color conversion to math part
* prevents sampling if it's a depth texture that is bound
* review texture formats
Added support for `R3G3B2`, `R5G6B5`, `R4G4B4A4` and `R5G5B5A1`
Added depth formats
* use of textures for the framebuffer
- Framebuffers can now use all texture types that are already available.
- The 24-bit depth format has been removed as it is no longer needed.
- Framebuffer formats are still defined at compile time.
- The allocated texture size is now preserved, which avoids frequent reallocations when resizing framebuffers and will allow the use of `glTexSubImage2D`.
* review framebuffer blit/copy
This greatly simplifies the framebuffer blit/copy logic while now supporting all pixel formats. It is slightly slower in debug builds, but this path is mainly kept for compatibility anyway. The `copy_fast` version is still used for the "normal" cases when presenting to the screen.
* review pixel get/set
less ops for certain formats + fixes
* fix depth write
* texture read/write cleanup + tweaks
I made the pointers parameters `restrict` for reading/writing textures, which resulted in a slight improvement.
And I reviewed the `static inline` statements, which could potentially bias the compiler; no difference, but it's cleaner.
* style tweaks
* review uint8_t <-> float conversion
* added a reusable object pool system
will allow management of both textures and framebuffers
added support for `glTexSubImage2D`
added handling of 'GL_OUT_OF_MEMORY' errors
removed the default internal texture (unused)
* added FBO API + refactored rasterizer dispatch logic
* fix ndc projection + review presentation
and rename rlsw's resize/copy/blit
* add `glRenderbufferStorage` binding
+ tweaks and fixes
* fix quad sorting + simplify quad rasterization part
* fix line shaking issue
* support of `GL_DRAW_FRAMEBUFFER_BINDING`
* update rlgl - support of rlsw's framebuffers
* fix pixel origin in line rasterization
my bad, an oversight in my previous fix.
This offset should have been moved here rather than per pixel during truncation.
* style tweaks
* fix vla issue with msvc - fill depth / fill color
Redesigned to support disabling features on compilation with `-DSUPPORT_FEATURE=0`
REMOVED: `SUPPORT_DEFAULT_FONT`, always supported
REMOVED: `SUPPORT_IMAGE_MANIPULATION `, always supported
REMOVED: `SUPPORT_TEXT_MANIPULATION`, always supported
REDESIGNED: `SUPPORT_FONT_ATLAS_WHITE_REC` to `FONT_ATLAS_CORNER_REC_SIZE`
REVIEWED: Config values (other than 0-1) are already defined on respective modules
Other config tweaks here and there
* win32 clipbaord: fix for BI_ALPHABITFIELDS narrow support
* Define BI_ALPHABITFIELDS even if wingdi headers are already included
since BI_ALPHABITFIELDS is not always defined there