Source of CPU waiting on GPU during post processing?

This topic contains 8 replies, has 2 voices, and was last updated by  peted 4 years, 3 months ago.

Viewing 9 posts - 1 through 9 (of 9 total)
  • Author
    Posts
  • #31318

    peted
    Member

    Hi — I have a PC/console engine ported to mobile (currently targeting iOS devices, so 5-series SGX), and have run into some ES driver behavior we are unable to explain or determine the root cause. I know that you may not be able to provide any specific advice about iOS, but any help would be deeply appreciated!

    The problem is: when we enable a certain post-processing pass (bloom), our CPU usage spikes and GPU utilization drops due to what looks like an operation stalling the CPU on the GPU. The symptom is visible both by viewing the handy CPU/GPU time + utilization monitors in Xcode, or by using the “OpenGL ES Driver Monitor”. When we enable this pass, in the driver monitor we see “hardware wait time” jump from 0 to a large value (tens of millions, the docs don’t give the units). In Xcode, the CPU time jumps about 10-20ms (approximately our frame time), while the GPU time jumps 3-4ms, and GPU utilization drops quite a bit (from 99% to quite a bit lower).

    The passes in question are simple, each renders into FBOs with texture attachments, and the next pass uses the resulting texture. The first one starts by using the texture the scene was rendered into. Each pass is reduced in size by half, and no FBOs are re-used throughout postprocessing. Some example shaders are below.

    We’re not doing anything that would historically cause the CPU to wait on the GPU, like glReadPixels, any queries, mapping a buffer without UNSYNCHRONIZED set, or generally modifying any resources in use by the gpu). This behavior doesn’t change when we modify the complexity of the shaders, or change the amount of geometry drawn. We’ve also experimented with fewer passes, and while the time does go down, the stall is still apparent (the cpu still seems to wait for the gpu to finish the main rendering passes before starting the postprocessing).

    We’d really love for someone to see if anything in here would be causing the driver to wait on the GPU, or could provide any advice about what might be going wrong. I can fill in any missing details about how our frame is rendered if that might help.

    Thanks much!

    Example data:

    Here’s a GL trace for two typical blits in the post process chain:

           #1115 glPushGroupMarkerEXT(0, "ImageBloomBlurMobile")
    #1116 glBindFramebuffer(GL_FRAMEBUFFER, 7)
    #1117 glViewport(0, 0, 800, 608)
    #1118 glDisable(GL_SCISSOR_TEST)
    #1119 glBindBuffer(GL_ARRAY_BUFFER, 22)
    #1120 glVertexAttribPointer(0, 2, GL_FLOAT, 0, 8, nullptr)
    #1121 glClearColor(1.0000000, 1.0000000, 0.0000000, 1.0000000)
    #1122 glClear(GL_COLOR_BUFFER_BIT)
    #1123 glUseProgram(125)
    #1124 glUniform4fv(vs_uniforms_vec4[0], 5, )
    #1125 glActiveTexture(GL_TEXTURE0)
    #1126 glBindTexture(GL_TEXTURE_2D, 17)
    #1127 glDrawArrays(GL_TRIANGLES, 0, 3)
    #1128 glPopGroupMarkerEXT()
    #1129 glPushGroupMarkerEXT(0, "ImageBloomBlurWideMobile")
    #1130 glBindFramebuffer(GL_FRAMEBUFFER, 9)
    #1131 glViewport(0, 0, 416, 320)
    #1132 glDisable(GL_SCISSOR_TEST)
    #1133 glBindBuffer(GL_ARRAY_BUFFER, 22)
    #1134 glVertexAttribPointer(0, 2, GL_FLOAT, 0, 8, nullptr)
    #1135 glClearColor(1.0000000, 1.0000000, 0.0000000, 1.0000000)
    #1136 glClear(GL_COLOR_BUFFER_BIT)
    #1137 glUseProgram(142)
    #1138 glUniform4fv(vs_uniforms_vec4[0], 5,
    )
    #1139 glActiveTexture(GL_TEXTURE0)
    #1140 glBindTexture(GL_TEXTURE_2D, 18)
    #1141 glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE)
    #1142 glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE)
    #1143 glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR)
    #1144 glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR)
    #1145 glDrawArrays(GL_TRIANGLES, 0, 3)
    #1146 glPopGroupMarkerEXT()

    And are some example fragment shaders, the vertex shaders are like you’d expect (sorry they are weird looking, they are all automatically generated by reading in our compiled hlsl shaders).

    #if GL_ES
    precision lowp float;
    #endif

    const vec4 ps_c0 = vec4(0.166667, 0.0, 0.0, 0.0);
    uniform sampler2D g_samSourceA;
    varying vec4 v_texcoord0;
    varying vec4 v_texcoord1;
    varying vec4 v_texcoord2;
    varying vec4 v_texcoord3;
    varying vec4 v_texcoord4;
    varying vec4 v_texcoord5;

    void main()
    {
    vec4 t0_ps;
    vec4 t1_ps;
    t0_ps = texture2D(g_samSourceA, v_texcoord0.xy);
    t1_ps = texture2D(g_samSourceA, v_texcoord1.xy);
    t0_ps = t0_ps + t1_ps;
    t1_ps = texture2D(g_samSourceA, v_texcoord2.xy);
    t0_ps = t0_ps + t1_ps;
    t1_ps = texture2D(g_samSourceA, v_texcoord3.xy);
    t0_ps = t0_ps + t1_ps;
    t1_ps = texture2D(g_samSourceA, v_texcoord4.xy);
    t0_ps = t0_ps + t1_ps;
    t1_ps = texture2D(g_samSourceA, v_texcoord5.xy);
    t0_ps = t0_ps + t1_ps;
    gl_FragData[0] = t0_ps * ps_c0.xxxx;
    }
    #if GL_ES
    precision lowp float;
    #endif

    uniform vec4 ps_uniforms_vec4[3];
    const vec4 ps_c0 = vec4(0.0, 0.0, 0.0, 0.0);
    #define g_fBloomCutoff ps_uniforms_vec4[0]
    #define g_fBloomStrength ps_uniforms_vec4[1]
    #define g_vBloomColor ps_uniforms_vec4[2]
    uniform sampler2D g_samSourceA;
    varying vec4 v_texcoord0;

    void main()
    {
    vec4 t0_ps;
    vec4 t1_ps;
    t0_ps = texture2D(g_samSourceA, v_texcoord0.xy);
    t0_ps.xyz = (t0_ps.xyz * g_vBloomColor.xyz) + g_fBloomCutoff.xxx;
    gl_FragData[0].xyz = t0_ps.xyz * g_fBloomStrength.xxx;
    gl_FragData[0].w = ps_c0.x;
    }
    #37539

    Joe Davis
    Member

    Hi,

    The one thing that’s jumping out at me are the calls to glTexParameteri(). You should set texture wrap and filter modes on texture creation & avoid changing them. Resetting these modes each frame could explain the CPU overhead you’re seeing.

    Thanks,
    Joe

    #37540

    peted
    Member

    Hi Joe, thanks for the hint. I went and fixed out state caching system to be a little more capable and got rid of those parameter changes (luckily they were redundant, it was always setting them to the same thing). Unfortunately I’m not seeing any difference, I still see what looks like the CPU stalling on the GPU when I enable these passes. Still good to get rid of those extra commands though 🙂

    #37541

    Joe Davis
    Member

    The passes in question are simple, each renders into FBOs with texture attachments, and the next pass uses the resulting texture. The first one starts by using the texture the scene was rendered into. Each pass is reduced in size by half, and no FBOs are re-used throughout postprocessing.

    Are the FBO attachments unique, for example a depth attachment is not bound to more than one FBO?
    We recommend developers use unique attachments for each FBO. this enables the driver to take the most optimal path possible.

    If your FBO’s have depth and/or stencil attachments, you should also use glClear for these.

    glUniform4fv(vs_uniforms_vec4[0], 5, )

    I would be surprised if it was related, but I’ve spotted that you’ve passed a count value of 5 to glUniform4fv. The driver should ignore the count value being to large. However, this value should be 4 as you’re passing in a 4 component vector.

    #37542

    peted
    Member

    Hi Joe,

    Thanks again for the closer look and analysis. Unfortunately all my FBOs use unique textures (per all the recommendations), and these blits are pure image processing and do not use depth attachments (hence no clear of GL_DEPTH_BIT). fwiw, the engine tries to be pretty scrupulous about using glDiscardFramebufferEXT on attachments whenever possible to avoid the tile contents being copied to memory (but in this case there is nothing unused).

    In case it is relevant, all the textures attached to my FBOs are specified using glTexStorage2DEXT, except for those that are renderbuffer attachments, which in case they use glRenderbufferStorage. All the buffers in question are 8 bit RGBA (GL_RGBA8_OES), but I just tried 565 buffers and the behavior is identical.

    The glUniform4fv thing I think is a red herring / confusion with glUniform4f. That 5 (“length” parameter) indicates it is setting a 5-element array of vec4s. It is not specifying 5 floats (in this case, the array is 20 floats). The intent is to specify data for a shader uniform like this:

    uniform vec4 vs_uniforms_vec4[5];

    peted

    #37543

    Joe Davis
    Member

    Hi Peted,

    The glUniform4fv thing I think is a red herring / confusion with glUniform4f

    My bad. I misinterpreted the man pages.

    Ahhh, ok. I can’t see any reason why the glTexStorage2DEXT() extension would introduce a CPU overhead, as it’s purpose is to remove draw-time checks. You should discuss the issue with Apple though to see if there are any quirks to their implementation of this extension.

    Unfortunately, I don’t think there is any more I can do to help you investigate this issue without a minimal reproduction test for Linux, Android or Windows.

    Thanks,
    Joe

    #37544

    peted
    Member

    Thanks for all your help.

    #37545

    Joe Davis
    Member

    You’re welcome 🙂

    If you can create a reproduction example application for any of the operating systems I’ve mentioned, let me know.

    Thanks,
    Joe

    #37546

    peted
    Member

    I’m fairly certain it’s a problem specific to iOS. It doesn’t seem like this problem is expected when using PVR hardware.

    I will see what I can do for Android, that may be possible, but in the meantime the release of this game will be crippled on this hardware which makes us sad. I’ve previously tried using your windows ES2 library, but it started returning odd errors from functions (I posted to the forums). I think it has some bugs tickled when you context switch a bunch. I ended up using Google’s ANGLE, the same code sequence worked on that (and on iOS).

Viewing 9 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic.