Plan for tomorrow:

  1. try to download dependency of https://github.com/NCCA/Sponza
  2. if not work, try to load the textures
  3. Change from single float to short in for peripheral pixels
  4. finish homework for 726 in python


low level Optimization of KFR

  1. Optimization of log(|| x – x0, y – y0||)
  2. Optimization of log function
  3. Optimization of fast atan
  4. Make the shader more complex to extend the rendering time to greater than 16ms

I will talk about every step in detail

  • Optimization of log(|| x – x0, y – y0||)
    • There is rendering time reduction
      • Original 52.96ms
      • 1/2 buffer: 15.39ms ->15.57ms
      • 1/4 buffer: 4.20ms -> 4.10ms
  • Optimization of log function
    • The fast-log contains at least 5 branches (possibly 5 additions and 5 shifts for 32 bit calculation)
    • The Nvidia log algorithm is not available on line. But the log, exp, sin, cos in AMD GPU is 4x that of add/sub. We can guess Nvidia doesn’t do worse than AMD.
      • Reference1: http://www.iquilezles.org/www/articles/palettes/palettes.htm (Iq talking about sin, cos in GLSL)
        • Popular wisdom (especially between old-school coders) is that trigonometric functions are expensive and that therefore it is important to avoid them (by means of LUTs or linear/triangular approximations). Often popular wisdom is wrong – despite the above still holds true in some especial cases (a CPU heavy inner loop) it does not in general: for example, in the GPU, computing a cosine is way, way faster than any attempt to approximate it. So, lets take advantage of this and go with the straight cosine expression.
      • Analysis of AMD GPU: https://seblagarde.wordpress.com/tag/gpu-performance/
        • Full rate (FR): mul, mad, add, sub, and, or, bit shift… Quater rate(QR): transcendental instruction like rcp, sqrt, rsqrt, cos, sin, log, exp…
      • Discussion about complexity of complexity:
        • 1/x, sin(x), cos(x), log2(x), exp2(x), 1/sqrt(x) – 0 or close to 0, as long as they are limited to 1/9 of all total ops (can go up to 1/5 for Maxwell).
  • Optimization of fast atan (I only tried diamond angle now. I will try the CORDIC later.)
    • Simple comparison of atan2 and diamond angle.
    • A test of shadertoy: https://www.shadertoy.com/view/lllyR4
  • Make the shader more complex to extend the rendering time to greater than 16ms



8:30am – 11: 30am

  • meet with Var
    • need to figure out the advantage of our algorithm
  • Try to update VS15 to get DirectX SDK

3:00 pm – 6:00 pm

  • Variance sampling TAA
  • Write paper


  • Read push-pull paper
  • Read Europe Log polar paper
  • think about ellipse log-polar


  • https://leetcode.com/problems/integer-break/description/


Decouple shading rate & visibility rate from pixels: allow for space for anti-aliasing and coarse pixel shading.

Texel Shading (shading rate reduction):

We show performance improvements in three ways. First, we show some improvement for the “small triangle problem”. Second, we reuse shading results from previous frames. Third, we enable dynamic spatial shading rate choices, for further speedups.

Visibility:  updating visibility at the full frame rate.

Shading rate: dynamically varying the spatial shading rate by simply biasing the mipmap level choice, texel shading and temporal shading reuse

Some reason for increased shading cost

  • The first is the mapping from pixels to texels
  • The second source of shading increase is in the caching system.

Process: deferred decoupled shading

rasterization -> records texel accesses as shading work rather than running a shade per pixel. Shading is performed by a separate compute stage, storing the results in a texture. A final stage collects data from the texture


Object Space Lighting:

Inspired by REYES (render everything your eyes can see)

Overall process

All objects in game are submitted for shading and rasterization. Queued for process
During submission step, the estimated projected area of the object is calculated. Thus an object requests a certain amount of shading
During shading, system allocates texture space for all objects which require shading. If the total request is more then available shading space, all objects are progressively scaled at shading rate until it fits
Material shading occurs, processing each material layer for each object. Results are accumulated into the master shading texture(s)
MIPS are calculated on master shading texture as appropriate
Rasterization step: each object references the shading part step. No specific need that there is a 1:1 correspondence, but this feature is rarely used.



Our architecture is also the first to support pixel shading at multiple different rates, unrestricted by the tessellation or visibility sampling rates.

automatic shading reuse between triangles in tessellated primitives

  1. we decouple pixel shading from screen space
  2. it allows lazy shading and reuse simultaneously at multiple different frequencies

enables a wider use of tessellation and fine geometry, even at very limited power budgets



  • Read the three papers and decide whether they are related to FR (Not directly, but there are many techniques which could be used to accelarate rendering)
    • [He, Extending, 2014]: The idea is totally the same with that of coarse pixel shading. But this is an approach for forward shading.
    • [Yee, Spatiotemporal, 2001]: Accelerate global illumination computation for dynamic environments. Use human visual system property.
    • [Liktor, Decoupled Deferred, 2012]:
      • compact geometry buffer: stores shading samples independently from the visibility
    • [ClarBerg, AMFS, 2014]: powerful hardware architecture for pixel shading, which enables flexible control of shading rates and automatic shading reuse between triangles in tessellated primitives.
    • [Foveated Real-Time Ray Tracing for Virtual Reality Headset]
      • foveated sampling
    • [Combining Eye Tracking with Optimizations for Lens Astigmatism in modern wide-angle HMDs]
      • Foveated sampling: taking the minimum values of two sampling maps (lens astigmatism & current eye gaze) in the foveated region.
    • [Perception-driven Accelerated Rendering, 2016]
      • survey
    • [A Retina-Based Perceptually Lossless Limit and a Gaussian Foveation Scheme With Loss Control, 2014]:
      • not related to foveated rendering
    • [User, Metric, and Computational Evaluation of Foveated Rendering Methods]
    • [H. Tong and R. Fisher. Progress Report on an Eye-Slaved Area-of Interest Visual Display. Defense Technical Information Center, 1984.]
    • [Proceedings of the 1990 Symposium on Interactive 3D Graphics]
  • Mutlisampling & Supersampling
    • Multisampling: only apply super samples at the edge of primitives
    • Supersampling: sample for the whole frame and amplify the whole frame
  • Forward rendering & deferred rendering
  • User, Metric, and Computational Evaluation of Foveated Rendering Methods
    • Compare 4 foveated rendering methods
      • Lower resolution for foveated view
      • Screen-Space Ambient Occlusion instead of global ambient occlusion
      • Terrain Tessellation
      • Foveated Real-time Ray-Casting
    • Provide foveated image metric
      • HDR-VDP: compare two images and get visibility and quality
      • Only spacial artifacts are considered! Temporal is not considered!
        • Find a saliency map during free viewing.

Report 08/07/2017

As we discussed last week, to make our paper better, we should:

  1. Make our algorithm better than others.
  2. Do user study.
  3. Compare with work of others.

To make our algorithm better than others, what I did last week is:

  1. Do interpolation for the rendered scene so there is no feeling of large pixels.
  2. Although interpolated, there are still jaggies in the frame. To reduce jaggies:
    1. I built a contrast map and did contrast-related blur. Jaggies exist at the position where contrast is large.
    2. I did bilateral filtering. The jaggies is not reduced, but it has very good effect of smoothing.


To compare with work of others, I need to redo the work of others.

I have already implemented the work of Microsoft without blur.


Built a summary for the

Next week:

Implement code of nvidia paper.

A similar apporach: https://github.com/GameTechDev/DeferredCoarsePixelShading

Microsoft code:


Write the summary paper.



Deferred shading is a screen-space shading technique. It is called deferred because no shading is actually performed in the first pass of the vertex and pixel shaders: instead shading is “deferred” until a second pass.

Decoupled shading: shades fragments rather than micropolygons vertices. Only shades fragments after precise computation of visibility.

Visibility: how much we read from the primitives

Shading rate: how much we render.