**Real-Time Interpolation:**

A Powerful and Unique Tool

** – Overview and Upcoming Improvements – **

A Powerful and Unique Tool

**Introduction:**

What is Interpolation?

For Safecast’s data, it creates a visualization with smooth, continuous coverage that can allow for quickly and intuitively seeing overall trends, as well as approximate predicted values for locations without measurements.

Common Pitfalls of Interpolation

One of Safecast’s primary goals is to present data as wholly and accurately as possible. Thus, both actual measurement point and interpolation layers are provided.

With only an interpolation layer alone, it is often unclear what actual data was measured and what is being predicted. This can lead to errors in assumption of safety or risk.

Why Performance Matters

*week*of compute time to generate. (For the webmap, it takes several hours.)

Nothing was found to produce interpolations quickly enough, or in a format the app could digest without significant conversion overhead. Even so, interpolations would be a significantly larger download than the points.

Something different was needed. As the app had all the data necessary to perform the interpolation, it kept coming down to performance. But no known implementation was fast enough to complete on a mobile device in a reasonable timeframe.

Was it possible to go a step beyond existing implementations, and make the performance fast enough, even real-time? Fortunately, it was.

And that fast, real-time interpolation solved many problems for the frequently updated Safecast dataset.

The performance also gave the user more power and control, turning a process taking at least several hours into something interactive.

The Present and the Future

**Contents:**

- Image: Measurements vs. Interpolation
- Image: Enabling IDW in the Safecast app for OS X
- Introduction
- What is Interpolation?
- Common Pitfalls of Interpolation
- Why Performance Matters
- The Present and the Future

- Contents
- Current Interpolation Methodology
- First Real-Time Spatial Interpolation
- Hybrid Interpolation: IDW and Lanczos
- Neighborhood Selection: Not Doing the Same Work Twice

- New Improvements: Next Version
- Register Optimizations
- Single-Pass Computation
- Tile Decomposition
- Database Connection Pool
- Common Code and Readability / Maintenance
- Dynamic Search Neighborhood Improvements
- Export to PNG Performance Improvements

- Future Improvements: Beyond the Next Version
- Conclusion
- Images: Performance vs. Quality
- Source Code: Core Numerics
- Single Pass Source
- Two-Pass Source

Current Interpolation Methodology:

Current Interpolation Methodology:

First Real-Time Spatial Interpolation Implementation

Real-time performance is a tradeoff of resolution and artifacts for rendering performance. While implemented in hand-optimized SIMD intrinsics and multithreaded, the primary factor enabling such high performance is controlling the problem set size and keeping it small enough to fit into fast L1 or L2 cache.

To that end, the default settings for both platforms involve input sample decimation and hybrid interpolation with Lanczos resampling.

Hybrid Interpolation: IDW and Lanczos

*(Why can’t only Lanczos be used? Because the control points must be arranged on a regular grid.)*

If the hybrid interpolation mode is used, IDW is first used to predict control points arranged in a grid, and Lanczos then interpolates between them. For the default settings in OS X, there is little visual difference from pure IDW interpolation without decimation, yet it renders approximately 60x faster.

*The overall goal is to interpolate the missing points in a 256×256 pixel tile. At the most basic level of the hybrid interpolation, IDW is used to predict raster cells (pixels) in every other row and column if they are not already present, which is 25% of the cells. (16384 of 65536 total) These predicted cells are essentially a 128×128 pixel tile. Lanczos resampling (@2x) is then performed on that to yield the final 256×256 pixel tile, with some added complications for avoiding edge artifacts.*

Neighborhood Selection: Not Doing the Same Work Twice

One of the largest performance issues in many IDW implementations is the search radius used in initial sample selection. For naive implementations, this involves calculating the distance to each point, which ironically is nearly the same computational effort as including that point in the calculation, thus doubling the work.

More performant implementations use a binary spatial index. Our implementation falls into this category, leveraging the bitmap indexing already present with a dynamic search radius on a blocked (tile) basis. This allows for <1 ms neighborhood selection, versus up to several hours on a single multicore CPU system for an unindexed naive implementation with a large dataset.

Algorithms and Numerics

Finally, in our implementation the core computation code is a two-pass algorithm, which is multithreaded and SIMD vectorized.

In the first pass, the inverse distance for each point in the neighborhood is calculated, and the sum of these is accumulated. The second pass computes the dot product of the multiplicative inverse of this sum with the point values (μSv/h).

Conversely, no other implementation we are aware of is SIMD vectorized, and publicly available source code uses a number of slow operations in each inner loop calculation (sqrt(), pow(), fp divide). Many implementations are not multithreaded.

New Improvements: Next Version:

New Improvements: Next Version:

Register Optimization: +10% to +20%

In compiler and code evaluation, it was found that not all the CPU’s SIMD registers were in use, resulting in needless transfer of data from RAM to the registers, leading to stalling during the computation. A number of implementations were tested, and it was found that predicting two values which shared the same neighborhood had a 10% – 20% performance advantage.

Single-Pass Computation: +200% to +1000%

It was also found found that the computation itself could be reduced to a single step, by accumulating the dot product of the inverted distance and the point value (μSv/h), performing a single final scalar calculation of this newly accumulated dot product sum divided by the currently calculated sum of the inverted distances. This was extremely significant for performance, resulting in a 200% – 1000% improvement depending on problem size on both ARM and Intel architectures. Again, the primary mechanism for this is minimizing memory transfers to and from RAM.

Tile Decomposition

Data is stored by the app in raster tiles, but must be decomposed to a series of x/y/z points to perform the interpolation. Currently, this is done by using Apple’s vDSP_vcmprs function, part of the excellent Accelerate framework. A hand-optimized version of this function was found to be approximately 200% faster on both ARM and Intel architectures by again minimizing memory transfers, significantly reducing stalling.

Database Connection Pool

Due to a fixed database connection pool size in the current app, the clipping features compile a SQL statement in each select which adds some stalling per tile being predicted, instead of reusing a connection and compiled statement from the pool. In the upcoming version of the app, the entire database provider application layer was rewritten, and now supports a dynamic connection pool size.

Common Code and Readability / Maintenance

To avoid code duplication and prepare for the open source release of the Safecast app, special attention was given to testing that even performance-critical SIMD numerics functions could be safely written using inlined function calls for common code, which is important as each IDW power p=x has its own function to avoid inner-loop branching which would be very detrimental to performance. These changes allow updates to made in a single location in code and makes it easier to understand, document, and maintain.

Dynamic Search Neighborhood Improvements

Currently, the dynamic search neighborhood used to find other points outside of the tile being rendered is only implemented during the rendering process itself. This was found to conflict with the spatial indexing present before that code was executed, leading to blank tiles which were not interpolated at high zoom levels in locations with no nearby points, such that the static neighborhood in the spatial index layer made a “no go” decision. This has been corrected in the next version, and now the spatial index uses the same dynamic neighborhood selection as the rendering code.

Export to PNG Performance Improvements

When exporting an interpolated layer using clipping features as PNG tiles from the OS X app, the current release treats this in the same manner as an interpolation without clipping features — it must brute-force test each tile in the extent of the dataset being interpolated to determine if rendering should be attempted, as tiles not in the original dataset will likely be rendered. For higher zoom levels with the Safecast dataset which has an almost global extent this causes significant pre-processing delays. This has been optimized such that instead, the more detailed spatial index for the clipping feature will be used instead of brute-force testing within the extent. In testing on a development machine, this exported an interpolated PNG tileset for zoom levels 0-13 for the Safecast dataset in 10 minutes. Full IDW interpolation without decimation and a larger minimum search neighborhood took approximately 4 hours to export for zoom levels 0-15, a preview of which is temporarily available on the Safecast tilemap.

##

Future Improvements: Beyond the Next Version:

Future Improvements: Beyond the Next Version:

At the level of raw performance being used in these functions, CPU cache becomes a very important factor. Significant performance breakpoints were measured in testing depending upon when the problem size fell outside of L1, L2 and L3 cache.

So, if a problem set size falls just outside of a cache boundary and thus the performance drops, is it possible to trade some extra computations for making the same problem set fit into cache? The answer was yes.

Currently, point data is held in memory as 32-bit floating point values. But it is also possible to represent some, or all, of this data as either 16-bit integer or 16-bit floating point values, with the tradeoff being a small upfront data conversion overhead, and a more significant conversion overhead of a few ops in the core numerics inner loop code. For small problem sizes (eg, L1 -> L2 cache spillover), this is not overall an advantage, as L2 cache is fast enough that extra conversion ops just slow things down.

However, for much larger problem sets this provided an approximately 200% performance advantage in preliminary testing; it was found that the narrower 16-bit data types could allow for a problem that fell out of cache entirely to fit into L3 cache, or a problem that fell out of L2 cache to fit in it. (no L1->L2 cache advantage was found) Fortunately, the problem set size is known a priori.

The challenge in exploiting this is that CPU cache sizes vary, and the available instructions to convert between data types using SIMD intrinsics vary tremendously by architecture; for example, only newer Intel architectures can convert 16-bit floating point to 32-bit floating point in a single op.

Thus, the key to using these improvements in the future will be architecture-specific code paths and evaluation of which method will be most performant for the given architecture and problem set size.

Conclusion:

Conclusion:

While often overlooked, high performance critically allows for new possibilities and ways of interacting with data, and continued development and improvements will expand upon this, allowing for improvements in precision and resolution while improving battery life on mobile devices.

Finally, as mobile devices continue to improve in performance, the capabilities of tools such as this which already offer equivalent or superior performance to a static server-rendered map will share the same potential for growth. And new possibilities will emerge. But only if the performance can be controlled. As seen with the OS X client compared to other GIS applications, all of this can only be realized by creating highly optimized algorithms with robust scaling. Without controlling performance and scaling, the end result is something slower than a static map, or a monopolization of resources by a single inefficient link in the data visualization chain.

Performance vs. Quality:

Performance vs. Quality:

In our implementation, render quality and precision can be selectively degraded for performance while retaining sub-millisecond neighborhood selection.

The final image shows the control: full, classic IDW interpolation, with no decimation of input samples.

Other images use our hybrid IDW/Lanczos interpolation, with decimation enabled, showing a progression of improved resolution targets.

The performance difference is approximately 600x, 600x, 600x, and 60x versus full IDW rendering, respectively.

Note the default OS X quality closely approaches the quality of the full IDW render while being almost two orders of magnitude faster. Variations in Lanczos and IDW interpolated surface slopes and the use of sample decimation account for the differences.

Source Code: Core Numerics:

Source Code: Core Numerics:

Caveat: For readability, these are simplified versions and are heavily commented. They use Apple’s simd.h functions for clarity instead of the architecture-specific intrinsics the actual implementation does (but are equivalent on a per-operation basis), and do not optimize register use as outlined in the “new improvements” section above. If you’d like full in-development sources, email nick@safecast.org.

Single Pass Source:

Single Pass Source:

// ======================== // pTest_ssIDW_p2_AppleSIMD // ======================== // // Performs IDW spatial interpolation in single pass. // // Replaces 2nd step which first normalizes the inverse distances, and then // computes the dot product with the z-values. // // Instead, this calculates the dot product within the first step entirely. // // // Once all blocks are done executing, the predicted cell value at (x,y) is // equal to SUM(zdSum)/SUM(wdSum). // // x, y: The coordinates of the point to be predicted. // xs, ys: The coordinates of all known points in the search neighborhood. // zs: The values at the corresponding coordinates in "xs" and "ys". // wdSum: Output parameter. The sum of the weighted, inverse distances. // zdSum: Output parameter. The sum of each weighted distance multiplied // by the corresponding element in "zs". // n: The number of elements in the vectors "xs", "ys", and "zs". // // // -- Note: This is a sample version intended for readability, and requires -- // -- Apple's simd.h included in the iOS 8 / OS X 10.10 frameworks. -- // -- It is equivalent in terms of ops except for the reciprocal -- // -- estimate step, which compiles oddly (slowly). -- // -- -- // -- Please contact nick@safecast.org for the actual source, or see -- // -- the Github repository for the Safecast app. (coming soon!) -- // void pTest_ssIDW_p2_AppleSIMD(const float x, const float y, const float* xs, const float* ys, const float* zs, float* wdSum, float* zdSum, const size_t n) { const vector_float4 sx = x; const vector_float4 sy = y; const vector_float4* vsrcx = (const vector_float4*)xs; const vector_float4* vsrcy = (const vector_float4*)ys; const vector_float4* vsrcz = (const vector_float4*)zs; vector_float4 vs = 0.0F; vector_float4 vi = 0.0F; vector_float4 vx; vector_float4 vy; // nb: vector_float4 is an alias for a short 128-bit SIMD vector type which // contains 4x 32-bit floating point values. This is either float32x4_t // or __m128 depending upon ARM or Intel platforms, respectively. // // SIMD allows the same instruction to be executed on multiple values // simultaneously, significantly improving computational speed. for (size_t i = 0; i < n; i += 4) { vx = *vsrcx - sx; vy = *vsrcy - sy; // nb: The distance is calculated using Pythagoreas' Theorem; however, // because the result of that is being raised to a power of two, // the square and square root cancel each other out, and can // be omitted with a significant computational savings. // The square root only needs to be calculated for { 0.0 <= p < 2.0 } vx = vx * vx // ^2, idw p=2 + vy * vy; // nb: this is significantly slower than vrecpsq_f32 or _mm_rcp_ps! vy = vector_fast_recip(vx); vs += vy; vi += vy * *vsrcz; vsrcx++; vsrcy++; vsrcz++; }//for // nb: The sums have been accumulated into two short vectors with four // values in each. These must be first combined into two sums, and then // finally a single sum for each. (not done in the loop for performance) // eg, vs = [ 1, 2, 3, 4 ] // and vi = [ 5, 6, 7, 8 ] // will become: vs = [ 3, 7 ] and vi = [ 11, 15 ] here. vector_float2 s2 = (vs.lo + vs.hi); vector_float2 i2 = (vi.lo + vi.hi); // nb: Finally, calculate the output scalar values. // Continuing the example above: // "wdSum" = 3 + 7 = 10 // "zdSum" = 11 + 15 = 26 // // Why isn't the zdSum / wdSum division done here? // These *must* be returned individually. This function may not be the // only block executing, due to either multithreading or a non-divisible- // by-four problem set size. Mathematically the addition for all blocks // into a single wdSum and zdSum must be done before the division. *wdSum = s2[0] + s2[1]; *zdSum = i2[0] + i2[1]; }//pTest_ssIDW_p2_AppleSIMD

Two-Pass Source:

Two-Pass Source:

// ====================== // pTest_IDW_p2_AppleSIMD // ====================== // // Performs the 1st pass in a two-pass IDW interpolation. // // x, y: The coordinates of the point to be predicted. // xs, ys: The coordinates of all known points in the neighborhood. // zs: The values at the corresponding coordinates in xs and ys. // dest: Ouput vector that the weighted inverse distances for "xs" and "ys" // relative to the point (x,y) will be written to. // n: The number of elements in the vectors "xs", "ys", and "zs". // // Returns the sum of all elements that were written to vector "dest". // float pTest_IDW_p2_AppleSIMD(const float x, const float y, const float* xs, const float* ys, float* dest, const size_t n) { const vector_float4 sx = x; const vector_float4 sy = y; const vector_float4* vsrcx = (const vector_float4*)xs; const vector_float4* vsrcy = (const vector_float4*)ys; vector_float4* vd = (vector_float4*)dest; vector_float4 vs = 0.0F; vector_float4 vx; vector_float4 vy; for (size_t i = 0; i < n; i += 4) { vx = *vsrcx - sx; vy = *vsrcy - sy; vx = vx * vx // ^2 + vy * vy; vy = vector_fast_recip(vx); vs = vs + vy; *vd = vy; vsrcx++; vsrcy++; vd++; }//for vector_float2 v2 = (vs.lo + vs.hi); return v2[0] + v2[1]; }//pTest_IDW_p2_AppleSIMD // ========================== // pTest_IDW_SumWDs_AppleSIMD // ========================== // // Performs the 2nd pass in a two-pass IDW interpolation. // // idwBuffer: The output vector "dest" from the first pass; contains the // inverted, weighted distances for each original point in xs // and ys (which are not used here). // idwBufferSum: The return value from the first pass; this is the sum of // all elements in "idwBuffer". // zs: The values at the corresponding weighted distances in // "idwBuffer". // n: The number of elements in the vectors "zs" and "idwBuffer". // // Returns: the predicted cell value for the original x,y coordinate used in // the first pass. // float pTest_IDW_SumWDs_AppleSIMD(const float* zs, const float* idwBuffer, const float idwBufferSum, const size_t n) { const vector_float4 si = 1.0F / idwBufferSum; vector_float4 vi = 0.0F; const vector_float4* vsrcz = (const vector_float4*)zs; const vector_float4* vsrcd = (const vector_float4*)idwBuffer; for (size_t i = 0; i < n; i += 4) { vi += (*vsrcd * si) // normalize distance * *vsrcz; // multiply by z-value vsrcd++; vsrcz++; }//for vector_float2 i2 = vi.hi + vi.lo; return i2[0] + i2[1]; }//pTest_IDW_SumWDs_AppleSIMD