Numerical Methods
Lagrangian Advection
In the numerical computation of the FTLE, we first compute the flow map \(\varphi_{t_n}^{t_{n+1}}(x_n)\), which maps the initial seed point \(x_n\) at time \(t_n\) to time \(t_{n+1}\).
To obtain this map, one must numerically integrate the underlying dynamical system, which is described by the ODE:
where \(\sigma = \pm1\) selects forward or backward integration. During the time integration process, the algorithm frequently queries the flow velocity vector \(\mathbf{u}(\mathbf{x},t)\) at specific locations and moments with very high precision requirements. However, since the data grid is inherently spatially discretized, high-order interpolation methods are required to keep numerical stability and obtain a physically meaningful flow map. Furthermore, when querying the velocity, special wall treatment must be applied at the boundaries to avoid value discontinuities and to represent certain real physical conditions.
Time Integration
Explicit Euler Method
The first-order explicit Euler scheme Euler advances the position by sampling the velocity at the beginning of the time step:
This method incurs a global error of order \(O(\Delta t)\) and requires only one velocity evaluation per step, therefore has high computational speed.
Runge-Kutta Method
Proposed by Carl Runge and Martin Kutta around 1900, Runge-Kutta methods constitute a widely used family of algorithms for the numerical integration of ODEs.
In an explicit \(s\)-stage Runge-Kutta scheme for this initial-value problem, the solution is advanced over a time step \(\Delta t\) as follows. First, compute the intermediate stage vectors:
and then update the solution:
Here, the boldface stage variables \(\mathbf{k}_i\) represent intermediate slope estimates.
Second-Order Runge-Kutta (RK2, Heun’s)
Heun’s RK2 method attains second-order accuracy by combining predictor and corrector slopes:
This scheme yields a global error of order \(O(\Delta t^2)\) with two velocity evaluations per step.
Classical Fourth-Order Runge-Kutta (RK4)
The classical RK4 method achieves fourth-order accuracy via four slope evaluations at intermediate points:
This yields a global error of order \(O(\Delta t^4)\) with four velocity evaluations per step.
Sixth-Order Runge-Kutta (RK6)
The seven-stage scheme ERK6(7) uses non-uniform weights to attain global \(O(\Delta t^6)\) accuracy, originating from [Butcher1964].
As for the coefficients for RK6 are more complex to write into equations, the Butcher table is given as follows.
\(c_i\) |
\(a_{i1}\) |
\(a_{i2}\) |
\(a_{i3}\) |
\(a_{i4}\) |
\(a_{i5}\) |
\(a_{i6}\) |
\(a_{i7}\) |
|---|---|---|---|---|---|---|---|
\(0\) |
|||||||
\((5\mp\sqrt{5})/10\) |
\((5\mp\sqrt{5})/10\) |
||||||
\((5\pm\sqrt{5})/10\) |
\(\mp\sqrt{5}/10\) |
\((5\pm2\sqrt{5})/10\) |
|||||
\((5\mp\sqrt{5})/10\) |
\((-15\pm7\sqrt{5})/20\) |
\((-1\pm\sqrt{5})/4\) |
\((15\mp7\sqrt{5})/10\) |
||||
\((5\pm\sqrt{5})/10\) |
\((5\mp\sqrt{5})/60\) |
\(0\) |
\(1/6\) |
\((15\pm7\sqrt{5})/60\) |
|||
\((5\mp\sqrt{5})/10\) |
\((5\pm\sqrt{5})/60\) |
\(0\) |
\((9\mp5\sqrt{5})/12\) |
\(1/6\) |
\((-5\pm3\sqrt{5})/10\) |
||
\(1\) |
\(1/6\) |
\(0\) |
\((-55\pm25\sqrt{5})/12\) |
\((-25\mp7\sqrt{5})/12\) |
\(5\mp2\sqrt{5}\) |
\((5\pm\sqrt{5})/2\) |
|
\(b_i\) |
\(1/12\) |
\(0\) |
\(0\) |
\(0\) |
\(5/12\) |
\(5/12\) |
\(1/12\) |
In our computation, the up symbol side is applied, in other words, ± represents +, taking \(\lambda=+\sqrt{5}\). With 15 digis are kept, the explicit Butcher table for RK6 used by the author is shown in the following table.
\(c_i\) |
\(a_{i1}\) |
\(a_{i2}\) |
\(a_{i3}\) |
\(a_{i4}\) |
\(a_{i5}\) |
\(a_{i6}\) |
\(a_{i7}\) |
|---|---|---|---|---|---|---|---|
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0.276393202250021 |
0.276393202250021 |
0 |
0 |
0 |
0 |
0 |
0 |
0.723606797749979 |
-0.223606797749979 |
0.947213595499958 |
0 |
0 |
0 |
0 |
0 |
0.276393202250021 |
0.0326237921249264 |
0.309016994374947 |
-0.0652475842498529 |
0 |
0 |
0 |
0 |
0.723606797749979 |
0.0460655337083368 |
0 |
0.166666666666667 |
0.510874597374975 |
0 |
0 |
0 |
0.276393202250021 |
0.12060113295833 |
0 |
-0.181694990624912 |
0.166666666666667 |
0.170820393249937 |
0 |
0 |
1 |
0.166666666666667 |
0 |
0.0751416197912285 |
-3.38770632020821 |
0.52786404500042 |
3.61803398874989 |
0 |
\(\mathbf{b_i}\) |
0.0833333333333333 |
0 |
0 |
0 |
0.416666666666667 |
0.416666666666667 |
0.0833333333333333 |
Velocity Interpolation
Trilinear
The trilinear interpolation method is the fastest among all the methods given in Streamcenter+, which is a low-order implement.
The continuous velocity field is reconstructed by trilinear interpolation of the component maps u, v, w that live on the eight vertices of a Cartesian cell:
Tricubic Catmull-Rom
The Tricubic Catmull-Rom interpolation tricubic used here is a separable three-dimensional cubic spline based on the one-dimensional Catmul-Rom spline (parameter a=-0.5) for velocity fields, which ensures \(C^{1}\) continuity.
The process is given as follows.
The one-dimensional Catmull-Rom interpolation reconstructs a \(C^{1}\)-continuous approximation of velocity at an arbitrary location \(x = i + t\), where \(t \in [0,1)\) and \(i = \lfloor x\rfloor\) on a uniform grid with \(\Delta x = 1\). A four-point stencil is used:
Define coefficients for \(t \in [0,1)\):
The one-dimensional interpolant is:
For three-dimensional interpolation, let the target velocity location be
At each \(z = k + \Delta\) (where \(\Delta \in \{-1,0,1,2\}\)), perform bicubic interpolation (first in x, then in y). For each \(y = j + \ell\) (where \(\ell \in \{-1,0,1,2\}\)), compute:
Then combine along y:
Finally, interpolate along the third direction z:
Here \(\mathrm{CR}_{1}(⋯; t)\) denotes the one-dimensional Catmull-Rom spline defined above.
Tricubic by F. Lekien I.P.
Marked as tricubicFL, this variation of tricubic interpolator is still under development by the author.
Hermite
The hermite mode in the current native solver is a separable three-dimensional cubic Hermite
interpolator.
It reconstructs the interval between the two central samples by using the endpoint values
\(p_1\), \(p_2\) together with central-difference slopes inferred from the four-point
stencil \(\{p_0,p_1,p_2,p_3\}\):
For \(t \in [0,1)\), the one-dimensional Hermite interpolant is
with basis functions
The three-dimensional implementation applies this reconstruction successively along x, y,
and z on a \(4\times4\times4\) stencil, in the same separable manner as the Catmull-Rom
tricubic mode.
Because the Hermite kernel reuses the same scalar sampling routine as the other interpolators,
all wall treatments described in Wall Treatment are applied consistently at every stencil access.
Compared with trilinear, hermite is substantially more expensive due to its wider stencil,
but it generally produces smoother velocity traces during advection.
WENO
The weighted essentially non-oscillatory WENO used here is a fifth-order WENO reconstruction (WENO-5). It is suggested to be used in research with intermittent capture need, e.g., high-speed flows and shock capture.
It shows relatively poor performance in general cases, and comsuming more wall time.
The method originates from [Jiang1996] and expanded to three-dimensional computation, and [Shu2009] gave a review on the WEMO method.
The process is given as follows.
The WENO-5 method reconstructs a non-oscillatory, fifth-order-accurate approximation of a function value at an arbitrary location \(x = x_{i+1/2} + t\,\Delta x\), where \(t \in [0,1)\) and \(x_{i+1/2} = x_i + \tfrac{1}{2}\,\Delta x\) on a uniform grid with \(\Delta x = 1\).
A five-point stencil f_{i-2}, f_{i-1}, f_i, f_{i+1}, f_{i+2} is used.
Define three overlapping three-point stencils:
On each stencil \(S_{\ell}\) (\(\ell = 0,1,2\)), construct a quadratic polynomial
that interpolates the three values in that stencil at \(x = x_{i+1/2} + t\,\Delta x\).
The coefficients are chosen so that each \(p_{\ell}(t)\) matches \(f\) at the three stencil points.
For \(S_{0} = \{f_{i-2}, f_{i-1}, f_{i}\}\):
For \(S_{1} = \{f_{i-1}, f_{i}, f_{i+1}\}\):
For \(S_{2} = \{f_{i}, f_{i+1}, f_{i+2}\}\):
Once \(p_{0}(t)\), \(p_{1}(t)\), and \(p_{2}(t)\) are defined, compute the Jiang–Shu smoothness indicators \(\beta_{\ell}\) for each stencil:
Fixed linear weights are \((d_{0}, d_{1}, d_{2}) = (0.1,\,0.6,\,0.3)\). Introduce \(\varepsilon = 10^{-6}\) and define unnormalized weights:
Normalize to obtain nonlinear weights \(\omega_{\ell}\):
Finally, reconstruct at \(x = x_{i+1/2} + t\,\Delta x\) by combining:
Wall Treatment
The wall-treatment option is applied at the scalar-sampling level before any interpolation stencil
is assembled.
Therefore the same boundary behavior is honored by trilinear, hermite, tricubic, and
WENO alike.
For each axis with valid index range \(0,\dots,N-1\), the current native solver supports the following rules:
Clamp
Out-of-range indices are projected to the nearest valid boundary index:
This is useful when the outermost grid values should be extended as constant boundary samples.
Dirichlet
Whenever any stencil index lies outside the valid domain, the queried scalar value is taken as zero:
This matches a homogeneous Dirichlet boundary condition for the sampled velocity component.
Reflect
Indices are mirrored back into the domain until they become valid. The implementation uses
and repeats this operation if necessary. This corresponds to even reflection at the wall.
Periodic
Indices are wrapped modulo the grid size:
For periodic runs, the same wrap-around logic is also reused in the FTLE assembly kernel when the flow-map gradients are differentiated.
FTLE Computation
After particle advection, the solver stores the final particle locations \(\mathbf{X}=\varphi_{t_0}^{t_0+T}(\mathbf{x}_0)\) on the seed lattice. The FTLE field is then assembled from the deformation gradient \(\mathbf{F}=\partial \mathbf{X}/\partial \mathbf{x}_0\) and the right Cauchy-Green tensor
With \(\lambda_{\max}\) denoting the largest eigenvalue of \(\mathbf{C}\), the solver evaluates
In the native CUDA mainline this stage is executed by a dedicated FTLE kernel after the forward
and optional backward branches finish advection.
The resulting scalar arrays are written as FTLE_forward and, if requested, FTLE_backward
in the output .vts files.
Gradient Discretization
In the current native CUDA solver, the deformation gradient is reconstructed directly from the final-position lattice by second-order centered differences. Let \(\mathbf{X}=(X,Y,Z)^{\mathsf{T}}\) be the advected particle position stored at each seed point. For interior nodes in a fully three-dimensional case,
and the same centered form is applied to \(Y\) and \(Z\). These nine derivatives form the Jacobian matrix \(\mathbf{F}\).
When wall_treatment = periodic, the centered-difference stencil is extended to every lattice
point by wrapped neighbors,
with analogous formulas in y and z.
For non-periodic runs, the current mainline kernel differentiates only interior points. Boundary FTLE values are presently written as zero, which avoids one-sided stencils inside the GPU kernel.
For quasi-two-dimensional inputs with \(n_z \le 2\), only the x and y derivatives are
assembled and the remaining z-coupling terms are set to zero in the subsequent
Cauchy-Green tensor.
Although the parameter interface still accepts labels such as O2, O4, O6, and
Spectral for compatibility with legacy workflows, the current native CUDA FTLE kernel
implements the centered O2 discretization described above.
Eigenvalue Solver
Once the symmetric Cauchy-Green tensor
is assembled, the native solver evaluates only its largest eigenvalue, since that is the only quantity required by the FTLE definition.
The current CUDA kernel uses a closed-form cubic solver, exposed in code as EigMaxSym3.
Define
If \(p_1=0\), the matrix is diagonal and the eigenvalues are simply \(c_{00}\), \(c_{11}\), and \(c_{22}\). Otherwise, set
and define the normalized matrix \(\mathbf{B}=(\mathbf{C}-q\mathbf{I})/p\). Its scaled determinant
is clamped to \([-1,1]\), after which
The solver then takes
This algebraic route avoids iterative sweeps inside the FTLE kernel and is a natural fit for the
symmetric positive-semidefinite Cauchy-Green tensor.
The parameter interface still recognizes Eigen_method = jacobi for compatibility, but the
current native CUDA FTLE kernel always dispatches the closed-form path above.
Windowing for Dynamic LCS
Dynamic mode applies a sliding temporal window over the input frame range. For a global range \([T_s,T_e]\), window size \(\Delta T_w\), and window step \(\Delta T_s\), the native solver constructs
for
so long as the last admissible start time satisfies \(T_e-\Delta T_w \ge T_s\).
Each window is loaded and solved independently. Within a window, the current mainline solver chooses an anchor frame near the midpoint of the local frame index range. If the window is sufficiently long, this anchor is clamped to the interior so that both forward and backward FTLE can be formed without sitting directly on the edge.
The advection branches are then split as follows:
Accordingly, the physical integration intervals used in the FTLE normalization are
in local frame coordinates.
Every valid window is saved to a numbered file such as FTLE3D_NbCU_00001.vts.
After all windows finish, the solver writes a .pvd collection whose reference time is taken
from the anchored frame reconstructed in the original global time range.
If a window is too short to provide an interior anchor, it is skipped.
Computational Density and Comparison
The native solver combines one advection scheme with one interpolation scheme, so the dominant cost on the advection side is well approximated by
where \(N_{\mathrm{vel}}\) is the number of velocity evaluations per integration step and \(N_{\mathrm{stencil}}\) is the scalar stencil width used by the chosen interpolator.
For the current mainline code path, the time-integration stage counts are:
Method |
Velocity evaluations per step |
|---|---|
|
1 |
|
2 |
|
4 |
|
6 |
The interpolation kernels use the following one-component stencil widths:
Method |
3D stencil per component |
Remarks |
|---|---|---|
|
\(2\times2\times2\) |
Lowest cost, piecewise linear |
|
\(4\times4\times4\) |
Smooth cubic Hermite reconstruction |
|
\(4\times4\times4\) |
Catmull-Rom style cubic spline |
|
\(5\times5\times5\) |
Highest stencil density, nonlinear |
Hence, a coarse code-level cost ranking for advection is
with the exact wall time also depending on particle count, GPU occupancy, memory bandwidth, whether backward FTLE is enabled, and whether dynamic windows are used.
In practice, RK4 with tricubic or hermite is a balanced choice for general research
workflows, while WENO is more appropriate when suppressing oscillatory reconstruction is more
important than raw speed.
General Tips
As for your reference, and configured as defaults, the Berkeley LCS Tutorials used RK4 for advection.
The velocity fields were interpolated with tricubic-FL by them, originating from [Lekien2005], which has higher performance by solving a 64×64 linear system using the function values, gradients, and mixed partial derivatives at its eight corners, which is in future development plan for Streamcenter+ with high priority.
Although not detailed, grad_order=2 was employed by them from the equation, supposing the mesh is sufficiently refined.
Please always notice that, although providing much better numerical precision and looks cool in papers, high-order methods could be resource-consuming, even several hundred times.