Condition-Aware Fourier Neural Operators

A data-driven approach to spectral mode selection in Fourier Neural Operators using condition number analysis and ridge regression.

PUBLISHED: Jan 26, 2026

This project was originally for my Math 221 class, Advanced Matrix Computations. Afterwards, I expanded on this work and presented at Yale’s YURC 2026 conference. To read a full version of my report, see here.

A brief background on FNOs

Fourier Neural Operators (FNO) are learned linear integral operators that map input functions to solution functions of PDEs. They replace Galerkin/finite-difference matrices with a trainable spectral convolution using FFTs.

Most translation-invariant PDE solution operators can be represented as integral operators of the form

(Ku)(x)=ΩK(xy)u(y)dy(\mathcal{K} u)(x) = \int_{\Omega} K(x-y)u(y) \, dy

where this operator’s kernel K(xy)K(x-y) encodes Green’s function-like behavior. This operator is diagonalized by the Fourier convolution theorem in the frequency domain:

Ku^(k)=K^(k)u^(k)\widehat{\mathcal{K} u}(k) = \hat{K}(k) \hat{u}(k)

The Fourier modes kk independently evolve through multiplication with complex gain K^(k)\hat{K}(k). Knowing this, we are able to construct the Fourier Neural Operator with

v+1(x)=σ(F1(M(k)F[v](k))+Wv(x)+b)v_{\ell+1}(x) = \sigma \left( \mathcal{F}^{-1} ( M_\ell(k) \mathcal{F}[v_\ell](k) ) + W_\ell v_\ell(x) + b_\ell \right)

where:

Each FNO layer applies a global Fourier convolution, followed by a local nonlinear map and nonlinearity.

Application Significance

FNOs learn solution operators for families of PDEs, which reduces numerical solves and computation time with quick inference. In practice, FNOs are used in climate/ocean modeling, forecasting flows, and similar modeling applications. Additionally, they enable inner-loop tasks such as Bayesian calibration and uncertainty-aware exploration to be done in a realistic time period by significantly decreasing simulation time.

Conventional solvers require the explicit form of the PDE and have tradeoffs in speed and accuracy resolution, whereas FNOs take in data and are both resolution and mesh-invariant.

It is apparent why we prefer FNOs to conventional PDE solvers; in traditional methods such as finite element and finite difference methods, we discretize space into fine meshes, which often makes the solvers slow and dependent on discretization. FNOs circumvent this problem entirely — once trained, they work on any mesh and learn the continuous function instead of discretized vectors.

Evaluating PDE Solvers

Classical PDE solvers. ML-based solvers rarely beat out HPC solvers in terms of both accuracy and stability. GPU finite-volume, AMR systems, and dedicated tools like Firedrake, Dedalus, JAX-FEM, etc. are incredibly fast.

Neural PDE solvers. Physics-Informed Transformers (PIT) and Operator Transformers handle arbitrary meshes, long-range interactions, and shock-like structures better than FNOs. These are the dominant ML-based PDE solve approach nowadays.

Graph Neural Networks. MeshGraphNets, GNN-PDE, Graph Operator Networks directly operate on meshes and are great for fluid simulations, elasticity, and FEM discretizations.

While the best general ML PDE solvers are now transformer and GNN-based, Fourier Neural Operators are still advantageous for atmospheric surrogates (FourCastNet), real-time inference, climate simulations, and medium-range forecasting.

PDE Examples and FNO Solutions

  1. Elliptic PDEs (ex. Darcy Flow)
(a(x)u(x))=f(x),xΩ,uΩ=0-\nabla \cdot (a(x) \nabla u(x)) = f(x), \quad x \in \Omega, \quad u|_{\partial \Omega} = 0

The diffusion coefficient a(x)a(x) is the input field. The solution operator is G(a)(x)=u(x)\mathcal{G}(a)(x) = u(x).

  1. Parabolic PDEs (ex. Heat Equation)
tuνΔu=f(x,t),u(x,0)=u0(x)\partial_t u - \nu \Delta u = f(x, t), \quad u(x, 0) = u_0(x)

The FNOs learn the time-evolution operator G(u0)=u(,T)\mathcal{G}(u_0) = u(\cdot, T) which maps the initial data to a time-evolved state at TT. Crank-Nicolson (implicit solve) is a finite-difference method to solve the heat equation:

uin+1uinΔt=a(2Δx)2((ui+1n+12uin+1+ui1n+1)+(ui+1n2uin+ui1n))\frac{u^{n+1}_i-u^n_i}{\Delta t} = \frac{a}{(2\Delta x)^2}\left((u^{n+1}_{i+1}-2u^{n+1}_i+u^{n+1}_{i-1})+(u^{n}_{i+1}-2u^{n}_i+u^{n}_{i-1}) \right)
  1. Hyperbolic PDEs (ex. Burgers’ Equation)
tu+uxu=νxxu,u(x,0)=u0(x)\partial_t u + u \partial_x u = \nu \partial_{xx} u, \quad u(x, 0) = u_0(x)

The FNO learns to time-evolve the initial u0(x)u(x,T)u_0(x) \mapsto u(x,T). The learned operator can also be a stepwise update operator.

  1. Incompressible Navier-Stokes
tv+(v)v=p+νΔv+f,v=0\partial_t \mathbf{v} + (\mathbf{v} \cdot \nabla) \mathbf{v} = -\nabla p + \nu \Delta \mathbf{v} + \mathbf{f}, \qquad \nabla \cdot \mathbf{v} = 0

The FNO learns G:v0v(,T)\mathcal{G}: \mathbf{v}_0 \mapsto \mathbf{v}(\cdot, T).

Adaptive Spectral Truncation

Now we move to the proposed data-driven spectral truncation for the FNO, solving partial differential equations on a bounded domain DD.

Analysis Conventions

We are provided with a dataset of NN pairs {(v(s),u(s))}s=1N\{(v^{(s)}, u^{(s)})\}_{s=1}^N, where:

Here, dind_{in} and doutd_{out} denote the number of physical variables (channels) in the input and output respectively (e.g., pressure, velocity components).

We assume the domain is discretized such that the Discrete Fourier Transform (DFT) is applicable. Let F\mathcal{F} denote the Fourier transform. We denote corresponding frequency modes of the input and output functions as:

v^(k)=F[v](k)Cdin,u^(k)=F[u](k)Cdout\hat{v}(k) = \mathcal{F}[v](k) \in \mathbb{C}^{d_{in}}, \quad \hat{u}(k) = \mathcal{F}[u](k) \in \mathbb{C}^{d_{out}}

where kk represents the frequency wavenumber.

Motivation

Many smooth PDE solutions exhibit rapid spectral decay, where higher frequency modes contain significantly less energy:

E(k)=E[u^(k)2]kpE(k) = \mathbb{E}\left[\|\hat{u}(k)\|^2\right] \sim |k|^{-p}

for some decay rate p>0p > 0 varying based on the regularity of the solution. In standard FNOs, this motivates a hard spectral truncation where only the first KK modes are retained [1]:

M(k)=0for k>KM_\ell(k) = 0 \quad \text{for } |k| > K

However, a fixed KK has a nontrivial likelihood to discard high-frequency features or retain noise. Thus, we construct a data-driven approach by framing the spectral weight learning as a linear regression problem per-mode:

u^(k)Ckv^(k)\hat{u}(k) \approx C_k \hat{v}(k)

We fit the weight matrix CkCdout×dinC_k \in \mathbb{C}^{d_{out} \times d_{in}} using least squares across the training dataset of size NN. This allows us to analyze how well each mode is conditioned before training the full network. We formulate the Ridge Regression objective as:

minCkXkCkHYkF2+λkCkF2(1)\min_{C_k} \|\mathbf{X}_k C_k^H - \mathbf{Y}_k\|_F^2 + \lambda_k \|C_k\|_F^2 \tag{1}

For a specific mode kk, we construct the data matrices:

Xk=[v^(1)(k)Hv^(N)(k)H]CN×din,Yk=[u^(1)(k)Hu^(N)(k)H]CN×dout\mathbf{X}_k = \begin{bmatrix} \hat{v}^{(1)}(k)^H \\ \vdots \\ \hat{v}^{(N)}(k)^H \end{bmatrix} \in \mathbb{C}^{N \times d_{in}}, \quad \mathbf{Y}_k = \begin{bmatrix} \hat{u}^{(1)}(k)^H \\ \vdots \\ \hat{u}^{(N)}(k)^H \end{bmatrix} \in \mathbb{C}^{N \times d_{out}}

where HH indicates the Hermitian transpose. The optimal weights CkC_k satisfy:

CkH=XkYkC_k^H = \mathbf{X}_k^\dagger \mathbf{Y}_k

where \dagger denotes the Moore-Penrose pseudoinverse. To evaluate the stability of this mode, we compute the Singular Value Decomposition (SVD) of the input matrix Xk\mathbf{X}_k:

Xk=UkΣkVkH,Σk=diag(σk,1,,σk,r)\mathbf{X}_k = U_k \Sigma_k V_k^H, \quad \Sigma_k = \text{diag}(\sigma_{k,1}, \dots, \sigma_{k,r})

The condition number κk=σk,maxσk,min\kappa_k = \frac{\sigma_{k,\max}}{\sigma_{k,\min}} quantifies the numerical stability of learning this frequency mode:

Adaptive Truncation Rule

We define a selection criterion SK\mathcal{S} \subset \mathcal{K} to keep only “well-conditioned, energy-dominant” modes. For each retained mode, we apply Ridge Regression (Tikhonov Regularization) to dampen noise:

CkH=(XkHXk+λkI)1XkHYkC_k^H = (\mathbf{X}_k^H \mathbf{X}_k + \lambda_k I)^{-1} \mathbf{X}_k^H \mathbf{Y}_k

The regularization parameter λk\lambda_k is effectively a noise filter: high λk\lambda_k smooths weights for numerically unstable modes, while low λk\lambda_k fits numerically stable modes tightly.

To select S\mathcal{S}, we define a per-mode utility score:

score(k)=Energy(k)Cond(k)=YkF2κ(Xk)\text{score}(k) = \frac{\text{Energy}(k)}{\text{Cond}(k)} = \frac{\|\mathbf{Y}_k\|_F^2}{\kappa(\mathbf{X}_k)}

We retain the top KK modes sorted by this score, or more exhaustively, use an energy criterion where we choose the smallest set of frequency modes S\mathcal{S} such that:

kSEnergy(k)ηall kEnergy(k)\sum_{k \in \mathcal{S}} \text{Energy}(k) \ge \eta \sum_{\text{all } k} \text{Energy}(k)

where η\eta (e.g., 0.95) is a target energy preservation ratio. This constructs a condition-aware spectral geometry, adapting the truncation boundary to the specific physics of the PDE dataset.

Discussion of Computational Concerns

While the SVD computation for every mode kk scales as O(Ndin2)O(N d_{in}^2), this truncation process is strictly a pre-computation step. Thus, the cost is amortized over the process of training the neural operator. Furthermore, since modes are independent in the frequency domain, the construction of Xk\mathbf{X}_k and subsequent scoring can be easily parallelized.

This is further evidenced in the empirical results below, where training time between condition-aware and standard FNOs varied minimally, increasing overall training time by less than 0.5%.

Empirical Results

The following empirical results were run on 50 training epochs for 80 differently seeded datasets. The defined energy criterion η\eta was set (arbitrarily) at 0.95, and all operations were performed in PyTorch.

DatasetBase MeanBase StdCA MeanCA StdImp Mean (%)Imp StdWin Rate (%)
heat0.01670.00190.01460.002512.303014.538290.0
poisson0.00060.00020.00060.0002-6.461448.881135.0
wave0.75540.10242.73370.13948.043211.452980.0
Table: Performance comparison on different datasets.
DatasetStandard Time (s)CA Time (s)
Poisson12.3712.39
Heat11.9712.02
Wave11.9112.00
Table: Time comparison between Standard and Condition-Aware models.

Characteristic Training Curves

Below are characteristic training results taken from 80 seeded datasets.

Heat dataset results.

Poisson dataset results.

Wave dataset results.

Empirical Discussion

Heat dataset. The Condition-Aware FNO achieves a lower and more stable final test loss, along with a much smoother descent. We observe jumps in the standard FNO training curve around epochs 20–30, indicating numerical instability as the ill-conditioned modes “bounce around” the narrow valleys of the loss landscape.

The smoother descent observed in the condition-aware scheme indicates a more “spherical,” isotropic loss landscape with gradients pointing more directly to the minimum. Additionally, with noisy modes filtered out, the FNO no longer fits to noise and reaches a faster convergence.

Poisson dataset. The Poisson dataset demonstrates little to no improvement in test/train loss as well as similar training curves. Since the Poisson dataset is an elliptic PDE with solutions exhibiting rapid spectral decay, the low-frequency modes contain nearly all signal information, which renders the adaptive truncation nearly meaningless.

Additionally, the “noise” in low modes in the smooth Poisson dataset is negligible — this regularization does not reduce variance, and instead introduces bias, oversmoothing the weights of valid signal modes and ultimately creating underfitting, explaining the poor performance of the condition-aware scheme.

Wave dataset. In wave (hyperbolic PDE) datasets, the high-frequency features are preserved and do not smooth out initial conditions as quickly as parabolic equations. This means significant energy exists in higher frequencies, which is shown in the standard FNO’s poor performance using fixed hard truncation, discarding these physically significant high-frequency modes. The condition-aware FNO adapts the truncation boundary, retaining energy-dominant high-frequency modes that are important in wave dynamics.

Overall, we observe a significant and consistent improvement in performance in nonsmooth datasets where high-energy modes with potentially numerically unstable modes contain significant signal energy.

Gradient Descent Convergence

Hessian Spectrum and Conditioning

We recall the function for learning spectral weights CkC_k for a specific mode kk with Tikhonov regularization (Ridge Regression) as defined in Eq. (1):

L(Ck)=XkCkHYkF2+λkCkF2\mathcal{L}(C_k) = \|X_k C_k^H - Y_k\|_F^2 + \lambda_k \|C_k\|_F^2

The gradient of this function with respect to the conjugate of the weights CkC_k is:

CkL=(XkHXk+λkI)CkHXkHYk\nabla_{C_k} \mathcal{L} = (X_k^H X_k + \lambda_k I)C_k^H - X_k^H Y_k

The Hessian matrix HkH_k, which defines the curvature of the loss landscape, is given by the following positive definite matrix:

Hk=XkHXk+λkIH_k = X_k^H X_k + \lambda_k I

Let the SVD of the data matrix XkX_k be UΣVHU \Sigma V^H, with singular values σi\sigma_i for i=1,,dini=1, \dots, d_{in}. The eigenvalues of the unregularized covariance matrix XkHXkX_k^H X_k are μi=σi2\mu_i = \sigma_i^2. Thus, the eigenvalues of the regularized Hessian HkH_k are:

νi=σi2+λk\nu_i = \sigma_i^2 + \lambda_k

Convergence Rate of Gradient Descent

For a quadratic convex problem, Gradient Descent with an optimal step size converges linearly with a rate determined by the spectral condition number κ\kappa of the Hessian. The error at step t+1t+1 is bounded by:

ϵt+1(κ1κ+1)ϵt\|\epsilon_{t+1}\| \le \left( \frac{\kappa - 1}{\kappa + 1} \right) \|\epsilon_t\|

We compare the condition numbers of the standard formulation (κstd\kappa_{\text{std}}) and the Condition-Aware regularized formulation (κreg\kappa_{\text{reg}}).

Standard FNO: For the unregularized case (λk=0\lambda_k = 0):

κstd=maxi(σi2)mini(σi2)=(σmaxσmin)2\kappa_{\text{std}} = \frac{\max_i(\sigma_i^2)}{\min_i(\sigma_i^2)} = \left( \frac{\sigma_{\max}}{\sigma_{\min}} \right)^2

High-frequency modes often exhibit rapid spectral decay, leading to σmin0\sigma_{\min} \to 0. This causes κstd\kappa_{\text{std}} \to \infty, resulting in an “elongated” loss landscape where gradient descent converges arbitrarily slowly or oscillates.

Condition-Aware FNO: For the proposed method with λk>0\lambda_k > 0:

κreg=σmax2+λkσmin2+λk\kappa_{\text{reg}} = \frac{\sigma_{\max}^2 + \lambda_k}{\sigma_{\min}^2 + \lambda_k}
Theorem 1 (Conditioning Improvement).

For any λk>0\lambda_k > 0 and σmax>σmin0\sigma_{\max} > \sigma_{\min} \ge 0, the condition number of the regularized problem is strictly less than that of the unregularized problem: κreg<κstd\kappa_{\text{reg}} < \kappa_{\text{std}}.

Proof. Let a=σmax2a = \sigma_{\max}^2 and b=σmin2b = \sigma_{\min}^2, with a>b0a > b \ge 0. We want to show a+λkb+λk<ab\frac{a + \lambda_k}{b + \lambda_k} < \frac{a}{b}:

b(a+λk)<a(b+λk)    ab+bλk<ab+aλk    bλk<aλkb(a + \lambda_k) < a(b + \lambda_k) \implies ab + b\lambda_k < ab + a\lambda_k \implies b\lambda_k < a\lambda_k

Since λk>0\lambda_k > 0, we divide by λk\lambda_k to get b<ab < a, which holds by assumption. \square

Conclusion

Since κreg<κstd\kappa_{\text{reg}} < \kappa_{\text{std}}, the convergence factor satisfies

κreg1κreg+1<κstd1κstd+1\frac{\kappa_{\text{reg}} - 1}{\kappa_{\text{reg}} + 1} < \frac{\kappa_{\text{std}} - 1}{\kappa_{\text{std}} + 1}

This inequality theoretically confirms the empirical results, proving that the Condition-Aware FNO operates on a better-conditioned optimization landscape. The regularization λk\lambda_k effectively “spheres” the loss geometry, removing the narrow valleys in ill-conditioned high-frequency modes and enabling faster convergence.

Potential Methodological Extensions

Some ideas I had for extensions and fixing methodological weaknesses.

The condition-aware truncation above establishes a principled pre-training framework for mode selection, but several natural limitations motivate further development. First, a fixed λk\lambda_k applied uniformly across modes introduces unnecessary bias on well-conditioned modes, directly explaining the degraded Poisson performance. Second, the truncation set S\mathcal{S} is computed once from the raw input–output pairs and never updated as the network’s internal representations evolve. Third, the spectral geometry of a deep FNO is not uniform across layers — yet a single global S\mathcal{S} is applied at every depth. Finally, the formulation is restricted to one-dimensional domains, whereas most physically relevant FNO benchmarks operate on 2D or 3D grids.

This section formalizes four extensions that address each of these limitations in turn.

Learnable Regularization via Bilevel Optimization

The regularization parameter λk\lambda_k in Eq. (1) currently plays a dual role: it stabilizes ill-conditioned modes and simultaneously smooths the weights of well-conditioned ones. For smooth PDE solutions such as the Poisson dataset, the low-frequency modes are inherently stable (σk,min\sigma_{k,\min} is large), so any positive λk\lambda_k introduces avoidable bias. Conversely, for high-frequency modes in the wave dataset, a large λk\lambda_k is genuinely necessary. A mode-adaptive, learned λk\lambda_k would collapse to near zero for stable modes while remaining large for unstable ones.

We treat each λk0\lambda_k \ge 0 as a hyperparameter optimized on a held-out validation set. Since the inner optimization problem (ridge regression) admits a closed-form solution, we can differentiate through it analytically without approximation. Formally, the bilevel problem is:

minα  Lval ⁣(Ck(α))subject toCk(α)=arg minCkLtrain(Ck,eαk)(2)\min_{\boldsymbol{\alpha}} \; \mathcal{L}_{\mathrm{val}}\!\left(C_k^*(\boldsymbol{\alpha})\right) \quad \text{subject to} \quad C_k^*(\boldsymbol{\alpha}) = \operatorname*{arg\,min}_{C_k} \mathcal{L}_{\mathrm{train}}(C_k,\, e^{\alpha_k}) \tag{2}

where we reparameterize λk=eαk\lambda_k = e^{\alpha_k} to enforce positivity. The inner problem has the closed-form solution

(Ck)H=(XkHXk+eαkI)1XkHYktr(C_k^*)^H = \bigl(\mathbf{X}_k^H \mathbf{X}_k + e^{\alpha_k} I\bigr)^{-1} \mathbf{X}_k^H \mathbf{Y}_k^{\mathrm{tr}}

and the validation loss is

Lval(Ck)=Xkval(Ck)HYkvalF2\mathcal{L}_{\mathrm{val}}(C_k^*) = \bigl\|\mathbf{X}_k^{\mathrm{val}} (C_k^*)^H - \mathbf{Y}_k^{\mathrm{val}}\bigr\|_F^2

Analytic Gradient. Let Ak=XkHXk+eαkIA_k = \mathbf{X}_k^H \mathbf{X}_k + e^{\alpha_k} I. By the chain rule and the matrix inverse derivative identity ddαkAk1=Ak1dAkdαkAk1\frac{d}{d\alpha_k} A_k^{-1} = -A_k^{-1} \frac{dA_k}{d\alpha_k} A_k^{-1}:

Lvalαk=tr ⁣[Lval(Ck)H(Ck)Hαk](3)\frac{\partial \mathcal{L}_{\mathrm{val}}}{\partial \alpha_k} = \operatorname{tr}\!\left[ \frac{\partial \mathcal{L}_{\mathrm{val}}}{\partial (C_k^*)^H} \cdot \frac{\partial (C_k^*)^H}{\partial \alpha_k} \right] \tag{3}

where

(Ck)Hαk=eαkAk2XkHYktr\frac{\partial (C_k^*)^H}{\partial \alpha_k} = -e^{\alpha_k} A_k^{-2} \mathbf{X}_k^H \mathbf{Y}_k^{\mathrm{tr}}

All quantities in Eq. (3) are computable in closed form via the pre-computed SVD Xk=UkΣkVkH\mathbf{X}_k = U_k \Sigma_k V_k^H: the matrix inverse Ak1A_k^{-1} is Vkdiag(σk,i2+eαk)1VkHV_k \operatorname{diag}(\sigma_{k,i}^2 + e^{\alpha_k})^{-1} V_k^H, reducing the cost to O(din2)O(d_{in}^2) per mode after the initial SVD.

Optimization Procedure. We solve the bilevel problem with the following alternating procedure, which runs entirely before FNO training begins:

Algorithm 1: Learnable Regularization Pre-computation
Require: Training data {(X_k^tr, Y_k^tr)}_k, validation data {(X_k^val, Y_k^val)}_k,
         outer step size β, iterations T
Initialize α_k ← 0 for all k (i.e., λ_k = 1)
for t = 1, ..., T do
    for each mode k in parallel do
        Compute C_k*(α_k) via closed-form solution
        Compute ∇_{α_k} L_val via Eq. (3)
        α_k ← α_k − β ∇_{α_k} L_val
    end for
end for
return {λ_k = e^{α_k}}_k

This procedure is still a pre-computation and does not alter the FNO architecture or training loop. Its cost per iteration is O(Kdin2)O(K \cdot d_{in}^2) where K=KK = |\mathcal{K}| is the number of candidate modes, and T50T \sim 50100100 outer iterations typically suffice.

Proposition 1.

Let σk,min2eαk\sigma_{k,\min}^2 \gg e^{\alpha_k} for all singular values of Xk\mathbf{X}_k. Then the gradient αkLval0\nabla_{\alpha_k} \mathcal{L}_{\mathrm{val}} \to 0, and the learnable λk\lambda_k converges to a neighborhood of zero for well-conditioned modes.

Proof. When σk,i2eαk\sigma_{k,i}^2 \gg e^{\alpha_k} for all ii, the matrix Ak1Vkdiag(σk,i2)VkH=(XkHXk)1A_k^{-1} \approx V_k \operatorname{diag}(\sigma_{k,i}^{-2}) V_k^H = (\mathbf{X}_k^H \mathbf{X}_k)^{-1}, the ordinary least-squares inverse. Hence CkCkOLSC_k^* \to C_k^{\mathrm{OLS}} and Lval\mathcal{L}_{\mathrm{val}} is insensitive to further changes in αk\alpha_k, driving αkLval0\nabla_{\alpha_k} \mathcal{L}_{\mathrm{val}} \to 0. \square

Per-Layer Spectral Truncation

The current framework scores modes using the raw input–output data pairs (v,u)(v, u) and applies the resulting set S\mathcal{S} uniformly at every FNO layer =1,,L\ell = 1, \dots, L. However, the feature field v(x)v_\ell(x) undergoes substantial transformation with depth: early layers tend to retain input-like spectral content, while later layers increasingly encode solution-like representations. A single global S\mathcal{S} cannot simultaneously reflect the spectral geometry of all layers.

We define a separate truncation set SK\mathcal{S}^\ell \subseteq \mathcal{K} for each layer \ell, constructed from the layer’s actual intermediate representations. Because these representations depend on the trained network weights, we adopt a two-pass bootstrap procedure.

Definition 4.2 (Layer-\ell Data Matrices). Given a trained FNO with feature fields {v(s)}s=1N\{v_\ell^{(s)}\}_{s=1}^N extracted at layer \ell on the training set, define

Xk=[v^(1)(k)Hv^(N)(k)H]CN×w,Yk=[v^+1(1)(k)Hv^+1(N)(k)H]CN×w+1\mathbf{X}_k^\ell = \begin{bmatrix} \hat{v}_\ell^{(1)}(k)^H \\ \vdots \\ \hat{v}_\ell^{(N)}(k)^H \end{bmatrix} \in \mathbb{C}^{N \times w_\ell}, \quad \mathbf{Y}_k^\ell = \begin{bmatrix} \hat{v}_{\ell+1}^{(1)}(k)^H \\ \vdots \\ \hat{v}_{\ell+1}^{(N)}(k)^H \end{bmatrix} \in \mathbb{C}^{N \times w_{\ell+1}}

where ww_\ell denotes the channel width at layer \ell.

The per-layer utility score is computed identically:

score(k)=YkF2κ(Xk)\mathrm{score}^\ell(k) = \frac{\|\mathbf{Y}_k^\ell\|_F^2}{\kappa(\mathbf{X}_k^\ell)}
Algorithm 2: Bootstrap Per-Layer Truncation
Require: Training data, energy threshold η, bootstrap epochs T_0
Train a baseline CA-FNO with global S for T_0 epochs
for each layer ℓ = 1, ..., L do
    Forward-pass training set; extract {v_ℓ^(s)} and {v_{ℓ+1}^(s)}
    Compute X_k^ℓ, Y_k^ℓ, SVD, and score^ℓ(k) for all k
    Construct S^ℓ via energy criterion
end for
Retrain FNO from scratch using {S^ℓ}_{ℓ=1}^L

The bootstrap run need not be trained to full convergence; T010T_0 \approx 1020%20\% of total training epochs is sufficient to obtain representative intermediate representations for scoring.

Proposition 2.

For a parabolic PDE with spectral decay exponent p>0p > 0, if the FNO layers approximate the time-evolution semigroup, then the expected per-layer energy satisfies Energy+1(k)Energy(k)\mathrm{Energy}^{\ell+1}(k) \le \mathrm{Energy}^\ell(k) for k>k0|k| > k_0 for some cutoff k0k_0, implying S+1S\mathcal{S}^{\ell+1} \subseteq \mathcal{S}^\ell in expectation.

Conversely, for a hyperbolic PDE (e.g., wave equation), energy is conserved across frequencies and the inclusion may not hold, motivating empirical verification via Algorithm 2.

Dynamic Spectral Truncation

Both the global and per-layer truncation schemes commit to a fixed mode set before or at the beginning of training. During training, the network’s predictions improve, and the residual between predictions and targets — not the original output uu — characterizes what information remains to be learned. Modes that appeared noise-dominated relative to the raw output may carry structured residual signal at later epochs. Dynamic truncation re-evaluates S\mathcal{S} periodically against the current residuals.

Let u^θ(s)(x)\hat{u}_\theta^{(s)}(x) denote the network’s prediction at training epoch tt, and define the residual at each training sample as

r(s)(x)=u(s)(x)u^θ(s)(x)r^{(s)}(x) = u^{(s)}(x) - \hat{u}_\theta^{(s)}(x)

The residual Fourier coefficients are r^(s)(k)=F[r(s)](k)\hat{r}^{(s)}(k) = \mathcal{F}[r^{(s)}](k). We construct residual target matrices

Rk(t)=[r^(1)(k)Hr^(N)(k)H]CN×dout\mathbf{R}_k^{(t)} = \begin{bmatrix} \hat{r}^{(1)}(k)^H \\ \vdots \\ \hat{r}^{(N)}(k)^H \end{bmatrix} \in \mathbb{C}^{N \times d_{out}}

and re-score each mode using the residual energy:

scoret(k)=Rk(t)F2κ(Xk)(4)\mathrm{score}_t(k) = \frac{\|\mathbf{R}_k^{(t)}\|_F^2}{\kappa(\mathbf{X}_k)} \tag{4}

where Xk\mathbf{X}_k (derived from inputs v(s)v^{(s)}) is fixed across re-evaluations since the inputs do not change. Only Rk(t)\mathbf{R}_k^{(t)} is updated.

Let T={t1,t2,,tm}\mathcal{T} = \{t_1, t_2, \dots, t_m\} be a sequence of re-evaluation epochs. At each tiTt_i \in \mathcal{T}, we recompute Rk(ti)\mathbf{R}_k^{(t_i)} and update S\mathcal{S}. Modes are added or dropped from the active set accordingly, and the FNO’s spectral multiplier matrices M(k)M_\ell(k) for newly excluded modes are zeroed out.

A geometric schedule ti=tmaxρmit_i = \lfloor t_{\max} \cdot \rho^{m-i} \rfloor for ρ(0,1)\rho \in (0,1) concentrates re-evaluations toward the end of training, where residual structure is most informative. In practice, m=3m = 355 re-evaluations suffice.

Re-evaluations mmHeat Test LossWave Test Loss
0 (static)baselinebaseline
1δ1-\delta_1δ1-\delta_1'
3δ3-\delta_3δ3-\delta_3'
5δ5-\delta_5δ5-\delta_5'
Table: Proposed ablation of dynamic re-evaluation frequency on heat and wave datasets. δm\delta_m denotes improvement in test loss over the static baseline. Values to be populated empirically.
Proposition 3 (Monotone Residual Energy).

Let u^θ\hat{u}_\theta be a consistent estimator improving under gradient descent. Then for modes kS(0)k \in \mathcal{S}^{(0)} with high signal-to-noise ratio, Rk(t)F2\|\mathbf{R}_k^{(t)}\|_F^2 decreases monotonically in expectation as training progresses.

This implies that dynamic truncation will tend to drop modes over time for parabolic PDEs as the network learns to explain their contribution, progressively shrinking S(t)\mathcal{S}^{(t)}. For hyperbolic PDEs, dynamic re-evaluation allows previously excluded high-frequency modes to be activated if the residual reveals unlearned structure.

Re-scoring requires recomputing Rk(t)\mathbf{R}_k^{(t)} for all modes kk, which costs one forward pass over the training set plus O(KNdout)O(K \cdot N \cdot d_{out}) to compute the Frobenius norms. Since Xk\mathbf{X}_k is fixed, no additional SVD computations are required.

Extension to Two-Dimensional Domains

The foregoing development assumes a one-dimensional spatial domain, yielding scalar wavenumbers kZk \in \mathbb{Z}. In practice, the most widely studied FNO benchmarks — two-dimensional Darcy flow and the vorticity formulation of Navier-Stokes — operate on two-dimensional grids of resolution N1×N2N_1 \times N_2. Extending to 2D introduces both additional notation and a nontrivial geometric question: whereas standard FNO in 2D uses a rectangular truncation in frequency space, the condition-aware criterion may select an irregular subset SZ2\mathcal{S} \subseteq \mathbb{Z}^2, providing new information about the anisotropic spectral structure of the underlying PDE.

Let the domain be D=[0,1]2D = [0,1]^2 discretized at resolution N1×N2N_1 \times N_2. Modes are indexed by 2D wavenumber pairs k=(k1,k2)K2DZ2\mathbf{k} = (k_1, k_2) \in \mathcal{K}^{2D} \subseteq \mathbb{Z}^2. The 2D DFT of the input and output functions are

v^(k)=F2D[v](k)Cdin,u^(k)=F2D[u](k)Cdout\hat{v}(\mathbf{k}) = \mathcal{F}_{2D}[v](\mathbf{k}) \in \mathbb{C}^{d_{in}}, \quad \hat{u}(\mathbf{k}) = \mathcal{F}_{2D}[u](\mathbf{k}) \in \mathbb{C}^{d_{out}}

The data matrices are constructed per 2D mode identically to the 1D case, and the utility score and energy criterion extend without modification:

score(k)=YkF2κ(Xk),S2D=arg minSK2DSs.t.kSEnergy(k)ηall kEnergy(k)(5)\mathrm{score}(\mathbf{k}) = \frac{\|\mathbf{Y}_\mathbf{k}\|_F^2}{\kappa(\mathbf{X}_\mathbf{k})}, \quad \mathcal{S}^{2D} = \operatorname*{arg\,min}_{S \subseteq \mathcal{K}^{2D}} |S| \quad \text{s.t.} \quad \sum_{\mathbf{k} \in S} \mathrm{Energy}(\mathbf{k}) \ge \eta \sum_{\text{all } \mathbf{k}} \mathrm{Energy}(\mathbf{k}) \tag{5}

Anisotropic Spectral Geometry. Unlike the 1D case, the selected set S2D\mathcal{S}^{2D} need not be a square or rectangular region in Z2\mathbb{Z}^2. The shape of S2D\mathcal{S}^{2D} is physically informative:

This geometric structure directly encodes the anisotropy of the physical system in the FNO architecture and cannot be recovered by any rectangular truncation.

The total number of modes to score grows as O(N1N2)O(N_1 N_2) rather than O(N1)O(N_1), but since each mode’s SVD and scoring is independent, the computation parallelizes identically to the 1D case. The SVD cost per mode remains O(Ndin2)O(N d_{in}^2), so the total pre-computation cost is O(N1N2Ndin2)O(N_1 N_2 \cdot N \cdot d_{in}^2), which for typical FNO resolutions (N1=N2=64N_1 = N_2 = 64, N=1000N = 1000, din=1d_{in} = 1) is on the order of 4×1064 \times 10^6 operations.

In the isotropic case, Eq. (5) recovers a radial spectral truncation S2D{k:kK}\mathcal{S}^{2D} \approx \{\mathbf{k} : |\mathbf{k}| \le K^*\} for some data-driven KK^*, providing a theoretical connection to classical spectral methods in which cutoff wavenumbers are chosen based on the Kolmogorov microscale or solution regularity.

Remark 1.

The extension to three-dimensional domains is notational: modes become triples (k1,k2,k3)Z3(k_1, k_2, k_3) \in \mathbb{Z}^3 and the data matrices scale accordingly. No structural changes to the algorithm are required.

References

[1] Zongyi Li, Nikola B. Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew M. Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. CoRR, abs/2010.08895, 2020.