Differentiability

Every component along the Mesh Assembler SparseMatrix Condenser Solve pipeline is either a torch.nn.Module or a custom torch.autograd.Function, and the linear solve at the end has an analytic adjoint backward. As a result a loss computed on the FEM solution can be back-propagated all the way to anything that touched the pipeline: a coefficient at every node, a Dirichlet value, or a neural network that parameterises the material coefficient field — without writing a single line of sensitivity code by hand.

This chapter explains how the gradient flow works, demonstrates it with two worked examples (parameter identification and density-based topology optimization), and lists the cost, correctness, and the few things that aren’t differentiable yet.

How it works

Three pieces of TensorMesh are wired into the autograd graph:

  • Mesh extends torch.nn.Module. Its points and per-node fields are tensors that can carry requires_grad=True. Moving the mesh with mesh.to(device) and saving with state_dict work as you’d expect.

  • ElementAssembler and NodeAssembler are themselves nn.Module s. Any tensor that flows into a forward(...) integrand becomes a graph input to the assembled matrix or vector.

  • tensormesh.sparse.SparseMatrix.solve() is inherited unchanged from torch_sla.SparseTensor (see Sparse Solvers), so the adjoint lives in torch-sla rather than in TensorMesh. The forward solve produces \(u\) from \(A\,u = b\). For a scalar loss \(L(u)\), reverse-mode autograd hands the solve the upstream gradient \(\partial L/\partial u\); rather than differentiate through the solver’s internals, the solve is wrapped in a torch.autograd.Function whose custom backward solves a single adjoint system for the adjoint variable \(\boldsymbol{\lambda}\),

    \[A^{T}\, \boldsymbol{\lambda} \;=\; \frac{\partial L}{\partial u},\]

    and then assembles the input gradients in closed form — for each non-zero matrix entry \(\partial L / \partial A_{ij} = -\lambda_i\, u_j\), and for the right-hand side \(\partial L / \partial b = \boldsymbol{\lambda}\).

A call to loss = criterion(K.solve(b), target).sum() followed by loss.backward() therefore gives correct gradients for every leaf tensor that fed into K, b, or the loss — matrix entries, source terms, prescribed boundary values, and any upstream parameter (NN weights, design variables, …).

Adjoint cost

Back-propagation through one linear solve costs one additional sparse solve (with the transposed system). For SPD problems \(A^T = A\), so the backward uses the same factorisation or preconditioner pattern as the forward. The total per-iteration cost in a gradient-based optimiser is therefore roughly 2 × forward_solve_cost regardless of how many degrees of freedom you have or how complex the assembly logic is.

Correctness check: autograd vs finite differences

Before trusting autograd on a serious inverse problem, it is worth spot-checking it against central finite differences. The script that produces the figures on this page contains a small check on a 35-node mesh; running it prints

[grad check] vs finite differences...
    node    autograd          fd                  |diff|
      32    -8.297026e-03   -8.297026e-03   8.05e-12
      15    -5.268205e-04   -5.268205e-04   2.31e-12
      37    -4.118132e-03   -4.118132e-03   1.04e-12

i.e. roughly 11 digits of agreement, which is what a float64 adjoint should give. The minor residual is the FD truncation error, not an autograd bug.

What’s differentiable, what isn’t

Differentiable through ``solve`` today:

  • Matrix entries (edata) – gradients land on the upstream tensor whose values fed each non-zero.

  • The right-hand side b.

  • Any tensor in point_data / element_data / scalar_data passed to an assembler – gradients flow through the integrand.

  • Dirichlet values (via the condensed RHS contribution).

Backend caveats:

  • The torch-sla backend="pytorch" and backend="auto" paths are the safest defaults for autograd. They route the forward through a pure-PyTorch iterative solver and use the analytic adjoint for backward.

  • The SciPy / Eigen / cuDSS / CuPy backends are wrapped by the same custom torch.autograd.Function, so gradients still flow – but the forward computation lives in NumPy / CuPy / C++ and detaches from the graph for that span. This is correct for the linear solve but means you cannot “see into” the solver from autograd.

  • The legacy fallback paths in tensormesh.sparse (used when torch-sla is not installed) implement the same custom autograd.Function with adjoint backward, so gradients work there too.

When in doubt for a research workflow involving autodiff, install torch-sla and stick with backend="auto".

Worked example 1: coefficient-field identification

Suppose we observe a “ground truth” Poisson solution and want to recover the unknown coefficient field \(\kappa(x)\) at every mesh node by gradient descent:

\[-\nabla \cdot \big(\kappa(x)\, \nabla u\big) \;=\; f \quad \text{in } \Omega, \qquad u = 0 \text{ on } \partial\Omega.\]

The forward map is “FEM-solve given \(\kappa\)”; the loss is the L² distance between the FEM solution and the observation; autograd supplies the gradient and Adam updates \(\kappa\). We parametrise \(\kappa = 1 + \tanh\theta\) so \(\kappa\) stays in \((0, 2)\) – strictly positive, hence the FEM matrix is SPD on every iteration – with \(\theta\) unconstrained.

import torch
from tensormesh import Mesh, ElementAssembler, NodeAssembler, Condenser

device = "cuda" if torch.cuda.is_available() else "cpu"
mesh   = Mesh.gen_rectangle(chara_length=0.04).to(device)
cond   = Condenser(mesh.boundary_mask)

class WeightedLaplace(ElementAssembler):
    def forward(self, gradu, gradv, kappa):
        return kappa * (gradu @ gradv)

class Source(NodeAssembler):
    def forward(self, v, f):
        return v * f

def fem_solve(kappa, f_vals):
    K = WeightedLaplace.from_mesh(mesh)(point_data={"kappa": kappa}).double()
    b = Source.from_mesh(mesh)(point_data={"f": f_vals}).double()
    K_, b_ = cond(K, b)
    return cond.recover(K_.solve(b_))

# Synthetic ground-truth data.
pts   = mesh.points
x, y  = pts[:, 0], pts[:, 1]
kappa_true = 1.0 + 0.6 * torch.sin(2 * torch.pi * x) * torch.cos(2 * torch.pi * y)
f_vals     = torch.ones(mesh.n_points, dtype=torch.float64, device=device)
with torch.no_grad():
    u_obs = fem_solve(kappa_true, f_vals)

# Recover kappa from u_obs by Adam.
theta = torch.zeros(mesh.n_points, dtype=torch.float64,
                    device=device, requires_grad=True)
optim = torch.optim.Adam([theta], lr=3e-2)
for step in range(5000):
    optim.zero_grad()
    kappa = 1.0 + torch.tanh(theta)
    u     = fem_solve(kappa, f_vals)
    loss  = ((u - u_obs) ** 2).sum()
    loss.backward()
    optim.step()

Over 5000 iterations the data loss drops by more than seven orders of magnitude and the max-norm observation error bottoms out near \(6 \times 10^{-5}\):

../_images/param_id_loss.png

Fig. 16 Adam optimisation history over 5000 steps: the data loss \(\|u_\theta - u_{\rm obs}\|_2^2\) falls from \(10^{-2}\) to \(\sim 3 \times 10^{-10}\) (the periodic spikes are Adam overshooting, recovered within a few steps), and the relative max-norm error in \(u\) reaches \(\sim 6 \times 10^{-5}\).

The recovered coefficient field:

../_images/param_id_fields.png

Fig. 17 Ground truth \(\kappa(x)\) (left), the field recovered after 5000 Adam steps (middle), and the absolute error (right). The four-lobe checkerboard is now recovered across the whole domain; only a faint residual remains at the centre (absolute error peaking near \(0.08\) there, under \(0.01\) over the rest of the domain), where the solution gradient is small.

The only visible residual is a faint blob at the centre. There the constant source makes \(\nabla u\) vanish, so the flux \(\kappa\,\nabla u\) carries almost no information about \(\kappa\) and that region is recovered slowest – a smoothness penalty on \(\theta\) removes it, left as an exercise.

The whole gradient computation goes through the WeightedLaplace assembler and the linear solve via the adjoint backward – you never write a sensitivity equation by hand.

Worked example 2: thermal-compliance topology optimization

For density-based topology optimization (compliance minimisation under a volume constraint), TensorMesh ships a dedicated OCOptimizer that implements the classical Optimality Criteria update. Here we apply it to a thermal problem on \([0,1]^2\): a Gaussian heat source in the centre, cold sinks at the boundary, density \(\rho \in [0,1]\) per element, SIMP conductivity \(\kappa(\rho) = \kappa_{\min} + (1 - \kappa_{\min})\,\rho^{p}\), and a hard volume cap \(\bar{\rho} \leq V = 0.4\).

import torch
from tensormesh import Mesh, ElementAssembler, NodeAssembler, Condenser
from tensormesh.optimizer import OCOptimizer

mesh = Mesh.gen_rectangle(chara_length=0.025).to(device)
cond = Condenser(mesh.boundary_mask)
elements = mesh.cells["triangle"]
n_elem   = elements.shape[0]

class SIMPLaplace(ElementAssembler):
    def forward(self, gradu, gradv, kappa):
        return kappa * (gradu @ gradv)

class Source(NodeAssembler):
    def forward(self, v, f):
        return v * f

# Localised heat source.
x, y    = mesh.points[:, 0], mesh.points[:, 1]
sigma   = 0.08
f_field = torch.exp(-((x - 0.5) ** 2 + (y - 0.5) ** 2) / (2 * sigma ** 2))

V, p_simp, kmin = 0.4, 3.0, 1e-3
rho = torch.full((n_elem,), V, dtype=torch.float64,
                 device=device, requires_grad=True)
optim = OCOptimizer(rho, vf=V, move_limit=0.15)

def fem_solve(rho):
    kappa = kmin + (1.0 - kmin) * rho ** p_simp
    K = SIMPLaplace.from_mesh(mesh)(element_data={"kappa": kappa}).double()
    b = Source.from_mesh(mesh)(point_data={"f": f_field}).double()
    K_, b_ = cond(K, b)
    return cond.recover(K_.solve(b_)), b

for step in range(60):
    optim.zero_grad()
    u, b = fem_solve(rho)
    compliance = (b * u).sum()             # J = b^T u
    compliance.backward()                  # autograd populates rho.grad
    optim.step()                           # OC update + bisection

Compliance drops from \(5.9 \times 10^{-3}\) (uniform grey plate) to \(6.9 \times 10^{-4}\) – nearly an order of magnitude – while the volume constraint \(\bar{\rho} = 0.4\) is held exactly throughout. The optimal layout is a “thermal cross” connecting the hot source in the centre to the cold sinks at the four mid-edges:

../_images/topology_thermal.png

Fig. 18 Density evolution over 60 OC steps. Iteration 0 is the uniform \(\rho = V = 0.4\) start; by iteration 5 the cross shape has emerged; iteration 59 is the converged design. Compliance drops by \(\sim 8.6\times\); the volume fraction is held at \(0.4\) exactly.

The OC step uses the gradient just computed by autograd — tensormesh.optimizer.OCOptimizer.step() reads rho.grad, finds the Lagrange multiplier by bisection, and applies the density update with a move limit. No finite differences, no hand-coded adjoint.

Wiring a neural network in

Three patterns cover the common ML/PDE couplings:

  • NN predicts a coefficient field. The NN ingests coordinates (or features derived from them) and outputs a per-node value — feed the output to an assembler via point_data={"kappa": nn_out}. Gradients flow from the FEM loss back through the NN.

  • NN predicts boundary values. Pass the output as dirichlet_value to a freshly-built Condenser. The condensation contribution to the RHS keeps the gradients connected.

  • NN predicts per-element stiffness modifiers. Stack the output into a per-element tensor and pass it via element_data={"alpha": nn_out}; let the assembler’s forward use it inside the integrand.

In all three the NN is a regular nn.Module and the FEM pipeline is a regular sequence of nn.Module calls; standard torch.optim optimisers work without any special hooks.

Note

Wiring a network in needs no TensorMesh-specific neural-network layer – a plain torch.nn.Module is enough, as above. Today tensormesh.nn ships only the container utilities BufferDict and BufferList (used internally to register tensors as module buffers), not learnable layers. Higher-level neural-operator and learnable-PDE building blocks are planned for a future release; until then, build the network side with ordinary torch.nn layers.

Reproducing the figures

The two examples above, together with the FD gradient check, are generated by a single script:

python docs/scripts/differentiability_figures.py

It uses only the public API. The 5000-step identification loop now dominates the runtime – a few minutes either way (~3 min on a GPU) – after which it writes param_id_loss.png, param_id_fields.png, and topology_thermal.png to docs/source/_static/user_guide/differentiability/.

What’s next

  • Sparse Solvers – the autograd-aware solver behind SparseMatrix.solve.

  • Batched Workflows – back-propagate through a batched pipeline for ML training-data generation.

  • Time Integration – transient adjoints work too, with one adjoint solve per time step.

  • API Reference – the solve autograd function and the OCOptimizer reference.