Differentiability
=================

Every component along the
``Mesh → Assembler → SparseMatrix → Condenser → Solve`` pipeline
is either a :class:`torch.nn.Module` or a custom
:class:`torch.autograd.Function`, and the linear solve at the end
has an **analytic adjoint backward**. As a result a loss computed
on the FEM solution can be back-propagated all the way to *anything*
that touched the pipeline: a coefficient at every node, a Dirichlet
value, or a neural network that parameterises the material
coefficient field --- without writing a single line of sensitivity
code by hand.

This chapter explains how the gradient flow works, demonstrates it
with two worked examples (parameter identification and
density-based topology optimization), and lists the cost,
correctness, and the few things that aren't differentiable yet.


How it works
------------

Three pieces of TensorMesh are wired into the autograd graph:

* :class:`~tensormesh.Mesh` extends :class:`torch.nn.Module`. Its
  ``points`` and per-node fields are tensors that can carry
  ``requires_grad=True``. Moving the mesh with ``mesh.to(device)``
  and saving with ``state_dict`` work as you'd expect.
* :class:`~tensormesh.ElementAssembler` and
  :class:`~tensormesh.NodeAssembler` are themselves ``nn.Module`` s.
  Any tensor that flows into a ``forward(...)`` integrand becomes
  a graph input to the assembled matrix or vector.
* :meth:`tensormesh.sparse.SparseMatrix.solve` is inherited unchanged
  from ``torch_sla.SparseTensor`` (see :doc:`linear_solvers`), so the
  adjoint lives in ``torch-sla`` rather than in TensorMesh. The forward
  solve produces :math:`u` from :math:`A\,u = b`. For a scalar loss
  :math:`L(u)`, reverse-mode autograd hands the solve the upstream
  gradient :math:`\partial L/\partial u`; rather than differentiate
  through the solver's internals, the solve is wrapped in a
  :class:`torch.autograd.Function` whose custom ``backward`` solves a
  single **adjoint** system for the adjoint variable
  :math:`\boldsymbol{\lambda}`,

  .. math::

     A^{T}\, \boldsymbol{\lambda} \;=\; \frac{\partial L}{\partial u},

  and then assembles the input gradients in closed form — for each
  non-zero matrix entry
  :math:`\partial L / \partial A_{ij} = -\lambda_i\, u_j`, and for the
  right-hand side :math:`\partial L / \partial b = \boldsymbol{\lambda}`.

A call to ``loss = criterion(K.solve(b), target).sum()`` followed
by ``loss.backward()`` therefore gives correct gradients for every
leaf tensor that fed into ``K``, ``b``, or the loss --- matrix
entries, source terms, prescribed boundary values, and any upstream
parameter (NN weights, design variables, …).


Adjoint cost
------------

Back-propagation through one linear solve costs **one additional
sparse solve** (with the transposed system). For SPD problems
:math:`A^T = A`, so the backward uses the same factorisation or
preconditioner pattern as the forward. The total per-iteration
cost in a gradient-based optimiser is therefore roughly
``2 × forward_solve_cost`` regardless of how many degrees of
freedom you have or how complex the assembly logic is.


Correctness check: autograd vs finite differences
-------------------------------------------------

Before trusting autograd on a serious inverse problem, it is worth
spot-checking it against central finite differences. The script
that produces the figures on this page contains a small check on
a 35-node mesh; running it prints

.. code-block:: text

   [grad check] vs finite differences...
       node    autograd          fd                  |diff|
         32    -8.297026e-03   -8.297026e-03   8.05e-12
         15    -5.268205e-04   -5.268205e-04   2.31e-12
         37    -4.118132e-03   -4.118132e-03   1.04e-12

i.e. roughly 11 digits of agreement, which is what a float64
adjoint should give. The minor residual is the FD truncation
error, not an autograd bug.


What's differentiable, what isn't
---------------------------------

**Differentiable through ``solve`` today:**

* Matrix entries (``edata``) -- gradients land on the upstream
  tensor whose values fed each non-zero.
* The right-hand side ``b``.
* Any tensor in ``point_data`` / ``element_data`` / ``scalar_data``
  passed to an assembler -- gradients flow through the integrand.
* Dirichlet values (via the condensed RHS contribution).

**Backend caveats:**

* The ``torch-sla`` ``backend="pytorch"`` and ``backend="auto"``
  paths are the safest defaults for autograd. They route the
  forward through a pure-PyTorch iterative solver and use the
  analytic adjoint for backward.
* The SciPy / Eigen / cuDSS / CuPy backends are wrapped by the same
  custom :class:`torch.autograd.Function`, so gradients still flow
  -- but the forward computation lives in NumPy / CuPy / C++ and
  detaches from the graph for that span. This is correct for the
  linear solve but means you cannot "see into" the solver from
  autograd.
* The legacy fallback paths in :mod:`tensormesh.sparse` (used when
  ``torch-sla`` is not installed) implement the same custom
  ``autograd.Function`` with adjoint backward, so gradients work
  there too.

When in doubt for a research workflow involving autodiff, install
``torch-sla`` and stick with ``backend="auto"``.


Worked example 1: coefficient-field identification
--------------------------------------------------

Suppose we observe a "ground truth" Poisson solution and want to
recover the unknown coefficient field :math:`\kappa(x)` at every
mesh node by gradient descent:

.. math::

   -\nabla \cdot \big(\kappa(x)\, \nabla u\big) \;=\; f \quad \text{in } \Omega,
   \qquad u = 0 \text{ on } \partial\Omega.

The forward map is "FEM-solve given :math:`\kappa`"; the loss is the
L² distance between the FEM solution and the observation; autograd
supplies the gradient and Adam updates :math:`\kappa`. We parametrise
:math:`\kappa = 1 + \tanh\theta` so :math:`\kappa` stays in
:math:`(0, 2)` -- strictly positive, hence the FEM matrix is SPD on
every iteration -- with :math:`\theta` unconstrained.

.. code-block:: python

   import torch
   from tensormesh import Mesh, ElementAssembler, NodeAssembler, Condenser

   device = "cuda" if torch.cuda.is_available() else "cpu"
   mesh   = Mesh.gen_rectangle(chara_length=0.04).to(device)
   cond   = Condenser(mesh.boundary_mask)

   class WeightedLaplace(ElementAssembler):
       def forward(self, gradu, gradv, kappa):
           return kappa * (gradu @ gradv)

   class Source(NodeAssembler):
       def forward(self, v, f):
           return v * f

   def fem_solve(kappa, f_vals):
       K = WeightedLaplace.from_mesh(mesh)(point_data={"kappa": kappa}).double()
       b = Source.from_mesh(mesh)(point_data={"f": f_vals}).double()
       K_, b_ = cond(K, b)
       return cond.recover(K_.solve(b_))

   # Synthetic ground-truth data.
   pts   = mesh.points
   x, y  = pts[:, 0], pts[:, 1]
   kappa_true = 1.0 + 0.6 * torch.sin(2 * torch.pi * x) * torch.cos(2 * torch.pi * y)
   f_vals     = torch.ones(mesh.n_points, dtype=torch.float64, device=device)
   with torch.no_grad():
       u_obs = fem_solve(kappa_true, f_vals)

   # Recover kappa from u_obs by Adam.
   theta = torch.zeros(mesh.n_points, dtype=torch.float64,
                       device=device, requires_grad=True)
   optim = torch.optim.Adam([theta], lr=3e-2)
   for step in range(5000):
       optim.zero_grad()
       kappa = 1.0 + torch.tanh(theta)
       u     = fem_solve(kappa, f_vals)
       loss  = ((u - u_obs) ** 2).sum()
       loss.backward()
       optim.step()

Over 5000 iterations the data loss drops by more than seven orders of
magnitude and the max-norm observation error bottoms out near
:math:`6 \times 10^{-5}`:

.. figure:: ../_static/user_guide/differentiability/param_id_loss.png
   :align: center
   :width: 75%

   Adam optimisation history over 5000 steps: the data loss
   :math:`\|u_\theta - u_{\rm obs}\|_2^2` falls from :math:`10^{-2}` to
   :math:`\sim 3 \times 10^{-10}` (the periodic spikes are Adam
   overshooting, recovered within a few steps), and the relative
   max-norm error in :math:`u` reaches :math:`\sim 6 \times 10^{-5}`.

The recovered coefficient field:

.. figure:: ../_static/user_guide/differentiability/param_id_fields.png
   :align: center
   :width: 100%

   Ground truth :math:`\kappa(x)` (left), the field recovered after
   5000 Adam steps (middle), and the absolute error (right). The
   four-lobe checkerboard is now recovered across the whole domain;
   only a faint residual remains at the centre (absolute error peaking
   near :math:`0.08` there, under :math:`0.01` over the rest of the
   domain), where the solution gradient is small.

The only visible residual is a faint blob at the centre. There the
constant source makes :math:`\nabla u` vanish, so the flux
:math:`\kappa\,\nabla u` carries almost no information about
:math:`\kappa` and that region is recovered slowest -- a smoothness
penalty on :math:`\theta` removes it, left as an exercise.

The whole gradient computation goes through the ``WeightedLaplace``
assembler and the linear solve via the adjoint backward -- you
never write a sensitivity equation by hand.


Worked example 2: thermal-compliance topology optimization
----------------------------------------------------------

For density-based topology optimization (compliance minimisation
under a volume constraint), TensorMesh ships a dedicated
:class:`~tensormesh.optimizer.OCOptimizer` that implements the
classical Optimality Criteria update. Here we apply it to a
thermal problem on :math:`[0,1]^2`: a Gaussian heat source in the
centre, cold sinks at the boundary, density :math:`\rho \in [0,1]`
per element, SIMP conductivity
:math:`\kappa(\rho) = \kappa_{\min} + (1 - \kappa_{\min})\,\rho^{p}`,
and a hard volume cap :math:`\bar{\rho} \leq V = 0.4`.

.. code-block:: python

   import torch
   from tensormesh import Mesh, ElementAssembler, NodeAssembler, Condenser
   from tensormesh.optimizer import OCOptimizer

   mesh = Mesh.gen_rectangle(chara_length=0.025).to(device)
   cond = Condenser(mesh.boundary_mask)
   elements = mesh.cells["triangle"]
   n_elem   = elements.shape[0]

   class SIMPLaplace(ElementAssembler):
       def forward(self, gradu, gradv, kappa):
           return kappa * (gradu @ gradv)

   class Source(NodeAssembler):
       def forward(self, v, f):
           return v * f

   # Localised heat source.
   x, y    = mesh.points[:, 0], mesh.points[:, 1]
   sigma   = 0.08
   f_field = torch.exp(-((x - 0.5) ** 2 + (y - 0.5) ** 2) / (2 * sigma ** 2))

   V, p_simp, kmin = 0.4, 3.0, 1e-3
   rho = torch.full((n_elem,), V, dtype=torch.float64,
                    device=device, requires_grad=True)
   optim = OCOptimizer(rho, vf=V, move_limit=0.15)

   def fem_solve(rho):
       kappa = kmin + (1.0 - kmin) * rho ** p_simp
       K = SIMPLaplace.from_mesh(mesh)(element_data={"kappa": kappa}).double()
       b = Source.from_mesh(mesh)(point_data={"f": f_field}).double()
       K_, b_ = cond(K, b)
       return cond.recover(K_.solve(b_)), b

   for step in range(60):
       optim.zero_grad()
       u, b = fem_solve(rho)
       compliance = (b * u).sum()             # J = b^T u
       compliance.backward()                  # autograd populates rho.grad
       optim.step()                           # OC update + bisection

Compliance drops from :math:`5.9 \times 10^{-3}` (uniform grey
plate) to :math:`6.9 \times 10^{-4}` -- nearly an order of
magnitude -- while the volume constraint
:math:`\bar{\rho} = 0.4` is held exactly throughout. The optimal
layout is a "thermal cross" connecting the hot source in the
centre to the cold sinks at the four mid-edges:

.. figure:: ../_static/user_guide/differentiability/topology_thermal.png
   :align: center
   :width: 100%

   Density evolution over 60 OC steps. Iteration 0 is the uniform
   :math:`\rho = V = 0.4` start; by iteration 5 the cross shape
   has emerged; iteration 59 is the converged design. Compliance
   drops by :math:`\sim 8.6\times`; the volume fraction is held
   at :math:`0.4` exactly.

The OC step uses the gradient just computed by autograd ---
:meth:`tensormesh.optimizer.OCOptimizer.step` reads ``rho.grad``,
finds the Lagrange multiplier by bisection, and applies the
density update with a move limit. No finite differences, no
hand-coded adjoint.


Wiring a neural network in
--------------------------

Three patterns cover the common ML/PDE couplings:

* **NN predicts a coefficient field.** The NN ingests coordinates
  (or features derived from them) and outputs a per-node value ---
  feed the output to an assembler via
  ``point_data={"kappa": nn_out}``. Gradients flow from the FEM
  loss back through the NN.
* **NN predicts boundary values.** Pass the output as
  ``dirichlet_value`` to a freshly-built
  :class:`~tensormesh.Condenser`. The condensation contribution to
  the RHS keeps the gradients connected.
* **NN predicts per-element stiffness modifiers.** Stack the
  output into a per-element tensor and pass it via
  ``element_data={"alpha": nn_out}``; let the assembler's
  ``forward`` use it inside the integrand.

In all three the NN is a regular ``nn.Module`` and the FEM
pipeline is a regular sequence of ``nn.Module`` calls; standard
``torch.optim`` optimisers work without any special hooks.

.. note::

   Wiring a network in needs **no** TensorMesh-specific neural-network
   layer -- a plain :class:`torch.nn.Module` is enough, as above. Today
   :mod:`tensormesh.nn` ships only the container utilities ``BufferDict``
   and ``BufferList`` (used internally to register tensors as module
   buffers), not learnable layers. Higher-level neural-operator and
   learnable-PDE building blocks are planned for a future release; until
   then, build the network side with ordinary ``torch.nn`` layers.


Reproducing the figures
-----------------------

The two examples above, together with the FD gradient check, are
generated by a single script:

.. code-block:: bash

   python docs/scripts/differentiability_figures.py

It uses only the public API. The 5000-step identification loop now
dominates the runtime -- a few minutes either way (~3 min on a GPU) --
after which it writes ``param_id_loss.png``, ``param_id_fields.png``,
and ``topology_thermal.png`` to
``docs/source/_static/user_guide/differentiability/``.


What's next
-----------

* :doc:`linear_solvers` -- the autograd-aware solver behind
  ``SparseMatrix.solve``.
* :doc:`batched_workflows` -- back-propagate through a batched
  pipeline for ML training-data generation.
* :doc:`time_integration` -- transient adjoints work too, with one
  adjoint solve per time step.
* :doc:`../api/index` -- the ``solve`` autograd function and the
  ``OCOptimizer`` reference.