Batched Workflows
=================

"Batched" means three different things in TensorMesh, and they
operate at different layers of the pipeline. Picking the right one
for your problem matters more than turning every knob at once. This
chapter walks through the three axes, then gives the canonical
"same mesh, many sources" recipe with end-to-end numbers measured
on the development workstation.


Three axes of batching
----------------------

.. list-table::
   :header-rows: 1
   :widths: 28 28 44

   * - Axis
     - Where it lives
     - When to reach for it
   * - **Memory chunking**
     - ``Assembler(batch_size=N)``
     - One assembly that doesn't fit in GPU memory.
   * - **Batched right-hand sides**
     - ``K.solve(B)`` with ``B`` of shape ``[n_dof, n_batch]``
     - Many loads, one mesh, one stiffness matrix.
   * - **Multi-problem datasets**
     - :mod:`tensormesh.dataset`
     - ML training data: many ``(K_i, b_i)`` pairs.

What is **not** built in: vmap-style assembly that vectorises ``K``
construction across many parameter values. If you need that, loop
in Python or wrap your assembler call in ``torch.vmap`` manually --
see :ref:`batched-not-built-in` below.


Axis 1 -- Memory chunking with ``batch_size``
---------------------------------------------

A single :class:`~tensormesh.ElementAssembler` call holds the
per-element, per-quadrature-point integrand tensor in memory. For
fine meshes or high-order elements that tensor can be too large.
The ``batch_size`` argument splits the quadrature dimension into
chunks, accumulating the contribution one chunk at a time:

.. code-block:: python

   K = LaplaceElementAssembler.from_mesh(mesh)(batch_size=4)

The result is **bit-identical** (to machine precision) to the
un-chunked call -- ``batch_size`` is purely a memory knob. Default
is ``-1`` (no chunking, fastest if memory allows).

The figure below assembles the same operator on a quadratic-triangle
mesh with about :math:`1.9\times10^{5}` nodes on the GPU, at quadrature
order 5 (:math:`n_q = 7` points per element). Shrinking ``batch_size``
from the full integrand down to one quadrature point at a time drops
peak memory from **810 MB to 271 MB** -- a 3x reduction -- at
essentially no extra time:

.. figure:: ../_static/user_guide/batched_workflows/memory_chunking.png
   :align: center
   :width: 90%

   Assembly time (blue) and peak GPU memory (red) versus the
   ``batch_size`` chunk knob on a quadratic triangle mesh
   (185 973 nodes, :math:`n_q = 7`). Peak memory falls from 810 MB
   (un-chunked) to 271 MB (one quadrature point per chunk); every
   setting produces a stiffness matrix that differs from the
   un-chunked reference by less than :math:`2 \times 10^{-15}` in
   max-norm.

Two things temper the knob -- and explain why a careless benchmark can
look flat:

* **It only bites while** ``batch_size`` :math:`< n_q`. Since
  ``n_batch = ceil(n_q / batch_size)``, any ``batch_size`` at or above
  the quadrature-point count collapses to a single chunk -- that is, no
  chunking at all. At the *default* quadrature order an order-2 triangle
  has only :math:`n_q = 3` points, so there is very little room to
  chunk; raising the quadrature order (or using high-order / vector
  elements) grows the transient integrand and gives the knob more to
  work on -- which is why the figure above uses order 5.
* **There is a floor it cannot touch.** The assembled matrix values,
  their ``int64`` row/column index buffers, and the per-element
  accumulator ``[n_elem, n_basis, n_basis]`` are all proportional to the
  element count and stay resident regardless of ``batch_size``. Chunking
  only pays off when the transient integrand
  ``[n_elem, n_q, n_basis, n_basis]`` is the dominant allocation;
  scaling up the *mesh* alone does **not** help, because the floor grows
  with it.

A typical recipe: start with the default; if you OOM, halve
``batch_size`` (down to ``1``) until the assembly fits.

This axis has nothing to do with batching across *problems*. The
result is one matrix, not a batch of matrices.

.. note::

   ``batch_size`` chunks one assembly on a single device. When the
   stiffness matrix is too large for one GPU even chunked, TensorMesh
   also has a **multi-GPU distributed assembly** path -- currently a
   beta / research-grade feature; see
   :doc:`../example_gallery/distributed` for the building blocks. A
   future release will align distributed assembly with the distributed
   linear solver in ``torch-sla``, closing the loop on truly
   large-scale DOF counts end-to-end.


Axis 2 -- Batched right-hand sides
----------------------------------

When you have one stiffness matrix ``K`` and many load vectors
``b_i``, stack the loads into a single ``[n_dof, n_batch]`` tensor
and hand the whole block to ``K.solve`` in one call:

.. code-block:: python

   B = torch.stack([b_1, b_2, ..., b_64], dim=1)   # [n_dof, 64]
   X = K.solve(B)                                  # [n_dof, 64]

``SparseMatrix.solve`` delegates to ``torch-sla``, which selects the
backend and method from the **matrix properties and the device** --
*not* from the shape of the RHS (a 1D and a 2D RHS take the same path).
For the condensed Poisson operator that means a **direct
factorisation**: SuperLU (``scipy``, ``method="lu"``) on the CPU, and
cuDSS on the GPU (``method="cholesky"`` for an SPD matrix, ``"ldlt"``
for a symmetric-indefinite one). With a direct method, passing a 2D RHS
lets the backend **factor ``K`` once and back-substitute every
column** -- which is exactly the win you want here. The speedup over a
Python loop of independent solves is dramatic on both CPU and GPU, and
grows with ``n_batch`` until the back-substitution starts to dominate:

.. figure:: ../_static/user_guide/batched_workflows/batched_rhs_scaling.png
   :align: center
   :width: 80%

   Per-problem cost on a 3014-node Poisson mesh: a Python loop
   that calls :meth:`tensormesh.sparse.SparseMatrix.solve` ``n_batch`` times
   (dashed) versus a single 2D-RHS call (solid). On the GPU the
   speedup reaches ``225x`` at ``n_batch = 256``.

The lesson reads cleanly off the figure. The loop is exactly
``n_batch`` independent solves, so each tick on the x-axis
doubles its wall time. The 2D-RHS call factors ``K`` once and
amortises that cost across every column; on the GPU it stays
within :math:`\sim 15\%` of the single-RHS time up to
``n_batch = 256``.

.. note::

   The factor-once amortisation is a property of the **direct**
   backends. If an *iterative* method runs instead -- a very large GPU
   problem routes to native-PyTorch CG, or you pass ``method="cg"`` /
   ``"bicgstab"`` explicitly -- ``torch-sla`` loops over the columns of
   a 2D RHS internally, so there is no factorisation to reuse and the
   only gain over a Python loop is reduced per-call dispatch overhead.
   ``torch-sla`` does not yet cache iterative-solver state (a
   preconditioner or Krylov basis reused across right-hand sides); a
   caching iterative solver is planned for a future release. See
   :doc:`linear_solvers` for how the backend and method are chosen.

Throughput is the same data the other way around -- problems
solved per second:

.. figure:: ../_static/user_guide/batched_workflows/throughput.png
   :align: center
   :width: 80%

   Throughput in problems per second. The CPU path saturates the
   SuperLU back-substitution around :math:`5 \times 10^{3}` problems
   per second; the GPU keeps climbing to nearly
   :math:`3 \times 10^{4}` problems per second at
   ``n_batch = 1024``.

Condensation works the same way. :meth:`~tensormesh.Condenser.condense_rhs`
accepts ``[n_dof, ...]`` shapes -- the leading DOF axis is sliced
and the trailing axes pass through:

.. code-block:: python

   condenser = Condenser(mesh.boundary_mask)
   K_, _    = condenser(K, torch.zeros(mesh.n_points))
   B_inner = condenser.condense_rhs(B)             # [n_inner, 64]
   X_inner = K_.solve(B_inner)                     # [n_inner, 64]
   X_full  = condenser.recover(X_inner)            # [n_dof, 64]

This is the workhorse pattern for ML training-data generation and
for parametric studies where every problem shares the geometry.


Axis 3 -- Multi-problem datasets
--------------------------------

For ML workloads where you want many ``(input, output)`` pairs,
:mod:`tensormesh.dataset` bundles two things:

* :class:`~tensormesh.MeshGen` -- programmatic Gmsh-backed mesh
  generation from CSG primitives (rectangles, circles, holes, …).
* Multi-frequency equation classes that emit batched source terms
  with known analytical solutions, for supervised training:

  * :class:`~tensormesh.dataset.PoissonMultiFrequency` (and ``…3D``)
  * :class:`~tensormesh.dataset.HeatMultiFrequency`
  * :class:`~tensormesh.dataset.WaveMultiFrequency`
  * ``LinearElasticityMultiFrequency`` (and ``…3D``)

Each ``…MultiFrequency`` class samples random Fourier coefficients
for the source term and -- where applicable -- returns the
analytical solution at the same nodes, which is convenient for both
training and validation.

The standard "same mesh, varying source" recipe combines this
generator with the batched-RHS pattern from Axis 2:

.. code-block:: python

   import torch
   from tensormesh import Mesh, LaplaceElementAssembler, MassElementAssembler, Condenser
   from tensormesh.dataset import PoissonMultiFrequency

   device = "cuda" if torch.cuda.is_available() else "cpu"

   mesh      = Mesh.gen_rectangle(chara_length=0.02).to(device)
   K         = LaplaceElementAssembler.from_mesh(mesh)().double()
   M         = MassElementAssembler.from_mesh(mesh)().double()
   condenser = Condenser(mesh.boundary_mask)
   K_, _     = condenser(K, torch.zeros(mesh.n_points, device=device))

   # 1024 random Poisson source terms, all evaluated at the same nodes.
   eq = PoissonMultiFrequency(a=torch.rand(1024, 8, 8, device=device) * 2 - 1)
   f  = eq.source_term(mesh.points, domain="rectangle").double()  # [1024, n_points]
   b  = M @ f.T                                                   # [n_points, 1024]
   b_ = condenser.condense_rhs(b)                                 # [n_inner, 1024]
   u_ = K_.solve(b_)                                              # [n_inner, 1024]
   u  = condenser.recover(u_)                                     # [n_points, 1024]

On a single RTX PRO 6000 the loop above produces 1024 Poisson
solutions in roughly 34 ms (see the throughput figure above) --
about :math:`30\,\mu\mathrm{s}` per problem, dominated by the
cuDSS triangular back-substitution. The fully benchmarked driver, with timing
across many ``n_batch`` values, lives in
``examples/dataset/poisson/poisson_dataset.py``.

For *varying mesh* workflows, regenerate ``Mesh`` per problem and
pay the assembly cost each time -- there is no shortcut for that
case today.


.. _batched-not-built-in:

What's not built in
-------------------

There is **no** built-in vmap-style assembler that vectorises
``K`` construction across many parameter values (e.g. one ``K``
per realisation of a stochastic coefficient field). For now two
patterns work in user code:

* **Python loop** -- simplest, fine when assembly is cheap relative
  to solve.
* **``torch.vmap`` around the assembler call** -- works when the
  varying parameter enters via ``point_data`` or ``element_data``
  and the result is uniform in shape. Note that ``vmap``'s
  interaction with sparse output is fragile; test on small
  problems first.

A first-class batched-assembly API may show up in a future release;
the dataset workflow above covers the most common case (same mesh,
many sources) without it.


Reproducing the numbers
-----------------------

All three figures on this page come from a single script:

.. code-block:: bash

   python docs/scripts/batched_workflows_figures.py

It uses only the public API and runs on CPU alone (the GPU panels
are skipped on a CPU-only machine). On a CUDA host it writes
``batched_rhs_scaling.png``, ``throughput.png``, and
``memory_chunking.png`` into
``docs/source/_static/user_guide/batched_workflows/``.


What's next
-----------

* :doc:`linear_solvers` -- the backend / method auto-selection and the
  ``backend`` knobs that affect batched solves.
* :doc:`differentiability` -- backprop through a batched pipeline.
* :doc:`time_integration` -- transient problems where the inner
  solve at every time step is itself batched-RHS-shaped.