Distributed FEM =============== Three scripts in ``examples/distributed/`` cover the building blocks for multi-GPU FEM in TensorMesh: graph coloring on the element adjacency graph, spectral domain decomposition, and a single-GPU-vs-multi-GPU assembly benchmark on a structured tetrahedral cube. .. caution:: The distributed layer is still under active development. The building blocks below (coloring, partitioning, multi-GPU assembly) work and the benchmarks are real, but the public API is not yet stable and the :doc:`../user_guide/index` deliberately does not document it as a first-class workflow. The next step is to couple it with ``torch-sla``'s distributed linear solver, so that assembly *and* solve both run across multiple GPUs — the goal being to tackle very large-scale industrial problems that do not fit on a single device. Until then, treat the examples here as recipes to learn from, not as a stable interface. Element-graph coloring — ``graph_coloring.py`` ---------------------------------------------- Race-free atomic-add assembly on GPU needs a coloring of the element-adjacency graph: elements that touch a common DOF must not share a color, so each color class can be scattered into the global matrix in parallel. ``mesh.color()`` does this in one call, running a parallel Welsh-Powell-style heuristic on the GPU when available: .. code-block:: python :caption: examples/distributed/graph_coloring.py (essence) from tensormesh.dataset.mesh import gen_rectangle mesh = gen_rectangle(chara_length=1.0/50, element_type="tri") mesh.to(device) colors = mesh.color() # one int per element n_colors = colors.max().item() + 1 print(f"Used {n_colors} colors.") For a regular triangulation of the unit square the algorithm typically finds 6–8 colors — close to the optimal chromatic number for a 2D triangle adjacency graph. The visualization paints each element by its color class, which gives the familiar striped / banded pattern. .. figure:: /_static/distributed/graph_coloring_result.png :alt: Triangular element coloring with 7 color classes :width: 75% :align: center ``graph_coloring.py`` output: a triangular mesh of the unit square painted by element color class. The Welsh-Powell heuristic finds 7 colors here; elements sharing a color never share a node, so each color class can be assembled in parallel without write conflicts. Spectral domain decomposition — ``graph_partition.py`` ------------------------------------------------------ For multi-GPU work the mesh has to be split into ``n_parts`` balanced subdomains with small interfaces. ``partition_mesh`` with ``method="spectral"`` computes the Fiedler vector of the graph Laplacian and recursively bisects: .. code-block:: python :caption: examples/distributed/graph_partition.py (essence) from tensormesh.mesh.partition import partition_mesh submeshes = partition_mesh(mesh, n_parts=4, method="spectral") The returned list of :class:`~tensormesh.Mesh` objects each carries **ghost nodes** — the boundary nodes that belong to two subdomains get duplicated, with their original global IDs stored in ``submesh.point_data["orig_nid"]``. The script's exploded-view visualization separates the four subdomains in space and circles the shared interface nodes — the geometric picture you usually have to draw on a whiteboard to explain a domain decomposition. .. figure:: /_static/distributed/graph_partition_exploded.png :alt: 4-way domain decomposition with ghost nodes (exploded view) :width: 90% :align: center ``graph_partition.py`` output: a 4-way spectral partition of the unit square in exploded view. Each subdomain is rendered in its own color; the highlighted dots along each subdomain's shared boundary are the **ghost nodes** — duplicated copies of the interface vertices that each rank owns locally, with the original global IDs preserved in ``submesh.point_data["orig_nid"]``. Multi-GPU assembly benchmark — ``benchmark_assembly.py`` -------------------------------------------------------- The headline distributed-FEM benchmark. A structured tetrahedral cube of size :math:`(n+1)^3` points and :math:`5 n^3` tets is assembled (i) on a single GPU, (ii) across all visible GPUs via ``torch.multiprocessing``. The driver sweeps :math:`n` from a configurable starting size until either path runs out of memory. The interesting machinery lives in :class:`~tensormesh.distributed.DistributedMesh`, which wraps the spectral partition + per-subdomain ghost-node bookkeeping into a single object: .. code-block:: python :caption: examples/distributed/benchmark_assembly.py (essence) import torch.multiprocessing as mp from tensormesh.distributed import DistributedMesh def _mp_assemble_worker(rank, submesh, device_id, q_order, return_dict): device = torch.device(f"cuda:{device_id}") torch.cuda.set_device(device) submesh.to(device) asm = tm.LaplaceElementAssembler.from_mesh( submesh, quadrature_order=q_order ) K_local = asm() # …translate K_local.row/col through orig_nid into global indices… dmesh = DistributedMesh(mesh, num_partitions=num_partitions) manager = mp.Manager(); return_dict = manager.dict() procs = [ mp.Process(target=_mp_assemble_worker, args=(i, dmesh.submeshes[i], i, q_order, return_dict)) for i in range(num_partitions) ] for p in procs: p.start() for p in procs: p.join() A few details that matter: * **One process per GPU.** ``torch.multiprocessing`` with ``set_start_method("spawn")`` gives each process its own Python interpreter, sidestepping the GIL. Threading would not help on CPU-bound assembly. * **Ghost-node bookkeeping.** Each worker assembles a *local* ``SparseMatrix`` whose row/col are local DOF indices; the driver translates them through ``orig_nid`` into global indices before stitching the partial matrices together. The user-visible cost is a single GPU→CPU copy per worker. * **Per-card memory.** The benchmark's "mem save" column reports ``single_gpu_peak / max_per_card_peak`` — typically close to ``num_partitions`` for the assembly step itself, since each worker only stores its own subdomain. A representative slice of the output table: .. code-block:: text n | Points | Tets | 1-GPU time | 1-GPU mem | N-GPU wall N-GPU max_t N-GPU mem/card | Speedup Mem save ---------------------------------------------------------------------------------------------------------------- 30 | 29,791 | 135,000 | 0.42 s | 0.5 GB | 0.18 s 0.11 s 0.16 GB | 2.3x 3.1x 50 | 132,651 | 625,000 | 1.90 s | 2.3 GB | 0.55 s 0.34 s 0.62 GB | 3.5x 3.7x … … … … … … … … … … CLI: .. code-block:: bash python benchmark_assembly.py # auto-detect GPUs python benchmark_assembly.py --partitions 4 --start 30 --max 200 Running the examples -------------------- .. code-block:: bash cd examples/distributed python graph_coloring.py # writes graph_coloring_result.png python graph_partition.py # writes graph_partition_exploded.png python benchmark_assembly.py # prints benchmark table All three are lightweight and run on a single workstation; only ``benchmark_assembly.py`` benefits from multiple GPUs. What's next ----------- * :doc:`../user_guide/concepts` — the module map mentions :mod:`tensormesh.distributed` as a research path. * :doc:`../user_guide/batched_workflows` — single-GPU batching patterns that often suffice before reaching for multi-GPU. * :doc:`basics` — ``plot_mesh.py`` has the adjacency-graph visualizations these algorithms operate on.