force-directed at scale: barnes-hut and multilevel

2026-05-28

part 1 ended with a warning. the naive FR inner loop iterates over every pair of nodes to compute repulsion: O(n²) per iteration. at 100 nodes that's 10,000 pair evaluations per step. at 1,000 nodes it's 1,000,000. at 10,000 nodes it's 100,000,000, multiplied by however many iterations the simulation needs to converge (typically 100 to 300). ten billion operations for one layout. on modern hardware, that takes minutes. for interactive use, it's a wall.

this post covers the two tricks that made force-directed practical above that threshold. the first is the Barnes-Hut approximation: build a quadtree over the node positions each iteration and skip computing forces from clusters that are far enough away to treat as a single point. that reduces repulsion from O(n²) to O(n log n). the second is multilevel coarsening: instead of simulating a large graph directly, compress it into a small one, solve the small problem, then gradually expand the solution back up. that reduces the effective problem size so dramatically that even O(n log n) per iteration stops being the bottleneck.

the quadtree: spatial subdivision

the core observation behind Barnes-Hut is this: if you're computing the repulsive force on node A from a group of nodes that are all far away, the individual positions of those nodes don't matter much. what matters is the aggregate: how many nodes are there, and where is their center of mass. replace the entire group with a single "super-particle" sitting at the center of mass with the combined weight, and the force calculation gives roughly the same answer. the farther the group, the better the approximation.

to make this systematic, you need a data structure that organizes nodes by spatial proximity and stores aggregate information at each level. that's the quadtree.

a quadtree recursively subdivides 2D space into quadrants. start with a bounding box that contains all nodes. if a cell contains more than one node, split it into four equal quadrants (top-left, top-right, bottom-left, bottom-right). recurse on each quadrant. stop when each leaf cell contains at most one node. every internal cell stores the total number of nodes it contains and the position of their center of mass.

here is a small example. six nodes (labeled 1 through 6) in a 100x100 bounding box:

(0,0)                            (100,0)
 +--------------------------------+
 |         2                      |
 |                                |
 |   1                  5    6    |
 |                                |
 |                                |
 |        3                       |
 |                    4           |
 |                                |
 +--------------------------------+
(0,100)                          (100,100)

the root cell contains all 6 nodes. split it into four quadrants:

 +---------------+---------------+
 |               |               |
 |  TL: 1, 2, 3  |  TR: 5, 6     |
 |               |               |
 +---------------+---------------+
 |               |               |
 |  BL: (empty)  |  BR: 4        |
 |               |               |
 +---------------+---------------+

TL has three nodes so it subdivides again. TR has two nodes so it subdivides. BR has one node and becomes a leaf. BL is empty:

TL quadrant (0..50, 0..50) splits into:
  TL-TL: node 2
  TL-TR: (empty)
  TL-BL: node 1
  TL-BR: node 3

TR quadrant (50..100, 0..50) splits into:
  TR-TL: (empty)
  TR-TR: node 6
  TR-BL: node 5
  TR-BR: (empty)

the final tree has all six nodes in leaf cells. each internal cell stores aggregate mass and center-of-mass position:

  • root: mass 6, center at the mean of all six positions
  • TL: mass 3, center at mean of nodes 1, 2, 3
  • TR: mass 2, center at mean of nodes 5, 6

those aggregates are what Barnes-Hut uses.

the barnes-hut walk

computing repulsion on a single node proceeds as a depth-first walk over the quadtree. at each cell, you decide: approximate the whole cell as one super-particle, or descend into its children.

the decision uses the s/d ratio: s is the cell's width, d is the distance from the cell's center of mass to the query node. if s/d is small, the cell is far away relative to its size, and the approximation is good. if s/d is large, you're too close to the cell for the aggregate to be accurate, so you descend.

the threshold is the parameter theta (written θ). if s/d < θ, approximate. otherwise, descend.

in pseudocode:

function barnesHutForce(node, cell) {
  // base cases
  if (cell is empty) return { fx: 0, fy: 0 };
  if (cell is a leaf containing only node) return { fx: 0, fy: 0 };

  const dx = cell.centerOfMass.x - node.x;
  const dy = cell.centerOfMass.y - node.y;
  const d = Math.sqrt(dx * dx + dy * dy) || 0.001;
  const s = cell.width;

  if (s / d < theta || cell is a leaf) {
    // approximate: treat cell as a single particle
    const force = repulsionMagnitude(cell.totalMass, d);
    return {
      fx: -(dx / d) * force,  // repulsion pushes away from cell
      fy: -(dy / d) * force,
    };
  }

  // descend: sum forces from each child
  let fx = 0, fy = 0;
  for (const child of cell.children) {
    const f = barnesHutForce(node, child);
    fx += f.fx;
    fy += f.fy;
  }
  return { fx, fy };
}

the key branch: if (s / d < theta || cell is a leaf). when s/d is below the threshold, you compute one force and return, skipping all the nodes inside the cell. that's the approximation doing work. when you're close, you recurse into children until you reach leaves, where you compute exact forces against individual nodes.

to run the full force computation for one iteration: for each node, call barnesHutForce(node, root). the result is the total repulsive force on that node from all other nodes (approximately).

the s/d criterion in numbers

take node A at position (10, 10) and a distant cluster of 20 nodes packed into a cell from (70, 70) to (90, 90). the cell center of mass is around (80, 80). s = 20 (cell width). d = sqrt((80-10)^2 + (80-10)^2) = sqrt(70^2 + 70^2) ≈ 99.

A (10,10)                         cluster cell (70..90, 70..90)
  *                                  +--------+
                                     |  · · · |
                                     |  · · · |  s = 20
                                     |  · · · |
                                     +--------+
                                     ^
                              center of mass (80,80)

  d = distance(A, center) ≈ 99
  s = cell width = 20
  s/d = 20/99 ≈ 0.20

s/d ≈ 0.20. with θ = 0.5, the criterion s/d < θ holds: 0.20 < 0.5. the entire cluster is approximated as one super-particle at (80, 80) with mass 20. instead of 20 individual repulsion calculations, you do one. the relative error from the approximation is roughly proportional to (s/d)^2, which here is about 4%. for a layout algorithm running 200 iterations, that error is swamped by the temperature schedule and the non-convexity of the energy landscape.

now put A at (65, 65), right at the edge of that same cluster. d = sqrt(15^2 + 15^2) ≈ 21. s/d = 20/21 ≈ 0.95. with θ = 0.5, the criterion fails. the algorithm descends into the cluster's children to get better resolution.

the theta knob: accuracy versus speed

θ controls the trade-off. smaller θ means the approximation is used only for cells that are very far away relative to their size, which is more conservative and more accurate. larger θ means more cells get approximated, which is faster but less accurate.

at θ = 0, the criterion s/d < 0 never holds (since s/d >= 0 always). every cell is always descended into all the way to the leaf level. that is exact O(n²) force computation; Barnes-Hut degenerates to brute force.

at θ = infinity, every cell satisfies the criterion immediately. the algorithm approximates the entire graph as one super-particle on the very first cell check. the force on every node is computed from a single aggregate. this is O(n) but useless: every node experiences the same force from the same super-particle, and the layout collapses.

the practical sweet spot is θ ≈ 0.5. d3-force uses theta2 = 0.81 (that's θ = 0.9 squared, since the code checks w * w / theta2 < l to avoid the square root). at these values, typical graphs see a 10x to 100x speedup over brute force with error that is invisible in the final layout.

you can tune θ per-run: use a higher θ early in the simulation when the layout is far from convergence (big moves anyway; approximation error doesn't matter), and reduce θ in later iterations when precision in force directions starts to matter for fine placement. d3-force doesn't do this automatically, but it's straightforward to add.

complexity

quadtree construction: inserting n nodes into an initially empty quadtree is O(n log n) in the average case. each insertion traverses the tree from root to leaf; the expected depth is O(log n) for a uniformly distributed point set.

force computation with Barnes-Hut: for each of the n nodes, the walk touches O(log n) cells on average. total: O(n log n) per iteration.

the total cost per iteration is O(n log n) for construction plus O(n log n) for the force walk. both terms are the same order, so the combined cost is O(n log n).

the quadtree is rebuilt every iteration. that's correct: node positions change each iteration, so the tree from the previous iteration is stale. the O(n log n) rebuild cost is paid on every step.

worst case. if all n nodes are clustered in a tiny area with one node far away, the cell containing the cluster is never approximated from the perspective of the distant node (because d is large, so s/d is small, so the criterion fires; but the distant node is far from the cluster so the criterion fires in its favor, not the cluster's favor). actually the nasty case is the reverse: many nodes packed densely means many cells subdivide deeply, so the depth of the quadtree grows beyond O(log n). in the extreme, n coincident nodes produce a degenerate chain of single-child cells n levels deep. d3-quadtree handles coincident nodes via linked list chaining at the leaf rather than infinite subdivision, but it's a good reminder that O(n log n) is an average-case statement, not a worst-case guarantee.

failure modes

theta too large. at θ = 1.5 or higher, even moderately close cells get approximated. the force on nodes near cluster boundaries becomes directionally wrong: the algorithm thinks the cluster's repulsion is centered at its center of mass, but some nodes in the cluster are much closer on one side. the layout develops visible artifacts: nodes that should be pushed into a cluster boundary instead slide along it or clip through.

very dense regions. a sub-region with 10,000 nodes packed into a small area still requires deep tree descent for nodes inside that region, because the local cells are small and s/d is large. Barnes-Hut only saves work for nodes far from a cluster. within a dense cluster, you pay close to O(n²) in the local neighborhood. this is correct behavior, not a bug, but it means Barnes-Hut's speedup is most dramatic for sparse graphs and less dramatic for graphs with highly uneven density.

deep quadtrees from nearly-coincident nodes. two nodes at positions (50.000, 50.000) and (50.001, 50.000) require repeated subdivision until the cells are narrow enough to separate them. at double precision, that's up to 52 levels of subdivision. the tree is correct but expensive to build and walk. d3-quadtree mitigates this with the coincident-node chaining, but nearly-coincident nodes still produce deep local subtrees.

reading the source: d3-quadtree

d3-quadtree at commit d01b12a0 is the cleanest available implementation of the data structure. the insertion function in src/add.js shows the mechanics directly.

function add(tree, x, y, d) {
  if (isNaN(x) || isNaN(y)) return tree; // ignore invalid points

  var parent,
      node = tree._root,
      leaf = {data: d},
      x0 = tree._x0,
      y0 = tree._y0,
      x1 = tree._x1,
      y1 = tree._y1,
      xm, ym, xp, yp, right, bottom, i, j;

  // If the tree is empty, initialize the root as a leaf.
  if (!node) return tree._root = leaf, tree;

  // Find the existing leaf for the new point, or add it.
  while (node.length) {
    if (right = x >= (xm = (x0 + x1) / 2)) x0 = xm; else x1 = xm;
    if (bottom = y >= (ym = (y0 + y1) / 2)) y0 = ym; else y1 = ym;
    if (parent = node, !(node = node[i = bottom << 1 | right])) return parent[i] = leaf, tree;
  }

  // Is the new point exactly coincident with the existing point?
  xp = +tree._x.call(null, node.data);
  yp = +tree._y.call(null, node.data);
  if (x === xp && y === yp) return leaf.next = node, parent ? parent[i] = leaf : tree._root = leaf, tree;

  // Otherwise, split the leaf node until the old and new point are separated.
  do {
    parent = parent ? parent[i] = new Array(4) : tree._root = new Array(4);
    if (right = x >= (xm = (x0 + x1) / 2)) x0 = xm; else x1 = xm;
    if (bottom = y >= (ym = (y0 + y1) / 2)) y0 = ym; else y1 = ym;
  } while ((i = bottom << 1 | right) === (j = (yp >= ym) << 1 | (xp >= xm)));
  return parent[j] = node, parent[i] = leaf, tree;
}

three things this code shows that the pseudocode doesn't. first, an internal node is a 4-element array indexed by bottom << 1 | right. the quadrant encoding is: 0 = top-left, 1 = top-right, 2 = bottom-left, 3 = bottom-right. second, the while (node.length) loop descends the existing tree to find where the new point belongs; node.length is 4 for internal nodes and undefined for leaf objects. third, when a new point arrives at a leaf, the code doesn't immediately split: it first checks if the two points are exactly coincident (handling that via linked list chaining with leaf.next). only non-coincident collisions trigger the do...while split loop.

the visit.js function drives the depth-first walk that Barnes-Hut uses:

export default function(callback) {
  var quads = [], q, node = this._root, child, x0, y0, x1, y1;
  if (node) quads.push(new Quad(node, this._x0, this._y0, this._x1, this._y1));
  while (q = quads.pop()) {
    if (!callback(node = q.node, x0 = q.x0, y0 = q.y0, x1 = q.x1, y1 = q.y1) && node.length) {
      var xm = (x0 + x1) / 2, ym = (y0 + y1) / 2;
      if (child = node[3]) quads.push(new Quad(child, xm, ym, x1, y1));
      if (child = node[2]) quads.push(new Quad(child, x0, ym, xm, y1));
      if (child = node[1]) quads.push(new Quad(child, xm, y0, x1, ym));
      if (child = node[0]) quads.push(new Quad(child, x0, y0, xm, ym));
    }
  }
  return this;
}

the callback return value is the pruning control. if the callback returns true, the children of that cell are not pushed onto the stack. that's the approximation branch: when s/d < θ, return true, and visit skips descending. when s/d >= θ, return false, and visit pushes the four children. the d3-force charge force passes exactly this callback in the apply function of src/manyBody.js.

function apply(quad, x1, _, x2) {
  if (!quad.value) return true;

  var x = quad.x - node.x,
      y = quad.y - node.y,
      w = x2 - x1,
      l = x * x + y * y;

  // Apply the Barnes-Hut approximation if possible.
  if (w * w / theta2 < l) {
    if (l < distanceMax2) {
      if (x === 0) x = jiggle(random), l += x * x;
      if (y === 0) y = jiggle(random), l += y * y;
      if (l < distanceMin2) l = Math.sqrt(distanceMin2 * l);
      node.vx += x * quad.value * alpha / l;
      node.vy += y * quad.value * alpha / l;
    }
    return true;  // prune: don't descend into this cell's children
  }

  // Otherwise, process points directly.
  else if (quad.length || l >= distanceMax2) return;

  if (quad.data !== node || quad.next) {
    if (x === 0) x = jiggle(random), l += x * x;
    if (y === 0) y = jiggle(random), l += y * y;
    if (l < distanceMin2) l = Math.sqrt(distanceMin2 * l);
  }

  do if (quad.data !== node) {
    w = strengths[quad.data.index] * alpha / l;
    node.vx += x * w;
    node.vy += y * w;
  } while (quad = quad.next);
}

the Barnes-Hut criterion: w * w / theta2 < l. w is the cell width (x2 - x1), l is the squared distance from the cell's center of mass to the query node. theta2 is θ². the check is w² / θ² < d², which is equivalent to w/d < θ, i.e., s/d < θ. the squared form avoids a square root on the hot path.

accumulate (lines 29-53, not shown) is the post-order pass that computes quad.value (total charge) and quad.x, quad.y (center of mass) for each internal node. d3 runs visitAfter(accumulate) once before the per-node visit(apply) calls.

coarsening: solve the small problem first

Barnes-Hut gets you to O(n log n) per iteration. for 100,000 nodes that's still 1.7 million operations per step times 200 iterations: 340 million total. feasible in seconds on modern hardware, but barely. for 1 million nodes, the numbers no longer cooperate.

multilevel coarsening approaches the problem from a different angle. instead of making each iteration faster, reduce the number of nodes. compress the graph into a smaller approximation, run a complete layout on the small graph, then expand the solution back to the full graph.

the algorithm has three phases: coarsening, initial layout, and refinement.

coarsening. repeatedly merge pairs of nodes until the graph is small enough to lay out cheaply (typically a few hundred nodes). each merge combines two nodes into a "super-node" with a position equal to the weighted average of its children's positions and a weight equal to their combined weight. edges between super-nodes are inherited: if node A was connected to node C, and A was merged into super-node AB, then AB is connected to C.

here is a 10-node graph going through two levels of coarsening:

level 0 (original, 10 nodes):
  1 - 2 - 3 - 4 - 5
  |           |
  6 - 7   8 - 9 - 10

level 1 (5 super-nodes, after matching 1-2, 3-4, 5-9, 6-7, 8-10):
  [12] - [34] - [59]
          |      |
         [67]  [810]

level 2 (2 super-nodes, after matching [12]-[34] and [59]-[810]):
  [1234] - [59810]
               |
             [67]

at level 2, the graph has 3 super-nodes (actually 3 in this case because [67] didn't get merged). a force-directed layout on 3 nodes converges in under 10 iterations. each super-node's position encodes the approximate center of its children.

which pairs to merge. two common strategies.

Walshaw (2003) uses maximal heavy-edge matching: for each node, find its highest-weight edge, pair up those endpoints, and mark them as merged. edges with no unmatched partner stay as singletons at this level. "heavy" usually means highest edge weight; for unweighted graphs, edges are chosen to minimize the inter-cluster connectivity of the coarsened graph (heuristically, merge nodes that are tightly connected).

Hu (2005) uses a maximal independent vertex set (MIVS): select a maximal set of nodes such that no two selected nodes are adjacent. every non-selected node is merged with one of its selected neighbors. the MIVS approach tends to produce more balanced super-nodes because every node either is in the MIVS or is directly adjacent to an MIVS node, guaranteeing at most two levels of hierarchy in each merge.

initial layout. at the coarsest level, run force-directed layout on the compressed graph. with O(1) to O(hundreds) nodes, this is cheap and converges fast. the resulting positions are the seeds for refinement.

refinement (uncoarsening). step back up the coarsening ladder, one level at a time. at each level, expand the super-nodes back into their children: each child inherits the super-node's position plus a small random perturbation. run a small number of force-directed iterations (5 to 20) to refine the layout at this scale. then expand again and refine. continue until you reach the original graph.

the total cost is a sum of work at each level. the coarsest level has n/2^k nodes (where k is the number of coarsening steps). work at each level is O(m_i log n_i) where m_i and n_i are the edge and node counts at level i. summing across levels, the total work is O(m log n) for the overall algorithm, which is the same asymptotic complexity as a single Barnes-Hut iteration on the full graph, but you also saved all the iterations that would have been needed on the full graph.

why multilevel works: the multigrid analogy

the trick works because force-directed at the fine level is dominated by short-range forces. look at the energy landscape of a large force-directed graph: the long-range structure (which cluster is on which side) is determined by the large-scale topology. the short-range structure (how nodes within a cluster arrange themselves) is determined by local edge connections.

force-directed on the full graph has to do both at once. but long-range structure changes slowly: a cluster of 1,000 nodes that needs to move from one side of the drawing to the other will take many iterations because each node's velocity is clamped and force computation is dominated by local neighbors. coarsening collapses that cluster into a few super-nodes and lets it move in one step of the coarse-level layout.

this is the multigrid strategy from numerical linear algebra. when solving an elliptic PDE on a grid, fine-grid iterations are efficient at reducing high-frequency error (small-scale oscillations) but slow at reducing low-frequency error (large-scale error modes). coarse-grid solvers handle low-frequency error cheaply because the coarse representation has few degrees of freedom. the full multigrid cycle alternates: fix low-frequency error on coarse grids, refine high-frequency error on fine grids.

in graph layout: long-range placement (which cluster sits where) is the low-frequency component. coarse levels handle it cheaply. local arrangement within a cluster is the high-frequency component. fine-level refinement handles it efficiently. the result is that you get a good global structure from the coarse layout, and correct local detail from the refinement passes, without paying for the global convergence cost at the fine level.

in practice, sfdp (Graphviz's fast force-directed engine) and OGDF's FM³ both implement this strategy and routinely handle graphs with 100,000 to 1,000,000 nodes in seconds. d3-force does not implement multilevel coarsening: it is Barnes-Hut only. for browser-based graphs above about 10,000 nodes, d3-force is not the right tool. sfdp or FM³ output positions as data and d3 renders them, which is a more practical split.

Barnes, J., and Hut, P. (1986). "A hierarchical O(N log N) force-calculation algorithm." Nature 324, pp. 446-449. DOI: 10.1038/324446a0. four pages. the original paper is about astrophysical n-body simulation, not graph layout. the connection to graph layout is that repulsive force between nodes is mathematically identical to gravitational force between particles (both are inverse-distance laws). everything in this post's Barnes-Hut section is a direct adaptation of the 1986 algorithm. worth reading to understand the original context and to see how compact the idea is.

Walshaw, C. (2003). "A Multilevel Algorithm for Force-Directed Graph Drawing." Journal of Graph Algorithms and Applications 7(3), pp. 253-285. DOI: 10.7155/jgaa.00070. the paper that adapted multilevel methods to graph layout specifically. introduces the heavy-edge matching coarsening strategy. the refinement schedule and the interaction between Barnes-Hut (used at every level) and multilevel coarsening are described in detail. the complexity analysis is careful.

Hu, Y. (2005). "Efficient and High Quality Force-Directed Graph Drawing." The Mathematica Journal 10(1), pp. 37-71. PDF: https://www2.cs.arizona.edu/~kobourov/efficient_high-quality_force-directed_graph_drawing.pdf. the most readable multilevel reference. uses maximal independent vertex set for coarsening instead of edge matching. the paper includes an analysis of why MIVS produces better-balanced coarsening and a comparison with sfdp. the PDF is freely available. if you read one paper from this list, make it this one.

canonical implementations: sfdp is part of Graphviz (C, LGPL, hosted on GitLab at gitlab.com/graphviz/graphviz). FM³ is part of OGDF (C++, GPL, at github.com/ogdf/ogdf). both are production-quality implementations of Barnes-Hut plus multilevel coarsening. reading sfdp's source alongside Hu 2005 is the fastest path to a working implementation.

next: part 3 covers the Sugiyama framework for layered DAG layouts. where force-directed is iterative and emergent, Sugiyama is a four-phase pipeline: cycle removal, layer assignment, crossing reduction, coordinate assignment. each phase has a known algorithm with known complexity and failure modes.