Kamiński, B.; Prałat, P.; Théberge, F. · 2021

Artificial Benchmark for Community Detection (ABCD)

Method

The Importance of Degree Sequences in Synthetic Graph Models

The degree sequence, the list consisting of the number of neighbours each node has, is arguably the most critical structural property bridging the gap between purely theoretical random graphs and real-world complex networks. In foundational models such as the Erdős-Rényi model [4, 3, 5], edges are formed with a uniform probability, resulting in a Poisson degree distribution where most nodes have roughly the same number of neighbours. However, real-world networks, from social networks and the internet to biological protein interactions, rarely look like this. Instead, they are typically characterized by heavy-tailed or "scale-free" degree distributions, often following a power law where the number of nodes of degree k is proportional to k^-γ for some constant γ that is typically between 2 and 3.

This mathematical phenomenon was first famously observed in the late 19th century by Italian economist Vilfredo Pareto, who noted that a tiny fraction of the population owned the vast majority of the wealth, a concept now widely known as the Pareto Principle or the 80/20 rule. When translated to network topology, this means the vast majority of nodes have very few connections, while a handful of vital "hub" nodes possess thousands. Because these hubs fundamentally dictate how the network behaves, serving as critical junctures for information flow or creating severe vulnerabilities to targeted attacks, synthetic models that fail to incorporate accurate, Pareto-like degree sequences will drastically misrepresent real-world dynamics.

Each member of the ABCD Family, starting from the original ABCD model [1], by default generates a power-law degree sequence with exponent γ, minimum degree δ, and maximum degree Δ. However, more advanced users can inject any feasible degree sequence of interest.

Pareto Principle.

Pareto Principle.


The Configuration Model

The Configuration Model [2] is arguably the most fundamental algorithm for generating a random graph with a specified degree sequence. It acts as the perfect "null model" because it strips away all complex network structures and leaves only the degree sequence intact. It will serve as a corner stone for the original ABCD model as well as other members of its family.

Here is exactly how the algorithm works:

  • Input. You begin with a specific degree sequence (k_1, k_2, ..., k_n), where k_i represents the degree of node i. Before generating the graph, there is one strict mathematical requirement: the sum of all degrees in the sequence must be an even number. Indeed, because every edge must have exactly two ends, an odd sum would leave one connection dangling. In fact, one can prove that Σ(i=1..n) k_i = 2m, where m is the total number of edges in the graph. This obvious necessary condition is not sufficient but, in practice (for sparse and large graphs), ensuring it holds is often enough to make sure the algorithm successfully generates the graph.
  • Assigning "Stubs". Imagine each node is a central hub with a specific number of open ports or "stubs" sticking out of it. Node i is assigned exactly k_i stubs. At this stage, you essentially have a bucket containing 2m unconnected stubs, each labelled with the node it belongs to.
  • Random Wiring. The algorithm proceeds by uniform random matching. In each step, reach into the bucket and randomly select two available stubs, tie them together to form an undirected edge between their respective nodes. Repeat this process until the bucket is empty and all stubs are paired.

Because the wiring process is entirely blind and random, the classic Configuration Model will inevitably create a few graphical anomalies:

  • Self-loops: two stubs from the same node are randomly picked and tied together,
  • Multi-edges: two nodes that are already connected end up having another pair of their stubs matched together, creating duplicate edges.

There are a few common fixes. However, we will not discuss them here as our ABCD model is more complex and will require slightly more complicated treatment anyway. We will come back to this problem soon.

Configuration Model in Action.

Configuration Model in Action.


The Role of Community Size Distributions

Just as the degree sequence is rarely uniform, the clustering of these nodes into distinct groups, known as community structure, also defies simple averages. In real-world networks, nodes tend to tightly knit themselves into dense modules or communities, such as distinct friend circles in a social network or specific functional pathways in a cellular system. Crucially, empirical research reveals that the distribution of these community sizes often mirrors the Pareto-like power law observed in degree sequences, that is, the number of communities of size s is proportional to s^-β for some constant β that is typically between 1 and 2. This means that a network will usually consist of a vast multitude of tiny, highly specialized communities alongside a very small number of massive, sprawling "mega-communities."

A high-quality synthetic model must reproduce this sequence of community sizes because failing to do so fundamentally distorts the macro-architecture of the network. If a synthetic model forces communities to be roughly uniform in size, a common limitation in early iterations of the popular Stochastic Block Model [6], it creates an artificial homogenization that fails to capture real-world dynamics. The presence of giant communities alters macroscopic processes like viral diffusion, rumour spreading, or consensus formation, acting as massive reservoirs that can trap or wildly accelerate information. Conversely, the "long tail" of microscopic communities often represents isolated, specialized niches that are highly resilient to global network shocks but vulnerable to local disruptions. By matching the sequence of community sizes, a synthetic model accurately replicates both the local, specialized interactions and the massive, overarching structures that govern the global behavior of the real-world system.

Similar to the degree sequence, from the very beginning, synthetic random graph models from the ABCD Family are able to generate power-law sequences of community sizes with exponent β, minimum value s, and maximum one S. As for the degree sequence, more advanced users can inject any feasible sequence instead.

Pareto Principle for Community Sizes.

Pareto Principle for Community Sizes.


Assigning Degrees to Nodes

At this stage of the process, we have generated two independent sequences: a degree sequence (k_1, k_2, ..., k_n) and a sequence of community sizes (s_1, s_2, ..., s_l). As with the Configuration Model, we can visualize the n nodes as hubs equipped with k_i open stubs. Separately, we have a collection of n available elements partitioned into l distinct communities, where community j contains exactly s_j elements. Our objective is to establish a one-to-one matching between the n nodes and the n community elements.

Ultimately, nodes will connect primarily to neighbours within their own community, a desired internal fraction governed by a mixing parameter xi (which we will discuss shortly). Because of this homophily constraint, we must ensure that high-degree nodes are not assigned to communities too small to absorb their intended internal edges. To balance this structural requirement with the need for maximum randomness, we employ the following assignment strategy. We first sort the nodes in decreasing order of their degrees. Processing them one by one, starting with the highest degree, we evaluate all communities large enough to validly support the current node i (with degree k_i). We then match node i uniformly at random to an available, unmatched element within one of those valid communities.

Assigning Degrees to Nodes.

Assigning Degrees to Nodes.


Creating Graphs

We now introduce the mixing parameter ξ ∈ [0,1], which serves as the primary mechanism in the ABCD model for controlling the level of noise; a larger value of ξ yields a noisier, more random synthetic network with weaker communities. Our goal is to retain an average fraction of 1-ξ edges strictly within their respective communities. To achieve this, each node i independently partitions its k_i stubs into community stubs and background stubs. Because the expected number of background stubs, ξ k_i, is rarely an exact integer, we apply randomized rounding to determine the precise integer count while strictly preserving the expected value. Finally, a parity check ensures that the total number of community stubs within any single community is even, a strict prerequisite for valid edge formation.

Once the stubs are successfully partitioned, the graph generation proceeds in two stages. First, for each community j, we take its s_j constituent nodes along with their designated community stubs and apply the Configuration Model to generate an internal community graph, G_j. Independently, all background stubs from across the entire network are pooled together and wired using the Configuration Model to form a global background graph, G_0. The final synthetic network is then constructed by superimposing all the community graphs (G_j, j ∈ {1, 2, ..., ℓ}) onto the background graph (G_0). This union produces a comprehensive network possessing both the pre-specified community structure and the exact desired level of noise.

Creating Graphs.

Creating Graphs.


Final Touch — Rewiring

Because the classic Configuration Model wires stubs together entirely at random, without looking at which nodes they belong to, it is "blind" to the graph's emerging topology. As we already mentioned above, this purely random matching inevitably leads to two types of anomalies that violate the rules of simple graphs: self-loops and multi-edges. A standard way to deal with this problem is to fix the loops and multi-edges by "rewiring" them with other edges. For example, if you have an unwanted self-loop uu and a normal edge xy, you break them both and rewire them to form ux and uy (provided this does not create new anomalies). This highly efficient solution preserves the exact degree sequence and is often the preferred method in complex network generation models.

In the context of the ABCD model, the predefined community structure introduces a critical constraint: a naive rewiring process could create more background edges and so more noisy network. To preserve the correct fraction of background edges, we employ a phased rewiring strategy. First, we resolve all self-loops and multi-edges, exclusively within each community graph G_j. This step guarantees that internal edges remain strictly localized within their respective communities.

Only after all community graphs have been rendered as valid, simple graphs do we proceed to rewire problematic edges within the global background graph G_0. Crucially, during this final phase, the algorithm must account for the fact that multi-edges can arise not just from within G_0 itself, but from the superposition of the background graph overlapping with the newly simplified community graphs.

Rewiring in the Configuration Model.

Rewiring in the Configuration Model.


Asymptotic Properties of the Model

Theoretical analysis of random graph models, particularly concerning their asymptotic properties as the number of nodes n → ∞, provides the mathematical foundation necessary to trust a synthetic generator's structural integrity. While computational simulations offer valuable empirical insights, they are inherently limited by finite network sizes and specific parameter choices. On the other hand, rigorous asymptotic results establish universal guarantees, proving that a model fundamentally possesses the desired topological features (such as exact degree distributions, clustering behaviour, or modularity) rather than just approximating them by chance in small samples. Furthermore, theoretical frameworks allow researchers to mathematically pinpoint phase transitions, revealing the precise parameter thresholds at which global network dynamics suddenly and dramatically shift.

Historically, proving these asymptotic limits has been a necessary milestone for validating the most influential network models. For instance, in foundational models such as the Erdős-Rényi model, theoretical analysis was required to prove the sudden emergence of a giant connected component and to establish that the edges, formed with uniform probability, result in a Poisson degree distribution. In the realm of network growth, models like preferential attachment rely on rigorous asymptotic proofs to guarantee that their "rich get richer" dynamic mechanisms genuinely converge to the heavy-tailed, power-law degree distributions. Similarly, for the Stochastic Block Model, a classic approach to generating community structure, asymptotic results were crucial for discovering the strict detectability threshold, which mathematically defines the exact level of structural noise at which communities become completely invisible to any clustering algorithm. Establishing similar rigorous theoretical results for the ABCD model ensures it stands alongside these foundational models as a mathematically sound and reliable benchmark.

The first paper we want to mention [8] analyzes the modularity function, arguably the most critical network property in the context of community detection. Modularity is frequently used to quantify the presence of community structure and serves as the primary quality function for many popular algorithms, including the widely used Louvain algorithm.

By design, the ABCD model ensures that a 1-ξ fraction of the edges are community edges, inherently belonging to the ground-truth partition 𝒞. The theoretical analysis demonstrates that only a negligible fraction of background edges align with this partition, meaning the modularity of the ground-truth partition 𝒞 is asymptotically equal to 1-ξ.

Analyzing the maximum modularity presents a much more complex challenge, yielding a dichotomy based on the noise level. When the structural noise is sufficiently high, the maximum modularity is asymptotically strictly greater than the modularity of the ground-truth partition 𝒞. This occurs because random fluctuations in noisy graphs inadvertently create spurious communities that outscore the actual underlying structure. This introduces a fundamental theoretical barrier: in this high-noise regime, no algorithm optimizing modularity function can successfully recover the ground truth. Conversely, in graphs with low noise, the ground-truth partition remains asymptotically optimal. While this strong structural signal does not guarantee that an algorithmic heuristic will successfully converge to it, these combined results clearly define the operational "sweet spot" for the noise parameter ξ, hinting where community detection algorithms can be meaningfully benchmarked.

The second paper we want to mention [7] investigates the properties of self-similarity and scale invariance within the ABCD model. While scale invariance in complex networks is traditionally associated with the scale-free nature of global node degrees, it also extends to the distributions of community sizes, distances, and network density. The paper demonstrates that the ABCD model inherently captures this self-similar behavior: each ground-truth community internally inherits a heavy-tailed, power-law degree distribution that reflects the degree distribution of the entire graph.

This structural self-similarity is not only a desirable feature for replicating real-world networks, but also yields important theoretical and practical implications. Analytically, it allows for the precise computation of the expected volume (the sum of internal degrees) for any given community. Furthermore, it enables a rigorous quantification of the self-loops and multi-edges naturally generated during the ABCD construction process.

Understanding the prevalence of these non-simple edges is critical for two distinct reasons. Computationally, resolving self-loops and multi-edges to produce a valid simple graph is a time-consuming bottleneck in the generation algorithm. Statistically, because the ABCD framework relies on multiple localized instantiations of the configuration model, the initial volume of self-loops and multi-edges directly dictates the extent of rewiring required. A higher frequency of these artifacts necessitates more aggressive rewiring, which subsequently "skews" the final ensemble of generated graphs, pushing the probability space further away from a purely uniform distribution.

Theoretical Results.

Theoretical Results.


References

References

  1. Kamiński, B., Prałat, P., and Théberge, F. (2021). Artificial Benchmark for Community Detection (ABCD) - Fast Random Graph Model with Community Structure. Network Science, 9(2), 153-178.
  2. Bollobás, B. (1980). A probabilistic proof of an asymptotic formula for the number of labelled regular graphs. European Journal of Combinatorics, 1(4), 311-316.
  3. Bollobás, B. (2011). Random graphs. In Modern graph theory (pp. 215-252). Springer.
  4. Janson, S., Łuczak, T., and Ruciński, A. (2011). Random graphs. John Wiley & Sons.
  5. Frieze, A., and Karoński, M. (2015). Introduction to random graphs. Cambridge University Press.
  6. Holland, P. W., Laskey, K. B., and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks, 5(2), 109-137.
  7. Barrett, J., Kamiński, B., Prałat, P., and Théberge, F. (2025). Self-similarity of Communities of the ABCD Model. Theoretical Computer Science, 1026, 115012.
  8. Kamiński, B., Pankratz, B., Prałat, P., and Théberge, F. (2022). Modularity of the ABCD Random Graph Model with Community Structure. Journal of Complex Networks, 10(6), cnac050.
  9. Lancichinetti, A., Fortunato, S., and Radicchi, F. (2008). Benchmark graphs for testing community detection algorithms. Physical Review E, 78(4), 046110.

Reference: Paper link