# Optimising a Graph Clustering Algorithm

**7 May 2024**

Recently, I was tasked to get the highest performance possible on the miniVite microbenchmark. MiniVite is a slimmed-down version of Vite which implements the Louvain method for community detection and graph clustering.

The Louvain method is composed of two phases and miniVite only implements the first phase of the algorithm. This phase aggregates vertices into communities by locally optimising the modularity metric which measures how well a graph is clustered into communities. The algorithm works by incremental improvement, at each step it moves a vertex into the community that leads to the largest increase in the modularity metric.

I was working on Arm hardware so the first course of action was to use the armclang compiler and the Arm Performance Library implementation of BLAS which already gave us a speed up.

The miniVite code contains some `TODO`

comments so these
were the first targets for tinkering with the code. The first of which
concerned using TBB’s `concurrent_vector`

and the second of which concerned parallelising a for loop wih OpenMP.

Algorithmically, we found two enhancements on a different branch which we were able to incorporate into our local build. The first enhancement involved implementing an on-stack hashmap for small graph segments: this is in contrast to the dynamically allocated hashmap that is used by default.

The second enhancement involved calculating the increase in modularity differently. For every vertex, the Louvain method sums values for each of the vertex’s neighbours and aggregates these values based on each of the neighbouring vertices’ current community.

Naively, we can implement this by summing values in a hashmap where the key is a hash for the neighbouring vertex’s community. However, for small graph segments we can sort the adjacency array by community so that vertices in the same community become adjacent in contiguous blocks. We can now find the community that maximises the increase in the modularity metric by performing a linear scan across the array which eliminates the random access from the previous implementation which is much better for the pre-fetcher.

Putting everything together, I achieved about a 10x performance boost for the selected problem instances. We had the best results from this challenge and the entire competition was so much fun.