ALGORITHMIC GUARANTEES FOR HIERARCHICAL DATA GROUPING: INSIGHTS FROM AVERAGE LINKAGE, BISECTING K-MEANS, AND LOCAL SEARCH HEURISTICS

Agus Santoso; Siti Nurhayati

doi:10.55640/ijidml-v02i02-02

Authors

Agus Santoso Faculty of Computer Science, Institut Teknologi Bandung (ITB), Bandung, Indonesia
Siti Nurhayati Center for Data Science and Artificial Intelligence, Universitas Airlangga, Surabaya, Indonesia

DOI:

https://doi.org/10.55640/ijidml-v02i02-02

Keywords:

Hierarchical clustering, Average linkage, Bisecting k-means, Local search heuristics

Abstract

Hierarchical data grouping plays a central role in diverse applications spanning bioinformatics, text mining, image segmentation, and customer behavior analysis. While a multitude of clustering algorithms have been proposed, including agglomerative techniques, divisive strategies, and heuristic optimizations, understanding their algorithmic guarantees and comparative performance remains an ongoing research challenge. This study provides a rigorous examination of the theoretical and empirical properties of three prominent approaches: average linkage clustering, bisecting k-means, and local search heuristics. We analyze their approximation bounds, convergence behaviors, and computational complexities under various objective functions, with particular emphasis on minimizing within-cluster variance and optimizing inter-cluster separation. Through formal proofs and experimental evaluation on benchmark datasets, we demonstrate that average linkage exhibits robust consistency and deterministic outcomes, though at the cost of higher computational overhead. In contrast, bisecting k-means provides scalable performance and favorable partitioning quality in high-dimensional settings, benefiting from recursive binary splitting. Local search heuristics offer flexible trade-offs between accuracy and efficiency, leveraging iterative refinement to escape suboptimal configurations. The findings underscore the importance of algorithm selection tailored to data characteristics and clustering objectives. This work contributes to a deeper understanding of the algorithmic guarantees associated with hierarchical data grouping and offers practical guidance for researchers and practitioners seeking principled, reliable clustering solutions.

References

Abboud, A., Cohen-Addad, V., & Houdrougé, H. (2019). Subquadratic highdimensional hierarchical clustering. In Advances in Neural Information Processing Systems, pages 11576–11586.

Ackerman, M., & Ben-David, S. (2016). A characterization of linkage-based hierarchical clustering. Journal of Machine Learning Research, 17:232:1–232:17.

Ackerman, M., Ben-David, S., Brânzei, S., & Loker, D. (2012). Weighted clustering. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, July 22-26, 2012, Toronto, Ontario, Canada.

Arora, S., Rao, S., & Vazirani, U. V. (2009). Expander flows, geometric embeddings and graph partitioning. J. ACM, 56(2):5:1–5:37.

Awasthi, P., Bandeira, A. S., Charikar, M., Krishnaswamy, R., Villar, S., & Ward, R. (2015). Relax, no need to round: Integrality of clustering formulations. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 191–200. ACM.

Ben-David, S., & Ackerman, M. (2008). Measures of clustering quality: A working set of axioms for clustering. In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 121–128. URL http://papers.nips.cc/paper/3491-measures-of-clustering-quality-a-working-set-of-axioms-for-clustering.

Charikar, M., & Chatziafratis, V. (2017). Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 841–854.

Charikar, M., Chatziafratis, V., & Niazadeh, R. (2019a).22 Hierarchical clustering better than average-linkage. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 2291–2304.

Charikar, M., Chatziafratis, V., Niazadeh, R., & Yaroslavtsev, G. (2019b).23 Hierarchical clustering for euclidean data. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, pages 2721–2730.

Chierchia, G., & Perret, B. (2019). Ultrametric fitting by gradient descent. In Advances in neural information processing systems, pages 3175–3186.

Cohen-Addad, V., Kanade, V., Mallmann-Trenn, F., & Mathieu, C. (2017). Hierarchical clustering: Objective functions and algorithms. CoRR, abs/1704.02147.

Dasgupta, S. (2016). A cost function for similarity-based hierarchical clustering. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pages 118–127.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). Unsupervised Learning, pages 485–585. Springer New York, New York, NY.

Heller, K. A., & Ghahramani, Z. (2005). Bayesian hierarchical clustering. In Machine Learning, Proceedings of the Twenty-Second International Conference (ICML 2005), Bonn, Germany, August 7-11, 2005, pages 297–304.

Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666. ISSN 0167-8655. doi: https://doi.org/10.1016/j.patrec.2009.09.011.

Krishnamurthy, A., Balakrishnan, S., Xu, M., & Singh, A. (2012).24 Efficient active algorithms for hierarchical clustering. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012.

Lattanzi, S., Lavastida, T., Lu, K., & Moseley, B. (2019). A framework for parallelizing hierarchical clustering methods. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 73–89. Springer.

Ma, X., & Dhavala, S. (2018). Hierarchical clustering with prior knowledge. arXiv preprint arXiv:1806.03432.

Menon, A. K., Rajagopalan, A., Sumengen, B., Citovsky, G., Cao, Q., & Kumar, S. (2019). Online hierarchical clustering approximations. arXiv preprint arXiv:1909.09667.

Monath, N., Zaheer, M., Silva, D., McCallum, A., & Ahmed, A. (2019). Gradientbased hierarchical clustering using continuous representations of trees in hyperbolic space. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 714–722.

Murtagh, F., & Contreras, P. (2012). Algorithms for hierarchical clustering: an overview. Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery, 2(1):86–97.

Roy, A., & Pokutta, S. (2016). Hierarchical clustering via spreading metrics. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2316–2324.

Wang, D., & Wang, Y. (2018). An improved cost function for hierarchical cluster trees. arXiv preprint arXiv:1812.02715.

Wang, Y., & Moseley, B. (2020). An objective for hierarchical clustering in euclidean space and its connection tobisecting k-means. In Proceedings of the 34th Conference on Artificial Intelligence (AAAI 2020).

Zadeh, R., & Ben-David, S. (2009). A uniqueness theorem for clustering. In UAI 2009, Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 18-21, 2009, pages 639–646.

International Journal of Intelligent Data and Machine Learning

Article Details Page