Robust k-means++

A. Deshpande; P. Kacham; Rameshwar Pratap

A good seeding or initialization of cluster centers for the k-means method is important from both theoretical and practical standpoints. The k-means objective is inherently non-robust and sensitive to outliers. A popular seeding such as the k-means++ [3] that is more likely to pick outliers in the worst case may compound this drawback, thereby affecting the quality of clustering on noisy data. For any 0 < δ ≤ 1, we show that using a mixture of D2 [3] and uniform sampling, we can pick O(k/δ) candidate centers with the following guarantee: they contain some k centers that give O(1)-approximation to the optimal robust k-means solution while discarding at most δn more points than the outliers discarded by the optimal solution. That is, if the optimal solution discards its farthest βn points as outliers, our solution discards its (β + δ)n points as outliers. The constant factor in our O(1)approximation does not depend on δ. This is an improvement over previous results for k-means with outliers based on LP relaxation and rounding [7] and local search [17]. The O(k/δ) sized subset can be found in time O(ndk). Our robust k-means++ is also easily amenable to scalable, faster, parallel implementations of k-means++ [5]. Our empirical results show a comparison of the above robust variant of k-means++ with the usual k-means++, uniform random seeding, threshold k-means++ [6] and local search on real world and synthetic data. © Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence, UAI 2020. All rights reserved.

Journal	Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence, UAI 2020
Publisher	Association For Uncertainty in Artificial Intelligence (AUAI)