Header menu link for other important links
X
Variance reduction in feature hashing using MLE and control variate method
B.D. Verma, , M. Thakur
Published in Springer
2022
Volume: 111
   
Issue: 7
Pages: 2631 - 2662
Abstract
The feature hashing algorithm introduced by Weinberger et al. (2009) is a popular dimensionality reduction algorithm that compresses high dimensional data points into low dimensional data points that closely approximate the pairwise inner product. This algorithm has been used in many fundamental machine learning applications such as model compression (Chen et al. 2015), spam classification (Weinberger et al. 2009), compressing text classifiers (Joulin et al. 2016), large scale image classification (Mensink et al. 2012). However, a limitation of this approach is that the variance of its estimator for the inner product tends to be large for small values of the reduced dimensions, making the estimate less reliable. We address this challenge and suggest two simple and practical solutions in this work. Our approach relies on control variate (CV) and maximum likelihood estimator (MLE), which are popular variance reduction techniques used in statistics. We show that these methods lead to significant variance reduction in the inner product similarity estimation. We give theoretical bounds on the same and complement it via extensive experiments on synthetic and real-world datasets. Given the simplicity and effectiveness of our approach, we hope that it can be adapted in practice. © 2022, The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature.
About the journal
JournalData powered by TypesetMachine Learning
PublisherData powered by TypesetSpringer
ISSN08856125