Header menu link for other important links
X
Efficient binary embedding of categorical data using BinSketch
B.D. Verma, , D. Bera
Published in Springer
2022
Volume: 36
   
Issue: 2
Pages: 537 - 565
Abstract
In this work, we present a dimensionality reduction algorithm, aka. sketching, for categorical datasets. Our proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors, and our distance estimation algorithm Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches. The minimum dimension of the sketches required by Cham to ensure a good estimation theoretically depends only on the sparsity of the data points—making it useful for many real-life scenarios involving sparse datasets. We present a rigorous theoretical analysis of our approach and supplement it with extensive experiments on several high-dimensional real-world data sets, including one with over a million dimensions. We show that the Cabin and Cham duo is a significantly fast and accurate approach for tasks such as RMSE , all-pair similarity, and clustering when compared to working with the full dataset and other dimensionality reduction techniques. © 2021, The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature.
About the journal
JournalData powered by TypesetData Mining and Knowledge Discovery
PublisherData powered by TypesetSpringer
ISSN13845810