An improved method of locality-sensitive hashing for scalable instance matching

Aydar, Mehmet; Ayvaz, Serkan

doi:10.1007/s10115-018-1199-5

An improved method of locality-sensitive hashing for scalable instance matching

Aydar M., Ayvaz S.

KNOWLEDGE AND INFORMATION SYSTEMS, cilt.58, sa.2, ss.275-294, 2019 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 58 Sayı: 2
Basım Tarihi: 2019
Doi Numarası: 10.1007/s10115-018-1199-5
Dergi Adı: KNOWLEDGE AND INFORMATION SYSTEMS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
Sayfa Sayıları: ss.275-294
Anahtar Kelimeler: Scalability, Locality-sensitive hashing, Instance Matching, Instance Similarity, Candidate Pairs Generation
Yıldız Teknik Üniversitesi Adresli: Hayır

Özet

In this study, we propose a scalable approach for automatically identifying similar candidate instance pairs in very large datasets. Efficient candidate pair generation is an essential to many computational problems involving calculation of instance similarities. Calculating similarities of instances with a large number of properties and efficiently matching a large number of similar instances in a scalable way are two significant bottlenecks of candidate instance pair generation. In our approach, we utilize locality-sensitive hashing (LSH) technique to greatly improve the scalability of candidate instance pair generation. Based on the candidate similarity threshold, our algorithm automatically discovers the optimum number of hash functions in each band in LSH. Moreover, we evaluated the scalability of our approach and its effectiveness in instance matching task using real-world very large datasets.