Hacking Book | Free Online Hacking Learning



Posted by forbes at 2020-02-27

30 Jan 2014

Simhash is an algorithm used by Google to deal with massive text de duplication. Google products, you know. One of the most powerful points of simhash is to convert a document into a 64 bit byte, which is temporarily called a character word. Then, to judge the repetition, you only need to judge whether the distance of their character words is < n (generally, n is 3 according to experience), and then you can judge whether the two files are similar.


The generation of simhash values is illustrated as follows

It will take about three minutes to understand the figure and realize the simhash algorithm. It's very simple. Google products, simple and practical.

The algorithm process is as follows:

(feature, weight) feature_weight_pairs ,其中 fwn = ( , hash_weight_pairs feature_weight_pairs (hash,weight) bits_count = 6 hash_weight_pairs +weight -weight bits_count [13, 108, -22, -5, -32, 55] [13,108,-22,-5,-32,55] -> 110001

So far, how to move from a doc to a simhash value has been explained. But there's another important part,

"Calculation of Hamming distance of simhash value"

The Hamming distance between binary string a and binary string B is the number of 1 in binary after a XOR B.

A xor B

For example:

A = 100111; B = 101010; hamming_distance(A, B) = count_1(A xor B) = count_1(001101) = 3;

After we calculate the simhash value of all docs, we need to calculate whether doc A and DOC B are similar

Whether the Hamming distance between a and B is less than or equal to N, which is generally taken as 3 according to experience,

Simhash is essentially a locally sensitive hash, which is different from MD5 and the like. Because of its local sensitivity, we can use Hamming distance to measure the similarity of simhash value.

"Efficient calculation of number of 1 in binary sequence"

/* src/Simhasher.hpp */ bool isEqual(uint64_t lhs, uint64_t rhs, unsigned short n = 3) { unsigned short cnt = 0; lhs ^= rhs; while(lhs && cnt <= n) { lhs &= lhs - 1; cnt++; } if(cnt <= n) { return true; } return false; }

If calculated by the above function, the time complexity is O (n); the default value of N here is 3. This shows that it is quite efficient.

"Implementation of O (1) algorithm to calculate the number of 1 in binary sequence"

Thank you @ schatwang for your comments:

Thank you for the simhash library. It will be very convenient. There are various implementations of O (1) for finding the number of 1 in binary. Please refer to this place: http://stackoverflow.com/a/14682688

Project implemented by simhash

Mainly for Chinese documents, that is to say, before simhash for this project, word segmentation and keyword extraction are also carried out.

Compare other algorithms

"Baidu's de duplication algorithm"

Baidu's de recalculation method is the simplest, which is to directly find out the longest n sentences of this article and do a hash signature. N is generally taken as 3. It is said that the accuracy and recall rate can reach more than 80%.

"Shine algorithm"

The principle of shingle is a little complicated, not detailed. I think the algorithm is too academic, not friendly to engineering implementation, too slow, and basically unable to deal with massive data.

Other algorithms

See the discussion on Weibo

Reference resources

Reprint please indicate the source: the principle and implementation of simhash algorithm