30 Jan 2014

Simhash is an algorithm used by Google to deal with massive text de duplication. Google products, you know. One of the most powerful points of simhash is to convert a document into a 64 bit byte, which is temporarily called a character word. Then, to judge the repetition, you only need to judge whether the distance of their character words is < n (generally, n is 3 according to experience), and then you can judge whether the two files are similar.

## principle

The generation of simhash values is illustrated as follows

It will take about three minutes to understand the figure and realize the simhash algorithm. It's very simple. Google products, simple and practical.

The algorithm process is as follows:

- Doc is used to extract keywords (including segmentation and calculation of weights), and N pairs (keywords, weights) are extracted, that is, features, weights in the graph. Record as feature "weight" pairs = [fw1, FW2 FWN], where FWN = (feature_n, weight_n `).

`(feature, weight)`

`feature_weight_pairs`

`，其中 fwn = (`

`,`

- Hash ﹣ weight ﹣ pairs = [(hash (feature), weight) for feature, weight in feature ﹣ weight ﹣ pairs] generate the (hash, weight) in the graph. At this time, assume that the bits ﹣ count = 6 generated by hash (as shown in the graph);

`hash_weight_pairs`

`feature_weight_pairs`

`(hash,weight)`

`bits_count = 6`

- Then add up the bits of the hash ﹣ weight ﹣ pairs vertically. If the bit is 1, + weight, if it is 0, - weight, and finally generate bits ﹣ count numbers, as shown in the figure [13, 108, - 22, - 5, - 32, 55]. The value generated here is related to the algorithm used in the hash function.

`hash_weight_pairs`

`+weight`

`-weight`

`bits_count`

`[13, 108, -22, -5, -32, 55]`

- [13108, - 22, - 5, - 32,55] - > 110001 this is very simple, positive 1 negative 0.

`[13,108,-22,-5,-32,55] -> 110001`

So far, how to move from a doc to a simhash value has been explained. But there's another important part,

"Calculation of Hamming distance of simhash value"

The Hamming distance between binary string a and binary string B is the number of 1 in binary after a XOR B.

`A xor B`

For example:

```
A = 100111;
B = 101010;
hamming_distance(A, B) = count_1(A xor B) = count_1(001101) = 3;
```

After we calculate the simhash value of all docs, we need to calculate whether doc A and DOC B are similar

Whether the Hamming distance between a and B is less than or equal to N, which is generally taken as 3 according to experience,

Simhash is essentially a locally sensitive hash, which is different from MD5 and the like. Because of its local sensitivity, we can use Hamming distance to measure the similarity of simhash value.

"Efficient calculation of number of 1 in binary sequence"

```
/* src/Simhasher.hpp */
bool isEqual(uint64_t lhs, uint64_t rhs, unsigned short n = 3)
{
unsigned short cnt = 0;
lhs ^= rhs;
while(lhs && cnt <= n)
{
lhs &= lhs - 1;
cnt++;
}
if(cnt <= n)
{
return true;
}
return false;
}
```

If calculated by the above function, the time complexity is O (n); the default value of N here is 3. This shows that it is quite efficient.

"Implementation of O (1) algorithm to calculate the number of 1 in binary sequence"

Thank you @ schatwang for your comments:

Thank you for the simhash library. It will be very convenient. There are various implementations of O (1) for finding the number of 1 in binary. Please refer to this place: http://stackoverflow.com/a/14682688

## Project implemented by simhash

- C + + version simhash
- Golang version gosimhash

Mainly for Chinese documents, that is to say, before simhash for this project, word segmentation and keyword extraction are also carried out.

## Compare other algorithms

"Baidu's de duplication algorithm"

Baidu's de recalculation method is the simplest, which is to directly find out the longest n sentences of this article and do a hash signature. N is generally taken as 3. It is said that the accuracy and recall rate can reach more than 80%.

"Shine algorithm"

The principle of shingle is a little complicated, not detailed. I think the algorithm is too academic, not friendly to engineering implementation, too slow, and basically unable to deal with massive data.

Other algorithms

See the discussion on Weibo

## Reference resources

- Similarity estimation techniques from rounding algorithms
- Simhash and Google's web page de duplication
- Simhash and Hamming distance for similarity calculation of massive data

Reprint please indicate the source: the principle and implementation of simhash algorithm