Hacking Book | Free Online Hacking Learning


similarity detection of distributed web pages based on spark

Posted by agaran at 2020-03-03


With the rapid growth of the number of web pages, the number of similar mirror pages is also increasing. The existence of near mirror web page seriously affects the search results of search engine. If we can get rid of the similar mirror pages in the collected web pages, we can improve the efficiency of the collection system and index system, and users will not have a lot of duplicate pages when they query. In order to remove the mirror page, we need the algorithm of page approximation detection. In the experiment, the similarity detection of distributed web pages based on spark is realized based on similarity connection, and 5000 web pages are tested to find out all similar web pages. There are 5000 web pages used in the experiment and the electronic resume of the recruitment website. Because we only care about the content of the resume, we need to remove the HTML tag, web page style and code outside the content of the resume. After the irrelevant content is removed, approximate detection can be carried out. In the big data environment, the amount of data is huge, and the efficiency of similarity connection using the traditional way is very low. In the experiment, similarity connection query is selected as the bottom algorithm, as a distributed algorithm, spark is selected as the distributed computing framework.

Similarity join query, that is to find similar data object pairs, has a wide range of applications, such as similar web page inspection, entity analysis, data cleaning and similar image retrieval. In the detection of similar web pages, using similar connection and other technologies to identify similar web pages can not only help web search engines to perform focused crawling, improve the quality and diversity of search results, but also identify spam. In entity analysis, we can find similar customers in enterprise database and match product quotation by using similarity connection technology. In data cleaning, similarity connection technology can provide consistent and accurate data for the integration of different data sources. In similar image retrieval, similarity connection technology is used to retrieve similar images, which can analyze the source of images, find high-definition images, etc. currently, there are string, set and vector similarity connection algorithms.

Description of selected platform

The experiment uses spark platform, the selected programming language is Scala, and the operating system is Ubuntu.

Data set description

The original data set is about 5000 resume pages, roughly in the following format:

After word segmentation, removal of duplicate words, HTML tags, and stop words, the output is as follows, as the input source of spark document proximity detection.

Design logic

Data cleaning and Chinese word segmentation

Traverse all HTML files in the directory, and remove all tags, styles, and codes in the HTML files. Chinese word segmentation is used to segment the filtered HTML file. At the same time, duplicate words and stop words are filtered out. The keywords of this resume page are left and written into another file as the input of similarity detection.

Proximity detection

Read all the introduction documents that have been segmented from the directory, and generate the file name and content pairs for subsequent processing. Establish an inverted index, and insert the recorded rid into the corresponding item in the inverted index of its item. The candidate set is generated by pairing the items in the inverted index, and the number of the same candidate pair is recorded, so as to generate the final candidate set. The results that meet the requirements are found out from the candidate set, and the corresponding similarity threshold is calculated through the Jaccard coefficient threshold, so as to find the corresponding results in the candidate set.

Run deployment description

Spark officially provides three cluster deployment schemes: standalone, mesos and yarn. In the experimental environment, the standalone mode is enough, so this experiment chooses the standalone mode. If there is already a yarn or mesos environment, it is also very convenient to migrate to the new resource scheduling framework.

Be careful

In order to avoid the influence of permission problems on the experiment, the demonstrations in this example are all root permissions. Of course, in the production environment, for the sake of security, you need to use a separate user to start and run spark. Note that unless otherwise specified, all operations need to be performed once on each machine. You can use x shell, ansible to simplify operations.

Environmental preparation

Installation system

I use VirtualBox virtual machine software to create three Ubuntu servers, all of which use pure installation. The cluster is prepared for one master and two slave. Each set is equipped with two network cards, one is NAT network, which is used to connect the external network, update and install the software package, and the other is host only mode, which is used for direct communication between the host computer and the virtual machine. As shown in the figure below.

Configure hosts

The IP addresses of the host only network card are allocated in sequence. Here, the IP addresses of the host only network cards of the three virtual machines are, and respectively.

Modify the host file on each host and add the following three lines at the end of the file.

After configuration, Ping each other to see if it works.

Configure SSH password free login

Install openssh server

Generate private and public keys on all machines

If the machines need to be able to access each other, send the id_rsa.pub on each machine to the master node, and the public key can be transmitted by SCP.

On the master, add all public keys to the authorized keys file for authentication

Distribute the public key file authorized_keys to each slave

Verify SSH no password communication on each machine

Install JDK

Install openjdk 8 directly through package management

Use the following command to configure and validate environment variables:

Verify that Java is installed successfully

Install Spark

Download and install spark using the following command:

Configure Spark

Enter the configuration directory / usr / local / spark / conf to modify the configuration file.


Configure the IP or host of the slave node in slaves,


Configure spark env.sh


Start Spark

If there is no problem, go to spark's web management page: http: / / Master: 8080

Script install spark

To facilitate the deployment of spark, I simply wrote a shell script, which can easily build a 3-node spark cluster in several steps. The script address is:


The script was only tested under Ubuntu 16.04, but other versions of Ubuntu and Debian should be available as well.

First, use VirtualBox to install an Ubuntu server, and execute the following commands in it:

After the restart is complete,

Clone this virtual machine to two virtual machines. Remember to modify the spark_local_ip in the clone machine's configuration file / usr / local / spark / conf / spark env.sh, and then start the spark cluster.

/usr/local/spark/conf/spark-env.sh SPARK_LOCAL_IP

Description of experimental results

After running the code, you can see that a DAG diagram is generated, which is the calculation path of spark, including three stages. Finally, all similar document pairs that exceed a certain threshold are output, and the similarity coefficient is given.


Java.io.ioexception: no space left on device error occurs during operation. After Google search, it is found that spark uses / tmp directory to store intermediate results by default, while / tmp directory is TMPFS directory. For the memory file system, the default size is only half of the memory size.

java.io.IOException: No space left on device /tmp /tmp

To solve this problem, you need to customize the spark ﹣ local ﹣ dirs environment variable to a directory large enough. I changed it to the spark installation directory here.