Hacking Book | Free Online Hacking Learning


taking the text mining of the 4w + article of tiger olfactory net as an example, the whole process of data analysis is presented

Posted by graebner at 2020-02-27
* 本文转载自:运营喵是怎样炼成的。作者: @苏格兰折耳喵

The analysis of external data, out of the stereotype of only internal data analysis (user data, sales data, flow data, etc.), can often bring unexpected inspiration to products, operations, marketing, and open a window for data-driven business growth.

In this paper, the author analyzes the whole process from data collection, data cleaning, data analysis to data visualization in order to show the powerful power of external data analysis.

The following is the writing framework of this paper:

1 analysis background

1.1 analysis principle - why to choose analysis tiger olfactory net

In the Internet era of data explosion and information quality, we are always in the "information flood" of Internet social media, so we are inevitably "trapped" by the overflowing information on it, that is to say, the information on social media has a significant impact on everyone in the real world, and social media is our indirect understanding The window of the real objective world and the subjective world, we are affected by it all the time. For the content of "social media", please refer to "dry goods" how to use social listening to "extract" valuable information from social media? 》, the following is also excerpted from this article:

Based on the above two situations, we can draw the conclusion that through social media, we can observe the real world:

In view of this situation, the author, as an Internet practitioner, would like to analyze some of the current situation of the Internet industry. The first step is to find the media with important influence in the Internet industry. The last analysis was "everyone's product manager" (please refer to "dry goods" as a qualified "growth hacker", and you should also pay attention to the analysis of external data! 》)This time, I think of the tiger olfactory net.

Founded in May 2012, tiger olfactory network is a new media platform that gathers high-quality innovative information and people. The platform focuses on the contribution of original, in-depth, sharp and high-quality business information, focusing on the perspective of innovation and entrepreneurship for analysis and exchange. The core of tiger olfactory network is to focus on the integration of Internet and traditional industries, the ups and downs of a series of star companies (including public companies and entrepreneurial enterprises), and the power and trend of industrial tide.

Therefore, the analysis of the released content on the platform has certain practical value for the study of the development process and current situation of the Internet.

1.2 purpose of this paper

The author's analysis purposes in this project are mainly four:

(1) Some analysis on the content operation of tiger olfactory network is mainly descriptive analysis on the amount of documents sent, collected and commented;

(2) Through text analysis, this paper makes interesting analysis on some people, enterprises and subdivisions of the Internet industry;

(3) Show the practical value of text mining in the field of data analysis;

(4) It visualizes the disordered structured data and unstructured data to show the beauty of data.

1.3 analysis methods - analysis tools and types

In this paper, the data analysis tools used by the author are as follows:

Using the above data analysis tools, the author will carry out two types of data analysis: the first type is more traditional statistical analysis based on the description of numerical data, such as the distribution of reading volume and collection volume in the time dimension; the second type is the play of this paper - deep text mining, including keyword extraction, content LDA topic model analysis, word vector / association word analysis DTM model, ATM model, vocabulary scatter diagram and word clustering analysis.

2 data collection and text preprocessing

2.1 data collection

The author collected articles from the homepage of tiger olfactory network (not all articles, but the information displayed on the homepage is selected by the editor in chief, which is very representative). The time interval of data collection is from May 2012 to November 2017, a total of 41121 articles. The collected fields are article title, release time, collection amount, comment amount, text content, author name, author's self introduction and author's post amount. Then the author extracts four features manually, mainly time feature (time point and week day) and content length feature (number of title words and article words). The final data is shown in the figure below:

2.2 data preprocessing

There is a golden rule in the field of data analysis / Mining: "garbage in, garbage out". To do a good job in data preprocessing is very important for obtaining ideal analysis results. In this paper, the data normalization is mainly to clean the text data. The processing items are as follows:

(1) Text participle

Word segmentation is the most critical step in text mining, which directly affects the subsequent analysis results. The author uses Jieba to segment the text. It has three types of word segmentation modes: full mode, precise mode and search engine mode

·Precise mode: try to cut the sentence most accurately, which is suitable for text analysis;

·Full mode: scan all the words that can be words in a sentence, which is very fast, but can not solve ambiguity;

·Search engine mode: on the basis of precise mode, long words are segmented again to improve recall rate, which is suitable for search engine segmentation.

Taking "Sina micro public opinion focuses on the scenario application of social big data" as an example, the results of three word segmentation modes are as follows:

[full mode]: Sina / micro public opinion / Sina micro public opinion / focus / focus / socialize / big data / socialize big data / of / scenario / Application

[precise mode]: Sina micro public opinion / focus / focus / socialized big data / of / scenario / Application

[search engine mode]: Sina, micro public opinion, Sina micro public opinion, focus on, socialize, big data, socialize big data, of, scenario, application

In order to avoid ambiguity and cut out the words that meet the expected effect, the author adopts the precise (segmentation) mode.

(2) To stop words

There are three types of de stop words:

(3) Remove high frequency words, rare words and computational bigrams

Removing high-frequency words and rare words is used for the following theme models (LDA, ATM), mainly to exclude words that have little significance to distinguish theme, and finally get the effect similar to stop words.

Bigrams is to automatically detect new words in the text, based on the co-occurrence relationship between words --- if two words often appear adjacent together, then the two words can be combined into a new word, such as "data" and "product manager" often appear together in different paragraphs, then "data" and "Product Manager" are the new words synthesized by the two, but only the two There are underscores between them.

3 Descriptive analysis

In this part, the author mainly carries on the descriptive statistical analysis to the numerical data, which belongs to the more conventional data analysis, and can reveal some problems, so as to know it.

3.1 change trend of number of posts, comments and collections

As can be seen from the figure below, during the period from May 2012 to November 2017, the number of posts on the homepage fluctuated little in terms of quarter. It fluctuated up and down on the average of 1800. After 2016, the number of Posts increased significantly.

In addition, one end (the second quarter of 2012) and the other end (the fourth quarter of 2017) are not statistically complete, so the number of posts is small.

The figure below shows the change of collection and comment volume in this period. The change of comment volume is not sullen, with little fluctuation, but the collection volume has been climbing, especially reaching the peak in the second quarter of 2017. To a certain extent, the amount of collection reflects the degree of dry goods and the value of articles. Only when readers think that valuable articles can be retained and collected, can they read them repeatedly, which means that the quality of Huxiang's articles is constantly improving, or the number of readers is growing.

3.2 analysis of the time rule of sending documents

The author extracts the information of "week" and "time period" from the time dimension, that is, the extraction of "artificial features" mentioned in the opening question. Now, we do the cross analysis on "week" and "time" of article distribution quantity, and get the following figure:

The figure above is a thermodynamic diagram. The warm to cold representation value of color block changes from large to small. It is obvious that there is an area with obvious color in the middle, that is, a rectangle surrounded by "6:00-19:00" and "Monday Friday". In other words, the time of sending is mainly in the daytime of the working day. In addition, from Monday to Friday, 6:00 to 7:00 is the peak time of sending articles, which shows that the content operators of tiger smell tend to publish articles in the early morning of the working day, which is also in line with its crowd positioning -- TMT practitioners, entrepreneurs, investors, many of them have the habit of reading in the morning, like to read tiger smell information in the process of catching the subway and taking the bus. There is still a peak of 9:00-11:00, which is to deal with readers' reading during lunch break in advance. There are also 17:00-18:00, which is to deal with readers' reading during off-duty time in advance.

3.3 correlation analysis

I have always been curious about whether there is a statistical correlation between the amount of comments and collections of articles and the number of titles and articles. Based on this, the author draws two graphs which can reflect the above variable relations.

First of all, the author makes a bubble chart between the number of title words, article words and comments (the round bubble is replaced by the hexagon star, but in essence it is still a bubble chart).

In the above figure, the horizontal axis is the number of words in the article, and the vertical axis is the number of words in the title. The size of the comment number is reflected by the size and color of the hexagon star. The warmer the color, the larger the value, the larger the pentagram star, and the larger the value. It can be seen from this picture that most of the articles with a large amount of comments are distributed in the area composed of 6000 words of articles and 20 words of titles. Most of the business information articles on huwen.com have the characteristics of originality and depth. The length of the article is medium and long, which means that the context behind the event can be discussed clearly, and the title should be able to attract people and trigger a large number of readers to read. Only the appropriate length of the Title and the length of the text can do this.

Next, the author draws a 3D stereogram of collection amount, comment amount, title words and article words. The x-axis and y-axis are title words and body words respectively. The z-axis is the plane composed of collection amount and comment amount. By rotating this 3D surface graph, we can find the correlation between collection amount, comment amount, title words and article words.

Note that the numerical representation in the above figure is the same as that in the previous figures, the warm to cold representation in color represents the numerical value from large to small. By rotating the cross sections of each dimension, you can see that the cross sections formed by the volume of collections and comments within 5000 words in the text and about 15 words in the title appear "Huashan style" steep peaks, so the volume of collections and comments here is the largest.

3.4 City mention analysis

Here, the author constructs a thesaurus of the first to fifth tier cities in China, extracts the names of the cities in the text after preprocessing, draws a geographical distribution map reflecting the frequency of mentioning according to the magnitude of mentioning frequency, and then indirectly understands the development of Internet in each city (the mentioning of general cities is linked with the information of Internet industry, products and positions Hook, to a certain extent, can reflect the development trend of the city's Internet industry).

The results reflected in the figure above are quite in line with common sense. The first tier cities, such as Beijing, Shenzhen, Guangzhou and Hangzhou, have been mentioned the most frequently, and they are the key cities for the development of the Internet industry. It is worth noting that a large area of the Yangtze River Delta (Yangtze River Delta city group, which includes Shanghai, Nanjing, Wuxi, Changzhou, Suzhou, Nantong, Yancheng, Yangzhou, Zhenjiang, Taizhou in Jiangsu Province, Hangzhou, Ningbo, Jiaxing, Huzhou, Shaoxing, Jinhua, Zhoushan, Taizhou in Zhejiang Province, Hefei, Wuhu, Maanshan, Tongling, Anqing, Chuzhou, chi in Anhui Province State, Xuancheng) show a high heat value, which directly shows that these cities mentioned more times in various information articles of huyuwang. Combined with national policies and regional factors, we can understand the fact reflected in the map as follows:

Next, the author will extract the co-occurrence relationship between cities in the text, that is, the frequency of two cities appearing at the same time, to a certain extent, reflecting the economic, cultural, policy and other related relationships between cities. The higher the co-occurrence frequency is, the higher the degree of tight connection between them is. The results of the extraction are shown in the following table:

Draw the above results into the following dynamic flow chart:

Because most of the articles on huyu.com involve entrepreneurship, policy and business, the co-occurrence relationship between cities reflects the relationship among resources, personnel or industries. In this dynamic chart, it mainly reflects the mutual flow relationship between beishangguangshenhang (the hub node in the network) and the single flow from these first tier cities to the central and western cities Flow to. There is no doubt that the most developed three urban agglomerations and several other emerging urban agglomerations in China are the regions with large flow and dense intersection

The above data analysis is based on the descriptive analysis of numerical data. Next, the author will conduct more in-depth text mining.

4 text mining

Data mining is to identify effective, novel, possibly useful and ultimately understandable patterns from structured databases, while text mining (also known as text data mining or knowledge discovery in text databases) is to extract patterns from a large number of unstructured data, which is a semi-automatic process of useful information or knowledge.

The text mining part of this paper mainly involves high-frequency word statistics / keyword extraction / keyword cloud, Article Title Clustering, article content clustering, article content LDA subject model analysis, word vector / associated word analysis, ATM model, word scatter diagram and word clustering analysis.

4.1 keyword extraction

For keyword extraction, the author does not take the method of word frequency statistics, because the logic of word frequency statistics is: the more times a word appears in the article, the more important it is. Therefore, the author adopts TF-IDF (term frequency – inverse document frequency) keyword extraction method:

It can be seen that, when extracting the key information of a certain text, keyword extraction is more preferable than word frequency statistics, and it can extract keywords of great significance to a certain text.

The following is the top 100 keywords extracted from the nearly 400MB corpus after preprocessing by using Jieba.

From a macro perspective, three types of keywords can be clearly identified from the above:

From the micro point of view, the first place is "users". Internet practitioners put "users are king", "users first" and "user centered", then "platform" and "enterprise".

The author selects top 500 keywords to draw keyword cloud. Because the name of tiger sniff comes from the famous poem "in me the tiger sniffs the rose" by Siegfried Sasson, a contemporary British poet, the word cloud takes "tiger sniffs the rose" as the background and can't find a suitable picture of tiger sniffing the rose, so it uses its close relative cat as an alternative. The word cloud is as follows:

4.2 LDA theme model analysis

Just now, the classification of key words is rough and artificial, which inevitably leads to biases and fails to achieve comprehensive results. Therefore, the LDA theme model is used to discover the potential themes in the corpus. For the related principles of LDA theme model, please refer to Part 4 of "dry goods" using big data text mining to gain insight into the current situation and trend of "shared bicycle" industry.

In general, the author sets the number of topics as 10. After several hours of operation, the following results are obtained:

It can be seen that the corpus after text preprocessing is relatively pure. Through the "theme words" under each theme, it is easy to distinguish several themes from these 10 clusters. However, there are three themes mixed (each topic contains two themes), but this does not affect the author's subsequent analysis. The classification of themes is shown in the table below:

The following is the proportion of the above topics in more than 4W articles. It can be clearly seen that the articles on the home page of tiger smell report more about the industry trends of the Internet industry giants, followed by the rising film and television entertainment. Except for the lack of reports on driverless driving, the reports on other topics are not much different and relatively balanced.

Thirdly, the change of the number of articles on each topic in time:

In the above figure, we can clearly see that the front page of the topic "giant strategy" has always maintained a high level of posting, followed by the topic of "artificial intelligence". It had a small climax in the first quarter of 2013 on the homepage of tiger olfactory network. It is worth noting that "Internet finance" has a large number of reports in the third quarter of 2014, from which we can learn that Internet Finance in this stage is in an outbreak stage. Major events in mutual fund industry in this period include: Xiaomi investment building block box entering Internet Finance (9.10), JD's consumer finance strategy (9.24), ant financial group's establishment (10.16) , and the whole year 2014 is the "first year of crowdfunding", P2P enters the shuffling season, and the central bank intensively orders to directly supervise internet finance. These events or policies are enough to trigger hot discussions among people in the Internet industry, resulting in the sudden rise of voice in this period.

4.3 emotion analysis & LDA thematic model cross analysis

Combined with the results of the above LDA thematic model analysis, the author uses Sina micro public opinion's emotional semantic analysis model (the model has six types of emotions, namely joy, anger, sadness, surprise, fear and neutral), to conduct Emotional Analysis on the titles of these articles, and obtains the emotional labels of each article. The processing results are as follows:

Cross analyze the theme and emotional dimension, and get the following figure:

It can be seen from the above figure that the mood of titles under various themes is mainly neutral, highlighting the objective and neutral attitude of the author and the official. However, in the era of the party's rampant and the whole people's heavy taste, the excessive neutrality of the quasi titles also means that it's not uncommon, and it's difficult to trigger the reading behavior of readers. The so-called "brand with personality, emotional marketing" can successfully provoke readers The author of the reader's emotion is definitely a master, so in addition to the neutral emotion in the above figure, the second is anger, tearing and grudging, which ignites the reader's emotion; the third is sadness, which always causes sympathy and resonance in real life.

4.4 ATM model

In this part, the author wants to understand "the writing themes of various writers on the tiger sniff network, analyze which aspects of articles some niux writers like to write (such as" industry insight "," explosive marketing "," new media operation ", etc.), and which authors have similar writing themes.

Therefore, the author uses ATM model for analysis. Note that this is not the abbreviation of ATM, but the author topic model:

First of all, the author removes some authors whose number of published articles is 1, and then "precipitates" some themes from the text. Because the number of texts has been deleted, it is not consistent with the previous theme division. According to the characteristics of the theme words under each theme, the author classifies these 10 themes into: "industry news", "smart phone", "entrepreneurship & Investment and financing", "Internet finance", "new media & Marketing", "film and television entertainment", "artificial intelligence", "social media", "investment & Financing & M & a" and "e-commerce retail".

Next, the author will analyze the writing themes of some interested authors and their related authors.

First of all, Luo Yonghao, the founder of hammer technology. I always think he is a strange person. I saw his signature article on tiger sniff. So I want to see what he wrote on tiger sniff

From the perspective of Lao Luo's writing theme and probability distribution, he tends to write articles on entrepreneurship, financing, smartphone and new media marketing, which is more in line with the public's cognition. Lao Luo, who is good at playing emotion cards, likes to talk about entrepreneurship and his understanding of mobile phones, and often speaks for his hammer brand due to his distinctive personality and sharp language 。

According to the document ID, the author found these articles published by him:

Looking at the title alone, ATM model is quite smart, and I can learn from Lao Luo's article his writing theme.

Next, there are tiger sniffing net writers whose writing themes are similar to Lao Luo's. They have published more than three articles:

Next, tiger sniffs its own media, with more than 10000 articles posted on its home page. The writing topics involved are "industry news", "smartphone" and "new media & Marketing":

In addition to some personal self-Media people, authors with similar writing themes also include some media, such as global.com, fortune.cn, Bloomberg BusinessWeek, etc. From the previous analysis, it can be inferred that their posts on the above three topics are also relatively large.

In the 10189 articles, the author randomly selected the titles of several articles according to the document ID and roughly verified them. Then, draw these titles into a word cloud in the shape of a unicorn.

From the above title and keyword cloud, the predicted theme is reasonable.

Take a look at two other self media that I'm more interested in: Chaos University and 21st century economic report.

As can be seen from the above two figures, chaos University focuses on the topics of "entrepreneurship & Investment and financing", "new media & Marketing", which tends to provide entrepreneurs with entrepreneurship related skills; while 21st century economic report prefers the topics of "investment & Financing & merger", "industry news" and "smart phone", which is more in line with the media's reporting style --- Analysis International form, perspective of China's economy, observation of industry dynamics and guidance of sound development, effectively reflect the world economic pattern and changes, and track and report the dynamics and development of China's business sector.

4.5 vocabulary dispersion chart

Next, I would like to know the locationof a word in the text and their location information (the Lexical dispersion plot) of some words in the 4W+ articles on the homepage of www.huxiang.com from 2012.05~2017.11. At this time, we can use the Lexical distribution plot to analyze, which can reveal the distribution of a word in a text (Producea plot showing the text) distribution of the words through the text)。

The author first arranges the text to be analyzed in chronological order, and then analyzes the lexical dispersion plot after word segmentation. Therefore, the cumulative growth direction of the number of words in the text is consistent with the direction of time forward. In the figure, the vertical axis represents the vocabulary, and the horizontal axis represents the number of words in the text, which is cumulative; the blue vertical line represents that the vocabulary is mentioned once in the text, and the corresponding horizontal axis can see its location information, while the blank line represents no mention. The density and starting position of the blue vertical line represent the frequency and year of reference of the word at a certain stage.

From the above keywords and subject words, the author selects 14 words for analysis, and the results are as follows:

As can be seen from the above figure, the four words "smart phone", "mobile payment", "o2o" and "cloud computing" have enjoyed a high popularity in the past six years, with a high frequency of mentioning, almost saturated on the pillars. By contrast, "Internet Education", "3D printing" and "live online" are not reported on huyu.com, but they are only mentioned sporadically from the beginning to the end.

It is worth noting that the number of mentions of "shared bicycle" increased significantly in the later period, and it is the emergence of explosive type, which is quite consistent with the emergence of shared bicycle.

4.6 word vector / correlation analysis - what are we talking about when we talk about XX

Word vectors based on deep neural network can learn word vectors unsupervised from a large number of unlabeled ordinary text data. These word vectors contain the semantic relationship between words and words. Just like the words can be defined by the company they keep in the real world.

In principle, word2vec based on word embedding refers to embedding a high-dimensional space with dimensions of all words into a continuous vector space with much lower dimensions, and each word or phrase is mapped to a vector on the real number field. The purpose of turning each word into a vector is to facilitate calculation. For example, "find the synonym of word a", you can do it by "find the most similar vector with word a in COS distance".

Next, through word2vec, the author finds out the Related words of some words of interest to interpret them in the unique context of tiger olfactory network.

Therefore, the author analyzes the Related words of Baidu, AI, Chu Shijian and Luo Zhenyu.

The words that come out are all related to Baidu, not Baidu's products and companies, but Baidu's CEO and manager. The word "search" has appeared many times in disguise, which is a magic weapon for Baidu to start its own business.

The vocabulary related to "Ai" is also a good explanation for the subdivision field of artificial intelligence and several popular application scenarios at present.

Like Chu Shijian, the former celebrities (Niu Gensheng, Hu Xueyan, Lu Guanqiu, Wang Yongqing and Zong Qinghou) are also famous business elites. The "old man", "Chu Lao" and "orange king" are the honorifics of the outside world. Interestingly, Mr. Chu also had the heroic spirit of some political figures (Chairman Mao and Chairman Jiang), and his people and affairs had the open-minded spirit and optimism of "the east corner is gone, Sangyu is late", "to start from scratch, to clean up the old mountains and rivers"!

Then there is Luo Zhenyu, a senior media person and communication expert. Many of his opinions can subvert the original ideas of the masses. Similar to Luo Pang are Shen Yin (the founder and planner of the Internet reality show "strange hero", Luo Zhenyu's entrepreneurial partner), Wu Xiaobo (the founder of Wu Xiaobo channel and community), Papi sauce (the famous funny online red), Ma Dong (now the host of "wonderful flower theory"), Li Xiang (the promoter of "Li Xiang's internal business reference" on the app), Ji shisan (the founder of guoke.com) )Li Xiaolai (a well-known preacher of financial freedom), Wu Bofan (the publisher of the 21st Century Business Review, whose works include the theory of relativity between winter and Wu and Bofan's daily knowledge)

4.7 word clustering and word classification of brands of top 100 Internet companies

In 2016, the total scale of Internet business revenue of top 100 Internet enterprises reached 1.07 trillion yuan, breaking the trillion mark for the first time, with a year-on-year growth of 46.8%, driving the growth of information consumption by 8.73%. The data shows that the leading enterprises in the Internet field are becoming more and more obvious. Their research and analysis can help us better understand the development profile and future direction of China's Internet industry.

Here, the author selects the top 100 Internet enterprises selected in 2016. The list is as follows:

For the brand list of the top 100 Internet companies, the author uses the word vector model trained above to cluster and classify the following words.

4.7.1 word clustering

Using k-means clustering based on word2vec (word vector), the semantic relationship between words is fully considered, and words with small cosine angle are gathered together to form clusters. The following figure is a visual representation of high-dimensional word vectors compressed into 2-dimensional space:

The author classifies all the words contained in the word vector model into 300 categories to see the effect of brand clustering under this setting. The analysis results and regularization are as follows:

From the above results, some classifications are easy to understand. For example, Tufeng (net) and lvma travel net are all for tourism. Renren loan, Lujin and paipaipai loan are for mutual money. These words appear more often in the "context of industry". They are clustered based on synonymous relation and belong to the same industry. But most of them are not based on industries, but in other contexts. Let's look at the following two paragraphs:

Although the above bold and black brands belong to different industries, they all appear in the context of "demographic dividend of mobile Internet", so in this context alone, they can be grouped into one category.

Therefore, the above clustering may be due to the presence of various words in different contexts. If you dig deep, you may find some interesting clues. Limited space, this is left to curious readers to complete it.

4.7.2 word classification

Here, the author still uses the word vector from the previous training to do text classification based on CNN (convolutional neural networks) for prediction. The specific principle of CNN is too complex. I will not elaborate here. Interested partners can refer to the following reference materials.

Because text classification follows the text clustering above Cluster) belongs to different tasks in machine learning. The former is supervised learning (all training data are labeled), and the latter is unsupervised learning (data is not labeled). Therefore, before the formal task of text classification, the author first uses labeled corpus training model, and then predicts the subsequent unknown text.

Here, the author divides the Internet enterprises into 17 categories according to the different subdivision fields, each of which only a few tagging corpus participate in the training, that is, a few words. Yes, you are right. With the help of external semantic information (the previously trained word vector model already contains a lot of semantic information), you can complete the training of classification model with a little annotation corpus.

Then, the author uses the words that have not appeared in the training corpus before to test the effect. The result is the category label and its corresponding probability. The category with high probability value is the most likely subdivision area of the brand. The results are as follows:

The above results are in line with our basic cognition. Under the small-scale test, the accuracy rate is acceptable. Finally, we come to an Internet company which is a little more difficult and has never been known by the author

Through Google, I learned that waze is an Israeli technology company that makes crowd sourced navigation maps. A while ago, Google bought it for $1 billion. Although its products are not supported by the powerful satellite images like Google maps, they can provide users with real-time information about traffic conditions, traffic accidents, speed measurement areas, etc. (the map screen is the visual sense). "Crowdsourcing" and "real-time information" correspond to "sharing economy" and "instant messaging" respectively, which are more in line with the connotation represented by the prediction label, and can predict the business attributes of the enterprise to a certain extent.

4.8 co presence analysis of top 100 Internet companies

The above cluster analysis and classification analysis of the top 100 Internet companies seems to be a "black box", and its internal mechanism is not easy for us to understand. Next, the author will do brand co-occurrence analysis based on "graph theory", and analyze the relationship between the top 100 brands from the perspective of network.

Extract the mutual co-occurrence relationship of the top 100 corporate brands, and form the following social network diagram:

In the above figure, each node represents a person, the thickness of the line represents the strong and weak link relationship between the brand and the brand, and the nodes of the same color represent that they belong to the same category (under certain conditions). The size of node and font indicates the influence of brand in the network, that is, "Betweenness" Centrality (intermediary core), "the academic saying is that" the interaction between two non adjacent members depends on other members in the network, especially those on the path between the two members. They have some control and restriction on the interaction between the two non adjacent members ". In other words, a greater influence means that the brand has more cooperation opportunities and resources, as well as more involvement in the Internet field.

First of all, we can see the top 10 influences in the top 10, which are Tencent, wechat, Baidu, QQ, Alibaba, Taobao, Jingdong, Xiaomi, Netease and Sina Weibo. Tencent Department occupies three seats in the top 10, which can be seen from its strong strength.

Then look at the six clusters distinguished by color:

Most of the above classifications are easy to understand. The light green system (Leju, fangtianxia) is for real estate, the Ming Huang system (Renren loan, auction loan) is for Internet P2P finance, and the yellow orange system (car home, e-Car network, e-pai) is a brand in the field of Internet automobile.

It's worth noting that the deep green series of millet, duokan, MIUI, Tianyi reading, with millet as the center, MIUI is Xiaomi's product, and duokan (reading) has been purchased by Xiaomi. Tianyi reading was once Xiaomi's bundled reading software. However, snail game is different from the previous several. The Title of an article is: "snail releases mobile strategy, Shihai: not Xiaomi Second, it is Xiaomi's opponent in the field of mobile games

In addition, in the two clusters of light blue (Tencent, wechat, Baidu, QQ, Netease, Sohu, etc.) and Yanghong (Alibaba, Taobao, Jingdong, Sina Weibo, tmall, etc.), the relationship between brand and brand is more complex. In these two clusters, parent company, brother brand, cross-border cooperation, competitive relationship, cross-border competition, financing and merger Or both.


In the text mining part of this paper, it involves the content of artificial intelligence / AI - keyword extraction, LDA subject model, ATM model belong to machine learning, emotional analysis, word vector, word clustering and word classification are related to deep learning knowledge, which are the real application of AI in data analysis.

In addition, this paper is to explore the nature of data analysis of dry goods, not data analysis report, focusing on enlightening ideas, teaching people to fish, drawing specific conclusions is not the purpose of this paper, the analysis of the results is scattered in various parts, "conclusion control at the end of the article" do not like to spray.

reference material:

1. Data source: huyuwang homepage, may 2012-november 2017

2. In data operation data analysis, text analysis is far more important than numerical analysis! (I) "

3. Scottish folding ear meow, why is text analysis more important than numerical analysis in operation? A practical case, five point analysis (2)

4. Scotland folding meow, how to use social listening to "extract" valuable information from social media? "

5. As a qualified "growth hacker", Scotland zhelmao, you have to pay attention to the analysis of external data! "

6. Zhelmao, Scotland, taking the rise of the Qin Empire as an example to talk about big data public opinion analysis and text mining

7. [dry goods] uses big data text mining to gain insight into the current situation and trend of "shared bicycle" industry

8. Word2vec Wikipedia entry, https://en.wikipedia.org/wiki/word2vec

9. "List of top 100 Chinese Internet enterprises issued by the Ministry of industry and information technology in 2016", http://tech.163.com/16/0712/18/brptfd6e00097u7r.html

10. Zong Chengqing, "natural language understanding: (06) lexical analysis and part of speech tagging", Chinese Academy of Sciences

11.UnderstandingConvolutional Neural Networks for NLP ,http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp

12.Yoon Kim,Convolutional Neural Networks for Sentence Classification

13. Hoffman, Blei, Bach. 2010. Online learning for LatentDirichlet Allocation

14.TomasMikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of WordRepresentations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

15.TomasMikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. DistributedRepresentations of Words and Phrases and their Compositionality. In Proceedingsof NIPS, 2013.