Hacking Book | Free Online Hacking Learning


zhihu bay area machine learning sharing meeting

Posted by herskovits at 2020-02-29

On May 19, 2017, we held our first machine learning sharing meeting in the bay area of the United States. Li Dahai, the leader of Zhihu big data team, shared Zhihu's thinking and application scenarios for machine learning for the first time, and conducted on-site Q & A. the actual record is as follows:

Hello everyone, I'm Li Dahai, now I'm Zhihu's partner, and I'm also the leader of Zhihu's big data team. One of my important tasks in Zhihu is to promote the application and landing of machine learning technology. It's a great pleasure to be in the bay area today to exchange ideas with you.

Today, our sharing will start from two questions. The first one is about the current situation of the application of machine learning: what we have done so far, and what we need to do in the future; the second one is to look forward to the future, and what kind of imagination space we hope machine learning can bring to Zhihu, and what new revolutionary products we can create.

First of all, to answer the first question, to explain "how does Zhihu use machine learning technology", we need to briefly describe "Zhihu". This year is the sixth year of Zhihu's establishment. Six years ago, it was known that when it was just launched, it was a closed invitation community. During the period when it was just launched, the number of users was not large. At that time, most of the topics discussed in the community focused on the Internet and entrepreneurship, which seemed to be a niche website. What about today six years later? Let's see what we know.

In the topic tag cloud of Zhihu, we can see that now Zhihu's discussion has become very diversified, from the Internet to psychology, from movies to literature, from professional astronomy, data analysis and in-depth learning, to topics close to life, such as health, sports, tourism, photography, etc. In terms of quantity, up to now, there have been 15 million questions, 55 million answers and a considerable number of columns, which are related to 250000 topics. So, today, Zhihu has become a very extensive knowledge social platform.

After reading the content, let's take a look at the user's situation. In the past six years, the number of users has also increased rapidly and become more and more diversified. In Zhihu, there are not only a group of users who are already "celebrities" in real life, such as Li Kaifu, Ma Boyong and Jia Yangqing, but also high-quality content producers who used to be unknown, through hard work in Zhihu, and institutional users such as @ poor travel brocade bag, @ China Science Expo, @ Microsoft Asia Research Institute. Up to now, we have 69 million registered users, over 20 million independent devices access and log in Zhihu every day, and tens of billions of page views every month.

These data can give us an intuitive impression of the scale of Zhihu. In fact, Zhihu has become the largest Chinese knowledge social platform in the world, and it is still growing rapidly. In 2016, we didn't spend a penny to promote the effect, while the number of registered users, dau and other indicators have roughly doubled. Why can we keep the rapid growth in such a volume?

We have done user surveys to find out why users want to use Zhihu. Some users say that they like to come to know the evaluation of news hot spots, understand different perspectives of the same thing, and collide with each other; others like to come to know their experience, and help them make consumption decisions: for example, how to make a budget for decoration, how to improve themselves in the third year of the workplace, and so on; while some users mainly come to share their professional knowledge. We have a real estate lawyer named Xu Bin, who often answers the legal questions in the process of Zhihu's house purchase. Because of the great professionalism shown in the process of answering questions, Zhihu live he opened is also very popular. A live called "how to rent a house is not foolproof". So far, 8000 audiences have bought tickets to enter.

It can be seen that users come to know nothing but to do two things, production content or consumption content. The combination of these two points constitutes the ecological closed loop of Zhihu. What do you mean? More and better content results in the stickiness and attractiveness of Zhihu as a platform, while more users attracted in turn generate more diversified content production needs. At the same time, users of production content can also benefit from this process, improve themselves and expand their influence through knowledge sharing and exchange. It is precisely because in this closed loop, the content production and consumption needs of users are met at the same time that Zhihu's high-speed growth and tens of millions of users and content today are achieved.

So, let's summarize "what is Zhihu? "Zhihu is a knowledge network connecting a large number of users and a platform. Our core goal is to make the closed-loop of content production and consumption run smoothly and provide a serious and friendly discussion environment for users.

To achieve this goal, it is relatively easy when the community scale is relatively small. We can ensure the efficiency of content production and distribution through simple product strategy and operation strategy, and also maintain a good community atmosphere through manual operation. However, when the community reaches today's scale, the operational pressure we are facing is also growing rapidly, tens of millions of active users, hundreds of thousands of new content and millions of interactive behaviors every day. If only relying on human resources to do community operation, the efficiency will be lower and lower, more and more infeasible.

So what do you need to do with machine learning? The simple answer is to model users and content more precisely to improve the efficiency of content production and content distribution. Specifically, it mainly involves six areas of application, which are user portrait, content analysis, sorting, recommendation, commercialization and community management. Next, let's talk about in detail that we have done some work in the past under these six scenarios.

The first is user portrait. It can be said that accurate and effective user portrait is the basis of all personalized strategies. Now we have preliminarily established a set of user portrait system to mine some important user attributes, such as: the user's activity, the recently commonly used login location; the people rank to measure the user's influence, the authority of the producer in a certain field and the interest of the consumer in a certain field, etc. These user tags have been used in a series of tasks, such as personalized sorting, recommendation, problem routing and so on, and achieved good results. Next, we will further mine the attributes of users. For example, we plan to conduct community analysis on users and locate the key nodes in the information dissemination network, which is the so-called KOL; we also hope that the labels in the user's portraits can become more "refined" and "predictable". For example, if a user is recently interested in the topic of "health during pregnancy", we can boldly speculate that in a few months, he may be interested in the topic related to "parenting"; we also hope to share with the user Reconstruct his character setting and personal background, etc. In a word, we hope that the accumulation in the field of personal portrait can make Zhihu more fully understand our users, better portray them and better meet their needs.

After that, let's take a look at the content analysis. Zhihu generates a large number of new content every day. They are heterogeneous, including questions, answers, articles and comments. Later, we will gradually launch videos. These content needs to be labeled in the first time after being produced. Therefore, we build a unified content analysis pipeline to ensure that when each content changes, it will immediately enter the pipeline for processing, and then synchronize the analysis results to search, recommendation, community and other businesses in real time. At present, the average delay from the content in pipeline to the content out pipeline is about 10 seconds. This kind of real-time performance meets our business requirements well. At present, in this pipeline, we have carried out some basic analysis on the data of text, image, audio, etc., such as text classification, named entity recognition, pornographic image detection and low-quality image detection, audio noise reduction, etc. We will gradually add more components to the pipeline. Recently, an important work is to depict content quality from different dimensions, including timeliness, professionalism, seriousness, accuracy, etc. We also plan to help us automatically create summaries of content through deeper semantic analysis, so that users can make preliminary judgments and improve the efficiency of filtering content in information intensive scenarios such as feed flow without opening cards.

User portrait and content analysis are basic work, to a large extent, not directly visible to users. Let's introduce a higher level business scenario. First of all, sorting is a very important part of the content distribution scenario. Whether we do well in sorting directly determines whether we can push the appropriate content to the user at the first time. We mainly use learning to rank algorithm of pointwise and pairwise to deal with our sorting problem. There are three typical scenarios for the ranking of Zhihu:

The first is the homepage feed stream. When a user enters Zhihu, the feed flow is the first entry. When a user sees what the content looks like in the information flow, he will think it looks like. So whether the information flow is done well or not will directly affect the changes of core indicators such as user retention, retention time, and user browsing depth.

The second is the sorting of search results. We currently use learning to rank to solve the problem of how to mix different types of content;

The third scenario is the ranking of different answers to the same question: there will be many answers to the more popular questions, and even thousands of answers to some questions. How to rank these answers is also a very important topic. In addition to considering the characteristics of users' voting, we also need to consider the text characteristics of the content itself, such as content format, content quality, the relevance of answers and questions, and so on. At the same time, we will also consider the professionalism of the author and the voting users in this field to ensure that "professional answers" will not be buried.

After using learning to rank technology in these scenarios, we have achieved good results. Take home page as an example. Our home page sorting algorithm is similar to edgerank's strategy. After the optimization in the past two years, various indicators have become stable, and it is difficult to improve significantly. In three months, the click through rate increased by 40%, the length of stay increased by 20%, and the retention rate increased slightly.

Of course, that's not enough. At present, we are also exploring the further optimization of learning to rank technology. Some possible directions include:

One is multi-objective Pareto optimization. As you know, many business scenarios need multi-objective optimization. If you only look at one indicator, it is easy to fall into the trap of local optimization. For example, the ranking of homepage information flow, Zhihu homepage needs to consider many indicators, including the user's click rate, stay time, browsing depth, the proportion of users who are not interested in clicking, the proportion of users who may comment and interact, etc. In this process, we have done a version of algorithm optimization, mainly for click through rate. It's true that the click through rate has increased a lot, but during the evaluation, it will be found that there will be many low-quality content on the user's homepage, such as "trembling smart", "quarreling dispute", "title party", etc. These contents attract people's attention and are more likely to be clicked by users, but they have little value to users. Too much exposure will have a negative impact on the atmosphere and tonality of the community. In this case, we hope to introduce Pareto optimization and other ideas in economics into the learning to rank scenario to promote the collaborative improvement of various indicators.

The second direction is to make good use of some real-time features, so that the model can not only reflect the user's stable preferences, but also take into account the current real-time status and make timely adjustments. For example, if a user is a Barcelona fan and likes to watch all kinds of discussions in Barcelona, but maybe Barca lost the game yesterday and was very unhappy and didn't want to see any relevant content, then in an ideal state, our information flow service should be able to find this change in the shortest time and quickly make adjustments.

The next business scenario is recommendation. Zhihu's recommendation is mainly divided into two types, one is for content recommendation related content, and the other is for content that users may be interested in. We continue to do some recommendation work, which is business oriented. At the beginning of this year, we launched the unified recommendation engine plan. Based on the open-source systems such as prediction IO and elastic search, we built a unified recommendation engineering framework of Zhihu, and unified most of the existing scenarios. There are two main tasks for recommendation. One is to implement a complete recommendation algorithm library, which plans to support various recommendation algorithms, including explore & explore algorithm group, collaborative filtering, content relevance recommendation, deep & wide Direction algorithm framework, etc., to improve the efficiency of small traffic experiment; second, combine recommendation with sorting, add some recommendation results to the entry level scene such as home page or search, and sort with the original content, explore and expand the interests of users, help users quickly find other high-quality content, and help them find a larger world 。

Sorting and recommendation are relatively clear user scenarios, the main purpose is to achieve efficient matching of users and content. Comparatively speaking, the scene of commercialization will be more complicated. As you know, the difference between commercial products and user products lies in the introduction of the new role of "advertiser" in the process of commercialization, which requires a balance between the interests of users, platforms and advertisers. Zhihu's commercial exploration started last year. So far, we have built a prototype business system, which has realized a series of functions such as traffic prediction, advertising targeted delivery, CTR prediction, intelligent packaging, etc. With the expansion of business scale this year, we need more effective tools and arm them with machine learning technology. For example:

Intelligent advertising sales tools to help the team better identify potential advertisers;

Advertising quality prediction and audit tools, Zhihu is a company that attaches great importance to user experience, and we are also very cautious when conducting commercial exploration, not only to ensure that users are not disturbed by advertising as much as possible, but also hope that advertising can bring value to users. In the past year, our commercial operation team has spent a lot of effort to ensure the quality of advertising materials, so users generally accept and understand the advertising on the site, and the transformation effect of advertising is also good. However, after the scale of advertising, such quality assurance can't be achieved by manpower alone. We need some machine learning mechanisms to assist in manually determining the quality of advertising materials.

The intelligent advertiser platform helps the advertiser to set various delivery plans and orientation schemes more optimally.

Of course, with the progress of our commercialization, we believe that there will be more machine learning challenges waiting for us.

Finally, let's talk about the last scenario, which is our community management. A good community needs a good atmosphere for discussion. We use machine learning technology to do a lot of work to improve the efficiency of community management colleagues. They mainly include:

Identification of spammer users: there are many types of spammer users, including crawler users, powder brushing users, marketing users, etc;

Automatic detection and processing of low-quality content and illegal content, such as pornographic images and language violence recognition, etc.

If we compare knowledge to cities, community management is like an infrastructure work, similar to building roads and dredging sewers. Only when the infrastructure construction of a city is completed can more people be attracted to settle in; but when the scale of a city is expanded, these infrastructure work will certainly become more and more onerous. So in the future, we plan to invest more energy to help the community management team improve efficiency:

On the one hand, further improve our recognition accuracy and coverage of low-quality content. For example, we should increase the ability to identify marketing soft articles and online rumors, and minimize the damage to the community.

On the other hand, we hope to introduce intelligent customer service robot to improve the efficiency of handling affairs such as user reporting and user feedback. This technology has been well applied in some e-commerce websites at present. We also hope that through this technology, we can reduce the workload of community management team, improve the satisfaction of users, and make users get feedback as soon as possible To respond.

The above briefly introduces the current machine learning usage of Zhihu, mainly focusing on six scenarios: user profile, content analysis, sorting, recommendation, commercialization and community management. In the final analysis, machine learning technology is used in the production and distribution of content to improve efficiency. We believe that with the deepening of our work and the development of machine learning technology itself, we can do better and better. In addition to this progressive improvement, what else do you want machine learning technology to help us do? In a word, I hope to know not only the content of "distribution", but also the content of "understanding".

As you can see, in the closed loop of "content production" and "content consumption", Zhihu, as a platform, is more playing the role of "information routing", promoting the production of content, and then delivering it to different users. If these contents are precious minerals, the role of Zhihu is more like a distribution center of minerals. How much value these minerals can be mined depends on the users themselves. We hope that in the future, Zhihu can process these minerals to a certain extent and further improve the efficiency of knowledge acquisition by users. For example, users have said before that they like to know about different opinions of the same news. Imagine that it would be very valuable if Zhihu could summarize and aggregate these opinions in different categories, make it easier for users to master the whole situation and save time for one answer and one answer to look through.

This work involves two aspects: on the one hand, extracting knowledge and viewpoints from unstructured content produced by users and turning them into a part of knowledge base; on the other hand, transforming the content of knowledge base into user-friendly products.

First of all, we extract structured information from unstructured content. At present, the industry has some existing operations and research, such as knowledge mapping technology. We hope to dig out all kinds of knowledge and opinions from the massive content of Zhihu, store them, and further index and use them. So what are the difficulties in the construction of Zhihu knowledge base? Mainly, the content and form of Zhihu are relatively complex. In addition to the attribute information and relationship information that can be structured, it also includes other forms of knowledge and opinions, such as the discussion of scientific theorem, the evaluation of some events, etc.; the content of Zhihu also has open problems: the discussion is not limited to specific fields, but an open and expanding one Domain set; in addition, the user's discussion often presents a very novel perspective.

Of course, compared with other companies, Zhihu has advantages in structuring knowledge and insights. Because the content quality of Zhihu is relatively high, at the same time, the user's interaction behavior and content constitute a network with abundant information. It makes our data have a very high signal-to-noise ratio, which also provides great convenience for information extraction.

Assuming that we have magically completed the previous step and established a strong knowledge base, how can we product this knowledge base? We also hope that machine learning technology can help. One of the ideas is to build an intelligent question and answer product, use natural language generation means to provide users with information in the knowledge base in a more natural and understandable way; if we go further, can we make "Zhihu" a smart brain, able to have more natural dialogue and communication with users, and a richer way What about delivering information? As we all know, this is relatively easy to do if we limit the scope of the field, but it will be very difficult to face a general knowledge base. This is also a relatively cutting-edge research direction in the industry. Many institutions at home and abroad are doing similar research, and we hope to know that we can have our own accumulation in this area.

The above is the main content of today's sharing, which mainly talks about two problems: one is to know the current situation of machine learning application; the other is to know the future prospect of machine learning application. Of course, if these things are to be realized, more machine learning bulls need to join us to increase the product value of Zhihu. This is what we came to the bay area to preach.

So the next is a small advertisement. Yes, we are recruiting, whether it's algorithmic God or machine learning related graduates just graduated, whether you are at home or abroad, no matter how far away we are, we are all eager for talents. At least we can communicate first.

(welcome to send your resume, recruitment link: Zhihu online application system - Recruitment details)

By the way, there is another small advertisement. This month, we are holding the "Zhihu watch mountain cup machine learning challenge", which officially starts on May 15 and ends on August 15. In this competition, the training data provided by Zhihu is the binding relationship between the problem and the topic tag, and the goal of participants is to provide the optimal automatic tagging model. We have provided 3 million questions and 2000 tags, each corresponding to a "topic" on Zhihu. Welcome to join us.

Thank you!

Part Q & A

Q: In your recruitment position, there are some parts related to the CV direction. What aspects do you focus on after all?

A: Next, Zhihu will support video, GIF and other formats, so that you can use more meta expressions to convey ideas. With these expressions, we naturally need to understand these contents. So on CV, we still need to understand the direction of image and video. In the pipeline of content analysis, we will add meta data of videos and pictures that are parsed offline. The simple form of these meta data is the tag; the complex form may be related to the semantic content of a certain frame. At present, this part of the work has just started, and we need to further arrange specific work.

Q: For the effects of various models, do you know how to do online experiments? How do you evaluate the effect of online experiments? Is it a general random sampling of some data or something special?

A: Online experiments are necessary. In addition, we will use ndcg or AUC to make pre launch estimation before launching. At present, the work related to the small flow experiment is mainly done by the machine learning team itself, and the reliability of the results may need the cooperation of the data analysis team. On the second question, my understanding is that you are asking the question of how to do a / B test in online experiments, right? Our current practice is relatively simple, mainly to divide users into some experimental groups and control groups. First, ensure that this experiment is stable for the same user, and then pay attention to whether these metrics are stable between the experimental group and the control group before going online. Observe the reliability of the experimental results after going online.

Q: What is the working atmosphere of Zhihu?

A: Zhihu is a technology driven company. Our team is very young, energetic and advocates hacker culture. There is a pirate flag in the office, and Paul Graham's anthology "hackers and painters" is a must read for every crew member. At the same time, Zhihu is also a company that attaches great importance to the sense of "work". We hope that every pirate can complete not only a "job", but also a work of his own. Zhihu is a company focusing on personal development potential and growth space. We believe that the world has rich possibilities to explore. Everyone has his own unique soul and good direction. Welcome to join us and explore a bigger world together.