Hacking Book | Free Online Hacking Learning


operation and maintenance event management for tens of millions of users

Posted by patinella at 2020-03-18

This paper is organized from GOPs 2017 Shanghai station speech "event operation and maintenance road for tens of millions of Internet Securities users"

Author brief introduction

About the author: yuan Yougao, an it man, joined Ping An Securities in 2013 and participated in and witnessed the establishment and development of the production event group.


The theme of this paper is "the road of event operation and maintenance for tens of millions of Internet Securities users".

1、 Background

This article mainly includes the following parts. Let's take a look at the background.

1.1 Internet transformation

First of all, let's look at the history of the company. Ping An Securities started to carry out our securities business in 1991. With the opening of policies in 2013 and 2014, and the favorable market, the business has transformed strongly. Combined with the comprehensive financial advantages of Ping An Group, we set up a massive customer acquisition plan.

The traditional R & D team could not meet the business development at that time, so we also urgently built the Internet R & D team. It can be seen from the figure that from 1991 to the end of 2014, our stock users were only more than 1.5 million, but in recent years, with the development of business, our stock users reached 10 million by the end of 2016.

What problems does the increasing number of users bring to our entire R & D team? What happened to us?

1.2 impact under surge

Let's take a look at the impact of user explosion? In 2014, the trend of business volume increased, and the number of events reported by users began to rise slowly. So there are more complaints from various business departments.

Until April 2015, a black pot of operation and maintenance was formed. After receiving the event, we will not do too much analysis, and will transfer the event to our R & D. R & D is also to complete the development of other requirements in this strongly disturbed mode. At this time, they began to complain that there was too much demand and there was no time to check the production problems. So our production events are not handled well.

Our situation at that time was not ideal, and our user experience also declined sharply. How can we solve this embarrassing situation? Ping An Securities incident handling group was also set up under such circumstances. Before our establishment, we also investigated other companies, such as Ctrip and hungry, and Ping An Group's telephone customer service center.

1.3 establishment of incident handling group

Let's first take a look at their work mode. Let's take a look at 95511. They use pure business ideas to help users solve problems. They only need to query some simple information. Their main work is to collect and distribute problems. Most of the problems can be completed with the help of our R & D team.

But Ctrip and hungry are different. They are mainly based on technology and need to query the server log. Only a small part of the more complex problems will flow to the R & D team. As the whole R & D team continues to expand its influence on event handling, we finally consider to establish an event handling group based on technology and supplemented by business. Our component is development and testing, which requires certain technical requirements.

First of all, we deal with the account opening business of Ping An Securities. As we continue to summarize and summarize this system, very few of our problems will flow to our R & D team. At this time, other systems also encounter a more embarrassing situation, so we have undertaken other systems in succession.

So far, there are 8 production event handling teams in Shanghai and Shenzhen. Now we have handled 10 core systems, including the account system of Ping An Securities, which mainly analyzes, tracks and provides solutions to the events.

1.4 event handling process

Let's take a look at the flow of event handling groups handling events. The first is to accept the event. When our users have problems, they will find the corresponding channel manager and wealth manager. We call them account managers. They will make a simple analysis and then report the problem to the event handling group. After receiving it, we will do a series of analysis, query Elk's log system, as well as the embedded information of app and database judgment.

When we find compatibility problems, or if we can't find the user log at all, we will transfer it to the tester to reproduce the production problems. If we encounter something that cannot be solved, we will also transfer it to the R & D team for their assistance. If we find problems with historical data, we will arrange the script for the O & M colleagues to review and execute.

If it is found that there are some product configuration problems, the product manager and operation colleagues will be contacted to correct the corresponding product information. Finally, we will form a knowledge base after dealing with the event, which mainly includes the description of the occurrence of the problem and the steps to solve the problem. Our goal is that when other colleagues encounter the same problem, they can easily solve the problem by looking at the knowledge base.

We will also consider whether to join our monitoring platform according to the number and severity of events. Some of the topics we share below mainly include what difficulties we encounter in the process of handling events. We need to solve this problem through a technology based approach.

2、 Reporting channel

Let's take a look at the reporting channel.

2.1 original reporting channels

After these account managers encounter problems, they must also report the events to the event handling group through a system. Ping An's event management follows the specifications. There is a set of ITSM over ten years, which also connects with our group's OA system. So far, it has undertaken more than 10 professional subsidiaries, with more than 800 task channels and an annual operation volume of more than several million.

But there is a channel that breaks our original reporting method. As we mentioned above, by the end of 2016, our stock of users reached 10 million, with an average of 200 to 20000 account opening days. The number of account opening increased by 100 times, the trading volume also increased in a straight line, and the number of events doubled. As a result, the submission of ITSM by these account managers is not particularly smooth, and the handling of incidents is not particularly timely, so there are more complaints.

Ping An is a comprehensive financial group, which has insurance, banking and securities business. In 2015 and 2016, many stock users are business development brought by insurance operators. When they encounter problems, they also need to submit them to ITSM. Our insurance agents spend most of their time on the customer's site, not in the office area, so it is difficult for them to submit them.

2.2 wechat channel

Of course, in order to solve the problem that our channel is not smooth, we also built wechat channel urgently. Now there are two massive customer groups in wechat channel, each with 500 account managers.

2.3 wechat group problem

First, let them feed back the problem to the event handling group via wechat. But after a period of operation, we found that wechat channels have a lot of problems. The cost of reporting is relatively low, and the quality is relatively poor. In a word, stocks cannot be traded. Such a sentence is not easy to deal with for the event handling group, because we need to query the user's log according to the user's information to judge, so the cost of back and forth communication is relatively large.

Second, due to the large number of people in the wechat group, the confusion of information cross and the high probability of scrolling, our omission rate is also relatively high.

In the third part, wechat group tracking is not easy to maintain and analyze.

And we don't have the manpower to reply the information in the wechat group one by one, so our efficiency is also relatively low.

2.4 feedback system MSS of mobile terminal

At the same time, we deeply realize that these insurance operators need to use mobile feedback system. Many of our insurance colleagues are on the site of our customers, not in the office area. For them to build a mobile feedback system MSS, let them submit and feedback through the system self-help.

MSS is not particularly complex, but provides a set of simple H5 front and back. It is also convenient to report. You only need to input the user's information, the description of the user's problem, and the screenshot of the user's problem to report the problem to the event processing group.

After we receive the problem, we will also deal with it through the MSS background. After that, they can also query the progress of the problem submitted by themselves according to the foreground. After the MSS system has been running for some time, we have also done a lot of questionnaires. There are many account managers who think that we have answered some questions and want to continue to ask this question.

So we added the interactive chat function to the second picture. If they have any questions about the event, they can send offline messages. We will reply them one by one in the MSS background until they understand.

So we have broken the binding of wechat, but this process is also painful, because we need to repeatedly communicate with these account managers. In order to solve the problem of event timeliness, we also quoted the wechat enterprise number. After reporting these problems, the reporter let them know that there are new events to be reported at the first time, which need to be handled by Lima.

When we deal with it through the MSS background, we will also push a message to our reporter to let them know that the solution we deal with needs them to help the user solve the problem urgently. Through the feedback system of the mobile terminal, the problems such as disorderly messages, less smooth channels and low efficiency of processing are solved.

2.5 implementation effect

Let's take a look at the trend chart driven by MSS. Most of them report this problem to our event handling team through wechat channels. Through our continuous improvement and propaganda of the system, 90% of the problems are submitted through the MSS system.

3、 Data structure

Let's take a look at the third block, data structure.

3.1 reproducing data

We found some compatibility problems, or we couldn't find the user's log at all. We need to transfer it to our test colleagues to reproduce our production problems. However, we often find it difficult for our test colleagues to construct test data under our test environment. The original way is that we and our testers summarize a large number of SQL scripts. When we need any account, we can perform a series of operations on the database to get the account we want.

The first one is the problem of poor accuracy. Because we have a large number of SQL scripts and a large number of stored procedures rely on a strong relationship, it is easy to make mistakes.

The maintenance cost of the second part is relatively large. When the development modifies the business scenario, the correctness of the test data we construct is not enough, so the cost of communication with them is relatively large.

There are many data sources in the third block.

In the fourth part, if you switch the database back and forth, it will be a time-consuming and laborious process.

3.2 functional characteristics

Now let's see what kind of platform we need? We need to be able to accurately and quickly construct the test data in the test environment. We need to construct regular data and complex data to support multiple scenarios and businesses. We need to provide a visual interface for us, which is relatively easy to operate, not a large number of SQL scripts.

3.3 UTA data construction platform

So we put forward the concept that we need a platform. At this time, we built a data construction platform together with testers. They provide business logic, and we will implement code work. There are not many functions of UAT data construction platform. We only need the functions of regular data, complex data and reports.

The following is the general data structure, which is easy to construct. Only the user's name and ID number need to be entered. After input, we can construct the account information we want. You can see that the interface is relatively simple, the interface of complex data construction has not been pasted out, but there are many more elements on the interface, such as whether it is necessary to add available, desirable, and whether it is necessary to set our risk assessment level.

Let's take a look at the specific implementation. We originally wanted to call the summarized script through code to reach the account we wanted, but in order to avoid frequent modification of business scenarios by our R & D team, we had a large amount of work in maintaining the code. In the end, we use the interface business logic to construct the normal data, which needs to call 35 interfaces. We can imagine how complex the background of data construction is.

3.4 data structure system

Construct data platform to complete our three systems:

The first is the account system, which sets the related operations of user data.

The second is the tripartite system, which needs to be bound with tripartite deposit and multilateral deposit.

Third, we need to add some agreements to our account to complete our trading system.

3.5 usage of data construction platform

Let's take a look at the usage. Our data construction platform was started in June this year. After it was built, the first thing we gave to the event handling team and testers was a trial. After the trial, we need to feed back the modification information to us, so as to improve our system. So far, we have constructed 80-90 pieces of test data on this platform every day, not only for testing production events, but more data is the test account required by the test version.

Then there is the distribution of results, divided into success and failure, as well as the data being processed. You can see that the 10% failure rate is caused by the unavailability of the environment during the deployment of the version.

Finally, the statistics of personal use. This part has also been counted. If a test data is constructed by traditional manual method, even if the script is extremely rich and the table structure of the database is familiar enough, the efficiency of constructing through the platform is about 8 times higher than that of manual.

4、 Service Center

4.1 data analysis

Let's take a look at the fourth part, our service center. The following is data analysis, which is a data visualization platform. We need to analyze the operation of various systems. If it is difficult for us to count the data trend every week or every month, we are wondering whether we can precipitate the daily events and analyze them regularly.

So we put some discrete data together to make the whole data have some logic. So collect the above ITSM and MSS data, put them into the data warehouse, and display them through the data visualization platform. Today I mainly read the weekly newspaper. What are the difficulties.

4.2 event analysis

The above is the weekly report of the event handling group, which just removes the trend line of event resolution rate and timely resolution rate. We need to integrate all kinds of problems into several categories: data, program and consultation. The trend of consulting represents the interaction of products, or the new business rules, or the consulting brought by the change of our business rules.

From the above chart, we can see the trend of data in the past two months. The trend of the average consulting class accounts for more than 80% of the total amount of each period. So such a large amount of data is a headache for our event handling team, because there are many problems in these consulting classes, which have a great impact on our workload.

At this time, we will discuss with our operation colleagues how to reduce the proportion of consulting services. First of all, the interaction of products needs to be designed more humanized. Secondly, the knowledge base we summarized should be documented and sent to customer managers to let them know some simple solutions and how to deal with them, but the effect is not particularly good.

At this time, I have a bold idea. Can we bring these colleagues and account managers together and share our knowledge through interactive communication.

4.3 live broadcast platform

Finally, we choose the bird of knowledge live platform of Ping An Group to share knowledge. As the keynote speaker, we have passed on some difficult and miscellaneous problems to them in an active way to let them know how to deal with these common problems. The practice is also relatively simple, so far every Tuesday at eight o'clock in the evening to do a live broadcast of knowledge, each issue is an hour, 30 minutes of the main speaker and 10 minutes of the topic, 15 minutes of interactive communication link.

Our live broadcast is mainly divided into four phases. The first phase is event process sharing, which tells them how to correctly report the event to our event processing group, and what is the priority level of event processing by the event processing group. The rest of the issues will be explained one by one in order to let them know how to help users solve the problems we have reported.

4.4 live data

Let's take a look at the data of live broadcast, which is mainly divided into the total number of viewers and the total number of viewers. The total number of viewers is the number of live viewers who are online at the same time, and the total number of viewers needs to be added with the data of back viewing. Because after we help them share the live broadcast, we will also send them some messages at this time, and let some colleagues who didn't watch watch our videos back and forth.

It will also be analyzed on a regular basis, whether weekly or monthly. If we find any problems with them, we will broadcast the knowledge live. In fact, the practice is relatively easy, and the cost is not particularly large, but the effect is good, just dare to try.

4.5 QA Service Center

Through weekly summary of hot issues to form a knowledge base, to these account managers for training, knowledge sharing.

In our live interactive communication, we have many account managers who mentioned that they want to query our solutions through a system, not the documents we provide them.

In fact, each of our systems has its own help center, but it is difficult for these account managers to query. We will build a QA Service Center for them, specifically for the help center of account managers.

Through the help center to reach the self-service of the service desk, mainly the business process and known problems, and the hot issues of each system will be put in.

4.6 AQ service center display

Now let's take a look at the QA service center, which mainly includes the home page, list page and content display page. Let's take a look at the home page, which contains the modules of each system that can query the solutions of our problems according to the key points.

The following are our hot issues. We need to summarize the issues we often deal with recently and put them in our QA service center.

The following is our content display page. There are many flow charts provided by the event handling group, but we also give them to the product manager for confirmation before we provide them. If there is no special situation, we will maintain them once a week, and we will discuss the hot issues on the weekly meeting of our event handling group.

Five. Monitoring

Let's take a look at our fifth piece of monitoring.

5.1 passive to active

In fact, after a large number of users occur, we need to report to users before we know where there is a problem, and we need to solve it. It doesn't achieve the function of early warning, or when users want to submit this problem, we already know it, or we are already solving it, or we have already solved it. So we have a deep understanding of the problem.

So we need to transform the passive mode into an active process. We need to remind the alarm and deal with a large number of problems to improve our user experience.

In fact, Ping An Securities has a lot of big and small monitoring, CPU, traffic, space and major monitoring. These are not so important for the event handling group, because we need to do our own monitoring according to our rules.

Here are some rules that we summarized when handling the account opening business. We mainly monitor the account opening volume, account opening details and distribution of account opening channels.

5.2 monitoring implementation

As mentioned above, in the process of handling events, we consider whether we need to add monitoring requirements through the number of events and the severity level of handling events. Regularly observe the trend of events, and then do a series of monitoring.

As an example, our account system and trading system are not in the same state. The online time of this project is not very long. The main reason is that it is very troublesome for users to complain.

First of all, let's talk about the background of the project. Every day, the trading system and the account system will get the clearing documents to do some clearing and modify the status information of the shareholder card in the database. But often we find that the status of some users is not consistent, so we do monitoring.

5.3 monitoring effect

Now let's see the effectiveness of our monitoring, mainly the distribution map of normal data and abnormal data, followed by the number of abnormal data. You can click the list details page to see the specific user data.

The third part is the way we alert. So far, we mainly alert by email. We will solve these problems after receiving the email. Finally, let's look at the parameters of indicators. After monitoring, no complaints have been received from users recently. However, the root cause of the problem still needs to be researched and developed to check its own code and solve the root problem.

So we can improve the user experience by transforming the passive into the active process and dealing with the problem of batch users.

Six, summary

Let's take a look at the summary, which is divided into the following three parts.

First, we need to pay attention to every link of handling events. If there is a problem, we need to deal with it in a technology based way.

Second, try to improve efficiency through tool platform.

Third, we need to innovate constantly in methods and give the best support to front-line colleagues.

Read more articles

A comprehensive understanding of the differences between redis and memcached

Developing an enterprise level monitoring platform with Python

Grab train ticket automatically with Python code

Ctrip operation and maintenance automation platform, tens of thousands of server changes can also be very easy

Is intelligent operation and maintenance replaced by AI?

Look at Tencent operation and maintenance's plan to deal with the event of "18-year-old photo nostalgia for the whole people". You must not regret it!

Seamless operation: a best practice of Alibaba's operation and maintenance guarantee system

Forever! A 20-year history of old operation and maintenance

Hungry? Live database in different places

Operation and maintenance version of Chengdu, how many people heard crying

Second level monitoring of Alibaba trillion Trading

Salvation of it operation and maintenance

What about the first operation and maintenance industry event in 2018?

Shenzhen, April 13-14, 2018.

The 2-day conference has 19 special sessions, covering many technical fields of aiops, operation and maintenance automation and Devops.

Moreover, Tencent SNG team systematically disclosed its operation and maintenance system for the first time

Click to read the original text and enter the official website of the conference