Rewrite the Technological Landscape

Chapter 40 Search Engine Algorithms

It was close to one o'clock in the afternoon, and Meng Qian arrived in Pudong, Shanghai. This was the first time he came to Shanghai after his rebirth, and it was a place he often visited in his previous life.

As the financial center of China, Shanghai is a business card for the world to see.

However, the first time Meng Qian came to Shanghai was in 2007, and he had never seen Shanghai in 2000.

At this time in Pudong, high-rise buildings have begun to rise, and at the same time there are large factories and shantytowns. Driving all the way, you can see that many places are being demolished and remodeled.

"Mr. Zhang plans to put the branch in Pudong?" After arriving at the destination, Meng Qian relied on his memory to compare. If he read correctly, this should be Zhangjiang High-tech Park.

Among the four key development areas in Pudong, the financial center Lujiazui and the technology center Zhangjiang should be relatively familiar to the world.

In Zhangjiang in 2000, the leading industries were circuits, software and biomedicine.

Zhang Shuxin nodded in affirmation, "The places with the most potential for development in the south are undoubtedly Shenzhen and Shanghai Pudong, and the Zhangjiang Hi-Tech Park is a treasure trove of technology incubation."

At this time, when everyone talked about the development potential of southern cities, especially the development of science and technology, no one would think of Hangzhou.

Arriving at the place Zhang Shuxin rented, there were five men waiting there, two of whom were foreign men at first glance.

Zhang Shuxin introduced them one by one. One of the two foreign men is from IBM and the other is from Google. It seems that they have either been dug in or plan to be dug in. Both of them were in the search engine project team before.

The other three Chinese people, one is Ying Haiwei's own technical director, and the other two are from Silicon Valley. One graduated from Stanford University and worked for Intel, and the other graduated from Harvard and worked for Oracle. They are all talents.

After a simple greeting, everyone went to the meeting room to sit down, and then it was Meng Qian's performance time. Today, he will demonstrate his core search engine technology.

Search engines need to use web crawler technology, retrieval and sorting technology, web page processing technology, big data processing technology, natural language processing technology, etc. Of course, at this time in 2000, natural language processing technology and big data processing technology were not yet available. The concept of later generations is also different.

But to put it simply, the core is actually one thing, the algorithm.

Because every technology is inseparable from algorithms.

"I don't know much about the accomplishments and understanding of everyone here in terms of search engines. I can only continue at my own pace. If anyone has any questions, they can interrupt me at any time." Meng Qian walked to the blackboard and got straight to the point.

"Before I show my core technology, let's take a look at the three mainstream algorithms, Baidu's hyperlink analysis, Google's PageRank algorithm and IBM's HITS algorithm.

Almost everyone thinks that Baidu’s hyperlink analysis is the most backward of the three algorithms, but we still need to take a look at some things from multiple angles. Baidu’s hyperlink analysis can be regarded as a foundation to some extent The basis for the development of search engines.

Some voices say that Google actually plagiarized Baidu’s hyperlink algorithm. After all, Robin Li’s patent is indeed before Google. We don’t speculate whether it is true or not now, but this kind of statement reflects a very important signal. In fact, no matter which company it is The algorithm and the algorithm basis are actually the same.

Crawl webpage information, and then use some mechanism to sort these webpages. When the user enters keywords to search, the webpages arranged according to the mechanism are matched according to the keywords.

So where does whiteness lose? The key is that the whiteness is now too simple based on the basic sorting method that the more pages pointed to by other web pages with hyperlinks in all the results of a certain search, the higher the value.

In contrast, Google's pagerank has two more important things. The first thing is to interpret the link from page A to page B as a vote from A to B. Google will evaluate both A and B here. levels to form new levels.

That is, each page has a PR value, and your PR value will become a reference for other page PR values.

Then repeatedly calculate the PR of each page. Assuming that each page is given a random PR value, then after repeated calculations, the PR values ​​of these pages will tend to be stable, that is, the state of convergence.

As for HITS, its theoretical basis remains unchanged. Its biggest feature or change is that it realizes that the average distribution weight of the pagerank algorithm does not conform to the actual situation of links.

Therefore, another type of webpage is introduced into the HITS algorithm, which is called a Hub webpage. The Hub webpage is a webpage that provides a collection of links pointing to authoritative webpages.

So the search results using HITS will be more authoritative than the other two, but this algorithm will greatly increase the computational burden, right? "

Meng Qian glanced at the buddy who came out of IBM, and the other party froze for a moment and nodded uncertainly.

So now to briefly summarize, the algorithmic basis of search engines is hyperlink analysis. The advantages and disadvantages of the algorithm lie in how to make the search results more reference value and allow users to obtain more effective information.

Of course, if you can directly understand the user's needs and help him search for the content he wants most, this is the most ideal search engine state, but everyone knows that this is impossible.

Therefore, the quality of a search engine determines whether you can allow relatively more people to obtain the content they want under the same keyword.

10 users use Google, and 5 people find what they want. If they use our search engine, 6 people find what they want. Under the current technical environment in this field, we are better.

Then on the basis of this understanding, what I want to introduce to you next is my search engine algorithm, the dynamic rule hyperlink analysis algorithm.

The dynamic rule hyperlink analysis algorithm has the following changes.

First, as we mentioned just now, a good search engine depends on whose feedback results under the same keyword can better meet the needs of users, so when a user is searching for something, there is a high probability that he wants The result you see should be more vertically related to this thing.

For example, when a customer searches for a car, no matter whether he wants to buy a car or learn about cars, the professional webpages related to cars should be more helpful to him.

So in my algorithm, for links pointing to a certain website, I will first score the vertical rate, for example, there are now 10 websites linking to A, these 10 websites are all automobile websites and these 10 websites are not The results of automobile websites must be different.

There is also a small psychological problem here, that is, peers rarely make hyperlinks, so websites with more vertical website links must be more professional than websites linked by messy websites. Spectrum.

Second, establish a popularity ranking mechanism for keyword databases. Now several search engine companies have sorted web pages, and I have also sorted keywords, and it is very simple to sort keywords, that is, to look at user searches quantity.

For example, if there are the most users searching for cars today, the score of cars may be 10 points. At this time, the algorithm will allocate more resources to car-related information to grab more high-quality web pages.

There are four benefits here, improving the speed of information feedback, increasing the timeliness of hot feedback, saving computer resources, and focusing on the ultimate goal, so that more users who use our search engine can get useful information.

The third is the user feedback mechanism, which is to track the user's clicks and browsing conditions.

Still using cars as an example, after 100 users search for cars, 80 of them click on web page A, and the rating of web page A will increase. If more users stay on web page A for a longer time, the rating of web page A will also increase. , if more users directly perform operations such as links on the A webpage, the rating of the A webpage will also increase.

In other words, add user feedback points to the entire webpage rating system.

Fourth, the regular algorithm, which looks for high-probability behaviors in all user behaviors, and feeds these high-probability behaviors to the manual. For example, 60% of users who searched for cars will search for insurance next.

We cannot predict such laws, but we can use algorithms to mine big data, and the results returned can be used by the manual analysis department to score certain web pages. This is manual scoring.

Combining the above four points, under my algorithm, any webpage will also have a score, which I call precision score.

Factors that affect the accuracy score include self-score, linked vertical website score, user feedback score, manual score and external link influence, etc. "

Afterwards, Meng Qian briefly showed the algorithmic logic and algorithmic deduction formulas of each branch.

However, when Meng Qian was talking about the last regular algorithm, Jeff from IBM suddenly stood up and exclaimed, "OH MY GAD! Artificial Intelligence?!"

Meng Qian turned his head and glanced at the other party, frowning.

Jeff paused, thinking that Meng Qian didn't understand, and said in a strange pronunciation, "Damn!!!"

And following Jeff's interruption, the eyes of the other four technicians who were originally immersed in Meng Qian's sharing also showed obvious changes

The starting point is actually unable to upload complex formulas

Tap the screen to use advanced tools Tip: You can use left and right keyboard keys to browse between chapters.

You'll Also Like