How do search engines work?
How do search engines work?
Answering this question How do search engines work? is very, very broad. For this question to be answerable, I’m going to assume you have some understanding of how a search engine works. You might also want to understand why it’s very interesting.
Search engines are very simple pieces of software. They receive a list of keywords or tags (as they are often called) for a piece of information (like a product or a person). Those keywords are “sought” in a very large document called a “web page.” A web page is simply an ordered list of words (or “keywords” if you’re into the whole brevity thing). The larger the web page, the more difficult it is to find all of those words (hence the title of the question). Web pages are also often referred to as “URLs,” or “Uniform Resource Locators.” Search engines use these to help them index (or find) the keywords on the web pages.
Let’s take a web page, If you were a search engine, how would you go about “learning” this web page, so you could find all of the words in it? The simplest way is to just download the whole page. In most modern search engines, you do that by requesting the page’s “home page,” which is simply a document with the HTML codes for that page. It’s a lot of “data,” but it’s probably not much of a problem for a search engine.
An easy approach would be to simply download the web page, look for the keywords in the pages, and add them to the index. In that case, a webpage would be searched as “www,” “website name” and “com.” This would also require the search engine to know that “www” is actually a web address, and that “com” is the top-level domain.
To actually find the information you’re looking for, search engines often use an approach called “spidering.” If you’ve ever visited a “normal” web page on the internet, it’s likely that you’ve been asked to wait a bit as a spider goes through the “crawl” web page. This happens so the spider can learn things like whether the page’s “alt” attribute matches the current browser’s preferences, and whether the page looks good. The spider also needs to get the actual page content and its links to other web pages.
The whole process of spidering is very time-consuming. When doing that for a whole website, it can easily take weeks depending upon the number of pages. And the “crawling” process may not be exactly a 100% accurate process. They use those results to reduce the “crawl” process time.
The result of all of this is that a search engine can take anywhere from a few milliseconds to tens or even hundreds of milliseconds to “look up” the search terms in a document. The result is that a search engine should be able to “give me a hit” on a document in approximately the same time it takes me to look for the document on my web browser. This is called “fuzzy matching.” The search engine can’t just find all of the words on the page in the order they appear, though. There’s a whole process to get the right one.
Some search engines can actually look at the words in order, and do a more “sophisticated” search. For example, if you were looking for the word “banana” and it happened to be the “keyword” at the end of a sentence, the search engine might choose to find that word, but if there were “banana” in the middle of a sentence, it might choose to find the word “bananas” instead. It also takes into account things like the frequency of the words, how the search terms related to the web page, and the text of the web page, too. This is because humans often use a lot of different words, but have the same meanings. Search engines have to be able to recognize that.
Some search engines also allow users to add their own “tags.” These are usually called “keywords” in an “advanced” search. The tags can be used to help search the website. For example, a search for “puppy” might also include the word “dog” as a tag, even though the word “puppy” is the primary “keyword.” Some search engines allow multiple tags. These can be used to help narrow down the results of a search.
So, in sum:
Search engines receive “keywords” (or “tags”) that are looked for. They often need to “spider” the page they are “on” (download it and “crawl” it) so they know the format and “structure” of a page. They also need to figure out things like whether the document is a web page, and what the page’s “URL” is. If it’s a web page, they need to know whether it is an html document. They download the web page’s home page. The home page is a document that the search engine might use to store the web page’s “keywords” (if they are allowed) or information on “other” pages. The search engine then looks for “keywords” on the home page. It needs to find them in the “right” order because this is how humans typically write. It looks for the keywords in the document’s tags and often finds additional ones as “tags” on the page. It also needs to spider “outlines,” or other pages that link to the home page. It “spiders” the links and “analyzes” them, to learn if they link to the right pages. The search engine also spiders the links within the document. It analyzes the links and looks for “keywords.” It also spiders things like text and images on the page. This includes things like the words on the page, how the images are labeled, and the type of “media” on the page. It looks at words in the title, “meta” tags, “alt” tags, etc. This process is sometimes referred to as “crawling.” Finally, the search engine “estimates” which pages have the right keywords, and which ones do not. From this, it can rank the pages that are “hot.” Those pages are the pages with the most or the highest-ranked keywords.
All of this takes time. In order to do all of this, the search engine needs to “connect” to the website, download the document, crawl the document, spider the links, analyze the links, spider the document’s links, spider the links, and finally estimate the pages with the right keywords. It has to do this process many times for each search term, and with many websites.
In addition to this, there is also a lot of “processing” that needs to be done. If you’re using a “web browser” like Firefox, you may notice a little progress bar in the “status” area of the “web page.” The main thing the search engine does is “rendering” the document on the screen. There’s a lot of “HTML,” or the code for making the web pages, to be created and written out. This is the most time-consuming process. Many people don’t realize how much HTML actually goes into a “web page.” Most web pages have millions of words in them.
The search engine also needs to store all of the data it has gathered in a “search index.” It needs to have a way to store it, and a way to retrieve it quickly. This is usually done in the form of a “database.” A database is simply a collection of “records” (also called “tables” or “data”). These records are typically named based on their “primary key.” The search engine “index” is the records “index” into the data. So, the search engine has an index in the database, but a different index in the search index.
Now, as you can see, this takes a lot of processing. The biggest problem here is that the websites are constantly changing. The “web pages” themselves can change and be “redesigned” or “reorganized.” The content of a website can change dramatically. “Meta” tags can change, or be completely removed. The structure of websites can change. Links can change, and links can be added and removed. It also takes a lot of time to “crawl” a website in the first place.
Because of all of these things, search engines need to keep improving. They use “predictive text.” This is when they look at what a human might be searching for, and use that information in their search engines. One thing search engines have learned is that the “keywords” (or tags) a human might use to search for something aren’t always the words the human actually uses. A search engine will often “replace” those tags with words that are close to them. That way, it can be sure it’s finding the right page.
Many other things have been learned about search engines and how they work over time.