Mario fischer is publisher and editor-in-chief of the website boosting and has been fascinated by optimization possibilities since the first hour of the web. He advises well-known companies of all sizes and in all industries and teaches in the newly founded e-commerce course at the university of applied sciences in wurzburg, Germany.
we usually use them several times a day, are happy about good hits and don’t waste a thought on how it is possible to find ten or more matching results out of several million documents within fractions of a second. And several thousand times simultaneously per second. search engines – how do these gigantic data collections actually work, and how do they manage to turn dull prose queries à la "how tall is justin beaver??" to realize that no one here wants to know anything about the average size of a rare construction abnager?
Figure 1: schematic, highly simplified structure of a search engine
How to build a search engine resp. Which individual systems are necessary for this is known in principle. the secret lies in the efficiency and thus in the infinite details of the algorithms that collect, process, and store data and retrieve it on demand, while at the same time filtering out manipulated documents. These algorithms are usually the result of many thousands of person-years of research and development by companies such as google, bing, yahoo!, yandex, baidu and others and it is all too understandable that the details are kept secret. The real processes of search engines are by nature extremely complex, and to describe them would go far beyond the scope of this article. In many places, things are therefore greatly simplified and presented in a principled manner. The basic principles of information retrieval (IR) are, however, essentially the same and for the necessary basic understanding of how a search engine works, this is therefore sufficient. This knowledge is of course not only useful for SEO.
In simplified terms, a search engine essentially consists of three large processing modules:
- The crawler (also bot or robot), which fetches documents for pre-categorization and passing them for storage (in the repository) on command from scheduler.
- The indexer, which fetches the unstructured data from the repository, prepares it and passes it to the search index for storage.
- The query processor (also called query engine), which takes in the search words via the user interface, parses them and retrieves the results from the word index.
Crawler and scheduler
The crawler is responsible for obtaining information from the web. The addresses to be called up are assigned to it by the scheduler. This determines when a web address (URL) is checked again for changes. Depending on whether a document has been changed since the last version, the visit intervals are increasingly shortened or, if necessary, reduced. Also extended. This ensures that z. B. An imprint, which is usually changed less frequently in terms of content, does not have to be crawled as often as z. B. A page with a blog post and growing comments. The challenge is to determine these intervals as accurately as possible, so that on the one hand no valuable resources are wasted, but on the other hand the index is up to date enough. Problems are caused by pages with duplicate content or documents that contain little or no text from the crawler’s technical point of view (e.g., a page with a duplicate content or a document with a duplicate content). B. Flash or ajax programming). If a content management or store system generates dynamic web addresses, the crawler has to stop its work. It constantly finds "new" urls, which in reality always have the same content. Such unfavorable constellations are called "spider trap" – even if they are usually not set up intentionally. Basically, the scheduler tries to call up all new addresses that are found via links, thus constantly increasing the index.
"Number games"
About 1.4 million new websites (individual pages) go online for the first time every second. Per second! For a layman it is hard to imagine how such a mass of new information can be absorbed and processed practically in real time. In addition, the crawlers must also regularly update the already existing data stock of usually several billion pages. If google, with its 60 billion indexed pages, would update even half of them at least once a year, it would add another approx. 57 million. Pages to crawl. experts know that after disclosure z. B. A new blog page via "ping" this page is already available in the index after a few seconds and is findable. A performance unimaginable even for experts. at this point, it also becomes clear why you can’t just build a new search engine like jacques chirac and gerhard schroder tried to do with "quaero", almost like big men and rather naive in terms of their expertise. In the end, not much more was left of the project than many sunk tax millions from EU funding pots.
Store server
The storeserver coordinates the handling of the crawled pages and the forwarding for further processing. Basically, it should be noted that search engines work document-oriented. The fact that different documents can be reached under one domain is an indication that they belong together, but this is not always necessarily the case. If, for example, it receives an error code from the web server, this is noted and the scheduler attempts to crawl the page again at a later time. If pages are permanently unavailable or moved (is announced by the server via 301 forwarding, if this has been set), the active entries are removed. duplicate content is also found here via the formation sog. Hashcodes recognized and marked as such. Blacklists, i.e. lists of undesirable web pages or sites, are also managed here. However, terms that have to be censored in certain countries for legal reasons can also be stored there.
If the "acceptance check" carried out according to the stored rules is positive, the store server passes the page on to the repository and the document index.
Document index
Here, unique ids (docid) are generated for each URL and important key figures are stored, such as z. B. The IP address, the time of the last update, the document status, the document type and much more. Again, for reasons of space and performance, no plain text information is stored, but all values are converted into simple digits. For example, instead of the status code 200, you could simply enter a 1 at the corresponding position, a 2 for a 404 code, and so on. About a further conversion into hexadecimal code again memory space is saved.
Repository
In this database the documents delivered by the storeserver are stored as a copy. For this purpose, each document is given a unique URL or. An ID is assigned and the version is timestamped. If a new version of a document is delivered, the entry is updated and the changes to the previous version are archived. It is not known how many versions of a web page are kept and for how long, but it can be assumed with a high degree of certainty that there are upper limits and special algorithms that, in terms of memory, limit a spillover of frequently changing documents or. In such cases, if necessary. Create and store only key figures, instead of really keeping all versions of a page.
Indexer/parser
The indexer represents the heart of the information processing of a search engine. It accesses the repository and calculates a list of relevant search words from the stored source code of a document. These are then stored in the search index. Most search engines also resort to lexicons of all kinds at this point. This allows z. B. Match the spelling more easily, concatenate synonyms, detect homonyms (same words that have different meanings, such as. B. "golf" as a car or as a sport). Further lexicons get common first names, place and country names and other special entries. This makes it possible to determine that "helmut" is most likely a male first name and thus part of a naming convention, or that "munich" is a place name. The reality in the data model is of course much more complex than it can be presented here. Leading search engines also work with so-called. Triple stores and thus store relationships between entities (objects). A search engine "learns" about these components. If it is recognized that a "store" usually contains "products", that a "product" has a "value", i.e. value/price, and that a "store" is always linked to a real address, then the semantic relationship between words can be better understood. This is how search engines like google or bing know that the eiffel tower is in paris, who built it and when, how tall it is – or who the current partner of a celebrity is. This information helps enormously to better classify the parsed terms and to provide even more relevant search results on the built ontologies in the end.
Normalization
The first step of the indexer is to "normalize" the information found, or to. To be homogenized so that it can be reasonably stored in databases in a uniform format. From the point of view of machines, documents on the web consist of relatively unstructured data. In addition to readable text, an HTML document also contains text instructions that are used for formatting and programming purposes. Readable texts can also be found unstructured in navigation areas, as anchor texts, in enumerations or in continuous texts, etc. Before. During normalization, the indexer cuts off all non-text information and thus extracts the pure content.
Figure 2: normalization extracts readable text
Tokenizing
In the next step, the so-called. token (token = element) to make individual words identifiable. This is not as trivial as it may sound. empty or punctuation marks are not always word separators, as the following examples show.
- We met at 7.15 p. M.
- We met at 7.15 p. M. In st. Augustin.
- "why," asked dr. O’hara-meyer from new york, "is e-commerce that *complicated*??"
Are "251.23" two numbers or is it a decimal point, which is common in the american language, i.e. our comma, which puts the two numbers in an important context?? Does it make sense to treat the words "new" and "york" separately?? Thus, tokenizing is about finding semantically related words or phrases. identify elements. Since the error rate would be too high with simple rules ("a period ends a sentence"), modern search engines use sog. neural networks to improve recognition rates. For ease of comparison, all words are converted to lower case.
Language identification
The language in which an HTML document is written can be coded by the operator in the meta-tag "language". However, since comparatively few webmasters do this in relation to the mass of pages, and since incorrect declarations are not unlikely due to the use of insufficiently adapted templates, search engines cannot rely on this self-disclosure. Usually it is sufficient to check a set of unique tokens against dictionaries to ensure a relatively reliable recognition of the actual language used. This is not only important for filtering according to the language of a searcher, but also in order to be able to carry out the following steps correctly – because they sometimes vary considerably.
Wordstemming
Science speaks of lemmatization of lexemes (the basic linguistic form). The aim is to trace a word back to its basic form. So z. B. "trace "praised" back to the verb "praise" or the noun "praise. This reduction has several advantages. On the one hand, the number of words that have to be stored in the (inverted) search index is reduced. On the other hand, the common basic form can also be used to identify more web pages on a topic as possibly relevant for the search. If one website says "justin biber’s age is 14" and another says "justin biber is 14 years old," both could provide an answer to a corresponding search query. would a search engine only dumbly use the word "old" in a search string "how old is justin biber?" with the occurrence of the words "justin beaver" and "alt" on web pages, the result would be limited to an exact match of terms in each case. Google has therefore been using wordstemming since 2004. Also yahoo!, bing and other search engines use these methods, but the exact extent is unknown as in most cases. The art of stemming is to balance the system between the simplest way of bringing plural forms into singular forms, and a rampant exaggeration and a concomitant increase in errors. Of course, the information of the actual word usage is not lost in stemming and can be added to the lexeme, i.e. the base form, when storing it.
When stemming would use z. B. The word "find" is assigned to the basic form "find" and stored there with a code number:
wordid: 4712 /find/ found(1); finds(2); findable(3); finding(4); finding(5); ..
#fundort; #fundstuck; #fundache ..
According to this (completely fictitious) scheme, instead of storing memory-intensive individual words, one must, for example. B. can only store 4712-5 and can convert this back to the database entry if needed. 4712 describes the line and the inflection number 5 the expression "auffinden. Since it is known that all entries with the same word ID (here 4712) go back to the same word root, comparisons can be made much better and, above all, much faster. one of the best known and often used stemming algorithms is certainly porter’s (for more information see simply.St/port1).
Determine stop words
Not all terms are relevant for searches. The more often a word appears in a text, the more unsuitable it is exactly. A search for "this" or "that" seems to make little sense. Therefore, they play a minor role in identifying a topic and the importance of a website for an apt keyword. Search engines sort such stop words as "and", "a", "that" etc. Therefore, before further processing in a separate step from. This reduces the number of words for further analysis and later storage even further.
Figure 3: google even compares whether the results should be determined with or without stop words (source: google, US patent US7409383)
The following 100 words represent almost half (47.1%) of the words used in German texts:
the, and, in, to, the, not, of, they, is, the, itself, with, the, that, he, it, a, I, on, so, one, also, as, at, after, as, in, for, one, but, out, by, if, only, was, still, become, at, has, we, what, will, be, one, which, are, or, to, about, have, one, me, about, him, these, one, you, us, there, to, can, yet, before, this, me, him, you, had, his, more, at, because, now, among, very, self, already, here, to, have, their, then, them, his, all, again, my, time, against, of, quite, individual, where, must, without, one, can, be
It is immediately clear that such words are not suitable for searching in documents, because they do not have a meaning of their own (except for the word "time") and occur practically in every text several times and thus far too often. They are typical candidates for stop words. A list of english stop words can be found for example at van rijsberg under simply.St/esw1.
Search word extraction and relevance determination
After eliminating all interfering words and words less suitable for description, the core algorithms can subject the remaining words to a relevance test for the document. You can imagine that the words are assigned scores according to various criteria, which are of course only used as examples in the following – after all, such details are among the great secrets of the search engine operators. not completely unrealistic could be the following evaluation scheme: a document contains the word "western boots" among other terms. This occurs in the title of the document, right at the beginning. For this you could assign the first 5 points to the word. In addition, it appears in a text marked with H1 (H stands for headline) (3 points) and in three of six H2 text lines (2 points). In continuous text, the word is used once in bold (1 point) and once in a bulleted list (1 point). It also appears more often at the beginning of text sections than at the end (1.5 points). An arguably important component is relevance measurement via WDF*IDF factors (see the cover article in issue 18 for more details). For this purpose, the frequency z. B. of the word "western boots" compared to all other words is determined, compressed by a logarithm and related to the ability to be a good, selective search word (IDF value). by the way, nouns play a more important role than verbs and adjectives in determining relevance.
The latter is about how often the word "western boot" appears in all other documents in the index. If it occurs in very many documents, it is rather unsuitable for a precise determination of relevance for the document at hand. An example: the word "red" comes up in about 241 million words according to a google query. documents before, "western boots" only in 366.000. "food suction ring", on the other hand, appears on only 74 websites worldwide. The increasing suitability as relevant word for a document resp. A web page appears quite logical with decreasing frequency of appearance on web pages. WDF*IDF now relates the relative frequency within a document to the frequency in all other documents. It is safe to assume that, given the current state of the art, synonyms or strongly related words are also included in this calculation, so that the algorithms do not simply count words.
An example for the recognition of the textual "normality" of a text is the Zipf’s law, which is often used in the so called "Japanese" text. corpus linguistics is used. It is based on the fact that certain words occur more frequently in a language than others and that this can be represented by a mathematical formula. About sog. N-grams can be used to calculate the probability of a given word being followed by another word. By the way, there is an ngram viewer for google books (https://books).Google.Com/ngrams), which can be used to compute N-grams filtered by time and language from books.
The frequency of letters in texts is also easy to describe mathematically and can therefore be tested. While the E is by far the most common (> 17 %), the most common initial letter is the D, followed by the S. The most common end letter is the N (21%). Such calculations become even more meaningful when not only the frequency of individual letters is compared, but also the occurrence of several letters. As can be seen in figure 4, certain combinations of three letters (trigrams) occur very frequently. The apparent two pairs "EN", "ER", "DE" and "IE" contain a space before or after them in texts, which is counted as well.
Figure 4: trigram analysis – how often do combinations of three letters occur?? (sample 20.1 million. Texts – source: wikipedia, http://einfach.St/wp6)
If the way of using terms and letters deviates far from what can be expected on average, the text may be meaningless, as it is often generated automatically by spammers. Such filters can also be used to create so-called. Automatically "spun" texts that try to circumvent the duplicate content problem through term variations. In short, using modern linguistics and specially developed algorithms, a great deal of useful information can be extracted mechanically from a text for estimation. even a classification of an author’s degree can be automated nowadays – and it’s not a simple process that only counts spelling mistakes.
Of course, most search engines use not only the text of a document to determine relevance, but also signals such as z. B. The number and quality of backlinks, the trust factor for a domain, visitor interest (see the article by andre alpar in this issue S. X), their dwell time, or even factors such as whether searchers quickly return after clicking on a search result and click on another result.
Dr. Jan pedersen from bing recently explained in a contribution that the quality of the content is very important for the ranking there:
"content quality is a primary factor in ranking" (http://einfach.St/bblog).
according to his explanations, the ranking at bing is generally a function of
- topical relevance),
- Context and
- Quality of the content.
Quality is backed up by authority (can the content be trusted)?), usefulness (".. when considering the utility of the page, our models try to predict whether the content is sufficiently useful for the topic it is trying to address. Does the page provide ample supporting information? Is it at the appropriate level of depth for the intended audience? We prefer pages with relevant supporting multimedia content: instructional videos, images, graphs, etc. Another important criterion in evaluating utility is gauging the effort and level of expertise required to generate the content. Websites serving unique content are preferred to those recycling existing data or widely available materials ..") and the presentation or. How well the searched content is presented and found on the page, determines. search engines are now not only able to compare texts and words, but are also in the midst of trying to actually understand them – as far as computers can really be understood in the usual sense of the word.
Since this article is not about ranking factors and their use, they are largely disregarded here. For the general understanding of the function of a search engine it is also not necessary and would significantly complicate the considerations.
The search index
Hitlist
Since searchers look for results by entering one or more words, a search engine must extract important words from the texts and prepare them in such a way that they can be accessed with high performance. In the sog. Hitlist therefore all relevant words occurring in a document are stored one after the other. From the sentence "this is a great website and here you can buy great western boots" stand after the elimination of stop words, stemming and fixing the position in the text:
- Great (at position 004)
- Web page (at position 005)
- Toll (at position 010)
- western boots (at position 011)
- Buy (at position 012)
Besides the position, many other things are important to calculate the relevance of your trip. To be able to judge. If the word was printed in bold, with a relatively larger font, it is mentioned in the title, in an H1 headline, etc., it is not mentioned in the title.? These characteristics can then be assigned value numbers and, if fulfilled, added up for the word. Figure 5 shows this as an example. If a word appears in the second heading (H2-1) and is highlighted in bold, the values 4 and 16 are added together. For this word, the hitlist stores the position of the word (011) as well as the value 20. The entry would be:
The value "20" can, of course, be broken down back into the factors "H2-1" and "bold/italic" using the corresponding table. In reality, of course, this catalog of features is much more complex and the storage is done by compressing algorithms. Through this process, in principle, every document on the web must be tagged with every non-stop word. At 60 billion documents z. B. With the index of google and the assumption that a page contains on average about 200 words, you get an impression of the enormous computing power that is necessary for this continuous process alone.