Google on Anchor Text Indexing and Crawl Rates

Initially filed in 2003, a granted Google patent sheds light on topics like anchor text indexing, crawling frequencies, and the treatment of redirects. It offers intriguing insights, though it’s worth noting that the patent may not reflect current practices.

Over time, Google’s methods could have evolved. It’s well-known that different web pages experience varying crawl rates and indexing speeds, and anchor text in hyperlinks can significantly affect a page’s ranking in search results. The patent explores how these elements interact, potentially revealing strategies once used to determine page rankings.

Why rely on anchor text indexing to determine relevance?

When users search, they expect a concise list of highly relevant pages, but older search engine systems only indexed the contents of the page itself. According to the authors of this anchor text indexing patent, valuable information about a page can be found in the hyperlinks pointing to it. This is especially useful when the target page has minimal or no text, such as in the case of images, videos, or software.

anchor-text-indexing-process

Often, the anchor text on linking pages provides the only descriptive information for these types of content. Additionally, anchor text indexing allows search engines to rank and categorize pages even before they are crawled, enhancing the efficiency and accuracy of search results.

Creating Link Logs and Anchor Maps

Creating link logs and anchor maps is a central aspect of the process outlined in the patent. The link log captures numerous link records during crawling and indexing, documenting the source URLs and the target pages linked to them.

crawling-and-indexing

An anchor map is also generated, which organizes information about the anchor text from hyperlinks. These anchor records are sorted based on the pages they point to and may include additional annotations.

To enhance relevance assessment, a page ranker may calculate a PageRank or another query-independent metric for specific pages. This ranking system can play a crucial role in determining the crawling frequency and priority for particular web pages, ensuring that more relevant pages are crawled and indexed promptly.

Different Layers and Different Crawling Rates Base Layer

The base layer of the data structure described in this patent consists of a sequence of segments, each potentially covering over 200 million URLs, though this figure might have changed since the patent’s creation. Collectively, these segments represent a large portion of the addressable URLs on the web. Periodic crawling, possibly on a daily basis, is applied to one of these segments.

anchor-text-indexing-and-relevance

This system allows for a methodical and scalable approach to covering vast numbers of URLs, ensuring that even pages in the base layer receive regular updates, albeit at a slower rate compared to more prioritized layers in the structure.

Daily Crawl Layer: Frequent Crawling for High-Priority URLs

The patent outlines a multi-layered crawling system to ensure timely indexing of web content. In addition to the base layer, which handles segments of URLs with periodic crawling, there’s a daily crawl layer that covers over 50 million URLs. These URLs are crawled more frequently than those in the base layer and may include high-priority URLs identified during active crawling periods.

Optional Real-Time Layer: Immediate Indexing for Critical URLs

An optional real-time layer targets around five million URLs that are crawled numerous times throughout a day. Some URLs in this layer are even crawled every few minutes to keep the index updated with the most current information. Newly discovered URLs, which need immediate attention, are also placed in this layer for rapid crawling.

Unified Crawling System: Robots Serving All Layers at Different Rates

Despite the different crawling frequencies, the same robots handle all the layers, with a scheduling program determining the crawl rates based on how often a page’s content changes and the URL’s perceived importance. This layered approach helps search engines prioritize content freshness and relevance while ensuring broad coverage across the web.

URL Discovery in Anchor Text Indexing

url-discovery

The sources for discovering URLs in the data structure include several key methods:

  1. User Submissions: URLs directly submitted by users to the search engine system.
  2. Crawled Pages: URLs found through outgoing links on already crawled pages.
  3. Third-Party Submissions: URLs provided by third parties, such as content publishers, who submit links as they are published, updated, or changed—often through sources like RSS feeds.

This explains why it’s common to see newly published blog posts appearing in search results just a few hours after being posted, as they may be part of the real-time layer or submitted through an RSS feed for rapid indexing.

Processing of URLs and Content

Before URLs and their corresponding content are stored in the data structure, they are processed to maintain content uniformity and eliminate duplicate pages. This involves examining the syntax of URLs and using a host duplicate detection program to identify and avoid indexing identical hosts based on their incoming URLs.

Key Aspects of the Anchor Text Indexing Process

Instead of outlining the full process step-by-step, here are some essential terms and procedures involved in anchor text indexing. Some technical details that enhance speed and efficiency, such as partitioning and incremental data addition, have been omitted for brevity. For those interested, reviewing the full patent is recommended.

  1. Epochs: A predetermined time frame, like a day, in which various actions of the process are executed.
  2. Active Segment: During each epoch, a specific segment from the base layer is selected for crawling. The system rotates through segments in a round-robin fashion across several epochs, ensuring all segments are eventually crawled.
  3. Movement Between Daily Layer and Optional Real-Time Layer: URLs can shift between the daily and real-time layers, depending on historical data showing how frequently the content changes. This movement is also influenced by each URL’s PageRank or other ranking metrics determined by page rankers.

anchor-text-indexing

Determining What URLs Are Placed in Which Layers

The placement of URLs in different layers can be determined by calculating a daily score for each URL, which is given by the formula: daily score = [PageRank]² * URL change frequency. This score helps decide which URLs receive more frequent crawling.

url-crawling-process

URL Change Frequency Data: When a robot accesses a URL, content filters evaluate if the content has changed since it was last accessed. This information is stored in history logs and used by the URL scheduler to compute the change frequency and determine crawling rates.

PageRank: A query-independent score, also known as a document score, is calculated for each URL by examining the number of URLs linking to it and the PageRank of those linking pages. Higher PageRank URLs may receive more frequent crawling. This PageRank data is managed by URL managers.

URL History Log: This log stores information on URLs, including those no longer part of the data structure, such as URLs that are defunct or excluded from crawling upon request by a site owner.

Placement into Base Layer: When a URL is placed in the base layer, the URL scheduler ensures a random or pseudo-random distribution across segments, ensuring an even allocation of URLs.

URL Layer Placement and Random Distribution: URLs are distributed into appropriate layers—base, daily, or real-time—using processing rules that aim for balanced segment allocation.

Handling Situations Where All URLs Cannot Be Crawled in an Epoch:

  1. Crawl Score Calculation: A crawl score is computed for each URL in different segments and layers, and only those with high scores are passed to URL managers.
  2. Optimum Crawl Frequency: The URL scheduler determines the optimal crawl frequency for each URL, and URL managers use this information to decide which URLs to crawl.

These methods can be applied individually or combined to prioritize URLs for crawling.

Factors Determining a Crawl Score: The crawl score helps decide whether a URL proceeds to the next stage during an epoch. It considers:

  1. Current Location of the URL (active segment, daily layer, or real-time layer).
  2. URL PageRank.
  3. URL Crawl History: This can be calculated as crawl score = [PageRank]² * (change frequency) * (time since last crawl).

How the Crawling Score Might Be Affected

Up-weighting Crawl Scores: URLs that haven’t been crawled in a significant period may have their crawl score increased to ensure they are revisited within a set minimum refresh time, like two months.

Setting Crawl Frequency: The URL scheduler sets and adjusts the crawl frequency for URLs in the data structure. This frequency is determined based on factors like the historical change frequency and PageRank of a URL. URLs in the daily and real-time layers generally have shorter crawl frequencies compared to those in the base layer, ranging from as frequent as every minute to as infrequent as several months.

Dropping URLs: URLs are periodically removed from the data structure to make room for new entries. This decision is based on a “keep score,” which could be a URL’s PageRank. URLs with lower scores are removed first as new URLs are added to the system.

Crawl Interval: The crawl interval is the target frequency for crawling a URL, such as every two hours. The URL manager schedules crawling based on these intervals, influenced by criteria like URL characteristics and category.

Representative URL Categories: URLs are categorized into groups such as news URLs, international URLs, language-specific URLs, and file types (e.g., PDF, HTML, PowerPoint). This helps determine crawling priorities and policies.

URL Server Requests: Sometimes, the URL server will request specific types of URLs from URL managers. These requests can be based on preset policies, such as aiming for a mix like 80% foreign URLs and 20% news URLs.

Robots: Crawling robots are responsible for visiting and retrieving documents at specific URLs. They also recursively gather documents linked from each page. Robots are assigned tasks by the URL server and submit retrieved content to content filters for further processing.

Crawling Pages Following the URL Scheduler

Crawling-Pages Following-URL- Scheduler

Role of the URL Scheduler: The URL scheduler decides which pages to crawl as directed by the URL server. The server receives URLs from content filters, while robots execute the crawling process. Unlike standard web browsers, robots do not automatically retrieve embedded content, such as images, nor do they follow permanent redirects.

Creation-Anchor-Maps

Host Load Server: A host load server helps prevent overloading target servers by managing the rate at which robots access them.

Avoiding DNS Bottleneck Problems: To avoid DNS lookup delays, a dedicated local DNS database stores previously resolved IP addresses. This helps ensure domain names do not need repeated resolution every time a robot visits a URL.

Handling Permanent and Temporary Redirects: Robots do not follow permanent redirects directly. Instead, they send both the original and redirected URLs to content filters, which add them to link logs. These logs are then managed by URL managers, who decide if and when a robot should crawl the redirected URLs. In contrast, robots do follow temporary redirects to collect page information.

Content Filters: After robots retrieve page content, it is processed by content filters, which then send information to a DupServer to identify duplicate pages. The filters analyze:

  1. URL Fingerprint: A unique identifier for the URL.
  2. Content Fingerprint: A unique identifier for the content.
  3. PageRank: The rank assigned to the page.

Redirect Indicator: Whether the page serves as a source for a temporary or permanent redirect.

Rankings of Duplicate Pages

This detailed processing system helps ensure efficient crawling and indexing while minimizing redundancy in the search engine’s database.

minimizing-redundancy- search-engine’s-database.

Handling Duplicates: When duplicates are detected, their page rankings are compared, and the “canonical” page is chosen for indexing. Non-canonical pages are not indexed but may have entries in the history log, and the content filter stops processing them. The DupServer also manages both temporary and permanent redirects that robots encounter during crawling.

Link Records and Surrounding Text: The link log contains one record per URL document, capturing the URL fingerprints of all links found in the document along with surrounding text. For example, a link to an image of Mount Everest might say, “to see a picture of Mount Everest, click here.” In this case, the anchor text is “click here,” but the surrounding text, “to see a picture of Mount Everest,” is also recorded. This expands the context of anchor text indexing to include nearby descriptive text.

RTlog for Matching PageRanks with Source URLs: An RTlog stores documents retrieved by robots, paired with the PageRank of their source URLs. For example, if a document is obtained from URL “XYZ,” it is stored with the PageRank assigned to “XYZ” as a pair in the RTlog. There are separate RTlogs for each layer: the active segment of the base layer, the daily layer, and the real-time layer. This structure ensures efficient tracking of documents and their relative importance for ranking purposes.

The Creation of Link Maps

Creation of Link Maps: The global state manager reads the link logs and uses this information to generate link maps and anchor maps. Unlike link logs, the records in the link map have their text removed. Page rankers use these link maps to adjust the PageRank of URLs within the data structure, and these rankings are preserved across different epochs.

Creation-of-Link-Maps

Creation of Anchor Maps: Anchor maps are also created by the global state manager. These maps are used by indexers across different layers to simplify anchor text indexing and help in indexing URLs that lack textual content, providing better context for pages with minimal on-page text.

Creation-Anchor-Maps

Text Passages in Link Logs: Each record in the link log includes a list of annotations, such as the text from anchor tags pointing to a target page. These annotations can include continuous blocks of text from the source document, termed text passages. Additionally, annotations might also contain text that is outside of the anchor tag but is within a predetermined distance of the anchor tag in the document. This distance is based on factors like the number of characters in the HTML code, placement of other anchor tags, or other predefined anchor text identification criteria.

text-passages-in-link logs

Other Annotations: Annotations may also include attributes associated with the text. For HTML text, these attributes can include formatting tags like:

  • Emphasized Text (<em>)
  • Citations (<cite>)
  • Variable Names (<var>)
  • Strongly Emphasized Text (<strong>)
  • Source Code (<code>)

Additional attributes may indicate the text position, number of characters, word count, and more. In some cases, annotations serve as delete entries, indicating that a link has been removed, as determined by the global state manager.

Anchor Text Indexing from Duplicates

Sometimes, anchor text pointing to duplicate pages is used to index the canonical version of the page. This can be especially useful if links to non-canonical pages have anchor text in a different language, enriching the context of the canonical page.

Conclusion

Google’s anchor text indexing and page crawling processes, as described in the patent, provide an intricate look into how search engines decide what content to prioritize and how URLs are organized and processed. The multi-layered crawling system, which includes base, daily, and real-time layers, ensures a balance between broad web coverage and the frequent updating of high-priority content. By calculating factors such as PageRank, change frequency, and crawl scores, Google’s approach aims to maximize the relevance and freshness of indexed pages.

The patent’s description of link maps, anchor maps, and annotations underscores the importance of link relationships and contextual information. These link structures not only contribute to determining a URL’s ranking but also help in indexing content that may lack on-page text. Handling duplicates and redirects efficiently ensures that only the most authoritative version of content is indexed, while annotations provide rich data to enhance the indexing process.

These advanced mechanisms reflect the dynamic nature of the web, where content changes frequently, and users demand up-to-date search results. Although the processes detailed in the patent may have evolved over time, the insights offer a valuable understanding of the principles behind search indexing, crawling prioritization, and managing the vast amount of information on the internet. By leveraging such sophisticated methods, search engines are better equipped to deliver the highly relevant and timely content that users expect.

The anchor text indexing patent is:

Anchor tag indexing in a web crawler system
Invented by Huican Zhu, Jeffrey Dean, Sanjay Ghemawat, Bwolen Po-Jen Yang, and Anurag Acharya
Assigned to Google
US Patent 7,308,643
Granted December 11, 2007
Filed July 3, 2003