Massive Google Search document leak exposes inner workings of ranking algorithm

Google Search algorithm

A collection of leaked Google documents provides a rare glimpse into the inner workings of Google Search, revealing key factors used to rank content.

Thousands of documents, seemingly sourced from Google’s internal Content API Warehouse, were released on March 13 on GitHub by an automated bot named yoshi-code-bot. These documents had been shared with Rand Fishkin, co-founder of SparkToro, earlier this month.


 


Understanding Google’s ranking algorithm through these leaked documents is invaluable for SEOs. In 2023, a Yandex Search ranking factors leak made headlines, becoming one of the year’s major stories.

This Google document leak is poised to be one of the most significant events in the history of SEO and Google Search.

What’s inside. Here’s what we know about the internal documents from Fishkin and King –

  • Current: The documentation is accurate as of March.
  • Ranking features: The API documentation includes 2,596 modules with 14,014 attributes.
  • Weighting: The documents don’t specify the weighting of ranking features, only their existence.
  • Twiddlers: Re-ranking functions that adjust the information retrieval score or change document rankings.
  • Demotions: Content can be demoted for various reasons, such as:
    • A link doesn’t match the target site.
    • SERP signals indicate user dissatisfaction.
    • Product reviews.
    • Location.
    • Exact match domains.
    • Porn
  • Change history: Google keeps a copy of every version of every indexed page, remembering all changes. However, it only uses the last 20 changes of a URL when analyzing links.

 

Links matter. According to the documents, link diversity and relevance are still key factors, and PageRank remains an important element in Google’s ranking features. PageRank for a website’s homepage is considered for every document.

  • This doesn’t prove that Google spokespeople have lied about links not being a “top 3 ranking factor” or their diminishing importance for ranking. Both statements can be true simultaneously. Additionally, we still don’t know the weighting of these features.

Successful clicks are crucial. To rank well, you need to create excellent content and user experiences, as highlighted in the documents. Google measures various types of clicks, including badClicks, goodClicks, lastLongestClicks, and unsquashedClicks.

Longer documents may get truncated, while shorter content receives a score (0-512) based on originality. Scores are also assigned to Your Money Your Life content, such as health and news.

What does this mean? King explains:

“You need to drive more successful clicks using a broader set of queries and earn more link diversity to maintain your ranking. It makes sense because strong content will naturally do this. Focusing on driving qualified traffic to a better user experience signals to Google that your page deserves to rank.”


Evidence and statements from the U.S. vs. Google antitrust trial affirmed that Google incorporates clicks into its ranking algorithms, notably through its Navboost system, described as “one of the significant signals” for ranking. Explore further in our coverage.

Evidence and statements from the U.S. vs. Google antitrust trial affirmed that Google incorporates clicks into its ranking algorithms, notably through its Navboost system, described as “one of the significant signals” for ranking. Explore further in our coverage:

Brand matters. Fishkin’s big takeaway? Brand matters more than anything else:

  • “If there was one universal piece of advice I had for marketers seeking to broadly improve their organic search rankings and traffic, it would be: ‘Build a notable, popular, well-recognized brand in your space, outside of Google search.’”

 

Entities are significant

Authorship remains relevant as Google stores author information linked to content and seeks to attribute documents to specific entities.

SiteAuthority: Google employs a metric known as “siteAuthority”.

Initially acknowledged in 2011 after the Panda update, Google publicly stated that “low-quality content on part of a site can impact a site’s ranking as a whole.” Despite this, Google has denied the existence of a website authority score in subsequent years.

Chrome data: A module named ChromeInTotal indicates Google utilizes data from its Chrome browser for ranking purposes.

Whitelists: Modules like isElectionAuthority and isCovidLocalAuthority indicate Google whitelists specific domains related to elections and COVID. This practice is known as having “exception lists” to mitigate unintended impacts from algorithms on websites.

Small sites: Another feature, smallPersonalSite, is designed for small personal sites or blogs. There is speculation that Google could adjust rankings for such sites using a Twiddler, although the extent of this adjustment remains uncertain. Once again, the weighting of these features is unclear.

Other notable findings from Google’s internal documents include:

  • Freshness matters: Google considers dates in the byline (bylineDate), URL (syntacticDate), and on-page content (semanticDate) to assess content freshness.
  • Topic relevance: To determine if a document aligns with a website’s core topics, Google uses page and site embeddings, comparing page embeddings (siteRadius) to site embeddings (siteFocusScore).
  • Domain registration information: Google stores domain registration details (RegistrationInfo).
  • Page titles: The feature titlematchScore is believed to gauge how well a page title matches a query.
  • Text analysis: Google measures the average weighted font size of terms in documents (avgTermWeight) and anchor text.

 

Update, May 29. Google provided a statement to Search Engine Land. Read how Google responds to leak: Documentation lacks context.



Quick clarification. There is some dispute as to whether these documents were “leaked” or “discovered.” I’ve been told it’s likely the internal documents were accidentally included in a code review and pushed live from Google’s internal code base, where they were then discovered.

The source. Erfan Azimi, CEO and director of SEO for digital marketing agency EA Eagle Digital, posted a video, claiming responsibility for sharing the documents with Fishkin. Azimi is not employed by Google.

Schedule FREE Consultation

Fill up the form below and we will get back to you within 24 hours

More To Explore

laravel faker generator random date
Laravel

Laravel faker generator random date [2024]

The Laravel Faker generator is a handy tool for generating fake data, and its randomDate functionality is often used for creating random dates. It is part of the FakerPHP library that Laravel integrates with, primarily useful for testing, seeding databases, or generating sample data.   Generating Random Dates with Faker

laravel 11 notify.css affecting background color​
Laravel

Laravel 11 notify.css affecting background color​ : SOLVED

Ah, the joys of integrating third-party CSS libraries—always a delightful adventure! If you’ve noticed that notify.css is playing uninvited stylist with your Laravel 11 application’s background colors, you’re not alone. Let’s dive into some strategies to keep notify.css in its lane without stepping on your design toes.   Understanding the

© 2024 | All Rights Reserved | SMV Experts

Join Our Fastest Growing Community for Expert Tips and Insights