Ahrefs Scraping Google: Data Accuracy, Interval, and More

Ahrefs is one of the biggest and best overall marketing and SEO analytics platforms currently extant on the internet. In order to take that place as #1, they have to leverage some serious computational power and bandwidth – not to mention the development of their analytics itself.
One problem that companies like Ahrefs face is that there’s only so much data they can access. Even with massive data centers constantly scraping both the internet at large and Google results specifically, they still pale in comparison to Google itself.
It would be one thing if Google worked with these companies, but they don’t. Google doesn’t even have an API you could use to pull data. That means Ahrefs needs to carefully scrape Google in ways that don’t get them banned or blocked. That means understanding Google’s rate limits, using a rotating list of IP addresses, and making sure their bots keep a reasonable activity profile.
Ahrefs has over 4,200 servers, 4 PB of RAM (yes, Petabytes with a P), over 500 PB of total storage, and one of the top 50 fastest supercomputers in the world. With that kind of power behind them, it’s amazing they exercise restraint.
All of this is in stark contrast to, for example, the AI bots from OpenAI, Amazon, and Meta, which hoover up everything they can uncritically.
All of this does bring up a few questions, though:
- How accurate is the data scraped and analyzed by Ahrefs?
- How often do they scrape?
- How up-to-date is their data?
All of these can be important to know if you’re going to rely on their data as part of your marketing.
How Ahrefs Gets Their Data
Since Google doesn’t have an API, Ahrefs basically has to get their data from some combination of scraping, analytics from sites they own, and data submitted to them from partners.
Now, the public has no idea which of these sources they actually use. I’m certainly not going to be the one that accuses them of trying to break Google’s TOS, and I’m absolutely certain that if Google thought they were doing so, they’d do something about it.
Google can rate limit people for making a few too many searches in too short a time; if they saw bot-like behavior from a supercluster’s worth of server IPs, they’d pretty easily be able to recognize it.
Ahrefs does publish some of their data, though. For example, they have this page, which is a list of IP addresses they use. These IP addresses and IP ranges are primarily used by the Site audit bot; they publish this list so that if you, as a site owner, want to add them to a filtered list in your analytics so the bots don’t count as hits, or even block them from looking at your site at all, you can do so. Remember, Ahrefs doesn’t want to hide their operations.
Ahrefs also has this to say about their data:
- For content indexing and backlink data, they use their own bots to scrape the web, just like Google and many other analytics platforms do. They’ve been building up their own index since 2013, both for their analytics index and for their Yep.com search engine index.
- For keyword data, including search volume, click-through rates, and other data, they use information from Google’s keyword planner, Google Trends, the Google Search Console, and a variety of unspecified third-party data sources. All data they gather is anonymized, and they request consent before gathering it.
These statements alone give me confidence that Ahrefs has several processes in place to make sure they’re gathering data in the most ethical way they can. It’s also very likely that they have some kind of partnership with Google; as long as they stay on the right side of the rules, Google will give them a bit more leeway.
That’s a supposition on my part, though; the public can’t confirm any of it because they haven’t published this data.
The actual numbers are pretty astonishing. Ahrefs is actually second only to Google in terms of bots, coming ahead of Qualys, Bing, Amazon, Moz, and Semrush. They crawl five million pages per minute! Their content index discovers ten million new pages every day and updates metrics for 300 million pages every 24 hours. Overall, their content index includes 16.3 billion pages. All of these stats and more are publicly available on their Big Data page.
I do actually think it’s important to draw some attention to that second bullet point. Ahrefs harvests a lot of data directly, but they also get a lot of secondary data from people who link their own Search Console to their app, and they get a lot of data from third-party sources. All of this helps broaden their horizons, but it also leaves a lot of gaps.
To fill those gaps, Ahrefs has to make a lot of assumptions. Some of those assumptions are small; if they know a keyword gets around 50 visits a month, and they know a site they have access to gets 30 visits a month from that keyword, they can easily assume that other sites ranking for that keyword are dividing up 20 visits.
They can also make assumptions based on the comparative popularity of different keywords and their search volume, or with different sites in their relative traffic numbers, and so on.
Most of the time, though, these assumptions aren’t actually as simple as a little math equation. They use a lot of machine learning and data processing to make these assumptions. Note that I’m not saying AI here; most people thinking of AI today think of generative AI, and generative AI isn’t reliable enough for the tasks Ahrefs is setting. Ahrefs does use generative AI for some features, like their title/description/paragraph generators, but not in their analytics.
How Often is Ahrefs Data Updated?
It’s difficult to be agile and respond to changing trends in your industry if the data sources you use don’t update often enough to see those trends happening. If you’re relying on Ahrefs to identify and react to changes, you need to know how rapidly Ahrefs updates to reflect those changes.
Fortunately, this is another bit of data Ahrefs publicly posts.
For their links database – the one you generally want to pay the most attention to – they update fresh data every 15-30 minutes. Since their scraping and crawling bots are always doing their duty, the data they send back can be integrated into their database as it happens.
They also say that every page/site/link in their index is refreshed at least once every 60 days. Slow, static, and low-priority sites may only be updated once in that time, while larger, more active, and higher-priority sites may be updated every single day.
Ahrefs assigns priority based on a few metrics. Sites with a higher DR, with more backlinks, more traffic, and more popularity take priority. Specific pages within a site can also be ranked higher or lower in priority based on things like their URL Rating, how often they change, and if they stand out in terms of traffic. Also, when Ahrefs notices a page gets new links, it bumps up the priority for that page for a while, assuming that it’s growing in popularity and its owners would want updated data.
Is Ahrefs Accurate?
So, with all of this in mind, is Ahrefs accurate?
Yes and no. The truth is, even as the second-largest index in the world, they’re still smaller than Google, and there are gaps in their coverage. I can only speak for myself and my clients, but I find it to be somewhat hit or miss, depending on the keyword or the page. Some pages are very accurate, and their data matches Google’s data.
Other pages are not. I’ve had a few cases where I’ll look up the metrics for one of my pages for a keyword, and Ahrefs will say I’m ranked #98 or something. But, when I go to Google’s Search Console or even just search Google from an anonymized/logged-out account, my page shows up in the top three.
Essentially, this is the way I look at it: the larger and more popular your site is, the more accurate and reliable your metrics will be. Similarly, the more accurately you’re identifying the keywords Ahrefs is searching, the more reliable the data will be.
A lot of it comes down to a sanity check. If you search for a page in Ahrefs and they show you are not indexed or ranking very low, and that feels wrong, it probably is. On the other hand, if the data feels more or less correct, it probably is. When Ahrefs is wrong, they tend to be wrong enough that it’s an easy data point to filter. And, if your site is small and unpopular enough that those data points are accurate, you’re probably not in a position where using Ahrefs is going to benefit you, at least not for the price you’re paying to use it.
A big part of this comes down to keyword tracking. Ahrefs tracks a lot of keywords, but they don’t use the same kind of semantic indexing and natural language processing that Google does to make it all fuzzy. That’s why, on their Big Data page, they list that they have 110 billion keywords “ever seen,” but they only really track 28.7 billion “filtered keywords.”
What this means, in practice, is that a lot of keyword variations, long-tail keywords, and less-popular keywords are left out of their index. I’ve had content rank for a bunch of useful keywords, but since Ahrefs doesn’t track those keywords, all the traffic I get from them isn’t recorded, and Ahrefs says those pages have zero traffic.
Another thing to know – which isn’t a deal-breaker but is important to keep in mind – is that Ahrefs is a global company, and its data center is based in Singapore. They use IPs and proxies from 217 different countries, but the road-scope data they give you might not be as nuanced as you might think. You might also end up in a situation where your pages seem oddly popular in countries you can’t sell to, so it’s not actually valuable traffic, even if it seems like it should be. This isn’t necessarily a flaw with Ahrefs – any analytics platform will have the same issues – but it’s a detail people often overlook.
Should You Use Ahrefs?
I do!
The thing about all of this is that there are only so many data sources available. You can get “first-party” data using Google Analytics or another analytics app plugged into your site directly. You can get data from Google’s Search Console. Since Google controls 90% of the search market, that’s your #1 data source.
When it comes to third-party tools, very few of them are able to leverage the kinds of immense resources that Ahrefs does. A few try – Moz, for example – but Ahrefs is the largest by far.
When you get further down the tiers, many of the third-party tools are themselves just taking Google’s data or using Ahrefs and white labeling it. So, even if you don’t want to use Ahrefs directly, you still might be.
I also find that, despite the few issues I’ve mentioned, Ahrefs is still broadly reliable and accurate. You don’t necessarily get fully comprehensive data, and it might not be the most up-to-date data, but it’s still better than what you’ll get from a lot of other sources.
My final verdict is that Ahrefs is definitely a worthwhile tool, as long as you know about some of the potential gaps or flaws going into it and you have the means to pay for it and make use of what they give you. If your site is too small or not very popular, it won’t be worth the cost.
Essentially, any of the issues Ahrefs has are issues everyone is going to have, and Ahrefs is large enough that they have the smallest amount of them. Meanwhile, they have the most value you can get out of any one analytics platform, at least in my experience, and I’ve used a lot of them over the years.
For that matter, if you have any questions, feel free to ask in the comments! I’m not an Ahrefs representative, but I’ve been around the block and can help out to an extent.
Leave a Comment
Fine-tuned for competitive creators
Topicfinder is designed by a content marketing agency that writes hundreds of longform articles every month and competes at the highest level. It’s tailor-built for competitive content teams, marketers, and businesses.
Get Started