How Can I Block My Sites from All SEO Tools and Scrapers?
There comes a time in every marketer’s life when they start to notice changes. Some of their metrics start to grow. The tone of their analytics changes. They start to grow hair in plac- err, nevermind on that one.
Websites, as they grow, get discovered. Sometimes, this is a good thing, like when Google indexes your site, feeds it into its discovery engine, and sends people your way. Other times, it’s a lot less pleasant, like when someone gets it into their head to scrape your entire site and copy it on their own domain with a single letter changed to try to phish your customers.
There’s an incredible amount of traffic on the web out there. And, depending on who you ask, as much as 50% of it might be bots. Bots from search engines, bots from botnets, bots both malicious and kind. It’s kind of a problem!
But is it a problem you want to try to solve?
That’s the question you will have to answer for yourself, sooner or later. Do you want to try to block bots, scrapers, and other non-person entities from visiting your site? Or do you just not care?
I wanted to talk a bit about it today, largely due to a few discussions I’ve seen on social media recently.
Table of Contents
Why You Might Want to Block Crawlers and Scrapers
Before I get into the specific ways you can block bot traffic, including SEO tools and scrapers, I want to talk a bit about why you might want to block them.
There are a bunch of reasons!
They eat up server and site resources.
Bot traffic hitting your site is usually not too bad, at least not these days. A decade or two ago, it was a lot worse when bandwidth was at a premium, and a site being a bit too successful could crash a server.
These days, bot traffic is only a significant problem if you’re on a really, really, really cheap web host or you’re being deliberately attacked. Still, I know some webmasters who are still concerned about resource usage, so that’s one reason.
They can screw up your analytics.
A slightly more important reason is that bot traffic can show up as people in your analytics. It makes a lot of your metrics look worse when these “people” show up, spend 0.06 seconds per page, and bounce with zero chance of converting. They don’t even get you ad revenue or click your affiliate links (or if they do, they don’t make a purchase.)
This is only kind of a problem, to be honest. Most analytics apps worth their salt do at least some filtering of bot traffic. All of the “good” bots also obey rules and identify themselves for easier filtering. But the malicious bots – which are probably around 65% of bot traffic – don’t. It can make your site look worse than it is.
They might be stealing your content.
This one is a concern, but it’s honestly a relatively small concern. Sometimes, a bot or scraper will show up on your site and just copy it. Copy the HTML, copy the content, copy the images, everything. These criminals set up their own versions of your site, usually just to spam a bunch of ads before they get delisted and start over.
I say this isn’t a concern for a few reasons. For one thing, it might not even be a bot doing it, or if it is, it might not be a bot you can reliably block. A lot of these people run these scrapers from proxies and only hit you once with any given IP, so blocking them doesn’t do much. And, since the bot isn’t self-identifying and probably even spoofing a user agent, you can’t filter it that way, either.
They may be undercutting your business.
One of the worst ways that bot traffic can hurt you is if they’re undercutting your business. It’s obviously annoying if they’re stealing your content to repost on something like Medium with the affiliate links swapped out. But, for some brands, it’s even worse.
One of the examples I’ve seen people complain about is food blogs. Food blogs are in a tough spot; they have to make content that attracts search engines, but actual people don’t care and just want to see the recipe. So, there are scrapers and bots that aggregate recipes from sites on their own platforms. It cuts out the superfluous content… but it also cuts out ads, affiliate links, store pages, and anything else. Those businesses can be hit pretty hard and have one of the most legitimate reasons to block bots.
You may not like or support a given business.
Another common discussion I’ve seen surrounding web scrapers and “SEO tools” is AI. It’s no secret that the major LLMs are trained on immense amounts of stolen content. The AI platforms that claim to be more ethical about it still might require an opt-out and a lot of the shadier companies just don’t care.
So, whether you just want to hinder Semrush because they banned your account once, or you really don’t like AI and want to block them from scraping your site, if you don’t ethically or morally support a business, blocking them might be a good idea for you.
You don’t want to give your competitors insight on you.
One of the biggest reasons – and the main reason I’m writing this post – is people who just want to hide their site from the legitimate bots because they don’t want insight on their sites to be available to their competitors.
For example, if you know your competitor uses Semrush and Ahrefs to find data about their competition and use it in their own marketing, you might want to block those two tools. That way, anyone who tries to scrape your keywords, dig into your site structure, or gather other data about you won’t be able to. At least not from those tools.
I’ll go into why this isn’t really a good idea at the end of this post, so stay tuned for that.
There are probably other reasons I haven’t thought up offhand, but that’s fine. I just wanted to run down some of the common thoughts, and whether or not they’re actually problems.
Is it Possible to Completely Protect Your Site from Crawlers?
No.
You can block some bots, some of the time. You can block some bots all of the time. Other bots you can’t block at all.
The more ethical the bot is – scrapers like search engines or tools like Ahrefs or Semrush – the more likely they are to obey your blocks.
The unethical bots, the ones hammering your login page, scraping your content, or whatever? They’re hiding behind proxies or TOR, they’re using botnets with compromised servers or residential devices, or they’re using one-and-done IPs and agents. You can play whack-a-mole with them forever, and you’ll still have that traffic around.
You can certainly try, though, so here are the options you can use.
Option 1: Blocking Bots with Robots.txt
The first option you have is to block specific bots with directives in your robots.txt file.
For those who don’t know, robots.txt is just a simple text file you can put in the root directory of your server. Ethical bots will check it before they check anything else on your domain. If it says, “Hey, ignore X page” or “get outta here, bot!” then they will. The unethical bots don’t care, but the ethical ones do.
Each entry in robots.txt is two lines minimum. One is the user agent that identifies the bot. The other is what pages to disallow. So, for example, if you wanted to block both Semrush and Ahrefs, you would need code like:
User-agent: SemrushBot
Disallow: /
User-agent: SiteAuditBot
Disallow: /
User-agent: SemrushBot-BA
Disallow: /
User-agent: AhrefsBot
Disallow: /
Ahrefs only uses the one bot, but Semrush has several, and you need to issue commands to each of them individually. Ethical companies publish a list of their bots, so you can just add them if you like. Here’s Google’s, for example.
Under the Disallow line, you can specify any URL or fragment of a URL. Putting just a / in means anything at domain.com/*, which is everything. If you wanted to disallow your blog, you would do /blog/. It’s all pretty easy. You can even block specific URLs if you want.
The downside to this method, obviously, is that only the ethical bots obey it. While you can disallow all the different bots (just use user-agent: *), there’s very little reason to do so. If you have a site you want to remain hidden, like a live test version of your main site, I guess you could do it. Otherwise, it’s only blocking the good kind of tools and traffic, and not really the bad ones at all.
An alternative option is to use the .htaccess file. It’s a server configuration file that helps filter traffic at the server level. It can block more bots, including bad bots, but if you’re overzealous, it can block legitimate users, too. It’s also limited to web servers using Apache, so if you’re using a different server software, it won’t work.
Option 2: Cloudflare Automation Protection
The second option is to throw your hands up in the air and tell someone else to do it.
In this case, what I’ve found to be the best option is to use Cloudflare security. Cloudflare does a lot of traffic filtering and, since they’re installed on millions of websites, they have a ton of data to use to suss out the bots that hide themselves. You can spoof IPs and use proxies, but bot behavior is still programmatic, so it’s easier to identify when you have all of the monitoring a service like Cloudflare does.
For the most part, what Cloudflare can do is block bots that come from obvious behavioral patterns, or from sources that are unlikely to be users. Traffic coming from a data center is not likely to be a regular user, so blocking it is probably fine. Cloudflare can handle that for you.
You’ve also probably encountered what happens if Cloudflare is skeptical about you. They don’t necessarily block traffic out of the gate, but they’ll issue a challenge or a captcha. Real users just have to click a button, and they’ll be let through just fine, but bots won’t.
You can also manually choose specific bots to block in Cloudflare, which works sort of like a combination of .htaccess and robots.txt. It’s not necessarily any more effective than just robots.txt, but the option is there.
Personally, I think there are a bunch of good reasons to use Cloudflare, or something very much like it. Blocking scrapers, SEO tools, and other bots is a reasonable side feature.
Option 3: Opt Out of Each Crawler Individually
Another option is to go to the owner of the bot and ask to be put on a list of prohibited sites. Some brands with SEO tools and scrapers allow you to go to their site and opt-out. Most don’t.
Semrush, for example, just tells you to use robots.txt.
Option 4: Don’t Worry About It
Honestly? I say don’t worry about it.
Pretty much any reason you would have to block bots isn’t something you can solve.
Blocking bots from search engines just means people can’t find your site. Blocking SEO tools blocks those specific tools, but there are thousands of SEO tools out there. I mean, heck, take a look at me. Would you have thought to block Topicfinder? Probably not; I’m still a pretty small business. By the time you realize I’ve seen your site, the data is already in someone else’s hands.
Besides, it’s all public anyway. Putting up a roadblock to seeing what keywords you’re using just means someone has to poke at your domain with a spreadsheet. It’s not harder; it’s just tedious.
And malicious bots, of course, don’t obey any of the above options anyway. So, why bother?
Instead, I say make use of those tools yourself. If your competitors are using them, don’t let them get ahead of you, use them right back.
Leave a Comment
Fine-tuned for competitive creators
Topicfinder is designed by a content marketing agency that writes hundreds of longform articles every month and competes at the highest level. It’s tailor-built for competitive content teams, marketers, and businesses.
Get Started