TP-Docs
HTML5 Icon HTML5 Icon HTML5 Icon
TP on Social Media

Recent

Welcome to TinyPortal. Please login or sign up.

March 29, 2024, 11:20:15 AM

Login with username, password and session length
Members
Stats
  • Total Posts: 195,106
  • Total Topics: 21,213
  • Online today: 358
  • Online ever: 3,540 (September 03, 2022, 01:38:54 AM)
Users Online

Tips for " .htaccess" offline browsers and 'bad bots

Started by Maxx1, May 01, 2012, 03:00:51 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Maxx1

offline browsers and 'bad bots'

Now ...Note this will be added to your existing ".htaccess" file!

Offline browsers are pieces of software which download your web page, following the links to your other web pages, downloading all the content and images. The purpose of this is innocent, so the visitor can log off the Internet and browse the site without a connection, but the demand on the server and bandwidth usage can be expensive. Bad bots as they are often called refers to programs which visit your web site, either to source content, look for security holes or to scan for email addresses. This is often how your email address ends up on 'Spam' databases, because they have set a 'bot' to scan the Internet and collect email addresses. These programs and 'bots' often ignore the rules set out in 'robot.txt' files.

Below is a useful example of how to block some common 'bots' and site rippers. Create a .htaccess file following the main instructions and guidance which includes the following text:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]


Note I will try to come up with a useful basic .htaccess file, there are so many ways to go about this as well making the site SEO friendly so please bare with me!

and also Note these are for Apache severs..

regards,
Maxx
But Mama, That's Where all the fun is!

IchBin

To explain what this does, it checks the user_agent of the bot that is crawling your site. If it matches any of the sections at the end of each line, it then forwards them to the root of your domain. Not allowing them to access any other part of your site.

Personally I'd probably do a "Redirect 503" to them instead. This would tell them "service is unavailable". :)

Maxx1

Yes and redirects can be useful in various ways, you can re-direct to a given HTML/shtml page for warnings, you can re-direct to a splash page or intro page, and to your Forum index.php, yes 404's I use to re direct to the main root when someone types the wrong your URL or if they have an ole book mark. or even if you change directories.

I'm just posting general knowledge stuff. about .htaccess for those who do not know or would like to learn more. I started with this one because some had apparent or possible flooding to their site in a previous post, but your are right.

Not sure if this would work better on a sub domain? or those who may be using a directory? but then again the re direct page would also work.

To be honest I'm working on a super .htaccess that will accomplish many things with one file!

regards,
Maxx
But Mama, That's Where all the fun is!

WillyP

Doesn't 503 imply a temporary outage, and encourage the client to try again? Wouldn't a 403 [Forbidden] be more appropriate, since you want to forbid bad users from using the site at all, to conserve bandwidth?

My understanding is that 5xx should be only used to report a server error, in this case the server should refuse the request, but it's not in error.

IchBin

Your choice really. You can forbid them, or make them think the site is down or gone. :)