News: Welcome to the TinyPortal Support site.

Login  |  Register
HTML5 Icon HTML5 Icon
TP on Social Media
Welcome, Guest. Please login or register.
Did you miss your activation email?

July 20, 2019, 09:41:51 AM

Login with username, password and session length

Recent

Members
Stats
  • Total Posts: 188522
  • Total Topics: 20737
  • Online Today: 40
  • Online Ever: 629
  • (November 08, 2018, 01:36:54 PM)
Users Online
Users: 1
Guests: 26
Total: 27

Author Topic: Tips for " .htaccess" offline browsers and 'bad bots  (Read 7944 times)

0 Members and 1 Guest are viewing this topic.

Offline Maxx1

  • Sr. Member
  • ****
  • Posts: 352
  • Learning To Fly
    • SMF Theme & Portal Testing
Tips for " .htaccess" offline browsers and 'bad bots
« on: May 01, 2012, 08:00:51 AM »
offline browsers and 'bad bots'

Now ...Note this will be added to your existing ".htaccess" file!

Offline browsers are pieces of software which download your web page, following the links to your other web pages, downloading all the content and images. The purpose of this is innocent, so the visitor can log off the Internet and browse the site without a connection, but the demand on the server and bandwidth usage can be expensive. Bad bots as they are often called refers to programs which visit your web site, either to source content, look for security holes or to scan for email addresses. This is often how your email address ends up on 'Spam' databases, because they have set a 'bot' to scan the Internet and collect email addresses. These programs and 'bots' often ignore the rules set out in 'robot.txt' files.

Below is a useful example of how to block some common 'bots' and site rippers. Create a .htaccess file following the main instructions and guidance which includes the following text:

Code: [Select]
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]

Note I will try to come up with a useful basic .htaccess file, there are so many ways to go about this as well making the site SEO friendly so please bare with me!

and also Note these are for Apache severs..

regards,
Maxx
But Mama, That's Where all the fun is!

Offline IchBin™

  • Developer
  • *
  • Posts: 16228
    • My Website
Re: Tips for " .htaccess" offline browsers and 'bad bots
« Reply #1 on: May 01, 2012, 10:04:20 AM »
To explain what this does, it checks the user_agent of the bot that is crawling your site. If it matches any of the sections at the end of each line, it then forwards them to the root of your domain. Not allowing them to access any other part of your site.

Personally I'd probably do a "Redirect 503" to them instead. This would tell them "service is unavailable". :)

Offline Maxx1

  • Sr. Member
  • ****
  • Posts: 352
  • Learning To Fly
    • SMF Theme & Portal Testing
Re: Tips for " .htaccess" offline browsers and 'bad bots
« Reply #2 on: May 01, 2012, 02:34:03 PM »
Yes and redirects can be useful in various ways, you can re-direct to a given HTML/shtml page for warnings, you can re-direct to a splash page or intro page, and to your Forum index.php, yes 404's I use to re direct to the main root when someone types the wrong your URL or if they have an ole book mark. or even if you change directories.

I'm just posting general knowledge stuff. about .htaccess for those who do not know or would like to learn more. I started with this one because some had apparent or possible flooding to their site in a previous post, but your are right.

Not sure if this would work better on a sub domain? or those who may be using a directory? but then again the re direct page would also work.

To be honest I'm working on a super .htaccess that will accomplish many things with one file!

regards,
Maxx
But Mama, That's Where all the fun is!

Offline WillyP

  • Support Team
  • *
  • Posts: 769
    • Planet Descent
Re: Tips for " .htaccess" offline browsers and 'bad bots
« Reply #3 on: May 05, 2012, 02:45:05 PM »
Doesn't 503 imply a temporary outage, and encourage the client to try again? Wouldn't a 403 [Forbidden] be more appropriate, since you want to forbid bad users from using the site at all, to conserve bandwidth?

My understanding is that 5xx should be only used to report a server error, in this case the server should refuse the request, but it's not in error.

Offline IchBin™

  • Developer
  • *
  • Posts: 16228
    • My Website
Re: Tips for " .htaccess" offline browsers and 'bad bots
« Reply #4 on: May 05, 2012, 06:58:24 PM »
Your choice really. You can forbid them, or make them think the site is down or gone. :)