cancel
Showing results for 
Search instead for 
Did you mean: 

A tool to prevent site being archived?

tonyjackson3
Newbie
Posts: 7
Registered: ‎09-07-2008

A tool to prevent site being archived?

Hi, my site was archived the other day - it happened a few times in the past, so I no longer keep much there.  Still, on one recent day I got 768 hits from a single address in New Zealand.  Represents almost a hundred hits for each of the files hosted there...  And my bandwidth was way over the limit, so the site was archived and I got the maddening letter.
Obviously there is nothing I can do to prevent this directly as I was not accessing the site myself.  Still, it has given me to wonder whether, if this sort of event is at all common, one might have a script of some sort which silently watches the log, and when, for example, a single user has made, say, ten hits on the site in one day, he/she gets a notice saying that users are limited to twelve hits per day.  Then, if there are two more hits the user is refused access for the next 24 hours.
Might this be possible to achieve?
Best regards, T.
9 REPLIES 9
Gabe
Grafter
Posts: 767
Registered: ‎29-10-2008

Re: A tool to prevent site being archived?

That sounds automated. Not much point talking nicely to a bot. A spider trap might help - for which there are various recipes on the web.
Gabe
avatastic
Grafter
Posts: 1,136
Thanks: 2
Registered: ‎30-07-2007

Re: A tool to prevent site being archived?

You could add a robots.txt to the root of your website, and hope that the crawlers honour it.
http://www.robotstxt.org/
F9 member since 4 Sep 1999
F9 ADSL customer since 27 Aug 2004
DLM manages your line the same way DRM manages your rights.
Look at all the pretty graphs! (now with uptime logging!)
IanSn
Rising Star
Posts: 565
Thanks: 31
Registered: ‎25-09-2011

Re: A tool to prevent site being archived?

in my experience naughty hits like this are from sources that either don't read the robots file or ignore it.
i usually block the IP.
never found a suitable way of preventing randomers, though I have various items regulated via htaccess, etc.
also, these days, tend to keep a several-times-a-day watch on the raw stats log. Suspicious activity is usually obvious, single hits preceded by further scans, you get to know what IP to block before they come back looking for php registation files... you see them arrive later for exactly that one file but getting the 403. Ha!
pita, but part of life  Embarrassed
funny you should mention NZ -- 202.175.135.* - DSLAK-NZ - Datacom Systems Ltd, just yesterday. Bonkers scan of everything...
tonyjackson3
Newbie
Posts: 7
Registered: ‎09-07-2008

Re: A tool to prevent site being archived?

Hi Gabe,
Sorry slow to reply, I didn't get notified of any replies, so after a few days I stopped looking!
Quote from: Gabe
That sounds automated. Not much point talking nicely to a bot. A spider trap might help - for which there are various recipes on the web.

I looked up 'spider trap' and it seems to be just what I need to avoid.  Or is it like 'bandwidth', a word which means totally different things i different worlds?  (I looked it up on Google, and it was described as an element on a page which tricks the Google spider into going around in circles.
Best, T.
tonyjackson3
Newbie
Posts: 7
Registered: ‎09-07-2008

Re: A tool to prevent site being archived?

Hi Ian,
Quote from: IanSn
in my experience naughty hits like this are from sources that either don't read the robots file or ignore it.
i usually block the IP.

How would I do that?
Quote
... these days, tend to keep a several-times-a-day watch on the raw stats log. Suspicious activity is usually obvious, single hits preceded by further scans, you get to know what IP to block before they come back looking for php registation files... you see them arrive later for exactly that one file but getting the 403. Ha!
pita, but part of life  Embarrassed

For my kind of site it is impossible to imagine checking its stats daily.  But for rare cases I guess I could (if I knew how) block a particular IP.  Could the '403' be configured to explain what is going on?  Otherwise, for example, a non-amok attempted access would simply be annoyingly barred, and have no way to know how or why or what to do about it.
I don't know what a php registration file is, so I don't know or understand what any one would be seeking - afaik I don't host any?
Quote
funny you should mention NZ -- 202.175.135.* - DSLAK-NZ - Datacom Systems Ltd, just yesterday. Bonkers scan of everything...

So I should block these people, and suggest to others that they do likewise, or write to them, or what?
Do ISP's other than PlusNet deal with this more effectively - I haven't heard of others dealing with this sort of grief?
Best, T.
tonyjackson3
Newbie
Posts: 7
Registered: ‎09-07-2008

Re: A tool to prevent site being archived?

Hi avatastic,
Quote from: avatastic
You could add a robots.txt to the root of your website, and hope that the crawlers honour it.
http://www.robotstxt.org/

What other effects would this have - e.g. would it effectively make the site invisible to Google?  The site is charitable, not commercial, but part of the point is being able to be found!
Best, T.
Gabe
Grafter
Posts: 767
Registered: ‎29-10-2008

Re: A tool to prevent site being archived?

You don't need to block Google. You can set different exclusions for different bots, e.g.
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:

would tell all bots but Google to keep off the grass. But bad bots tend to ignore the notices.
You can also set a "Crawl-delay" (which Google ignores, but you can set it in Google Webmaster Tools), but that's more relevant to avoiding dips in response time on large sites than limiting daily bandwidth on small ones. Bots that respect robots.txt aren't usually the problem.
You can use a bad-bot list, but it's easy to go ott. An alternative or additional method is to set a spider trap. I'll give a brutal but effective version.
First you create a hidden link in your homepage to a folder that will act as your spider trap (there are various ways of doing that). You put it top and bottom of the page body, e.g.
<body>
<div style="display:none"><a href="/bath"><!-- bath time --></a></div>
...
<div style="display:none"><a href="/bath"><!-- bath time --></a></div>
</body>

Then you tell spiders to keep out of the bath by creating a robots.txt file and putting it in whichever folder corresponds to www.yoursite.com/robots.txt (that's where bots should look for it).
User-agent: *
Disallow: /bath/
Disallow: /bath
User-agent: Googlebot
Disallow: /bath/
Disallow: /bath

You say it with and without the trailing slash, because bots are thick, and you say it again to Googlebot, because it's been known to be hard of hearing, and it pays to be paranoid.
Then you put an index.php file in your /bath, such as
<!doctype html>
<html>
<head>
<title>Naughty Spider</title></head>
<body>
<p>Please respect the robots.txt</p>
<?php
$addr = $_SERVER['REMOTE_ADDR'];
file_put_contents("../.htaccess", "deny from $addr\n", FILE_APPEND | LOCK_EX);
?>
</body>
</html>

If you already have a .htaccess file, make sure it ends on a new line.
Then check your .htaccess file occasionally, just to make sure you're only trapping rude spiders. The idea is to forbid anything that tries to scrape your whole site without asking and not the polite spider or ordinary user.
Gabe
IanSn
Rising Star
Posts: 565
Thanks: 31
Registered: ‎25-09-2011

Re: A tool to prevent site being archived?

@Gabe - Like the naughty spider trapper!
@Tony -
Definitely agree - use a robots file.
But having said that, the worst hits I've experienced didn't even look at the robots file.
Normally there's a Control Panel that would allow you to add an IP, or a range to block. This adds the IP to that .htaccess file mentioned above.
Or you can edit the .htaccess file 'manually', though carefully. But if you're not familiar, the Control Panel would be best.
Sorry, by 'php registration file' I meant the file that's used when (if) people register on the site. (PHP is a computer language usually used for server-side databases.) 'Hackers' know what this registration file is called and will attempt a single hit on that file. If found, and if its an automated 'bot', it can fill the register form in seconds. Very annoying.
Single hits on vulnerable files just get immediately blocked around here!
In my experience writing to the abuse dept. of the providers results in goose egg. But I'd still encourage people to do it.
edit - typo.  (dept. not debt. !!)
spraxyt
Resting Legend
Posts: 10,063
Thanks: 674
Fixes: 75
Registered: ‎06-04-2007

Re: A tool to prevent site being archived?

Might be worth mentioning that we can't do PHP with sites on Homepages. It this case the principles still apply but raw access logs would have to be searched for transgressors (manually or using a CGI script elsewhere) and .htaccess updated manually.
Any source exhibiting bot-like behaviour that doesn't check and respect robots.txt deserves to be blocked as a matter of course.
David