Re: [squid-users] disallow caching based on agent

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Wed, 27 Oct 2010 22:13:16 +0000

On Wed, 27 Oct 2010 13:39:59 -0400, alexus <alexus_at_gmail.com> wrote:
> On Tue, Oct 26, 2010 at 10:15 PM, Amos Jeffries <squid3_at_treenet.co.nz>
> wrote:
>> On Tue, 26 Oct 2010 16:34:52 -0400, alexus <alexus_at_gmail.com> wrote:
>>> On Mon, Oct 25, 2010 at 6:38 PM, Amos Jeffries <squid3_at_treenet.co.nz>
>>> wrote:
>>>> On Mon, 25 Oct 2010 12:38:49 -0400, alexus <alexus_at_gmail.com> wrote:
>>>>> is there a way to disallow serving of pages based on browser
(agent)?
>>>>> I'm getting a lot of these:
>>>>>
>>>>> XX.XX.XX.XX - - [25/Oct/2010:16:37:44 +0000] "GET
>>>>> http://www.google.com/gwt/x? HTTP/1.1" 200 2232 "-"
>>>>> "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1
>>>>> UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible;
>>>>> Googlebot-Mobile/2.1; +http://www.google.com/bot.html)"
>>>>> TCP_MISS:DIRECT
>>>>
>>>> Of course. you may...
>>>>
>>>> http://www.squid-cache.org/Doc/config/cache
>>>>
>>>> Although you need to be aware that preventing one object caching
>> operates
>>>> by removing records to it after the transaction has finished. The
>> effect
>>>> of
>>>> doing this which you can expect is that a visit by GoogleBot will
empty
>>>> your cache of most content.
>>>>
>>>> Amos
>>>>
>>>
>>> I'm not sure what do you mean by that, it seems like I dont know how
>>> but my SQUID gets hit by different bots and I was thinking to somehow
>>> disallow access to them, so they dont hit me as hard... maybe it's a
>>> stupid way of dealing with things...
>>
>> Ah. Not caching will make the impact worse. One of the things Squid
>> offers
>> is reduced web server impact from visitors. Squid is front-line
software.
>>
>> * Start with creating a robots.txt. The major bots will obey that and
>> you
>> can restrict where they go and sometimes how often.
>>
>> * allowing caching of dynamic pages where possible with squid-2.6 and
>> later (http://wiki.squid-cache.org/ConfigExamples/DynamicContent).
Squid
>> will handle the bots and normal visitors faster if it has cached
content
>> to
>> serve out immediately instead of waiting.
>>
>> * check your squid.conf for performance killers (regex, external
>> helpers), reduce the number of requests reaching those ACL tests as
much
>> as
>> possible. Squid routinely handles thousands of concurrent connections
for
>> ISP so a visit by several bots at once should not really be any visible
>> load.
>>
>>
>> Amos
>>
>
> I'm a little confused... what is robots.txt has to do with squid?

Very little. It will however make search bots reduce their impact.
Squid is just one tool among many for solving website traffic problems.

> where exactly should I place this robots.txt ?

As a publicly available file in the website. http://www.robotstxt.org/ has
details about it and how to use it to control the web bots. The http://
link in the bots user-agent header usually has specific details on what to
add to robots.txt to control that bot.
The bad bots which don't obey it will usually be missing a http:// link
and Squid can be set to deny all access to those.

Amos
Received on Wed Oct 27 2010 - 22:13:22 MDT

This archive was generated by hypermail 2.2.0 : Thu Oct 28 2010 - 12:00:04 MDT