A couple more things about the ACLs used in my test
all of them are allow ACLs (no deny rules to worry about precidence of)
except for a deny-all at the bottom
the ACL line that permits the test source to the test destination has zero
overlap with the rest of the rules
every rule has an IP based restriction (even the ones with url_regex are
source -> URL regex)
I moved the ACL that allows my test from the bottom of the ruleset to the
top and the resulting performance numbers were up as if the other ACLs
didn't exist. As such it is very clear that 3.2 is evaluating every rule.
I changed one of the url_regex rules to just match one line rather than a
file containing 307 lines to see if that made a difference, and it made no
significant difference. So this indicates to me that it's not having to
fully evaluate every rule (it's able to skip doing the regex if the IP
match doesn't work)
I then changed all the acl lines that used hostnames to have IP addresses
in them, and this also made no significant difference
I then changed all subnet matches to single IP address (just nuked /##
throughout the config file) and this also made no significant difference.
so why are the address matches so expensive
and as noted in the e-mail below, why do these checks not scale nicely
with the number of worker processes? If they did, the fact that one
3.2 process is about 1/3 the speed of a 3.0 process in checking the acls
wouldn't matter nearly as much when it's so easy to get an 8+ core
system.
it seems to me that all accept/deny rules in a set should be able to be
combined into a tree to make searching them very fast.
so for example if you have
accept 1
accept 2
deny 3
deny 4
accept 5
you need to create three trees (one with accept 1 and accept 2, one with
deny3 and deny4, and one with accept 5) and then check each tree to see
if you have a match.
the types of match could be done in order of increasing cost, so if you
have acl entries of type port, src, dst, and url regex, organize the tree
so that you check ports first, then src, then dst, then only if all that
matches do you need to do the regex. This would be very similar to the
shortcut logic that you use today with a single rule where you bail out
when you don't find a match.
you could go with a complex tree structure, but since this only needs to
be changed at boot time, it seems to me that a simple array that you can
do a binary search on will work for the port, src, and dst trees. The url
regex is probably easiest to initially create by just doing a list of
regex strings to match and working down that list, but eventually it may
be best to create a parse tree so that you only have to walk down the
string once to see if you have a match.
you wouldn't quite be able to get this fast as you would have to actually
do two checks, one if you have a match on that level and one for the rules
that don't specify something in the current tree (one check for if the
http_access line specifies a port number and one for if it doesn't for
example)
this sort of acl structure would reduce a complex ruleset down to ~O(log
n) instead of the current O(n) (a really complex ruleset would be log n of
each tree added togeather)
there are cases where this sort of thing would be more expensive than the
current, simple approach, but those would be on simple rulesets which
aren't affected much by a few extra checks.
David Lang
On Fri, 8 Apr 2011, david_at_lang.hm wrote:
> I did some more testing with this, and it looks like the bottleneck is in the
> ACL checking.
>
> if I remove all the ACLs (except the one I actually use for testing), I am
> able to get 16,750 requests/sec with 3.2.0.5 on 8 workers, with them all only
> useing ~30% cpu (I think this is the limit of the apache server I am hitting
> behind squid)
>
> I have the following ACLs defined
>
> port 13
> src 89
> dst 173
> url_regex 338
>
> used in 292 http_access rules
>
> so what has changed since 3.0 in terms of the ACL handling to slow it down so
> much? and why do multiple processes kill scale so badly when they should all
> be busy checking ACLs? (does each process lock the table of ACLs or somehow
> block other threads from doing checks?) This would seem like the problem
> space that would be ideal for multiple processes, each has it's own copy of
> the ACL rules, gets a connection and then does it's own checking with no need
> to communicate with the other processes at all.
>
> now the performance numbers
>
> with the minimal ACLs
>
> 3.2.0.5 with 1 worker gets 3300 requests/sec
> 3.2.0.5 with 2 workers gets 8400 requests/sec
> 3.2.0.5 with 3 workers gets 10,800 requests/sec
> 3.2.0.5 with 4 workers gets 13,600 requests/sec
> 3.2.0.5 with 5 workers gets 15,700 requests/sec
> 3.2.0.5 with 6 workers gets 16,400 requests/sec
> 3.2.0.6 with 1 worker gets 4400 requests/sec
> 3.2.0.6 with 2 workers gets 8400 requests/sec
> 3.2.0.6 with 3 workers gets 11,300 requests/sec
> 3.2.0.6 with 4 workers gets 15,600 requests/sec
> 3.2.0.6 with 5 workers gets 15,800 requests/sec
> 3.2.0.6 with 6 workers gets 16,400 requests/sec
>
> David Lang
>
>
>
> On Fri, 8 Apr 2011, Amos Jeffries wrote:
>
>> Date: Fri, 08 Apr 2011 15:37:24 +1200
>> From: Amos Jeffries <squid3_at_treenet.co.nz>
>> To: squid-users_at_squid-cache.org
>> Subject: Re: [squid-users] squid 3.2.0.5 smp scaling issues
>>
>> On 08/04/11 14:32, david_at_lang.hm wrote:
>>> sorry for the delay. I got a chance to do some more testing (slightly
>>> different environment on the apache server, so these numbers are a
>>> little lower for the same versions than the last ones I posted)
>>>
>>> results when requesting short html page
>>>
>>>
>>> squid 3.0.STABLE12 4000 requests/sec
>>> squid 3.1.11 1500 requests/sec
>>> squid 3.1.12 1530 requests/sec
>>> squid 3.2.0.5 1 worker 1300 requests/sec
>>> squid 3.2.0.5 2 workers 2050 requests/sec
>>> squid 3.2.0.5 3 workers 2700 requests/sec
>>> squid 3.2.0.5 4 workers 2950 requests/sec
>>> squid 3.2.0.5 5 workers 2900 requests/sec
>>> squid 3.2.0.5 6 workers 2530 requests/sec
>>> squid 3.2.0.6 1 worker 1400 requests/sec
>>> squid 3.2.0.6 2 workers 2050 requests/sec
>>> squid 3.2.0.6 3 workers 2730 requests/sec
>>> squid 3.2.0.6 4 workers 2950 requests/sec
>>> squid 3.2.0.6 5 workers 2830 requests/sec
>>> squid 3.2.0.6 6 workers 2530 requests/sec
>>> squid 3.2.0.6 7 workers 2160 requests/sec instead of all processes being
>>> at 100% several were at 99%
>>> squid 3.2.0.6 8 workers 1950 requests/sec instead of all processes being
>>> at 100% some were as low as 92%
>>>
>>> so the new versions are really about the same
>>>
>>> moving to large requests cut these numbers by about 1/3, but the squid
>>> processes were not maxing out the CPU
>>>
>>> one issue I saw, I had to reduce the number of concurrent connections or
>>> I would have requests time out (3.2 vs earlier versions), on 3.2 I had
>>> to have -c on ab at ~100-150 where I could go significantly higher on
>>> 3.1 and 3.0
>>>
>>> David Lang
>>>
>>
>> Thank you.
>> So with small files 2% on 3.1 and ~7% on 3.2 with a single worker. But
>> under 1% on multiple 3.2 workers.
>> And overloading/flooding the I/O bandwidth on large files.
>>
>> NP: when overloading I/O one cannot compare to runs with different sizes.
>> Only with runs of the same traffic. Also only the CPU max load is reliable
>> there, since requests/sec bottlenecks behind the I/O.
>> So... your measure that CPU dropped is a good sign for large files.
>>
>> Amos
>>
>
Received on Sat Apr 09 2011 - 02:27:43 MDT
This archive was generated by hypermail 2.2.0 : Sat Apr 09 2011 - 12:00:02 MDT