Re: Squid 2.x performance curioisities

From: Arjan de Vet <Arjan.deVet@dont-contact.us>
Date: Thu, 29 Apr 1999 01:17:28 +0200 (CEST)

In article <37239888.D7459FA@zeor.simegen.com> you write:

>Now, here we come to the curious bit. Correct me if I'm wrong, but the
>number of objects actually in the cache should not significantly impact
>the lookup time for a given object, because it's a hashed model, yes?

Tests I did a few weeks ago show exactly the same kind of behavior. Let
me give some results.

Test setup: Squid 1.1.20 on FreeBSD with a 25GB RAID array (with 64MB
cache on the RAID controller) using a simulated workload based on a
Squid logfile. During office hours the workload was generated by 10
clients, after office hours (17:00h) I increased the number of clients
to 400. The clients choose URLs randomly from the 1 million URLs in that
logfile (I found out about polygraph only very recently). The test
started with an empty cache.

This table shows the time-of-day, the hitrate and the amount of data in
the cache.

time htrt% cache
10:08 1 56 MB
10:31 1 103 MB
11:44 2 240 MB
12:54 3 362 MB
15:31 4 563 MB
16:25 8 649 MB
16:54 20 1089 MB
(switch to 400 parallel conns)
17:05 25 1266 MB
17:40 40 2030 MB
18:04 38 2126 MB
18:50 37 2421 MB
19:15 41 2700 MB

Look pretty normal, isn't it? Now let's study the behavior of some
disk-I/O related system calls. The following statistics where obtained
by ktrace'ing the Squid process for 10 seconds and analyzing the output
with a perl script which has the intelligence to separate disk and
network read/write calls and makes a distinction whether an open call is
for writing or reading a file.

> fgrep 'open-write' 0413*.stat
0413-0936.stat: open-write 0.057516 0.58% 55 0.001046
0413-1008.stat: open-write 0.029744 0.30% 23 0.001293
0413-1031.stat: open-write 0.001962 0.02% 5 0.000392
0413-1144.stat: open-write 0.016467 0.17% 33 0.000499
0413-1254.stat: open-write 0.007711 0.08% 10 0.000771
0413-1531.stat: open-write 0.038268 0.40% 30 0.001276
0413-1625.stat: open-write 0.044891 0.46% 50 0.000898
0413-1654.stat: open-write 0.104158 1.04% 161 0.000647
0413-1705.stat: open-write 0.277842 2.79% 311 0.000893
0413-1740.stat: open-write 0.131819 1.32% 158 0.000834
0413-1804.stat: open-write 0.545237 5.45% 255 0.002138
0413-1850.stat: open-write 0.687754 6.89% 273 0.002519
0413-1915.stat: open-write 0.245093 2.45% 278 0.000882

After the system call you see: total amount of time (secs) spend on that
system call, same thing but then expressed as a percentage, number of
calls and average time (secs) per call.

Because open() calls for writing a file can be buffered we stay around
1-2 ms/call on average. Similar for write() to disk files:

> fgrep 'write-disk' 0413*.stat
0413-0936.stat: write-disk 0.066449 0.67% 292 0.000228
0413-1008.stat: write-disk 0.034090 0.35% 112 0.000304
0413-1031.stat: write-disk 0.006046 0.06% 23 0.000263
0413-1144.stat: write-disk 0.079426 0.80% 162 0.000490
0413-1254.stat: write-disk 0.011783 0.13% 19 0.000620
0413-1531.stat: write-disk 0.049329 0.51% 159 0.000310
0413-1625.stat: write-disk 0.071865 0.74% 206 0.000349
0413-1654.stat: write-disk 0.352437 3.52% 717 0.000492
0413-1705.stat: write-disk 0.676163 6.78% 983 0.000688
0413-1740.stat: write-disk 0.242255 2.42% 508 0.000477
0413-1804.stat: write-disk 0.361256 3.61% 551 0.000656
0413-1850.stat: write-disk 0.412117 4.13% 623 0.000662
0413-1915.stat: write-disk 0.235154 2.35% 460 0.000511

Let's have a look at the reads now.

> fgrep 'open-read' 0413*.stat
0413-0936.stat: open-read 0.000055 0.00% 1 0.000055
0413-1008.stat: open-read 0.000063 0.00% 1 0.000063
0413-1254.stat: open-read 0.000166 0.00% 2 0.000083
0413-1531.stat: open-read 0.001966 0.02% 6 0.000328
0413-1625.stat: open-read 0.001034 0.01% 3 0.000345
0413-1654.stat: open-read 0.067214 0.67% 70 0.000960
0413-1705.stat: open-read 0.163981 1.64% 154 0.001065
0413-1740.stat: open-read 0.347313 3.47% 280 0.001240
0413-1804.stat: open-read 0.446925 4.47% 152 0.002940
0413-1850.stat: open-read 0.181502 1.82% 154 0.001179
0413-1915.stat: open-read 0.816136 8.16% 115 0.007097

Here you see the increase in times when the cache fills. The initial
times below the millisecond are due to the file system cache which can
keep all metadata (like directories) in the cache (FreeBSD has a merged
VM/buffer cache and uses all free memory for it, there's no fixed
limit). After some time not all metadata can be kept in the file system
cache anymore leading to real disk I/O for these open-read calls and
then we see 6-7 milliseconds although I have seen 12 milliseconds on a
very busy cache with a larger Squid cache (8-12GB).

> fgrep 'read-disk' 0413*.stat
0413-0936.stat: read-disk 0.004221 0.04% 50 0.000084
0413-1008.stat: read-disk 0.001324 0.01% 16 0.000083
0413-1031.stat: read-disk 0.000374 0.00% 6 0.000062
0413-1144.stat: read-disk 0.002151 0.02% 27 0.000080
0413-1254.stat: read-disk 0.000185 0.00% 2 0.000092
0413-1531.stat: read-disk 0.003133 0.03% 32 0.000098
0413-1625.stat: read-disk 0.007439 0.08% 45 0.000165
0413-1654.stat: read-disk 0.536705 5.36% 286 0.001877
0413-1705.stat: read-disk 1.541172 15.46% 566 0.002723
0413-1740.stat: read-disk 2.966033 29.67% 538 0.005513
0413-1804.stat: read-disk 2.493700 24.92% 378 0.006597
0413-1850.stat: read-disk 1.859529 18.64% 369 0.005039
0413-1915.stat: read-disk 1.559156 15.59% 295 0.005285

The initial times below 100 microseconds show the file system cache at
work. After that the amount of data in Squid's cache has become that
big that most read() call on a disk file leads to real disk I/O and
that's when we see 5-6 milliseconds per call. As you can see, at some
point in time Squid is spending 30% in read() from disk calls; I have
seen 45% on a very busy cache. During that time Squid just stops waiting
for some disk heads to move unless you can use async-io :-(.

This is a complete .stat file b.t.w.:

> cat 0413-1740.stat
                MISC 2.990658 29.92% 28020 0.000107
              accept 0.035056 0.35% 846 0.000041
               close 0.091841 0.92% 1112 0.000083
             connect 0.041467 0.41% 405 0.000102
               fcntl 0.126011 1.26% 3603 0.000035
               fstat 0.000028 0.00% 1 0.000028
             fstatfs 0.000038 0.00% 1 0.000038
       getdirentries 0.000293 0.00% 3 0.000098
           getrusage 0.000069 0.00% 1 0.000069
         getsockname 0.015422 0.15% 478 0.000032
        gettimeofday 0.091203 0.91% 2311 0.000039
               lseek 0.038744 0.39% 1140 0.000034
             madvise 0.411980 4.12% 6990 0.000059
           open-read 0.347313 3.47% 280 0.001240
          open-write 0.131819 1.32% 158 0.000834
                read 1.053009 10.53% 1332 0.000791
           read-disk 2.966033 29.67% 538 0.005513
              select 0.290544 2.91% 4961 0.000059
          setsockopt 0.007279 0.07% 204 0.000036
              socket 0.013868 0.14% 204 0.000068
                stat 0.000115 0.00% 1 0.000115
               write 1.101469 11.02% 2942 0.000374
          write-disk 0.242255 2.42% 508 0.000477
         Total time: 9.996514

(MISC is the time between system calls, read and write are read and
write on sockets).

>Run test at 100/sec. Runs for some time. Symptoms repeat. I'm quietly
>watching 'vmstat 5' this whole time, and watching the disk buffers
>filling (490MB now..squid's cache_mem is set at 8MB just as a reference
>point..since I later want to examine speed differences for different
>sizes). Eyeballing the disks: Hardly working. updated is running at five

I've modified the FreeBSD systat command to show the percentage of time
that a disk had scsi commands queued. During my test it remained far
below 100% because in the non-async-io case Squid performs all I/O
operations sequentially, no parallellism at all. Multiple independent
disks won't help in such a situation either I guess.

>Now:
>
>ASSUMPTION: Squid's lookup time for a given object does not grow
>significantly as the number of objects increases. (It'll grow, but not
>hugely)

Indeed.

>GUESS: The slowdown is the operating system looking through a
>half-gigabyte of cache-blocks to find objects, and actually managing to
>exceed the time it would take to fetch the object from the physical
>disk.

The average time per disk-related system call increases with the amount
of objects on disk because the chance of having a file or the metadata
needed to find that file in your filesystem cache decreases. Unless your
filesystem cache has the same size as your disks of course; in that case
you have in fact a memory filesystem backed by real disks :-).

Combined with doing all I/O in a strictly serial fashion (unless using
async-io) this increases the amount of time needed to do one full select
loop, on a very busy cache I sometimes see only 2 completed select loops
per second. That slows down *all* connections leading to longer transfer
times leading to more parallel connections leading to even slower select
loops etc. You'll see the problem I think :-).

One of the other things I noticed during another test is that as soon as
unlinkd starts unlinking files (18-24 per second in my tests) the amount
of disk I/O increases significantly because of all the modified metadata
which needs to be written back to disk. The problem is, I guess, that
unlinkd gets completely random filenames (haven't verified this) on its
input channel. Expiring and thus unlinking files per directory or
truncating (Squid 2.2stable2 does this) could prevent a lot of this disk
I/O. It's the same principle used in my new_store patch I did a year ago
which does not store a new object in a directory and then proceed to the
next directory but fills a directory first before going to the next one,
leading to less disk I/O.

>I could be missing something here. Of course I could. I've hardly slept
>in the last couple weeks. Y'know the drill...it's crunch time here in
>the dev-group, and deliverables are due on site on the 8th. Whee.
>Sleep-time aside, I'm wondering how to ameliorate this consistant
>falloff in response. I saw it with squid 1.1.22 when I tested it
>(eventually falling to <30/sec, as the cache increased in size...but
>definitely not disk bound)

I get 60-65/secs on a Squid cache and am quite sure it's the disk I/O.
So I'm curious what improvements SquidFS will bring.

I've also been thinking recently about splitting that one big Squid
program into smaller cooperating processes like many other Unix servers
do (INN, Diablo, Postfix, Apache, NNTPcache) with the disk I/O
bottleneck as the main reason to do so.

Some wild and completely unfinished ideas follow, they might even have
been discussed before (I don't remember):

- One small master process which just starts and watches the other
  processes;

- A store manager process; it reads the cache.swap file(s) on startup
  and writes it back to disk on shutdown; it's the only process which
  knows where all objects are stored on disk but should avoid doing any
  disk I/O itself except for startup/shutdown; talks to unlinkd for
  expiring old files;

  This kind of store manager process could be compared with 'actived'
  from INN which the nnrpd processes use to get information (via UDP
  datagrams) from the active file instead of having that information in
  all processes of which only a fraction will be used. NNTPcache uses a
  shared mmap()ed piece of memory to make information sharing among
  multiple forked child processes possible; it relies on a very good
  mmap implementation which might not be the case for all Unices.

- client handling processes; a set of preforked daemons like Apache
  uses; they talk to servers and clients and read/write objects from/to
  disk; they communicate with the store manager via some form of
  interprocess communication (unix domain sockets etc.); some protocol
  will be necessary for communicating things like:

  - given an MD5 key and possible other data of an object, tell me where
    to find it on disk (filename or some other identification) if it's
    in the cache or tell me where to store it if it isn't.

  - some sort of 'release' command to tell the store manager I'm not
    serving a file from the cache to a client anymore or that I've
    finished writing a new object which should now be considered valid.

  - etc.

  Note: in this setup the store manager is likely to become a
  bottleneck. All its functions should only need memory access
  (hash-table lookups) if possible.

- separate DNS manager still needed?

Why should we want such a setup:

- no big select() loop, non-blocking I/O and complicated commSetSelect
  calls with read and write handlers anymore; it comes with a lot of
  overhead;

- limited number of file descriptors per process (some OSes have/had
  problems with this);

- avoids the use of mmap, kernel threads, async-io, etc. which are
  poorly or even not implemented on some systems; the preforked daemons
  Apache uses work very well on many platforms without all these fancy
  things;

- the client handling processes can do blocking operations without
  stopping other connections; the kernel will schedule one of the other
  processes as soon as a process blocks. I think the kernel can do this
  more efficiently than Squid can 'schedule' connections in its select()
  loop where it needs to update state information per connection after
  each read or write.

- without the need for kernel threads to support async-io or native
  async-io we can have parallel I/O requests because we have multiple
  processes doing disk I/O instead of one: this should scale quite
  nicely on multiple independent disks and/or multiple processors.

As I said before, these are just some wild and completely unfinished
ideas. I would not be surprised if somebody finds big problems with this
setup which would make it difficult to implement.

Arjan
Received on Tue Jul 29 2003 - 13:15:57 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:07 MST