<color><param>0100,0100,0100</param>On 15 Sep 2001, at 1:41, Henrik Nordstrom <<hno@squid-cache.org> wrote:

<color><param>7F00,0000,0000</param>> Andres Kroonmaa wrote:

> 

> >  - signals are expensive when most fds are active.

> 

> well.. not entirely sure on this one. Linux RT signals is a very light

> weight notification model, only the implementation sucks due to

> worthless notification storms.. (if you have already received a

> notification that there is data available for reading, there is

> absolutely no value in receiving yet another notification when there is

> more data before you have acted on the first.. and similar for writing)


</color> I don't know the details. I imagine that signal queueing to the proc

 is very lightweight. Its the setting up and dequeueing that burns the

 benefits imho. Mostly because its done 1 FD at a time. I can't imagine

 any benefits compared to devpoll, unless you need to handle IO from

 signal handler.


<color><param>7F00,0000,0000</param>> >  - syscalls are expensive if doing little work, mostly for similar

> >    reasons as threads.

> 

> depends on platform and syscall, but generally true. The actual syscall

> overhead is however often overestimated in discussions like this. A

> typical syscall consists of

>   * light context switch

>   * argument verification

>   * data copying

>   * processing

> 

> What you can optimize by aggregation is the light context switches.

</color>> The<color><param>7F00,0000,0000</param> rest will still be there.


</color> Not sure what you mean by light context switches. Perhaps you make

 distinction between CPU protmode change, kernel doing queueing and

 kernel going through the scheduling stuff. In my view CPU protmode

 change alone is burning alot of CPU, and typical syscall is read()

 or write() in Squid. Both are cancellation points, meaning some

 scheduling checks are done. Same stands for sigwait(), poll, etc.

 Perhaps only calls like fcntl() are lightweight in this sense.

 Basically any syscall that can possibly block is heavy syscall.


 But again, CPU-specific overhead can't be underestimated. CPU mode

 change causes cpu-cache flushes on almost all CPUs (prefetching,

 pipelines, VM maps). This means that you have high misrate freshly

 after. With CPU clockrate very high compared to RAM clockrates this

 translates to alot of CPU cycles lost. You could have syscall doing

 single memory write and return, but burning CPU as much as few hundred

 lines of code eventually.

 The only difference is that CPU is stalled instead of running code.


 This all is sensed only when syscall rate is very high, when code

 leaves process very often doing only very little work at a time in

 either kernel or userspace. We should stay longer in userspace,

 preparing several sockets for IO, and then stay longer in kernel,

 handling IO.


 Imagine we had to loop through all FD's in poll_array and poll()

 each FD individually. This is where we are today with IO queueing.


<color><param>7F00,0000,0000</param>> >  Eventually we'll strike syscall rate that wastes most CPU cycles in

> >  context-switches and cache-misses.

> 

> The I/O syscall overhead should stay fairly linear with the I/O request

> rate I think. I don't see how context switches and cache misses can

> increase a lot only because the rate increases. It is still the same

> amount of code running in the same amount of execution units.


</color> Under light loads syscall overhead is small because time between the

 two is relatively large. The problem is that when reqrate increases,

 expected overhead per request is the same, but timeframe between the

 switches reduces, and proportion of CPU time burned in syscall overhead

 goes up. If overhead is around 1% with load of 100 req/sec it should

 be 10% at 1K req/sec. But it is worse. Possibly much worse. And very

 difficult to measure. Eventually we loose CPU resource to nothing.

 To make things worse, we optimise our code, the goal of what is to

 further reduce time between successive syscalls.

 Saying "most" is exaggerated, though. ;)


<color><param>7F00,0000,0000</param>> >  Ideally, kernel should be given a list of sockets to work on, not in just

> >  terms of readiness detection, but actual IO itself. Just like in devpoll,

> >  where kernel updates ready fd list as events occur, it should be made to

> >  actually do the IO as events occur. Squid should provide a list of FDs,

> >  commands, timeout values and bufferspace per FD and enqueue the list.

> >  This is like kernel-aio, but not quite.

> 

> Sounds very much like LIO. Main theoretical problem is what kind of

> notification mechanism to use for good latency.


</color> Yes, LIO. Problem is that most current LIO implementations are done

 in library by use of aio calls per FD. And as aio is typically done

 with thread per FD, this is unacceptable. Its important that kernel

 level syscall was there to take a list of commands, and equally return

 a list of results, not necessarily same set as requests were.


<color><param>7F00,0000,0000</param>> >  From other end sleep in a wakeup function which returns a list of completed

> >  events, thus dequeueing events that either complete or error. Then handle

> >  all data in a manner coded into Squid, and enqueue another list of work.

> >  Again, point is on returning list of completed events, not 1 event at a

> >  time. Much like poll returns possibly several ready FD's.

> 

> I am not sure I get this part.. are you talking about I/O or only

> notifications?


</color> both, combined. Most of the time you poll just as a means to know

 when IO doesn't block. If you can enqueue the IO to the kernel and

 read results when it either completes or times out, you don't really

 need poll. You need to wakeup on a list of IOCB's that has changed

 status: done, error, or timeout. But IOCB could also have a command

 for doing just poll() for a FD.


<color><param>7F00,0000,0000</param>> >  I believe this is useful, because only kernel can really do work in

> >  async manner - at packet arrival time. It could even skip buffering

> >  packets in kernel space, but append data directly to userspace buffs,

> >  or put data on wire from userspace. Same for disk io.

> 

> Apart from the direct userspace copy, this is already what modern

> kernels does on networking..


</color> oh, no. I'm about different model of communicating to kernel.

 Sure kernel works in async manner, there is no other way infact.

 There just have to be more work delegated to kernel to do at packet

 arrival time to reduce useless jerking between process and kernel.

 Just as with devpoll.


<color><param>7F00,0000,0000</param>> >  In regards to eventio branch, new network API, seems it allows to

> >  implement almost any io model behind the scenes. What seems to stick

> >  is FD-centric and one-action-at-a-time approach. Also it seems that

> >  it could be made more general and expandable, possibly covering also

> >  disk io. Also, some calls assume that they are fulfilled immediately,

> >  no async nature, no callbacks (close for eg). This makes it awkward

> >  to issue close while read/write is pending.

> 

> Regarding the eventio close call: This does not close, it only signals

> EOF. You can enqueue N writes, then close to signal that you are done.

> And there is a callback, registered when the filehandle is created.

> Serialization is guaranteed.


</color> ok. I should have realised that..

 Btw, why is close callback registration separated from the call?

 To follow existing code style more closely?


<color><param>7F00,0000,0000</param>> >  One thing that bothers me abit is that you can't proceed before FD

> >  is known. For disk io, for eg. it would help if you could schedule

> >  open/read/close in one shot. For that some kind of abstract session

> >  ID could be used I guess. Then such triplets could be scheduled to

> >  the same worker-thread avoiding several context-switches.

> 

> The eventio does not actually care about the Unix FD. The exact same API

> can be used just fine with asyncronous file opens or even aggregated

> lowlevel functions if you like (well.. aggregation of close may be a bit

> hard unless there is a pending I/O queue)


</color> Hmm, I assumed that you cannot call read/write unless filehandle is

 provided by initial callback. Do you mean we can?


<color><param>7F00,0000,0000</param>> >  Also, how about some more general ioControlBlock struct, that defines

> >  all the callback, cbdata, iotype, size, offset, etc... And is possibly

> >  expandable in future.

> 

> ???


</color>struct IOCB {

     int       errno;

     int       command; /* read,write,poll,accept,close... */

<color><param>7F00,0000,0000</param>     COMMIOCB  *callback;

     void      *cbdata;

</color>     IOBUF     *buf;

<color><param>7F00,0000,0000</param>     size_t    max_size;

</color>     off_t     offset;

<color><param>7F00,0000,0000</param>     COMMCLOSECB *handler;

</color>      ... etc ...

<color><param>7F00,0000,0000</param>};


</color> dunno, maybe this packing is a job for actual io-model behind the api.

 It just seems to me that it would be nice if we can pass array of

 such IOCB's to the api.


<color><param>7F00,0000,0000</param>> >  Hmm, probably it would be even possible to generalise the api to such

> >  extent, that you could schedule acl-checks, dns, redirect lookups all

> >  via same api. Might become useful if we want main squid thread to do

> >  nothing else but be a broker between worker-threads. Not sure if that

> >  makes sense though, just a wild thought.

> 

> Define threads in this context.


</color> Of course I'm into scaling here.

 pthreads. all DNS lookups could be done by one worker-thread, which

 might setup and handle multiple dnsservers, whatever. ACL-checks

 that eat CPU could also be done on separate cpu-threads.

 All this can only make sense if message-passing is very lightweight

 and does not require thread/context switching per action. Its about

 pipelining. Same or separate worker-thread could handle redirectors,

 messagepassing to/from them.


 Suppose that we read from /dev/poll 10 results that notify about 10

 accepted sockets. We can now check all ACL's in main thread, or we

 can queue all 10 to separate thread, freeing main thread. If we also

 had 20 sockets ready for write, we could either handle them in main

 thread, or enqueue-append (gather) to a special worker-thread.

 Difference is that all process overhead and kernel overhead is in

 one case consuming only single CPU, in other case load is shared or

 work is postponed to time when main thread is idle in wakeup.


<color><param>7F00,0000,0000</param>> >  Also, I think we should think about trying to be more threadsafe.

> >  Having one compact IOCB helps here. Maybe even allowing to pass a

> >  list of IOCB's.

> 

> See previous discussions about threading. My view is that threading is

> good to certain extent but locality should be kept strong. The same

> filehandle should not be touched by more than one thread (with the

> exception of accept()).


</color> I agree. Thread is bad if it does little work. But thread is good

 if it has decent amount of work. Makes sense only with CPUs >1.


<color><param>7F00,0000,0000</param>> The main goal of threading is scalability on SMP.


</color> In this context, its not about SMP so much. More like offloading

 work from overloaded CPU in decently efficient manner. Better SMP

 scaling is a wanted byproduct, but not the main goal that much.

 To pursue perfect SMP scaling would need to redesign too much.


<color><param>7F00,0000,0000</param>> The same goal can be acheived by a multi-process design, which also

> scales on assymetric architectures, but for this we need some form of

> low overhead shared object broker (mainly for the disk cache).


</color> Yes. Just seems that it would be easier to start using threads

 for limited tasks, at least meanwhile.


------------------------------------
 Andres Kroonmaa <andre@online.ee>
 CTO, Microlink Online
 Tel: 6501 731, Fax: 6501 725
 Pärnu mnt. 158, Tallinn,
 11317 Estonia