Re: [squid-users] Squid with PHP & Apache

From: Ghassan Gharabli <sounarose_at_googlemail.com>
Date: Fri, 29 Nov 2013 01:44:11 +0200

On Wed, Nov 27, 2013 at 1:28 PM, Amos Jeffries <squid3_at_treenet.co.nz> wrote:
> On 27/11/2013 5:30 p.m., Ghassan Gharabli wrote:
>> On Tue, Nov 26, 2013 at 5:30 AM, Amos Jeffries wrote:
>>> On 26/11/2013 10:13 a.m., Ghassan Gharabli wrote:
>>>> Hi,
>>>>
>>>> I have built a PHP script to cache HTTP 1.X 206 Partial Content like
>>>> "WindowsUpdates" & Allow seeking through Youtube & many websites .
>>>>
>>>
>>> Ah. So you have written your own HTTP caching proxy in PHP. Well done.
>>> Did you read RFC 2616 several times? your script is expected to to obey
>>> all the MUST conditions and clauses in there discussing "proxy" or "cache".
>>>
>>
>> Yes , I have read it and I will read it again , but the reason i am
>> building such a script is because internet here in Lebanon is really
>> expensive and scarce.
>>
>> As you know Youtube is sending dynamic chunks for each video . For
>> example , if you watch a video on Youtube more than 10 times , then
>> Squid fill up the cache with more than 90 chunks per video , that is
>> why allowing to seek at any position of the video using my script
>> would save me the headache .
>>
>
> Youtube is a special case. They do not strictly use Range requests for
> the video seeking. If you are getting that lucky you.
> They are also multiplexing videos via multiple URLs.
>

Hi Amos,

Youtube application is mostly using range requests on iPhone or
Android , but range argument on Browsers if and only if Browsers have
Flash player 11 installed. Youtube sends full length if Flash player
10 was installed. That was my investigation regarding Youtube.

>
>>>
>>> NOTE: the easy way to do this is to upgrade your Squid to the current
>>> series and use ACLs on the range_offset_limit directive. That way Squid
>>> will convert Range requests to normal fetch requests and cache the
>>> object before sending the requested pieces of it back to the client.
>>> http://www.squid-cache.org/Doc/config/range_offset_limit/
>>>
>>>
>>
>> I have successfully supported HTTP/206, if the object is cached and my
>> target is to enable Range headers, as I can see that iPhones or Google
>> Chrome check if the server has a header Accept-Ranges: Bytes then they
>> send a request bytes=x-y or multiple bytes like bytes=x-y,x-y .
>>
>
> Yes that is how Ranges requests and responses work.
>
> What I meant was that Squid already contained a feature to selectively
> cause the entire object to cache so it could generate the 206 response
> for clients.
>
>
>>>> I am willing to move from PHP to C++ hopefully after a while.
>>>>
>>>> The script is almost finished , but I have several question, I have no
>>>> idea if I should always grab the HTTP Response Headers and send them
>>>> back to the borwsers.
>>>
>>> The response headers you get when receiving the object are meta data
>>> describing that object AND the transaction used to fetch it AND the
>>> network conditions/pathway used to fetch it. The cachs job is to store
>>> those along with the object itself and deliver only the relevant headers
>>> when delivering a HIT.
>>>
>>>>
>>>> 1) Does Squid still grab the "HTTP Response Headers", even if the
>>>> object is already in cache or Squid has already a cached copy of the
>>>> HTTP Response header . If Squid caches HTTP Response Headers then how
>>>> do you deal with HTTP CODE 302 if the object is already cached . I am
>>>> asking this question because I have already seen most websites use
>>>> same extensions such as .FLV including Location Header.
>>>
>>> Yes. All proxies on the path are expected to relay the end-to-end
>>> headers, drop the hop-by-hop headers, and MUST update/generate the
>>> feature negotiation and state information headers to match its
>>> capabilities in each direction.
>>>
>>>
>>
>> Do you mean by Yes , for grabbing the Http Response Headers even if
>> the object is already in cache, so therefore latency of network is
>> always added even if MISS or HIT situation?
>
> No. I mean the headers received along with the object need to be stored
> with it and sent on HITs.
> I see many people thinking they can just store the object by itself same
> as a webs server stores it. But that way looses the vital header
> information.
>

Do you mean within one call , you store the object and the header in
the same file and then you extract the header information when its
called ?. I am only calling the website once and retrieve
headers/Objects , Then storing the object as is and then I store the
header again in the same directory close to the object .

I am also categorizing the cache as if the URL was :
http://www.example.com/1/2/3/filename.ext

For example I generate directories based on URL (after CDN detection) such as :

Cache-Folder --> www.example.com --> 1 --> 2 --> 3 --> filename.ext .

Do you agree that it is a good idea ?.

Do you use swap.state file as an indexing file or the file that has
the location object/headers stored in the cache ?.

>> I have tested Squid and I
>> have noticed that reading HIT objects from Squid takes about 0.x ms,
>> which I believe objects are always offline until expiry occurs.Right?
>>
>> Till now I am using $http_response_headers as it is the fastest method
>> by far , but I still have an issue with latency as for each request
>> the function takes about 0.30s, which is really high, even if my
>> network latency is 100~150 ms. That is why I have thought that I could
>> possibly grab the HTTP Response Headers for the first time and store
>> them, so if the URI was called for a second time, then I would send
>> them the cached Headers instead of grabbing them again
>
> This is the way you MUST do it. To retain Last-Modified, Age, Date, ETag
> and other critical headers.
> Network latency reduction is just a useful side effect.
>

I have noticed that Squid doesn't contact the website at all unless
the object stored in the cache expires, so Squid tries to revalidate
and check if the Last-Modified header has newer data, then we save the
newer object/response_header. Right?.

How useful can Etag Header be in Squid .. I also have seen Etag as a
path encoded of the location of the filename . Is this true ?.

>> , to eliminate
>> the network latency. But I still have an issue ... How am i going to
>> know if the website sends HTTP/302 (because some websites send
>> HTTP/302 for the same requested file name ), if I am not grabbing the
>> header again in a HIT situation just to improve the latency. Second
>> issue is Saving headers of CDN.
>
>
> In HTTP the 302 response is an "object" to be cached same as 200 when it
> contains Cache-Control or Expiry and/or Last-Modified headers sufficient
> to determine freshness/staleness.
> NOTE: it has no meaning about the Range transaction except perhapse to
> be a response without Ranges.
>
>
> Of course you can choose not to cache it. But be aware that Squid will
> try to.
>
>
>>>>
>>>> 2) Do you also use mime.conf to send the Content-Type to the browser
>>>> in case of FTP/HTTP or only FTP ?
>>>
>>> Only FTP and Gopher *if* Squid is translating from the native FTP/Gopher
>>> connection to HTTP. HTTP and protocols relayed using HTTP message format
>>> are expected to supply the correct header.
>>>
>>>>
>>>> 3) Does squid compare the length of the local cached copy with the
>>>> remote file if you already have the object file or you use
>>>> refresh_pattern?.
>>>
>>> Content-Length is a declaration of how many payload bytes are following
>>> the response headers. It has no relation to the servers object except in
>>> the special case where the entire object is being delivered as payload
>>> without any encoding.
>>>
>>>
>>
>> I am only caching objects that have "Content-Length" header, if the
>> size was greater than 0 and I have noticed that there are some files
>> like XML , CSS , JS, which I believe I should save, but do you think I
>> must follow if-modified header to see if there is a fresh copy?.
>
> If you already have an object cached for the URL being ruqested with any
> If-* header then you need to revalidate it following the RFC 2616
> instructions for revalidation calculation. Or just MISS - but that makes
> caching a bit useless because If-* happen a lot in HTTP/1.1.
>
> NOTE: the revalidation calculation is done against the headers you have
> cached with the object. The results will determine whether a HIT or MISS
> can happen on it.
>

Yes, I have considered the calculation to be with the cached headers
after your advice.

I don't know if C programming has the same issue as PHP . If the
internet is very slow eg: 128 Kbps with some load while PHP is
downloading a file, then the download stops and it is then saved as a
corrupted download . Does Squid save the corrupted file after reaching
a timeout or slowness of the network thus storing a part of the remote
file in cache ?.

Another Question :

How does squid behave while downloading a file if more than 2 or 3
clients are requesting the same file? And would squid know that the
file is already being downloaded . Does squid send the stored bytes at
first and then pass the requests directly to the remote website ?. I
already tried to read a file while it is being written by an external
handler , but it was failing because the function is reading the
stored bytes and it thinks that it has reached End OF File .

Do you recommend to end the session or unset the file if timeout
occurs while storing the file into cache to avoid storing corrupted
files?.

>
>>>> I am really confused with this issue , because I am always getting a
>>>> headers list from the internet and I send them back to the browser
>>>> (using PHP and Apache) even if the object is in cache.
>>>
>>> I am really confused about what you are describing here. You should only
>>> get a headers list from the upstream server if you have contacted one.
>>>
>>>
>>> You say the script is sending to the browser. This is not true at the
>>> HTTP transaction level. The script sends to Apache, Apache sends to
>>> whichever software requested from it.
>>>
>>> What is the order you chained the Browser, Apache and Squid ?
>>>
>>> Browser -> Squid -> Apache -> Script -> Origin server
>>> or,
>>> Browser -> Apache -> Script -> Squid -> Origin server
>>>
>>>
>>> Amos
>>
>> Squid configured as:
>> Browser -> Squid -> Apache -> Script -> Origin server
>>
>> url_rewrite_program c:/PHP/php.exe c:/squid/etc/redir.php
>> acl dont_pass url_regex ^http:\/\/192\.168\.10\.[0-9]\:312(6|7|8)\/.*?
>> acl denymethod method POST
>> acl denymethod method PUT
>> url_rewrite_access deny dont_pass
>> url_rewrite_access deny denymethod
>> url_rewrite_access allow all
>> url_rewrite_children 10
>> #url_rewrite_concurrency 99
>>
>> I hope I can enable url_rewrite_concurrency , but if I enable
>> concurrency then I must always echo back the ID, even if I am hitting
>> cache or maybe I dont understand the behavior of the URL_REWRITE
>> manual while fgets(STDIN) .
>
> Your helper MUST always return exactly one line of output for every line
> of input regardless of concurrency.
>
> Making that line of output contain the concurrency ID number instead of
> being empty is trivial and it allows you to return results out of order
> if you want or need to. For example in helpers using threads that take
> different lengths of time to complete.
>

I am going to try again later and see if url_rewrite_concurrency would work .

Sorry for the extended questions. I really appreciate all the
information you have put together for me to keep in mind.

Thank you very much for your time.

>
> Amos
Received on Thu Nov 28 2013 - 23:44:22 MST

This archive was generated by hypermail 2.2.0 : Fri Nov 29 2013 - 12:00:05 MST