[squid-users] Youtube\dynamic content caching with squid 3.2 |DONE|

From: Eliezer Croitoru <eliezer_at_ngtech.co.il>
Date: Fri, 08 Jun 2012 15:05:00 +0300

"MSI"

:some history, you can skip if you want.

That is how it's done!

for a long time i was looking for a good way to cache dynamic objects
using squid 3.2 but always it came to use another outside software such
as apache nginx or other options.

so i was looking for a better way then store_url_rewrite because it was
kind of "hack" to the whole problem of dynamic content.
i have found that squid is very good at what he does ... forward proxy!
some problems that i encountered with programs out of squid is that you
always need to manage your own cache size and objects.
on top of that i have seen how servers are redirecting using 302 code
and then serving on the same address.
i dont know what happens internally in these cache servers but it works.

so instead of using some "web servers" to server as proxy using php or
other trick let the proxy hierarchy do the job for us!

the idea is to use a "MSI" and it means: mysql + squid(x2) + icap !

the problem with dynamic content is that it has a lot of dynamic content
and it means also "headers".
my solution is more then just for youtube.... it's a solution for the
dynamic content caching problem!
(can also be used for 206 partial content manipulation)

some history on how it's done before and how it's implemented now with
better options!

squid is caching http objects based on couple of things:
the main part is the object url as the identification of an object in
the cache.
second level is the object cache headers and structure.
third and last forced refresh_patterns\rules per http object\url.

store_url_rewite took the problem solution on the url level.
it takes one object and refer to it as another.
the problems that came with that is that the refresh patterns was
refering to the original urls and not the the real objects that are in
cache.
also the logs are logging per dynamic urls and there for you can really
benefit from the logs on how much this method is good for caching.
you can't purge cache objects and also can't verify if the object cached
on the server or to clear an object from cache using htcp protocol.

some people were using apache\nginx with php script that fetch the
dynamic content and cached it on the webserver storage.
this makes the very very fast proxy software to crawl and also to do
things on interpretation lever of php instead of a very very fast
compiled robust proxy server.

so these solutions are nice but they must be maintained and monitored
manually for space performance and availability.

another problem with these caches is that these servers are not really
caching the whole object but reserving it.

after using cache proxy hierarchy for quite a time with some squid 2.7
with store_url_rewrite i took my idea to use icap and made the
unimpossible possible!

review of what we will do:
we will take one proxy server with at least two squid instances, one for
cache and the other with no cache at all\minimal.
one of the servers(mem only) is binded only to lo interface and the
other intercepts\forward requests.
we will also install on this proxy server mysql db server and the ICAP
server that you desire\have.
i have used GreasySpoon at : http://greasyspoon.sourceforge.net/
it's based on java and really fast but as for a basic setup we will use
only reqmod(request review), with a more advanced setup we can use a
response headers manipulation to make the object "cache friendly".

this is the software.

now the idea:
we can use the ICAP server to rewrite requests transparently to the
client (and also for a server that is a client of our server).
so we setup two instances of squid proxy based on two different conf
files (can be done with one compiled squid).
the first one is the main cache and we will send every request we want
to manipulate to the icap server based on acls(very very important to
plan them!!!)
on this instance we will configure the other instance on the lo
interface as cache peer that is *NOT* a proxy-only server and a parent.

we will select an internal domain such as "squid.internal" to use for
object storage schema.

on this domain we will define a never_direct policy and we will peer all
requests for this spoofed domain to the second instance.

on the second instance we will limit the access of request (reqmod) only
for this spoofed domain.

now the fun begins!

it's time to combine MYQSL(memory db) + squid + icap.

first we will analyze what we want to do.
an example is to cache all sourceforge cdn downloads as to one object.
this is a file download link:

http://dfn.dl.sourceforge.net/project/npp-compare/1.5.6/compare-1.5.6-unicode.zip

http://iweb.dl.sourceforge.net/project/npp-compare/1.5.6/compare-1.5.6-unicode.zip

you will notice that the only different is on the low level domain and
all other parameters are the same so to server this object from cache
for two cdns we only need to use one simple url schema.

what we will do will be a bit complicated to understand and i hope it
will be simplified now.

we know that the proxy servers do not reveal the request modification
that is done by the icap server and this specific icap server
(GreasySpoon) has a very very powerful capabilities of external and
custom libs classes and programing languages.

we will create a database with couple of fields for temporary data and
if we want we can also build some statistics tables in the db.

the purpose of the database is to store destination url and compatible
key and will be managed by the key and not the url because the url is
dynamic..

we will do a double request manipulating on each request.
one one the intercept\forward proxy and the second is on the
cache_peer\second instance proxy.

the flow is like that:

request from client ------------------------------------->proxy1

proxy1--------------------------------------------------->ICAP server
proxy1 acl on the real domain to reqmod on ICAP

icap server(extracting the data of the object from url and paring them
on the db with the url, then rewrites the request to a spoofed domain
with the key on the uri) ----->proxy1
example:

http://dfn.dl.sourceforge.net/project/npp-compare/1.5.6/compare-1.5.6-unicode.zip

became:
http://dl.df.squid.internal//project/npp-compare/1.5.6/compare-1.5.6-unicode.zip

and paired in the db as id and the original url with timestamp.

proxy1 request as client the spoofed object -------------->proxy2

proxy1 acls for "squid.internal" dstdomain is to peer it to proxy2

proxy2--------------------------------------------------------->ICAP

proxy2 has acls that allow only spoofed domains ".squid.internal" to
reqmod the ICAP server (to prevent an endless loop).

ICAP server------------------------------------------------>proxy2
the icap server rewrites the paired url instead of the key.
this is because we want to fetch the real object recursively into proxy1
cache.

in this state of proxy 1 thinks it's fetching the spoofed key aka:
http://dl.df.squid.internal//project/npp-compare/1.5.6/compare-1.5.6-unicode.zip

but proxy2 is feeding him:
http://dfn.dl.sourceforge.net/project/npp-compare/1.5.6/compare-1.5.6-unicode.zip

proxy2------------>proxy1---------->client

so this specific state is logically like that:
client thinks he fetches the real file.
proxy1 fetch a spoofed file\url from proxy2
proxy2 fetch the real file\url from the real server to proxy1.

but next time that a client will try to get one of the objects:
http://dfn.dl.sourceforge.net/project/npp-compare/1.5.6/compare-1.5.6-unicode.zip

http://X.dl.sourceforge.net/project/npp-compare/1.5.6/compare-1.5.6-unicode.zip

http://yyy.dl.sourceforge.net/project/npp-compare/1.5.6/compare-1.5.6-unicode.zip

if proxy1 will have the spoofed object:
http://dl.df.squid.internal//project/npp-compare/1.5.6/compare-1.5.6-unicode.zip

in cache he will serv it from there otherwise it will be fetched from
the internet using proxy2.

*this is the main concept*

i have a working setup for:
youtube
ytimg
imdb mp4\flv
sourceforge
some of facebook content
bliptv
vimeo
dailymotion
metacafe
av updates.
Filehippo
linux distros repos. (need to make a change in the db\key
structure\match rules)

if you have more features that can be good i will be happy to try.

(there is a access.log file with some nice data)

Regards,
Eliezer

-- 
Eliezer Croitoru
https://www1.ngtech.co.il
IT consulting for Nonprofit organizations
eliezer <at> ngtech.co.il

Received on Fri Jun 08 2012 - 12:05:33 MDT

This archive was generated by hypermail 2.2.0 : Sun Jun 10 2012 - 12:00:03 MDT