Plastic SCM proxy server explained

May 11, 2010

One of the new features we introduced with the 2.9 release is the Proxy Server. As you know Plastic is all about flexibility, so it can behave as a DVCS or as a centralized system.

When you run Plastic in centralized mode, especially on wide area networks or across VPNs, you’ll be hit by network issues: latency, slow down, connection problems… Then you have two options: you can use the distributed system to avoid being hit by the network (setting up a local server at your office to communicate with the central one, then avoiding a huge number of roundtrips), or you can set up a proxy server to greatly reduce network traffic and improve performance.

Depending on your own circumstances, preferences, network resources and so on, you can go from one or the other. At the end of the day what we try to come up with is a good set of options so you can choose.

How the proxy server works


The proxy server works in a pretty straightforward way: it simply caches revision data (file data actually) to make it available to clients so that they don’t have to go and query the central server. It greatly reduces network usage since normally data transfers (more than metadata) generate most of the daily traffic.

In order to use the proxy server the clients need to be specifically configured (a detailed explanation later), so every time they need to request data, they’ll ask the proxy server, which will make the call on their behalf, handle concurrent requests of the same revision so the data is retrieved only once (reducing data traffic) and store the data locally (using a pre-configured cache directory) before returning it to the client.

The proxy doesn’t need any configuration since:
  • It doesn’t know about servers in advance, it just receives requests from the clients and connects to the specific servers on their behalf using the same credentials the client does.
  • There’s no specific preload operation: in order to trigger a preload simply run a “update forced” on a existing workspace or a regular one on a new one (force to download data).
  • Currently there’s no limit on the maximum cache size, but all data is stored on a single directory, so it’s straightforward to remove data if it grows too large.

    Data is stored by server and repository (a different directory for each server and then a directory for each repository).

    The following figure shows how the basic communication flow works and how data is arranged inside the proxy server data location.


    The next graphic explains how the individual calls requesting data for revisions are handled by the proxy server which will cache the received data after calling the repository server.


    And the same principle will apply when scenarios get more complicated and instead of a single server and repository there are several servers and repositories involved.


    What happens if the proxy server goes down?


    Currently the mechanism we’ve implemented is also pretty transparent: if the proxy server goes down (or you shut it down), the client will detect it (network connection will fail) and will directly contact the real repository server. It will log it for diagnostic purposes. A client won’t use again the proxy server once it detects it is down until the client itself gets restarted.

    Installing a proxy server


    Installing a proxy server is pretty straightforward on Windows, Linux and Mac OS X. You just have to get the installer and follow the steps. In fact, it will only ask you for a directory to locate the cached data, and that’s all.
    The configuration will be saved on a plasticcached.conf file with a single entry for the directory mentioned above.


    Configuring a proxy server on the client side


    There’s only a simple change to perform on the clients: run the configuration wizard (from the GUI preferences option or running plastic - -configure) and set the right proxy server.


    How a typical proxy server set up looks like


    The initial situation before you set up a proxy server will be something like the following.


    The network traffic (in red) is too high and clients are slowed down. In order to solve it you can set up a couple of proxy servers, one at each LAN.



    Now the data traffic will be local and performance will get much better.

    Performance benchmark


    Ok, so far I’ve been telling that performance gets better on centralized setups when you introduce a proxy server, but I didn’t share any data about how better does it actually get.

    We run load tests on a cluster to check and improve Plastic SCM performance, and this time we focused on finding out how to reduce network traffic by using proxy servers.

    We use the following configuration: 4 different networks where computers are connected through a gigabit connection and then one central server connected to the different sub-networks with a 100Mbps connection (which is the actual limiting factor). In total we will use 71 concurrent clients.

    We use a very simple repository were a simple copy consist on 25k files and about 3k directories and a total of 300Mb.

    The test itself is very simple:
  • Every client will create a workspace and download the latest copy of the main branch (trunk in SVN jargon)
  • Then the client will create a branch, switch to it and modify a total of 10 files on it.
  • Will repeat the process (go back to step 2) 5 times.

    The following figure depicts the network layout and the machines at each lab (CPU, total bogomips of each node and RAM).



    Then we run the test with and without proxy servers and compare the results.

    server os

    time (min)

    Gb Sent

    Gb Recv

    Linux 64bits + proxy servers

    10,73

    2,10

    0,26

    Linux 64bits

    30,17

    16,85

    0,30



  • As you can see, in this very simple example, we can multiply overall performance by a factor of 3 by introducing proxy servers. The actual number of proxy servers and configuration will vary depending on your layout, we tested with 4 proxy servers because we’re using 4 networks, but it would vary depending on the topology.

    No comments :

    Real Time Web Analytics