The GRR client protocol is depicted below.
Due to network realities such as NAT, firewalls etc, it is not possible to directly connect to the client, so GRR relies on the client connecting to the server in order to communicate with it.
The GRR client makes periodic POST requests to the server to both send replies and receive new instructions. Since POST requests are very short lived (most client polls carry no data) the client has to repeat the polls periodically.
There are two parameters which determine how the GRR client behaves -poll_max and poll_min. When there is some requests sent to the client, the client will reduce its poll wait time to poll_min (default 0.2 seconds). When nothing happens, the client will increase its poll time gradually up to poll_max (default 10 minutes).
Having long poll times means that any new flows launched on an idle client must wait for up to 10 minutes before the client polls again in order to send the new requests to the client. Unfortunately reducing the poll_max setting actually increases server load, as the server needs to hit the database more often to serve each poll. This scheme essentially poses a trade off - for a responsive client, we must have a low poll_max (i.e. more frequent polls) but this increases the load on the frontend so it can not be too low.
When GRR is normally deployed it produces 2 types of clients on the web interface: the debug client has max_poll set to 5 seconds making testing easier because it is more responsive, but the non-debug version has max_poll set to 10 minutes. For example at Velocidex, one of our clients had once accidentally deployed the debug version and the server was slammed with 5 second polls from several thousand clients! This rendered the server useless, returning HTTP 500 status codes for most client polls. The only way to recover was to push new config to the clients and restart their GRR service in order to lower the poll frequency and recover control over the deployment.
The other problem with GRR's client communication protocol is that it tends to exhibit catastrophic failure under load. When the client makes a HTTP POST operation, the server goes through the following steps in order:
In previous posts we have seen that GRR's overuse of queuing leads to extreme loads on the database, so under load (e.g. when a large hunt is taking place), the above process may take some time until the server can obtain a lock on the database row, write and read the messages, and compose its response.
What tends to happen under load, is that the client will time the request out if the server takes too long, or the server itself may timeout the request with a HTTP 500 code. The client, thinking it has not got through will try to POST the same data again (this time it will wait longer though).
This essentially makes things worse, because the replies are probably already mostly queued so the next retry will re-queue the same requests (these will be discarded by the worker anyway but they are still queued), increasing database pressure and server load. This manifests in a critical meltdown of the frontends who pretty soon serve mostly 500 errors (making things worse again).
This is the reason why resource provision is so important with GRR, if the frontends are just too slow to be able to keep up, the connections will start to timeout, and load increases (rather than decreases) causing a catastrophic failure.
The main problem with a polling scheme is that the user experience is terrible - even if we reduce the poll wait times to few seconds, users will have to wait to view the results of their actions - leading to an overall experience of a slow and sluggish system. For a responsive user interface we need to have client round trips of a second or less and having poll_max set this low will just use up too many resources. This is particularly noticeable in the VFS browser since it takes so long to navigate to the desired directory and download files interactively.
Other endpoint monitoring systems use distributed pub/sub systems like RabbitMQ or Firebase realtime database to inform the clients of new requests. In those systems, the client makes a TCP connection to an endpoint and holds the connection open for long periods of time, the server can then immediately push new requests to the client as soon as they are published. This seems like the way to go but we did not want to introduce another dependency on Velociraptor (we really like it being a self contained - working out of the box binary).
In particular we also wanted to solve the catastrophic failure we saw with GRR clients under load (described above). This means that we need to make sure that the clients are not sending data faster than the server can process it. We definitely want to avoid the POST timing out with a 500 error and the client retrying the same POST since this is the main cause for the catastrophic failures we experienced with GRR.
We can do this by keeping the client's connection open for as long as we need, but in order to not time it out, we send the HTTP status code immediately, then process the POST data, while sending the client keepalive data periodically using HTTP chunked transfer encoding.
To the client, and any proxies in the way, it simply looks like the POST request was received immediately, and the response body is downloaded slowly (there is always some pad data flowing so none of the TCP or HTTP timers are triggered since the connection is always active). This is illustrated in the diagram below.
This scheme has the two main advantages:
We really wanted to make clients more responsive. We were frankly sick of having to wait up to 10 minutes to access a client that we knew was online in our IR work. To make the client more responsive we wanted to use the same technique to keep the client connection open for long periods of time, and then send instructions to the client as soon as the user issues a new flow.
In the GRR scheme new requests are sent on the same connections as client replies are received. This won't work if the client connection is held open for long periods of time because while the client is blocked reading new responses from the server, it can not send any replies (the POST header was already sent).
To fix this we switched to two separate POST connections on two server handlers, a reader handler and a writer handler. The writer handler only receives messages from the client to the server (i.e. replies to client requests), while the reader handler blocks the client for prolonged time and sends client requests as soon as new flows are launched.
This scheme allows a full duplex, responsive communication protocol, with no polling overheads. This can be seen in the diagram below.
The client establishes the reader channel by sending a HTTP POST request to the reader handler. The server checks for any messages for the client, and sees that there are none pending. It will then keep the client's connection open as before, trickle sending pad data (using HTTP chunked transfer encoding) to keep the connection open for as long as possible.
When the user launches a new flow, the server can immediately forward the client's requests on the open channel, completing the POST operation. The client will then process the requests and send the responses with a separate HTTP POST to the writer channel. In the meantime the reader channel will re-POST to the reader handler and become blocked and ready for the next request.
This scheme has the following advantages:
However, the scheme also has some disadvantages:
We believe that these limitations can be easily managed. They are no different from typical limitations of large scale pub/sub systems (they too need to hold many TCP connections open). In our testing we have not seen a problem scaling to many thousands of connected clients with very low resource use.
Velociraptor now also has a pool client that allows spinning up several thousand clients at the same time. This helps with testing a deployment to make sure it can handle the increased open file limit and test how large scale hunts can be handled.
The new responsive client communications protocol allows for near instantaneous access to clients. This actually reduces the overall load on the system because we do not need to perform frequent client polls just to check if a new flow is launched. User experience is much better as users can interact with clients immediately.