parallelize

Plugin

ArgDescriptionType
queryThe query will be run in parallel over batches.StoredQuery (required)
client_idThe client id to extractstring
flow_idA flow ID (client or server artifacts)string
hunt_idRetrieve sources from this hunt (combines all results from all clients)string
artifactThe name of the artifact collection to fetchstring
sourceAn optional named source within the artifactstring
start_timeStart return events from this date (for event sources)int64
end_timeStop end events reach this time (event sources).int64
notebook_idThe notebook to read from (should also include cell id)string
notebook_cell_idThe notebook cell read from (should also include notebook id)string
notebook_cell_tableA notebook cell can have multiple tables.)int64
workersNumber of workers to spawn.)int64
batchNumber of rows in each batch.)int64

Description

Runs query on result batches in parallel.

Normally the source() plugin reads result sets from disk in series. This is fine when the result set is not too large but when we need to filter a lot of rows at the same time it is better to use all cores by reading and filtering in parallel.

The parallelize() plugin is a parallel version of source() which breaks result sets into batches and applies a query over each batch in parallel. If you have a multi threaded machine, it will be a lot faster.

The query passed to parallelize() will receive a special scope in which the source() plugin will returns results from a small batch of the total. The size of this batch is controlled by the batch parameter.

This is especially useful when we need to filter rows from a hunt

  • each client’s result set will be filtered in parallel on a different core.

Example:

SELECT * FROM parallelize(hunt_id=HuntId, artifact=ArtifactName, query={
   SELECT * FROM source()
   WHERE FullPath =~ "XYZ"
})
comments powered by Disqus