Plugin
Arg | Description | Type |
---|---|---|
query | The query will be run in parallel over batches. | StoredQuery (required) |
client_id | The client id to extract | string |
flow_id | A flow ID (client or server artifacts) | string |
hunt_id | Retrieve sources from this hunt (combines all results from all clients) | string |
artifact | The name of the artifact collection to fetch | string |
source | An optional named source within the artifact | string |
start_time | Start return events from this date (for event sources) | int64 |
end_time | Stop end events reach this time (event sources). | int64 |
notebook_id | The notebook to read from (should also include cell id) | string |
notebook_cell_id | The notebook cell read from (should also include notebook id) | string |
notebook_cell_table | A notebook cell can have multiple tables.) | int64 |
workers | Number of workers to spawn.) | int64 |
batch | Number of rows in each batch.) | int64 |
Runs query on result batches in parallel.
Normally the source() plugin reads result sets from disk in series. This is fine when the result set is not too large but when we need to filter a lot of rows at the same time it is better to use all cores by reading and filtering in parallel.
The parallelize()
plugin is a parallel version of source()
which breaks result sets into batches and applies a query over
each batch in parallel. If you have a multi threaded machine, it
will be a lot faster.
The query passed to parallelize() will receive a special scope in
which the source()
plugin will returns results from a small
batch of the total. The size of this batch is controlled by the
batch
parameter.
This is especially useful when we need to filter rows from a hunt
SELECT * FROM parallelize(hunt_id=HuntId, artifact=ArtifactName, query={
SELECT * FROM source()
WHERE FullPath =~ "XYZ"
})