Aggregations with ElasticSearch Query DSL
Source:vignettes/Aggregations-with-ElasticSearch-Query-DSL.Rmd
Aggregations-with-ElasticSearch-Query-DSL.RmdThe documentation is extensive: Aggregations Docs. For the NTSPortal schema, see Structure of NTSPortal.
Aggregation: terms
The terms aggregation groups the documents returned by
the query into buckets according to a keyword or boolean field.
Aggregations can be nested, so each bucket can be successively split
into further sub-buckets.
This code is to search for two stations and split results by station,
polarity and blank yes/no via three nested terms
aggregations.
GET ntsp_msrawfiles/_search?size=0 // Size is 0, meaning no doc sources are returned
{
"query": {
"terms": { // terms query
"station": [
"donau_ul_m",
"saale_wettin_m"
]
}
},
"aggs": {
"messstellen": { // just a variable name you give it (can be anything)
"terms": { // terms aggregation
"field": "station"
},
"aggs": {
"polaritaet": {
"terms": { // terms aggregation
"field": "pol"
},
"aggs": {
"methoden_blanks": {
"terms": { // terms aggregation
"field": "blank"
}
}
}
}
}
}
}
}Note: The terms query and the terms
aggregation are two different things. The first filters the results by
multiple possible values (in a keyword field), the second splits the
filtered results (documents) into buckets based on a keyword field. In
the example we have three terms aggregations nested in each other, that
is to say, each station bucket is again split by polarity and each of
those buckets are again split by field blank (boolean).
Overview of all compounds
This example will also show for which polarities and in which matrices each compound is found.
GET ntsp25.1_dbas*/_search?size=0
{ // Since there is no query, all docs are used for the aggs
"aggs": {
"compounds": {
"terms": {
"field": "name", // split by compound name
"size": 1000 // Default is 10, must be set higher for 'name' field
},
"aggs": {
"pols": {
"terms": {
"field": "pol" // There are only 2 polarities
},
"aggs": {
"matrices": {
"terms": {
"field": "matrix"
}
}
}
}
}
}
}
}Explanations:ntsp25.1_dbas* – with this index pattern, all indices
starting with this character string are selected.
Bucket aggregations and metric aggregations
In this example a term query is used to filter only for
data from suspended particulate matter (spm). The detections are then
binned by station name (terms bucket aggregation) and for
each station the latest file upload time is returned (max
metric aggregation).
Combining aggregations and a bool query
In this example the query is for stations ulm or wettin, no blanks and the documents are split by station and polarity
GET ntsp_msrawfiles/_search?size=0
{
"query": {
"bool": {
"must": [
{
"terms": {
"station": [
"donau_ul_m",
"saale_wettin_m"
]
}
},
{
"term": {
"blank": {
"value": "false"
}
}
}
]
}
},
"aggs": {
"kontrolle": {
"terms": {
"field": "station"
},
"aggs": {
"polaritaet": {
"terms": {
"field": "pol"
}
}
}
}
}
}Aggregation: geotile_grid
In this example documents are searched by matching a pattern and the split by coordinates (geopoint)
GET ntsp_msrawfiles/_search?size=0
{
"query": {
"regexp": {
"station": ".*NA"
}
},
"aggs": {
"stations": {
"terms": {
"field": "station"
},
"aggs": {
"locations": {
"geotile_grid": {
"field": "loc",
"precision": 21,
"size": 10
}
}
}
}
}
}To split the results by station the precision can be set at maximum since within a station all coordinates must be exactly the same.
Aggregation: cardinality
The cardinality aggregation counts the number of unique entries in a keyword field. In this example we want to split the detections by compound but also want to know how many different compounds are found in the database. So two aggregations are set in parallel (not nested).
GET ntsp_dbas_upb/_search?size=0
{
"aggs": { // There are two different aggregations listed here
"num_different_comps": { // Aggregation 1 determines the number of different compounds
"cardinality": {
"field": "name",
"precision_threshold": 1000 // default is 100, need to set it higher, ca. 400 different compounds
}
},
"comps_buckets": { // Aggregation 2 splits docs into buckets by compound name
"terms": {
"field": "name",
"size": 1000
}
}
}
}Filtering buckets by their doc counts
In this example the buckets are filtered for compounds found 10 or more times in total and then only days on which the compound is found are returned.
GET ntsp_dbas_v231006_frame/_search?size=0
{
"query": {
"term": {
"pol": {
"value": "pos"
}
}
},
"aggs": {
"comps": {
"terms": {
"field": "name",
"size": 1000
},
"aggs": {
"comps_selector": {
"bucket_selector": {
"buckets_path": {
"numSamplesPerComp": "_count"
},
"script": "params.numSamplesPerComp >= 10"
}
},
"samples": {
"date_histogram": {
"field": "start",
"calendar_interval": "1d"
},
"aggs": {
"days_selector": {
"bucket_selector": {
"buckets_path": {
"numSamplesPerDay" : "_count"
},
"script": "params.numSamplesPerDay > 0"
}
}
}
}
}
},
"files": {
"cardinality": {
"field": "filename"
}
}
}
}