gcp_bigquery_select

Executes a SELECT query against BigQuery and creates a message for each row received.

# Config fields, showing default values
input:
  label: ""
  gcp_bigquery_select:
    project: "" # No default (required)
    credentials_json: "" # No default (optional)
    table: bigquery-public-data.samples.shakespeare # No default (required)
    columns: [] # No default (required)
    where: type = ? and created_at > ? # No default (optional)
    auto_replay_nacks: true
    job_labels: {}
    priority: ""
    args_mapping: root = [ "article", now().ts_format("2006-01-02") ] # No default (optional)
    prefix: "" # No default (optional)
    suffix: "" # No default (optional)

Once the rows from the query are exhausted, this input shuts down, allowing the pipeline to gracefully terminate (or the next input in a sequence to execute).

Examples

  • Word counts

Here we query the public corpus of Shakespeare’s works to generate a stream of the top 10 words that are 3 or more characters long:

input:
  gcp_bigquery_select:
    project: sample-project
    table: bigquery-public-data.samples.shakespeare
    columns:
      - word
      - sum(word_count) as total_count
    where: length(word) >= ?
    suffix: |
      GROUP BY word
      ORDER BY total_count DESC
      LIMIT 10
    args_mapping: |
      root = [ 3 ]

Fields

project

GCP project where the query job will execute.

Type: string

credentials_json

This field contains sensitive information. Review your cluster security before adding it to your configuration.

Type: string

Default: ""

table

Fully-qualified BigQuery table name to query.

Type: string

# Examples

table: bigquery-public-data.samples.shakespeare

columns

A list of columns to query.

Type: array

where

An optional where clause to add. Placeholder arguments are populated with the args_mapping field. Placeholders should always be question marks (?).

Type: string

# Examples

where: type = ? and created_at > ?

where: user_id = ?

auto_replay_nacks

Whether messages that are rejected (nacked) at the output level should be automatically replayed indefinitely, eventually resulting in back pressure if the cause of the rejections is persistent. If set to false these messages will instead be deleted. Disabling auto replays can greatly improve memory efficiency of high throughput streams as the original shape of the data can be discarded immediately upon consumption and mutation.

Type: bool

Default: true

job_labels

A list of labels to add to the query job.

Type: object

Default: {}

priority

The priority with which to schedule the query.

Type: string

Default: ""

args_mapping

An optional Bloblang mapping which should evaluate to an array of values matching in size to the number of placeholder arguments in the field where.

Type: string

# Examples

args_mapping: root = [ "article", now().ts_format("2006-01-02") ]

prefix

An optional prefix to prepend to the select query (before SELECT).

Type: string

suffix

An optional suffix to append to the select query.

Type: string