airflow.providers.apache.druid.transfers.hive_to_druid¶

This module contains operator to move data from Hive to Druid.

Attributes¶

`LOAD_CHECK_INTERVAL`
`DEFAULT_TARGET_PARTITION_SIZE`

Classes¶

HiveToDruidOperator

Moves data from Hive to Druid.

Module Contents¶

airflow.providers.apache.druid.transfers.hive_to_druid.LOAD_CHECK_INTERVAL = 5[source]¶

airflow.providers.apache.druid.transfers.hive_to_druid.DEFAULT_TARGET_PARTITION_SIZE = 5000000[source]¶

class airflow.providers.apache.druid.transfers.hive_to_druid.HiveToDruidOperator(*, sql, druid_datasource, ts_dim, metric_spec=None, hive_cli_conn_id='hive_cli_default', druid_ingest_conn_id='druid_ingest_default', metastore_conn_id='metastore_default', hadoop_dependency_coordinates=None, intervals=None, num_shards=-1, target_partition_size=-1, query_granularity='NONE', segment_granularity='DAY', hive_tblproperties=None, job_properties=None, **kwargs)[source]¶

Bases: airflow.providers.common.compat.sdk.BaseOperator

Moves data from Hive to Druid.

[del]note that for now the data is loaded into memory before being pushed to Druid, so this operator should be used for smallish amount of data.[/del]

Parameters:

sql (str) – SQL query to execute against the Druid database. (templated)
druid_datasource (str) – the datasource you want to ingest into in druid
ts_dim (str) – the timestamp dimension
metric_spec (list[Any] | None) – the metrics you want to define for your data
hive_cli_conn_id (str) – the hive connection id
druid_ingest_conn_id (str) – the druid ingest connection id
metastore_conn_id (str) – the metastore connection id
hadoop_dependency_coordinates (list[str] | None) – list of coordinates to squeeze int the ingest json
intervals (list[Any] | None) – list of time intervals that defines segments, this is passed as is to the json object. (templated)
num_shards (float) – Directly specify the number of shards to create.
target_partition_size (int) – Target number of rows to include in a partition,
query_granularity (str) – The minimum granularity to be able to query results at and the granularity of the data inside the segment. E.g. a value of “minute” will mean that data is aggregated at minutely granularity. That is, if there are collisions in the tuple (minute(timestamp), dimensions), then it will aggregate values together using the aggregators instead of storing individual rows. A granularity of ‘NONE’ means millisecond granularity.
segment_granularity (str) – The granularity to create time chunks at. Multiple segments can be created per time chunk. For example, with ‘DAY’ segmentGranularity, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size.
hive_tblproperties (dict[Any, Any] | None) – additional properties for tblproperties in hive for the staging table
job_properties (dict[Any, Any] | None) – additional properties for job

template_fields: collections.abc.Sequence[str] = ('sql', 'intervals')[source]¶

template_ext: collections.abc.Sequence[str] = ('.sql',)[source]¶

template_fields_renderers[source]¶

sql[source]¶

druid_datasource[source]¶

ts_dim[source]¶

intervals = ['{{ ds }}/{{ logical_date.add_days(1) | ds }}'][source]¶

num_shards = -1[source]¶

target_partition_size = -1[source]¶

query_granularity = 'NONE'[source]¶

segment_granularity = 'DAY'[source]¶

metric_spec[source]¶

hive_cli_conn_id = 'hive_cli_default'[source]¶

hadoop_dependency_coordinates = None[source]¶

druid_ingest_conn_id = 'druid_ingest_default'[source]¶

metastore_conn_id = 'metastore_default'[source]¶

hive_tblproperties[source]¶

job_properties = None[source]¶

execute(context)[source]¶

Derive when creating an operator.

The main method to execute the task. Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

construct_ingest_query(static_path, columns)[source]¶

Build an ingest query for an HDFS TSV load.

Parameters:

static_path (str) – The path on hdfs where the data is
columns (list[str]) – List of all the columns that are available