Skip to main content

Apache Hive setup

  • Maintained by: Cloudera
  • Authors: Cloudera
  • GitHub repo: cloudera/dbt-hive
  • PyPI package: dbt-hive
  • Slack channel: #db-hive
  • Supported dbt Core version: v1.1.0 and newer
  • dbt Cloud support: Not Supported
  • Minimum data platform version: n/a

Installing dbt-hive

Use pip to install the adapter. Before 1.8, installing the adapter would automatically install dbt-core and any additional dependencies. Beginning in 1.8, installing an adapter does not automatically install dbt-core. This is because adapters and dbt Core versions have been decoupled from each other so we no longer want to overwrite existing dbt-core installations. Use the following command for installation:

python -m pip install dbt-core dbt-hive

Configuring dbt-hive

For Hive-specific configuration, please refer to Hive configs.

Connection Methods

dbt-hive can connect to Apache Hive and Cloudera Data Platform clusters. The Impyla library is used to establish connections to Hive.

dbt-hive supports two transport mechanisms:

  • binary
  • HTTP(S)

The default mechanism is binary. To use HTTP transport, use the boolean option. For example, use_http_transport: true.

Authentication Methods

dbt-hive supports two authentication mechanisms:

  • insecure No authentication is used, only recommended for testing.
  • ldap Authentication via LDAP

Insecure

This method is only recommended if you have a local install of Hive and want to test out the dbt-hive adapter.

~/.dbt/profiles.yml
your_profile_name:
target: dev
outputs:
dev:
type: hive
host: localhost
port: PORT # default value: 10000
schema: SCHEMA_NAME

LDAP

LDAP allows you to authenticate with a username and password when Hive is configured with LDAP Auth. LDAP is supported over Binary & HTTP connection mechanisms.

This is the recommended authentication mechanism to use with Cloudera Data Platform (CDP).

~/.dbt/profiles.yml
your_profile_name:
target: dev
outputs:
dev:
type: hive
host: HOST_NAME
http_path: YOUR/HTTP/PATH # optional, http path to Hive default value: None
port: PORT # default value: 10000
auth_type: ldap
use_http_transport: BOOLEAN # default value: true
use_ssl: BOOLEAN # TLS should always be used with LDAP to ensure secure transmission of credentials, default value: true
username: USERNAME
password: PASSWORD
schema: SCHEMA_NAME

Note: When creating workload user in CDP, make sure the user has CREATE, SELECT, ALTER, INSERT, UPDATE, DROP, INDEX, READ and WRITE permissions. If you need the user to execute GRANT statements, you should also configure the appropriate GRANT permissions for them. When using Apache Ranger, permissions for allowing GRANT are typically set using "Delegate Admin" option. For more information, see grants and on-run-start & on-run-end.

Kerberos

The Kerberos authentication mechanism uses GSSAPI to share Kerberos credentials when Hive is configured with Kerberos Auth.

~/.dbt/profiles.yml
your_profile_name:
target: dev
outputs:
dev:
type: hive
host: HOSTNAME
port: PORT # default value: 10000
auth_type: GSSAPI
kerberos_service_name: KERBEROS_SERVICE_NAME # default value: None
use_http_transport: BOOLEAN # default value: true
use_ssl: BOOLEAN # TLS should always be used to ensure secure transmission of credentials, default value: true
schema: SCHEMA_NAME

Note: A typical setup of Cloudera Private Cloud will involve the following steps to setup Kerberos before one can execute dbt commands:

  • Get the correct realm config file for your installation (krb5.conf)
  • Set environment variable to point to the config file (export KRB5_CONFIG=/path/to/krb5.conf)
  • Set correct permissions for config file (sudo chmod 644 /path/to/krb5.conf)
  • Obtain keytab using kinit (kinit username@YOUR_REALM.YOUR_DOMAIN)
  • The keytab is valid for certain period after which you will need to run kinit again to renew validity of the keytab.
  • User will need CREATE, DROP, INSERT permissions on the schema provided in profiles.yml

Instrumentation

By default, the adapter will collect instrumentation events to help improve functionality and understand bugs. If you want to specifically switch this off, for instance, in a production environment, you can explicitly set the flag usage_tracking: false in your profiles.yml file.

Installation and Distribution

dbt's adapter for Apache Hive is managed in its own repository, dbt-hive. To use it, you must install the dbt-hive plugin.

Using pip

The following commands will install the latest version of dbt-hive as well as the requisite version of dbt-core and impyla driver used for connections.

python -m pip install dbt-hive

Supported Functionality

NameSupported
Materialization: TableYes
Materialization: ViewYes
Materialization: Incremental - AppendYes
Materialization: Incremental - Insert+OverwriteYes
Materialization: Incremental - MergeNo
Materialization: EphemeralNo
SeedsYes
TestsYes
SnapshotsNo
DocumentationYes
Authentication: LDAPYes
Authentication: KerberosYes
0