Apache Hive setup
- Maintained by: Cloudera
- Authors: Cloudera
- GitHub repo: cloudera/dbt-hive
- PyPI package:
dbt-hive
- Slack channel: #db-hive
- Supported dbt Core version: v1.1.0 and newer
- dbt Cloud support: Not Supported
- Minimum data platform version: n/a
Installing dbt-hive
Use pip
to install the adapter. Before 1.8, installing the adapter would automatically install dbt-core
and any additional dependencies. Beginning in 1.8, installing an adapter does not automatically install dbt-core
. This is because adapters and dbt Core versions have been decoupled from each other so we no longer want to overwrite existing dbt-core installations.
Use the following command for installation:
python -m pip install dbt-core dbt-hive
Configuring dbt-hive
For Hive-specific configuration, please refer to Hive configs.
Connection Methods
dbt-hive can connect to Apache Hive and Cloudera Data Platform clusters. The Impyla library is used to establish connections to Hive.
dbt-hive supports two transport mechanisms:
- binary
- HTTP(S)
The default mechanism is binary
. To use HTTP transport, use the boolean option. For example, use_http_transport: true
.
Authentication Methods
dbt-hive supports two authentication mechanisms:
Insecure
This method is only recommended if you have a local install of Hive and want to test out the dbt-hive adapter.
your_profile_name:
target: dev
outputs:
dev:
type: hive
host: localhost
port: PORT # default value: 10000
schema: SCHEMA_NAME
LDAP
LDAP allows you to authenticate with a username and password when Hive is configured with LDAP Auth. LDAP is supported over Binary & HTTP connection mechanisms.
This is the recommended authentication mechanism to use with Cloudera Data Platform (CDP).
your_profile_name:
target: dev
outputs:
dev:
type: hive
host: HOST_NAME
http_path: YOUR/HTTP/PATH # optional, http path to Hive default value: None
port: PORT # default value: 10000
auth_type: ldap
use_http_transport: BOOLEAN # default value: true
use_ssl: BOOLEAN # TLS should always be used with LDAP to ensure secure transmission of credentials, default value: true
username: USERNAME
password: PASSWORD
schema: SCHEMA_NAME
Note: When creating workload user in CDP, make sure the user has CREATE, SELECT, ALTER, INSERT, UPDATE, DROP, INDEX, READ and WRITE permissions. If you need the user to execute GRANT statements, you should also configure the appropriate GRANT permissions for them. When using Apache Ranger, permissions for allowing GRANT are typically set using "Delegate Admin" option. For more information, see grants
and on-run-start & on-run-end.
Kerberos
The Kerberos authentication mechanism uses GSSAPI to share Kerberos credentials when Hive is configured with Kerberos Auth.
your_profile_name:
target: dev
outputs:
dev:
type: hive
host: HOSTNAME
port: PORT # default value: 10000
auth_type: GSSAPI
kerberos_service_name: KERBEROS_SERVICE_NAME # default value: None
use_http_transport: BOOLEAN # default value: true
use_ssl: BOOLEAN # TLS should always be used to ensure secure transmission of credentials, default value: true
schema: SCHEMA_NAME
Note: A typical setup of Cloudera Private Cloud will involve the following steps to setup Kerberos before one can execute dbt commands:
- Get the correct realm config file for your installation (krb5.conf)
- Set environment variable to point to the config file (export KRB5_CONFIG=/path/to/krb5.conf)
- Set correct permissions for config file (sudo chmod 644 /path/to/krb5.conf)
- Obtain keytab using kinit (kinit username@YOUR_REALM.YOUR_DOMAIN)
- The keytab is valid for certain period after which you will need to run kinit again to renew validity of the keytab.
- User will need CREATE, DROP, INSERT permissions on the schema provided in profiles.yml
Instrumentation
By default, the adapter will collect instrumentation events to help improve functionality and understand bugs. If you want to specifically switch this off, for instance, in a production environment, you can explicitly set the flag usage_tracking: false
in your profiles.yml
file.
Installation and Distribution
dbt's adapter for Apache Hive is managed in its own repository, dbt-hive. To use it,
you must install the dbt-hive
plugin.
Using pip
The following commands will install the latest version of dbt-hive
as well as the requisite version of dbt-core
and impyla
driver used for connections.
python -m pip install dbt-hive
Supported Functionality
Name | Supported |
---|---|
Materialization: Table | Yes |
Materialization: View | Yes |
Materialization: Incremental - Append | Yes |
Materialization: Incremental - Insert+Overwrite | Yes |
Materialization: Incremental - Merge | No |
Materialization: Ephemeral | No |
Seeds | Yes |
Tests | Yes |
Snapshots | No |
Documentation | Yes |
Authentication: LDAP | Yes |
Authentication: Kerberos | Yes |