Installing the Databricks Native App
Prerequisites
The following prerequisites are required in order to run Kumo as a Databricks native application:
-
A dedicated service principal in your Databricks workspace to be used by Kumo
-
A dedicated, all-purpose compute cluster with “Shared” access mode (version 14.3 LTS or greater), with “Can Manage” permissions assigned to the Kumo service principal
-
The above cluster needs to be appropriately sized for the amount of anticipated data processed by Kumo (e.g., auto-scaling to a max of 8 4-core 16GB instances for data thresholds under 100GB)
-
A dedicated small size Serverless SQL warehouse, with “Can manage” permissions assigned to the Kumo service principal
-
Unity Catalog table access assigned to the Kumo service principal
-
A dedicated Unity Catalog Volume for which the Kumo service principal can read and write. All Kumo generated data will be stored in this volume.
After creating these Databricks resources, you will need to share the following information with Kumo in order to create the environment:
-
Your Databricks workspace host URL
-
The cluster ID for the dedicated, all-purpose compute cluster
-
The warehouse ID for the dedicated, serverless SQL warehouse
-
The name of the catalog and schema containing the tables to be accessed by Kumo
-
The UC Volume path for the dedicated Unity Catalog Volume
-
The Kumo service principal client ID and secret
Required Catalog Permissions
The following table illustrates the list of permissions you should grant to the Kumo service principal:
Specifically:
-
USE_CATALOG, USE_SCHEMA, EXECUTE, SELECT, CREATE_FUNCTION are needed on the catalog-schema containing tables to be read by Kumo
-
USE_CATALOG, USE_SCHEMA, EXECUTE, MODIFY, SELECT, CREATE_FUNCTION, CREATE_TABLE are needed on the catalog-schema in which Kumo writes the batch prediction tables
-
USE_CATALOG, USE_SCHEMA, READ_VOLUME, WRITE_VOLUME are needed on the catalog-schema in which Kumo writes intermediate data into your UC Volume
Additional Steps
After creating the above Databricks resources, the following additional steps are needed to set up your Kumo native app for Databricks:
JAR file installation
The following JAR files must be installed as the libraries of the all-purpose cluster:
-
s3://kumo-pyspark-venv/databricks-jars/feature_proto_scala.jar
-
s3://kumo-pyspark-venv/databricks-jars/lenses_sjs1_2.12-0.11.11.jar
-
s3://kumo-pyspark-venv/databricks-jars/protobuf-java-3.19.4.jar
-
s3://kumo-pyspark-venv/databricks-jars/scalapb-runtime_2.12-0.11.11.jar
-
s3://kumo-pyspark-venv/databricks-jars/sst_source_0.0.1.jar
If specifying S3 paths is not possible (e.g., for Azure Databricks), these JAR files can also be downloaded and uploaded into the Unity Catalog Volume. The libraries can then be installed from the UC Volume path.
If you encounter errors like “Jars and Maven ... must be on allowlist," follow the instructions here to add the library path (S3 or UC Volume path) to the shared compute allowlist.
Grant Additional Permissions to Kumo Service Principal
Part of the compute pushed down by Kumo requires additional permissions to execute in your Databricks all-purpose cluster. Specifically, the spark_partition_id
built-in function requires the following step:
Open Databricks SQL editor and execute the following query:
GRANT SELECT ON ANONYMOUS FUNCTION TO `SERVICE_PRINCIPAL_ID`;
SERVICE_PRINCIPAL_ID
should be replaced with the client ID of the Kumo service principal created above.
Databricks PrivateLink Requirements
If you have Databricks PrivateLink enabled for your workspace, add the following IP addresses to your allowlist to enable Kumo VPC communication with your Databricks workspace. This allows Kumo to communicate with your Databricks workspace to perform pushdown compute.
52.36.237.40
52.26.42.73
54.187.89.42
35.161.130.246
To enable UC Volume support, add the following four IP addresses to your allowlist. This enables Kumo GPU instances to access files in your UC volume when performing training and batch prediction.
34.209.103.243
44.241.45.152
54.213.250.99
44.226.59.243
Data Sharing for Unity Catalog Volume
Kumo engineers helping to onboard your use cases may require temporarily access to the data generated by Kumo in your UC Volume. The data stored in your UC Volume is intermediate data generated by Kumo that does not contain raw data of the tables shared with Kumo. You should therefore create a dedicated UC Volume for usage by Kumo to facilitate sharing data in these cases.
Databricks provides this sharing capability via Delta Sharing. When Kumo requests sharing of the data, you can follow these steps to temporarily grant access to the data in your UC Volume, and revoke the access afterwards.
-
Enable Delta sharing for your metastore if not already enabled
-
Add Kumo as a Delta sharing recipient. Kumo is a Databricks recipient, and our recipient ID is
aws:us-west-2:b7ab6fd5-7ee2-4fee-853c-d4e3716c4c01
-
Create a Delta sharing object for the UC Volume to be shared with Kumo.
-
Kumo’s access to the UC Volume can be managed from Databricks.
More options are available in the Databricks documentation for things such as auditing Delta sharing.
Support
If you need assistance, you can reach out to your Kumo customer support representative or email us on [email protected]
Updated 4 months ago