Description
Kumo generates embeddings and predictions by sampling and aggregating features from a node’s local neighborhood. The num_neighbors
parameter determines how Kumo samples these local subgraphs by specifying how many neighbors are sampled for each node in each iteration.
By default, num_neighbors
is determined using the run_mode
argument , and a two-hop subgraph is sampled. Different sampling strategies can be specified and tuned via AutoML:
num_neighbors:
- hop1:
default: 16
hop2:
default: 8
This configuration means that Kumo samples a maximum of 16 nodes for each primary key/foreign key connection in the first hop, and a maximum of 8 nodes for each primary key/foreign key connection in the second hop. Kumo supports sampling depths of up to 6 hops. Note that deeper subgraphs will increase runtime and memory requirements of the model, and it is advised to specify a smaller batch size in case deeper subgraphs are desired.
In each hop, we can also customize the neighborhood count to sample for specific connections. For example,
num_neighbors:
- hop1:
default: 16
USERS.USER_ID->TRANSACTIONS.USER_ID: 128
hop2:
default: 8
TRANSACTIONS.STORE_ID->STORES.STORE_ID: 0
means that we sample a maximum of 128 transactions per user in the first hop, while we don’t want to sample any stores from these transactions in the second hop. This allows for fine-grained control to give more or less importance to specific connections.
The maximum neighbors to sample per hop by default is 128, and 512 for specific connections.
For temporal queries, the default model planner will give special treatment to the {entity}->{target} connection, e.g., in queries such as:
PREDICT COUNT(TRANSACTIONS.*, 0, 7) FOR EACH USERS.USER_ID
The model trainer will be set as:
num_neighbors:
- hop1:
default: 16
USERS.USER_ID->TRANSACTIONS.USER_ID: inferred
hop2:
default: 16
Here, the inferred value indicates to Kumo that it should determine an optimal value based on edge degree statistics between the entity table and the target table. The inferred option is currently only supported for the entity/target connection in the first hop. You are able to confirm the inferred value in the final model plan once training finishes.
Default Values
The default value of num_neighbors
depends on run_mode.
# run_mode: FAST
num_neighbors:
- hop1:
default: 12
hop2:
default: 12
# run_mode: NORMAL
num_neighbors:
- hop1:
default: 16
hop2:
default: 16
# run_mode: BEST
num_neighbors:
- hop1:
default: 24
hop2:
default: 24
Updated 2 days ago