Table Types
Kumo makes accurate predictions by learning patterns in your historic data that can be projected to the future. This requires mitigating the common ML problem of Data Leakage, wherein the model at training time has access to data that it will not have when making predictions on unseen data; this causes the model to perform worse in production than it did in the training phase.
To prevent learning from the wrong patterns due to data leakage, Kumo must establish an accurate view of what information was actually known/true at a particular point in time. You must therefore provide Kumo with historical data that conforms to either a Dimension Table or Fact Table.
Dimension Tables
Each row in a Dimension Table must have a unique primary key (located in a designated primary key column), with all row values assumed constant over time, both historically and into the future.
Some common dimension tables include:
- A table of fixed demographic attributes of users (e.g., dietary requirements)
- A table of fixed business attributes (e.g., industry sector, founding date)
Considerations
- Data in each row is assumed to be constant over time. Do not include any values that actually change over time (e.g., "date_updated", "last_invoice_date", etc.), as they could impact your model's quality.
- Each row should represent a separate entity with a unique primary key. If multiple rows have the same primary key value, only one of the rows will be chosen and the rest will be dropped.
- You can have either zero or one time column—if a time column is used, it should only represent the entity's creation date.
Guidelines
-
❌ DO NOT use aggregate statistics (e.g., sum of views since the product was launched) that may only hold true at a particular point in time.
✅ DO use these in a fact table with dates for each observation, aggregated in a backward-looking fashion.
-
❌ DO NOT include columns with information only known at a certain time, and not consistently true/known across all timeframes (e.g., fields like "time of first order", "time of the first invoice", "time of latest membership renewal").
✅ DO extract these time-dependent columns into fact tables.
Whenever possible, tables should include a DATE/TIME/TIMESTAMP column that indicates the creation time of the entity represented in each row. This helps Kumo use the right historical data when recreating the past accurately to extrapolate predictions into the future. For example, a streaming video provider may have a table of subscribers with a column holding the dates each subscriber joined the platform.
Note: time columns are optional and should only be used if they provide useful temporal information about the entity. Only one time column is allowed.
Fact Tables
Each row in a Fact Table must represents a fact (i.e., an event or observation) that became true/known at a specific point in time, indicated by an authoritative time column.
Some common fact tables include:
- A timestamped table of purchase transactions
- A timestamped table of user website views
Considerations
- Fact table rows are treated as logs. Once a row is created, entries in the columns should NOT be updated.
- Rows missing a time column value will be ignored
- All rows are assumed to be unique and not duplicates—Kumo processes all rows
- Kumo does not take into account significant (e.g., multi-day) gaps between an event's actual occurrence date and its ingestion date
- Information in the row is assumed to be known at the time indicated in the time column; do not include information as a column not known at the timestamp (e.g., an aggregate statistic generated over the next N days)—this may impact your pQuery's accuracy when used in production.
Guidelines
- ❌ DO NOT update values in an existing fact table row. For example, you might have a fact table storing the subscription status of your customers, where updates are made to the entry if a subscription status changes.
- ✅ DO store information as "logs". In the example of a subscription status table, instead of updating an existing entry, you can create a new entry whenever a customer changes subscription status.
- ❌ DO NOT include values that are not known to be true at the timestamp given by the time column (e.g., "whether an order was eventually refunded/canceled", "whether a subscription was renewed").
- ✅ DO make sure your timestamp is always later than the known time/date column values
Raw Tables
Raw table are table types that don't conform to either a fact table or dimension table format.
Guidelines
- ❌ DO NOT use this table type for Kumo graph creation later, as it can only be used to create Kumo views.
- ✅ DO use this table type if you need to add a table without any constraints.
Updated 7 months ago