PREDICT
Description
A pQuery starts with the PREDICT clause that defines a required specification of a “target” which represents the value you want to predict for an entity described in FOR EACH clause.
If you want to predict some aggregation of values over time (e.g. sum of purchase values over the next 30 days), then you must choose the proper aggregation formula and the column representing the value you want to aggregate. This column must come from a fact table that includes a foreign key column to the entity ID specified in FOR EACH clause.
Example 1
The syntax (and some examples) for this part of the query is as follows:
PREDICT <Aggregation_Function>(<target_table>.<column_name>, <start>, <end>)
PREDICT SUM(PURCHASES.amount, 0, 30)
PREDICT LIST_DISTINCT(PURCHASES.item_category, 0, 14)
You can use the same table as an entity table (the table used in FOR EACH clause) if you aim to estimate static values in the entity table.
Example 2
The syntax (and some examples) for this part of the query is as follows:
PREDICT <target_table>.<column_name>
PREDICT CUSTOMERS.industry
When creating queries that involve the aggregation of values over time, allowable aggregation functions are:
- Sum (numerical columns):
SUM
- Average (numerical columns):
AVG
- Maximum (numerical columns):
MAX
- Minimum (numerical columns):
MIN
- Count (all columns):
COUNT
- Cardinality/count of distinct values (all columns):
COUNT_DISTINCT
- List of all distinct values (all columns):
LIST_DISTINCT
- The latest/most recent value in a specified timeframe (all columns except foreign keys):
LAST
- The oldest/least recent value in a specified timeframe (all columns except foreign keys):
FIRST
Note for LIST_DISTINCT
for a smooth user experience we only recommend using it if the number of possible values to return is less than 1000.
Within the aggregation function inputs, the start
and end
parameters refer to the time period you want to aggregate facts across, as calculated in days. For example: 10
for start
and 30
for end
implies that you want to aggregate across all facts from 10 days later (excluding the 10th day) to 30 days later (including the 30th day). Both start
and end
should be non-negative integers and end
should be strictly greater than start
.
In general for the target formula, you should input a facts table that has historic fact data over a timeframe that is at least an order of magnitude more than the prediction horizon in your target formula. For example, if you are predicting over the next 30 days, especially to take into account seasonality, you'd probably be best off including at least 1 year of rows in your fact table.
Note it is also optionally possible to convert your prediction target into a conditional statement prediction problem (e.g. a prediction of whether a particular target expression is true or false) by adding a comparison operator.
Example 3
The syntax (and some examples) for this part of the query is as follows:
PREDICT <target_expression> <comparison_operator> <constant>
PREDICT SUM(PURCHASES.amount, 0, 30) > 10
PREDICT LAST(STATUS.status, 0, 90) = 'INACTIVE'
Here, allowable comparison operators include:
- Greater than (numeric, timestamp columns):
>
- Less than (numeric, timestamp columns):
<
- Equal to (all column types):
=
- Not equal to (all column types):
!=
- Greater than or equal to (numeric, timestamp columns):
>=
- Less than or equal to (numeric, timestamp columns):
<=
- Starts with a particular value (text columns):
STARTS WITH
- Ends with a particular value (text columns):
ENDS WITH
- Contains a particular value (text columns):
CONTAINS
Example 4
If the target is a logical value, you can combine multiple conditions with logical operations AND/OR/NOT:
PREDICT SUM(PURCHASES.amount, 0, 30) > 10 OR COUNT(VIEWS.*, 0, 30) > 20
PREDICT CUSTOMERS.industry = 'IT' AND CUSTOMERS.date_of_birth <= 1990-01-01
However, one should not mix static conditions and aggregations when defining the target. Temporal aggregations change their values depending on the anchor time that they are computed from, while static conditions are not affected by time. As such, they require different data-sampling strategies.
Updated 3 days ago