How should I scale/handle outliers in the data?
A common machine learning challenge lies in their susceptibility to outliers and novel data points. When a new entity enters the system, historical data about its behavior and interactions is typically scarce. Consequently, the model struggles to accurately capture the entity's profile and preferences, potentially introducing considerable noise with the introduction of new inputs. The inadequacy of existing approaches to fully encapsulate the intricate relationships between entities results in any new data point appearing as an outlier, rendering the system highly sensitive to input alterations.
Several techniques exist to address outliers within conventional machine learning paradigms, each with its own drawbacks. For instance, the straightforward approach of outlier removal can enhance model accuracy but risks discarding valuable information in certain contexts. Another method involves capping, where values beyond predefined thresholds are constrained to specified limits, thereby retaining outliers in the dataset while mitigating their impact.
In contrast, by leveraging graph machine learning, Kumo mitigates the outlier concern by inherently capturing relational properties present in the data. Graphs represent a complex network of entities and connections, with each entity existing within the context of its interconnected components. By encoding structural information, graph models can effectively make predictions with limited data on a given entity, thereby reducing false positives and addressing label imbalance issues. Moreover, graph models necessitate less feature engineering compared to traditional machine learning approaches, as they are designed to autonomously learn the structural properties and interactions within the network from raw data.
Learn More:
Updated 9 months ago