- Input: Datasets with numerical or categorical features.
- Output: Anomaly score for each data point, where higher scores indicate a higher likelihood of being an outlier.
- Strengths:
- Handles high-dimensional data effectively.
- Unsupervised, requiring no labels for what anomalies look like.
- Adaptable to streaming data with incremental updates.
- Doesn't make strong assumptions about data distribution.
- Weaknesses:
- Less intuitive than some simpler methods, making the underlying rationale harder to explain.
- Hyperparameter tuning (number of trees, tree depth, etc.) is important for good performance.
- Might miss subtle anomalies close to the normal data distribution.
- Use Case: Detecting unusual patterns in network traffic that could signify cyberattacks or intrusions.
How it Works (Simplified)
- Random Cuts: RCF projects data points onto randomly placed lines (cuts) in a multi-dimensional space.
- Tree Building: Decision trees are created, but splits are determined solely by a data point's position along these random cuts.
- Anomaly Scoring: Data points frequently isolated at the ends of branches across many trees get higher anomaly scores.