TimeSeriesOutliers
Identifies and visualizes outliers in time-series data using the z-score method.
Purpose
This test is designed to identify outliers in time-series data using the z-score method. It’s vital for ensuring data quality before modeling, as outliers can skew predictive models and significantly impact their overall performance.
Test Mechanism
The test processes a given dataset which must have datetime indexing, checks if a ‘zscore_threshold’ parameter has been supplied, and identifies columns with numeric data types. After finding numeric columns, the implementer then applies the z-score method to each numeric column, identifying outliers based on the threshold provided. Each outlier is listed together with their variable name, z-score, timestamp, and relative threshold in a dictionary and converted to a DataFrame for convenient output. Additionally, it produces visual plots for each time series illustrating outliers in the context of the broader dataset. The ‘zscore_threshold’ parameter sets the limit beyond which a data point will be labeled as an outlier. The default threshold is set at 3, indicating that any data point that falls 3 standard deviations away from the mean will be marked as an outlier.
Signs of High Risk
- Many or substantial outliers are present within the dataset, indicating significant anomalies.
- Data points with z-scores higher than the set threshold.
- Potential impact on the performance of machine learning models if outliers are not properly addressed.
Strengths
- The z-score method is a popular and robust method for identifying outliers in a dataset.
- Simplifies time series maintenance by requiring a datetime index.
- Identifies outliers for each numeric feature individually.
- Provides an elaborate report showing variables, dates, z-scores, and pass/fail tests.
- Offers visual inspection for detected outliers through plots.
Limitations
- The test only identifies outliers in numeric columns, not in categorical variables.
- The utility and accuracy of z-scores can be limited if the data doesn’t follow a normal distribution.
- The method relies on a subjective z-score threshold for deciding what constitutes an outlier, which might not always be suitable depending on the dataset and use case.
- It does not address possible ways to handle identified outliers in the data.
- The requirement for a datetime index could limit its application.