KolmogorovSmirnov

Assesses whether each feature in the dataset aligns with a normal distribution using the Kolmogorov-Smirnov test.

Purpose

The Kolmogorov-Smirnov (KS) test evaluates the distribution of features in a dataset to determine their alignment with a normal distribution. This is important because many statistical methods and machine learning models assume normality in the data distribution.

Test Mechanism

This test calculates the KS statistic and corresponding p-value for each feature in the dataset. It does so by comparing the cumulative distribution function of the feature with an ideal normal distribution. The KS statistic and p-value for each feature are then stored in a dictionary. The p-value threshold to reject the normal distribution hypothesis is not preset, providing flexibility for different applications.

Signs of High Risk

Elevated KS statistic for a feature combined with a low p-value, indicating a significant divergence from a normal distribution.
Features with notable deviations that could create problems if the model assumes normality in data distribution.

Strengths

The KS test is sensitive to differences in the location and shape of empirical cumulative distribution functions.
It is non-parametric and adaptable to various datasets, as it does not assume any specific data distribution.
Provides detailed insights into the distribution of individual features.

Limitations

The test's sensitivity to disparities in the tails of data distribution might cause false alarms about non-normality.
Less effective for multivariate distributions, as it is designed for univariate distributions.
Does not identify specific types of non-normality, such as skewness or kurtosis, which could impact model fitting.