Descriptive Statistics: Raw Data is designed to provide a comprehensive summary of both numerical and categorical variables within a dataset, offering a detailed overview of the data’s distribution, central tendency, and variability. The primary purpose of this test is to facilitate an understanding of the dataset’s structure and characteristics, which is essential for interpreting model behavior and anticipating performance outcomes.
The test operates by applying established statistical functions to the dataset. For numerical variables, it uses a summary statistics approach, calculating the count of observations, mean (average value), standard deviation (a measure of spread or variability), minimum and maximum values, and key percentiles (25th, 50th, 75th, 90th, and 95th). These metrics collectively describe the central tendency, dispersion, and range of the data. The mean provides an average, while the median (50th percentile) offers a robust measure of central location, less sensitive to outliers. The standard deviation quantifies how much values deviate from the mean, with higher values indicating greater spread. Percentiles help identify the distribution of values across the dataset, highlighting skewness or concentration in certain ranges. For categorical variables, the test counts the total number of entries, the number of unique categories, the most frequent category (top value), its frequency, and the proportion this frequency represents relative to the total. This approach reveals the diversity and dominance of categories, which is crucial for understanding potential biases or imbalances. The typical range for these metrics is determined by the data itself, with counts ranging from zero to the dataset size, proportions from 0% to 100%, and numerical values spanning the observed data range. High dominance of a single category or significant differences between mean and median can indicate skewness or lack of diversity, which may impact model performance.
The primary advantages of this test include its ability to quickly and effectively summarize large and complex datasets, making it easier to identify patterns, anomalies, and potential data quality issues. By providing both central tendency and dispersion measures for numerical variables, the test enables users to detect outliers, skewness, and unusual distributions that could affect model training and inference. For categorical variables, the test highlights the presence of dominant categories or limited diversity, which are important for assessing the risk of bias or overfitting. This comprehensive overview is particularly useful in the early stages of model development, data validation, and regulatory review, as it ensures that all relevant aspects of the data are considered before proceeding to more advanced analyses. The test’s versatility allows it to be applied across a wide range of domains and data types, supporting robust and transparent model documentation.
It should be noted that while this test provides valuable high-level insights, it does not capture relationships or dependencies between variables, nor does it detect subtle patterns or correlations that may be critical for model performance. The test is limited to univariate analysis, meaning it examines each variable independently without considering interactions. As a result, it cannot identify multicollinearity, confounding factors, or complex data structures. Additionally, the test may not detect rare but important categories in categorical variables if they are overshadowed by dominant classes. Interpretation challenges may arise if the data contains significant outliers or is heavily skewed, as these can distort summary statistics such as the mean and standard deviation. Signs of high risk include large discrepancies between mean and median, high standard deviation relative to the mean, or a single category accounting for a large proportion of the data. These characteristics may indicate potential issues with data representativeness or suitability for modeling, and should prompt further investigation using complementary statistical tests.
This test shows the results in two tabular formats: one for numerical variables and one for categorical variables. The numerical variables table lists each variable alongside its count, mean, standard deviation, minimum, several percentiles (25th, 50th, 75th, 90th, 95th), and maximum values, providing a detailed snapshot of the distribution and spread for each feature. For example, the "CreditScore" variable has a mean of 650.16, a standard deviation of 96.85, and ranges from 350 to 850, with percentiles indicating the distribution across the population. The "Balance" variable shows a mean of 76,434.10 and a wide standard deviation of 62,612.25, with a minimum of 0 and a maximum of 250,898, suggesting a highly variable distribution. The categorical variables table presents each variable with its total count, number of unique values, the most frequent category, its frequency, and the percentage this represents. For instance, "Geography" has three unique values, with "France" being the most common at 50.12% of the data, while "Gender" is split between two categories, with "Male" comprising 54.95%. These tables allow for straightforward identification of central tendencies, variability, and category dominance, and can be read by examining each row for the variable of interest and interpreting the corresponding summary statistics. Notable observations include the presence of variables with high standard deviations, potential skewness in distributions, and categorical variables with dominant classes.
The test results reveal the following key insights:
- Numerical variables exhibit wide ranges and varying degrees of dispersion: Variables such as "CreditScore" and "Balance" display substantial spreads, with "Balance" showing a particularly high standard deviation (62,612.25) relative to its mean (76,434.10), indicating significant variability and potential outliers.
- Central tendency and skewness are evident in several variables: The "CreditScore" mean (650.16) is close to the median (652.0), suggesting a relatively symmetric distribution, while "Balance" has a median (97,264.0) notably higher than the mean, indicating right-skewness with a concentration of lower values and a long tail of higher balances.
- Categorical variables show limited diversity and dominance of specific categories: "Geography" is dominated by "France" (50.12%), and "Gender" by "Male" (54.95%), highlighting potential imbalances that could influence model outcomes if not addressed.
- Binary variables are well represented and balanced: Variables such as "HasCrCard" and "IsActiveMember" are binary, with means of 0.70 and 0.52, respectively, indicating a moderate split between categories and reducing the risk of extreme imbalance.
- Percentile analysis reveals concentration and outlier presence: For "Age," the 95th percentile is 60, while the maximum is 92, suggesting a small number of older individuals that may act as outliers. Similarly, "Balance" and "EstimatedSalary" show large gaps between the 95th percentile and maximum values, further indicating the presence of extreme values.
Based on these results, the dataset demonstrates a mix of well-behaved and highly variable features, with numerical variables such as "CreditScore" and "Age" showing relatively symmetric distributions and moderate dispersion, while "Balance" and "EstimatedSalary" exhibit significant variability and right-skewness, as evidenced by high standard deviations and large differences between percentiles and maximum values. The categorical variables are characterized by limited diversity, with a single category accounting for over half of the observations in both "Geography" and "Gender," which may introduce bias or reduce the model’s ability to generalize across less-represented groups. Binary variables are reasonably balanced, minimizing the risk of model bias due to class imbalance. The presence of outliers in variables like "Balance" and "Age" is apparent from the percentile and maximum value comparisons, suggesting that further investigation or preprocessing may be warranted to mitigate their impact. Overall, the descriptive statistics provide a clear and detailed overview of the dataset’s structure, highlighting areas of stability, variability, and potential risk that are critical for understanding model behavior and informing subsequent modeling decisions.