Metrics for Data Exploration¶
The two types of metrics calculated in
Univariate metrics
Multivariate metrics
Univariate analysis metrics
Datatype |
Metric |
Description |
Numeric |
min |
The minimum value in the data |
max |
The maximum value in the data |
|
mean |
Mean of the data |
|
median |
Median of the data |
|
mode |
Mode of the data |
|
std |
The standard deviation of the data |
|
var |
The variance of the data |
|
deciles |
A decile is any of the nine values that divide the sorted data into ten equal parts |
|
outliers |
Outliers present in the data |
|
quartiles |
A quartile is a type of quantile which divides the number of data points into four more or less equal parts or quarters |
|
Probability Distribution Function gives the distribution of numeric data |
||
iqr |
IQR (Interquartile range ) is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles |
|
kurtosis |
Kurtosis is a statistical measure that defines how heavily the tails of distribution differ from the tails of a normal distribution |
|
skewness |
skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. |
|
is_valid |
Boolean indicating the validity of the attribute |
|
Categorical |
outliers |
Outliers present in the data |
freq_count |
Frequency count of each individual category |
|
mode |
Mode of the data |
|
Date |
min_date |
Minimum date in the data |
max_date |
Maximum date in the data |
|
day_count |
Count of number of days |
|
month_count |
Count of number of months |
|
year_count |
Count of number of years |
|
missing_dates |
list of tuples of missing date ranges |
|
Text |
unigram |
An n-gram of size 1 |
bigram |
An n-gram of size 2 |
|
trigram |
An n-gram of size 3 |
|
String |
freq_count |
Frequency count of string data |
Multivariate analysis metrics
Metric |
Description |
---|---|
num_records |
Number of records in the dataset |
num_attributes |
Number of attributes |
na_count |
Number of NA fields |
na_count_percentage |
Percentage of NA fields |
missing_count |
Number of missing data |
missing_count_percentage |
Percentage of missing data |
duplicate_count |
duplicate count in data |
duplicate_count_percentage |
duplicate count percentage in data |
duplicate_rows_count |
Number of rows with duplicate count |
pearson |
Pearson’s correlation between two attributes |
chi_square |
Chi-square correlation between two attributes |
spearman |
Spearman’s correlation between two attributes |