Metrics for Data Exploration

The two types of metrics calculated in

  1. Univariate metrics

  2. Multivariate metrics

Univariate analysis metrics

Datatype

Metric

Description

Numeric

min

The minimum value in the data

max

The maximum value in the data

mean

Mean of the data

median

Median of the data

mode

Mode of the data

std

The standard deviation of the data

var

The variance of the data

deciles

A decile is any of the nine values that divide the sorted data into ten equal parts

outliers

Outliers present in the data

quartiles

A quartile is a type of quantile which divides the number of data points into four more or less equal parts or quarters

pdf

Probability Distribution Function gives the distribution of numeric data

iqr

IQR (Interquartile range ) is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles

kurtosis

Kurtosis is a statistical measure that defines how heavily the tails of distribution differ from the tails of a normal distribution

skewness

skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

is_valid

Boolean indicating the validity of the attribute

Categorical

outliers

Outliers present in the data

freq_count

Frequency count of each individual category

mode

Mode of the data

Date

min_date

Minimum date in the data

max_date

Maximum date in the data

day_count

Count of number of days

month_count

Count of number of months

year_count

Count of number of years

missing_dates

list of tuples of missing date ranges

Text

unigram

An n-gram of size 1

bigram

An n-gram of size 2

trigram

An n-gram of size 3

String

freq_count

Frequency count of string data

Multivariate analysis metrics

Metric

Description

num_records

Number of records in the dataset

num_attributes

Number of attributes

na_count

Number of NA fields

na_count_percentage

Percentage of NA fields

missing_count

Number of missing data

missing_count_percentage

Percentage of missing data

duplicate_count

duplicate count in data

duplicate_count_percentage

duplicate count percentage in data

duplicate_rows_count

Number of rows with duplicate count

pearson

Pearson’s correlation between two attributes

chi_square

Chi-square correlation between two attributes

spearman

Spearman’s correlation between two attributes