Data Cleaning Library¶

xpresso.ai Data Cleaning module is used to modify data, remove redundancies and regularize date format as per user specifications.

This module is represented by the DataCleaning class.

transform method

This method makes all the transformations to the dataset and cleans it for further exploration and visualization with respect to parameters specified.

It takes a StructuredDataset instance as a parameter, which is first populated by data with a call to it’s import_dataset method (as described here). Data Cleaning module also needs EDA defined data types inferred from understand method in Explorer class.

Note: The transform function overwrites the data of the dataset object.

These are the arguments required for cleaning a dataset, in transform method:

Name	Type	Description	Mandatory?	Comments
deduplicate	boolean	Specifies whether de duplication of data is necessary (default is False)	No	Invokes the method responsible for data de duplication
columns	list of strings	Specifies a list of attribute name to be considered to identify duplicate rows (default is all columns)	No	Not required if deduplicate is False
keep	boolean	Used to keep either one row of the duplicate rows or none at all (default is True)	No	Not required if deduplicate is False
clean_dates	boolean	Specifies whether format of all date type attributes in the dataset are needed to be cleaned or not (default is False)	No	Invokes the method responsible for cleaning the date formats
date_format	string	Specifies the date format acceptable as per the ISO standards (default is “YY YY-MM-DD”)	No	Not required if clean_dates is False

Sample code to clean data

from xpresso.ai.core.data.xdm.structured_dataset import StructuredDataset
from xpresso.ai.core.data.preparation.data_prepare import DataCleaner
# create object for StructuredDataset class
dataset = StructuredDataset()
# configuration JSON object as required by Data Connectivity module
config = { "type": "FS", "data_source": "Local", "path": "/home/abzooba/Downloaded_Files/irrs_data.csv" }
# populating the StructuredDataset object with data from local directory
dataset.import_dataset(config)
# create object for Explorer class
explorer = Explorer()
# run understand method to classify attributes as EDA defined datatypes
explorer.understand()
# create object for DataCleaning class
cleaner = DataCleaning(dataset)
# method call to clean the data
cleaner.transform(clean_dates=True, date_format="%d/%m/%Y", deduplicate=True, keep=False, columns=None)
# For identifying duplicate records with respect to certain attributes,
# make a method call with columns parameter specifying the list of attributes.
# Example:
# cleaner.transform(clean_dates=True, date_format="%d/%m/%Y", deduplicate=True, keep=True, # columns=["TITLE", "STATUS", "PRIORITY"])
# printing results for data cleaning module
dataset.show()