The process of identifying and fixing the incorrect, incomplete/missing, and inappropriate data.
It is an important step in data mining.
It helps to make the data easier to analyze and get accurate results.
The real-world data is raw - noisy, inconsistent, missing, and duplicate tuples.
Steps in data preprocessing
Data Collection
Data Cleaning
Data Integration
Data Transformation
Data Reduction
1. Data Collection
The process of gathering and analyzing data to make decisions like Gathering data from a retail store
Imagine a retail company that wants to create a comprehensive customer profile by integrating data from various sources, including:
Online Store: Data from e-commerce transactions.
Physical Stores: Data from in-store purchases.
Customer Support: Interaction logs and feedback from customer support.
Social Media: Customer interactions and sentiment from social media platforms.
Types of Data:
Numerical data
1. Discrete - No. of students, Date
2. Continuous - Cost of a House(In Decimal)
Categorical data
1. Nominal - that does not have a natural order or ranking. They are simply labels or names. Ex: Apples, Oranges
2. Ordinal - that has a meaningful order or ranking, but the differences between the categories are not necessarily equal. Ex: Movie ratings
3. Dichotomous - binary data, represents variables that have only two categories. Ex: true or false, yes or no.
Time Series data
1. Sequence of numbers collected at regular intervals of time
Ex: Stock price, Sales data over certain years
Text
2. Data Cleaning
Data cleansing and data scrubbing both refer to data cleaning, which means finding and fixing mistakes in data to make it better.
Let’s see an example with our retail store scenario
o Online Store: Handle missing values, and correct errors in customer information.
o Physical Stores: Remove duplicate records, standardize data formats like replacing N/A, no answer into NULL.
o Customer Support: Normalize text data, and remove irrelevant interactions.
o Social Media: Clean text data, filter out non-customer interactions.
Benefits of data cleaning:
Reliable Data: Accurate and useful data.
Better Analysis: Leads to better insights and decisions.
Efficiency: Easier data processing and analysis.
Finding Outliers
Outliers is a crucial part of the data cleaning process.
Why it is important in data cleaning is to improve the data quality, enhance model accuracy, and understand data distribution. Steps in handling outliers are
Identification
Visual methods – use box plots, scatter plots and Histograms to make visuals
Statistical methods – Calculate Z-score(Standard score). It is a way to standardize data points, allowing comparison between different distributions.
Evaluation
Contextual analysis – determine if the outliers are genuine or due to data entry errors
Impact assessment – assess how outliers affect your analysis
Handling
Removal
Correction
Transformation
Imputation
Handling Missing Values
Removing training examples – completely deleting the rows that have missing values.
Filling in missing values(manually) - if a customer age is missing in a dataset and if you have the required data then you enter the correct age by yourself.
Using a standard value to replace the missing value – if the product price is missing in some rows then you might fill it as $0 or an average price for a similar product.
Using central tendency(mean, median, mode)
Mean – average of all the values
Median – ex: if some house sq ft is missing, then use the median sq ft value from the dataset.
Mode: if the favorite color column has missing values then fill in the most favorite color from the dataset.
3. Data Integration
Combine and consolidate data from multiple data sources into a unified(Clear, consistent) form. It helps to make decisions better and faster by providing complete, and accurate data.
Let’s look into our retail store examples like Walmart or Target
o Unique Customer Identifier: Use unique identifiers like email addresses or customer IDs to merge records.
o Merge Operations: Combine datasets using database joins (e.g., inner join, outer join) to create a unified customer profile.
o Conflict Resolution: It means fixing problems when there are conflicting pieces of information in your data.
Example
Problem: You have two addresses for the same person, and you're not sure which one is correct.
Solution: Decide which address to keep by choosing the most recent or most reliable one.
4. Data Transformation
o Normalize and standardize data formats (e.g., dates, addresses).
o Encode categorical variables (e.g., customer feedback categories).
Standardize Date Formats
Original: Mixed date formats (2024/07/01 and 2024-07-01)
Transformed: Uniform date format (2024-07-01)
Handling categorical data
Label encoding - Convert each category into a unique integer. Ex: red = 0, black = 1
One hot encoding - Convert each category into binary columns (0 or 1).
Binary encoding - Convert categories into binary digits and then split these into separate columns.
Frequency encoding - replace categories with their frequency in the dataset.
For a "Color" feature in a dataset, if
Red appears 50 times
Green appears 30 times
Blue appears 20 times then the value of
Red → 50
Green → 30
Blue → 20
Ordinal encoding – Assign a rank or order to categories.
Example: for an educational level, it is categorized as elementary, middle school and high school.
Then the elementary is ranked as 1, middle school is ranked is as 2 and high school as 3.
5. Data Reduction:
· Reduce the volume of data while preserving its integrity and usefulness for analysis.
· A retail store collects large amounts of transaction data. To analyze trends and improve decision-making, the store needs to reduce the data volume.
It helps to reduce storage requirements and computational costs.
Computational Cost refers to the resources required to perform a computational task. This includes:
· Time: How long it takes to complete a task.
· Memory: How much memory (RAM) is used during the task.
· Processing Power: The amount of CPU or GPU power required.
Data reduction techniques:
· Sampling: Instead of analyzing every transaction, take a sample of 10% of transactions to analyze sales trends.
· Aggregation: Summarize data into broader categories. Like aggregating daily sales data into monthly totals.
· Dimensionality Reduction: Reduce the number of variables.
· Data Pruning: Remove less important or redundant data.
Example: Exclude data on discontinued products or low-volume items.
· Feature Selection: Select only the most relevant features for analysis.
Example: Focus on features like Sales Amount, Customer Age, and Purchase Frequency, while excluding less relevant ones.
Data Preprocessing Tools:
Pandas
NumPy
R
Python
Excel
Tableau
Power BI
Conclusion:
Proper data preprocessing leads to better model performance and more accurate insights.
It's a continuous process that requires domain knowledge and iterative refinement.
Comentarios