top of page
hand-businesswoman-touching-hand-artificial-intelligence-meaning-technology-connection-go-

Data Preprocessing

  • The process of identifying and fixing the incorrect, incomplete/missing, and inappropriate data.

  • It is an important step in data mining.

  • It helps to make the data easier to analyze and get accurate results.

  • The real-world data is raw - noisy, inconsistent, missing, and duplicate tuples.

         

 

Steps in data preprocessing

  • Data Collection

  • Data Cleaning                        

  • Data Integration

  • Data Transformation

  • Data Reduction

           

1.    Data Collection

                  The process of gathering and analyzing data to make decisions like Gathering data from a retail store

Imagine a retail company that wants to create a comprehensive customer profile by integrating data from various sources, including:

  • Online Store: Data from e-commerce transactions.

  • Physical Stores: Data from in-store purchases.

  • Customer Support: Interaction logs and feedback from customer support.

  • Social Media: Customer interactions and sentiment from social media platforms.


Types of Data:

  • Numerical data

1.    Discrete - No. of students, Date

2.    Continuous - Cost of a House(In Decimal)

  • Categorical data

1.    Nominal - that does not have a natural order or ranking. They are simply labels or names. Ex: Apples, Oranges

2.    Ordinal - that has a meaningful order or ranking, but the differences between the categories are not necessarily equal. Ex: Movie ratings

3.    Dichotomous - binary data, represents variables that have only two categories. Ex: true or false, yes or no.

  • Time Series data

1.    Sequence of numbers collected at regular intervals of time

Ex: Stock price, Sales data over certain years

  • Text   

2.    Data Cleaning

  • Data cleansing and data scrubbing both refer to data cleaning, which means finding and fixing mistakes in data to make it better.

Let’s see an example with our retail store scenario

o   Online Store: Handle missing values, and correct errors in customer information.

o   Physical Stores: Remove duplicate records, standardize data formats like replacing N/A, no answer into NULL.

o   Customer Support: Normalize text data, and remove irrelevant interactions.

o   Social Media: Clean text data, filter out non-customer interactions.

Benefits of data cleaning:

  • Reliable Data: Accurate and useful data.

  • Better Analysis: Leads to better insights and decisions.

  • Efficiency: Easier data processing and analysis.

Finding Outliers

 Outliers is a crucial part of the data cleaning process.

Why it is important in data cleaning is to improve the data quality, enhance model accuracy, and understand data distribution. Steps in handling outliers are

Identification

  • Visual methods – use box plots, scatter plots and Histograms to make visuals

  • Statistical methods – Calculate Z-score(Standard score). It is a way to standardize data points, allowing comparison between different distributions.

Evaluation

  • Contextual analysis – determine if  the outliers are genuine or due to data entry errors

  • Impact assessment – assess how outliers affect your analysis

Handling

  • Removal

  • Correction

  • Transformation

  • Imputation

Handling Missing Values 

Removing training examples – completely deleting the rows that have missing values.

Filling in missing values(manually) -  if a customer age is missing in a dataset and if you have the required data then you enter the correct age by yourself.

Using a standard value to replace the missing value – if the product price is missing in some rows then you might fill it as $0 or an average price for a similar product.

Using central tendency(mean, median, mode)

Mean – average of all the values

Median – ex: if some house sq ft is missing, then use the median sq ft value from the dataset.

Mode: if the favorite color column has missing values then fill in the most favorite color from the dataset.

3.    Data Integration

Combine and consolidate data from multiple data sources into a unified(Clear, consistent) form. It helps to make decisions better and faster by providing complete, and accurate data.

Let’s look into our retail store examples like Walmart or Target

o   Unique Customer Identifier: Use unique identifiers like email addresses or customer IDs to merge records.

o   Merge Operations: Combine datasets using database joins (e.g., inner join, outer join) to create a unified customer profile.

o   Conflict Resolution: It means fixing problems when there are conflicting pieces of information in your data.

Example

Problem: You have two addresses for the same person, and you're not sure which one is correct.

Solution: Decide which address to keep by choosing the most recent or most reliable one.

4.    Data Transformation

o   Normalize and standardize data formats (e.g., dates, addresses).

o   Encode categorical variables (e.g., customer feedback categories).

      Standardize Date Formats

  • Original: Mixed date formats (2024/07/01 and 2024-07-01)

  • Transformed: Uniform date format (2024-07-01)

          

Handling categorical data

Label encoding - Convert each category into a unique integer. Ex: red = 0, black = 1

One hot encoding - Convert each category into binary columns (0 or 1).

Binary encoding - Convert categories into binary digits and then split these into separate columns.

Frequency encoding - replace categories with their frequency in the dataset.

For a "Color" feature in a dataset, if

  • Red appears 50 times

  • Green appears 30 times

  • Blue appears 20 times then the value of

  • Red → 50

  • Green → 30

  • Blue → 20

Ordinal encoding – Assign a rank or order to categories.

Example: for an educational level, it is categorized as elementary, middle school and high school.

 Then the elementary is ranked as 1, middle school is ranked is as 2 and high school as 3.

5.    Data Reduction:

·      Reduce the volume of data while preserving its integrity and usefulness for analysis.

·      A retail store collects large amounts of transaction data. To analyze trends and improve decision-making, the store needs to reduce the data volume.

  •  It helps to reduce storage requirements and computational costs.

  • Computational Cost refers to the resources required to perform a computational task. This includes:

·       Time: How long it takes to complete a task.

·       Memory: How much memory (RAM) is used during the task.

·       Processing Power: The amount of CPU or GPU power required.

Data reduction techniques:

·      Sampling: Instead of analyzing every transaction, take a sample of 10% of transactions to analyze sales trends.

·      Aggregation: Summarize data into broader categories. Like aggregating daily sales data into monthly totals.

·      Dimensionality Reduction: Reduce the number of variables.

·      Data Pruning: Remove less important or redundant data.

 Example: Exclude data on discontinued products or low-volume items.

·      Feature Selection: Select only the most relevant features for analysis.

Example: Focus on features like Sales Amount, Customer Age, and Purchase Frequency, while excluding less relevant ones.

Data Preprocessing Tools:

  • Pandas

  • NumPy

  • R

  • Python

  • Excel 

  • Tableau

  • Power BI

Conclusion:

  • Proper data preprocessing leads to better model performance and more accurate insights.

  • It's a continuous process that requires domain knowledge and iterative refinement.

63 views0 comments

Comentarios

Obtuvo 0 de 5 estrellas.
Aún no hay calificaciones

Agrega una calificación
bottom of page