Welcome To DailyEducation

DailyEducation is an open-source platform for educational updates and sharing knowledge with the World of Everyday students.

Data Preparation - Cleansing, integrating, and transforming data

naveen

Moderator
Data Retrieval Phase and Modeling

  • Data from retrieval phase is often "diamond in the rough."
  • Sanitization and preparation are crucial for better performance and less time spent on output correction.
  • Data transformation is necessary for the model to fit specific data formats.
  • Early correction of data errors is recommended.
  • Corrective actions may be necessary in realistic settings.
  • Below figure shows common actions during data cleansing, integration, and transformation.




1. Data Cleaning

Data cleansing is a sub process of the data science process that focuses on removing errors in your data so your data becomes a true and consistent representation of the processes it originates from.

1.1. Data Entry Errors Overview

  • Data collection and entry are error-prone processes requiring human intervention.
  • Human errors can include typos or loss of concentration.
  • Machine data collection also faces errors due to human sloppiness or machine or hardware failure.
  • Examples include transmission errors and bugs in the extract, transform, and load phase (ETL).
  • Hand-checking every value is recommended for small data sets.
  • Data errors can be detected by tabulating data with counts.
  • Frequency tables can be created for variables with only two values.

1.2.Outliers in Data Analysis

  • Outliers are observations that seem distant from others or follow a different logic or generative process.
  • Finding outliers is easy using plots or tables with minimum and maximum values.
  • An example is provided where a normal distribution (Gaussian distribution) is expected, showing high values in the bottom graph.
  • Outliers can significantly influence data modeling, so it's crucial to investigate them first.

1.3. Dealing with Missing Values in Data Science

  • Missing values aren't always wrong but need separate handling.
  • They may indicate data collection errors or ETL process errors.
  • Common techniques used by data scientists are listed in table 2.4.


2. Transforming Data for Data Modeling

  • Data cleansing and integration are crucial for data modeling.
  • Data transformation involves transforming data into a suitable form.
  • Linear relationships between input and output variables can be simplified by transforming the log of independent variables.
  • Combining two variables into a new variable can also be used.

Reducing Variables in Models

  • Overloading variables can hinder model handling.
  • Techniques like Euclidean distance perform best with 10 variables.
  • Reducing the number of variables can add new information to the model.

Turning Variables into Dummies in Data Science

  • Variables can be transformed into dummy variables, which can only take two values: true(1) or false(0).
  • Dummy variables indicate the absence of a categorical effect explaining an observation.
  • Separate columns for classes stored in one variable are created, with 1 indicating present classes and 0 otherwise.
  • Example: Turn Weekdays into Monday through Sunday columns to show if the observation was on a Monday.
  • This technique is popular in modeling and is not exclusive to economists.
  • The next step is to transform and integrate data into usable input for the modeling phase.

3. Data Combination from Different Sources

  • Data sources include databases, Excel files, text documents, etc.
  • Data science process is the focus, not presenting scenarios for every type of data.
  • Other data sources like key-value stores and document stores will be discussed in later sections.

Different Ways of Combining Data

  1. Joining: enriches an observation from one table with information from another.
  2. Appending or stacking: adds observations from one table to another.
  3. Combining data allows creation of new physical or virtual tables.
  4. Views consume less disk space
 
Back
Top
AdBlock Detected

We get it, advertisements are annoying!

Sure, ad-blocking software does a great job at blocking ads, but it also blocks useful features of our website. For the best site experience please disable your AdBlocker.

I've Disabled AdBlock