Data Science Steps: Retrieving Required Data
Data can be stored in many forms, ranging from simple text files to tables in a database. The objective now is acquiring all the data you need. This may be difficult, and even if you succeed, data is often like a diamond in the rough: it needs polishing to be of any use to you.
1. Start with data stored within the company
Assessing Data Relevance and Quality
Data Management Challenges in Companies
Data Access Challenges
2. Don’t be afraid to shop around
Data Sharing and its Importance
3. Do data quality checks now to prevent problems later
Data Science Project Overview
- Designing data collection process may be necessary.
- Companies often collect and store data.
- Unneeded data can be purchased from third parties.
- Don't hesitate to seek data outside your organization.
- More organizations are making high-quality data freely available for public and commercial use.
Data can be stored in many forms, ranging from simple text files to tables in a database. The objective now is acquiring all the data you need. This may be difficult, and even if you succeed, data is often like a diamond in the rough: it needs polishing to be of any use to you.
1. Start with data stored within the company
Assessing Data Relevance and Quality
- Assess the quality and relevance of available data within the company.
- Companies often have a data maintenance program, reducing cleaning work.
- Data can be stored in official repositories like databases, data marts, data warehouses, and data lakes.
- Databases are for data storage, data warehouses for data analysis, and data marts serve specific business units.
- Data lakes contain raw data, while data warehouses and data marts are preprocessed.
- Data may still exist in Excel files on a domain expert's desktop.
Data Management Challenges in Companies
- Data scattered as companies grow.
- Knowledge dispersion due to position changes and departures.
- Documentation and metadata not always prioritized.
- Need for Sherlock Holmes-like skills to find lost data.
Data Access Challenges
- Organizations often have policies ensuring data access only for necessary information.
- These policies create physical and digital barriers, known as "Chinese walls."
- These "walls" are mandatory and well-regulated for customer data in most countries.
- Accessing data can be time-consuming and influenced by company politics.
2. Don’t be afraid to shop around
Data Sharing and its Importance
- Companies like Nielsen and GFK specialize in collecting valuable information.
- Twitter, LinkedIn, and Facebook provide data for enriching their services and ecosystem.
- Governments and organizations share their data for free, covering a broad range of topics.
- This data is useful for enriching proprietary data and training data science skills at home.
- Table 2.1 shows a small selection from the growing number of open-data providers.
3. Do data quality checks now to prevent problems later
Data Science Project Overview
- Data correction and cleansing are crucial, often up to 80% of project time.
- Data retrieval is the first phase of data inspection in the data science process.
- Errors in data retrieval can be easily identified, but carelessness can lead to long-term data issues.
- Data investigation occurs during import, data preparation, and exploratory phases.
- Data retrieval checks if the data is equal to the source document and if the data types match.
- Data preparation involves a more detailed check, aiming to eliminate typos and data entry errors.
- The exploratory phase focuses on learning from the data, examining statistical properties like distributions, correlations, and outliers.
- Iteration over these phases is common, as outliers can indicate data entry errors.