Four primary data types are commonly used in data science, each having distinct properties and processing techniques. Understanding these types helps choose the proper analysis, visualization, and modeling techniques.
Structured data
The format in which structured data is arranged is usually a table with rows and columns. This format makes analysis, search, and storage simple.
Characteristics
- Data is often numeric or categorical.
- They are organized in relational databases or spreadsheets.
Types
- Relational databases: Data in SQL tables, such as customer information with fields like ID, name, age, and purchase history.
- CSV files: Sales data stored in a CSV file with structured columns (e.g., date, product, quantity).
Common uses
- Structured data is ideal for traditional database queries, reporting, and statistical analysis. It’s also widely used in machine learning algorithms requiring clearly defined features.
Real-world usage
Structured data is foundational in industries where information needs to be easily stored, queried, and analyzed. Here are some examples:
- Finance and banking: Structured data is used in customer databases, transaction records, and financial reporting. Banks use structured data to assess loan applications by analyzing credit scores, income, and account histories stored in structured formats.
- Retail and e-commerce: Product databases, sales records, and inventory lists are structured data retailers use to track stock, analyze purchase trends, and personalize customer recommendations. Customer purchase histories stored in tables help e-commerce platforms generate targeted marketing.
- Healthcare: Patient records, billing information, and appointment schedules are stored as structured data. Hospitals use structured data in Electronic Health Records (EHRs) to manage patient information, diagnoses, and treatment plans. Structured data is crucial for regulatory reporting and clinical research.
Sample structured data
Here is a sample structured data grid with five records related to house prices. This data includes numerical and categorical features typically used for predictive modeling in data science.
Price | Square_Feet | Bedrooms | Year_Built | Location |
---|---|---|---|---|
320,000 | 2000 | 3 | 1995 | Suburb |
450,000 | 2500 | 4 | 2010 | City Center |
280,000 | 1500 | 2 | 1980 | Outskirts |
375,000 | 2200 | 3 | 2005 | Suburb |
510,000 | 3000 | 4 | 2015 | City Center |
Applicable techniques
Due to its organized, tabular format, structured data is ideal for classical machine learning algorithms. These techniques focus on extracting insights, identifying patterns, and making predictions.
Machine learning algorithms:
- Linear regression: Used for predictive modeling in regression tasks, such as predicting house prices or stock returns.
- Logistic regression: Applied for binary classification tasks like churn prediction or fraud detection.
- Decision trees and random forests: Effective for classification and regression tasks, practical in applications like loan eligibility prediction or customer segmentation.
- Support vector machines (SVM): Often used in classification tasks with structured data, such as identifying high-risk patients based on medical records.
- K-nearest neighbors (KNN): Used for recommendation systems and simple classification tasks.
Deep learning algorithms:
- Feedforward neural networks (FNNs): They can be applied to structured data for complex regression and classification tasks, especially when there is a large dataset.
Other techniques:
- Data visualization: Techniques like scatter plots, heatmaps, and histograms are essential for EDA to uncover patterns in structured data.
- Statistical analysis: Hypothesis testing and correlation analysis help understand variables’ relationships.