A comprehensive guide to data types in data science - part 1

Four primary data types are commonly used in data science, each having distinct properties and processing techniques. Understanding these types helps choose the proper analysis, visualization, and modeling techniques.

Part 2 | Part 3 | Part 4

Structured data

The format in which structured data is arranged is usually a table with rows and columns. This format makes analysis, search, and storage simple.

Characteristics

Data is often numeric or categorical.
They are organized in relational databases or spreadsheets.

Types

Relational databases: Data in SQL tables, such as customer information with fields like ID, name, age, and purchase history.
CSV files: Sales data stored in a CSV file with structured columns (e.g., date, product, quantity).

Common uses

Structured data is ideal for traditional database queries, reporting, and statistical analysis. It’s also widely used in machine learning algorithms requiring clearly defined features.

Real-world usage

Structured data is foundational in industries where information needs to be easily stored, queried, and analyzed. Here are some examples:

Finance and banking: Structured data is used in customer databases, transaction records, and financial reporting. Banks use structured data to assess loan applications by analyzing credit scores, income, and account histories stored in structured formats.
Retail and e-commerce: Product databases, sales records, and inventory lists are structured data retailers use to track stock, analyze purchase trends, and personalize customer recommendations. Customer purchase histories stored in tables help e-commerce platforms generate targeted marketing.
Healthcare: Patient records, billing information, and appointment schedules are stored as structured data. Hospitals use structured data in Electronic Health Records (EHRs) to manage patient information, diagnoses, and treatment plans. Structured data is crucial for regulatory reporting and clinical research.

Sample structured data

Here is a sample structured data grid with five records related to house prices. This data includes numerical and categorical features typically used for predictive modeling in data science.

Price	Square_Feet	Bedrooms	Year_Built	Location
320,000	2000	3	1995	Suburb
450,000	2500	4	2010	City Center
280,000	1500	2	1980	Outskirts
375,000	2200	3	2005	Suburb
510,000	3000	4	2015	City Center

Applicable techniques

Due to its organized, tabular format, structured data is ideal for classical machine learning algorithms. These techniques focus on extracting insights, identifying patterns, and making predictions.

Machine learning algorithms:

Linear regression: Used for predictive modeling in regression tasks, such as predicting house prices or stock returns.
Logistic regression: Applied for binary classification tasks like churn prediction or fraud detection.
Decision trees and random forests: Effective for classification and regression tasks, practical in applications like loan eligibility prediction or customer segmentation.
Support vector machines (SVM): Often used in classification tasks with structured data, such as identifying high-risk patients based on medical records.
K-nearest neighbors (KNN): Used for recommendation systems and simple classification tasks.

Deep learning algorithms:

Feedforward neural networks (FNNs): They can be applied to structured data for complex regression and classification tasks, especially when there is a large dataset.

Other techniques:

Data visualization: Techniques like scatter plots, heatmaps, and histograms are essential for EDA to uncover patterns in structured data.
Statistical analysis: Hypothesis testing and correlation analysis help understand variables’ relationships.

Innovaty Hub

A comprehensive guide to data types in data science – part 1