Semi-structured data
Semi-structured data does not conform to a strict structure but includes organizational elements like tags or markers that make it easier to parse.
Characteristics
- It is more flexible than structured data, yet only partially unstructured.
- Often stored in formats that support hierarchical or nested structures.
Types
- JSON and XML Files: Data in JSON format from APIs, with nested fields for different data attributes.
- HTML: Webpage content with HTML tags that provide a structure for text, links, and images.
Common uses
- Semi-structured data is widely used in data interchange between systems (like API responses) and web scraping. It is also valuable for applications that need structure and flexibility, like document databases (e.g., MongoDB).
Real-world usage
Semi-structured data is widely used in applications where data needs flexibility but still benefits from a level of organization.
- Web and mobile applications: JSON and XML files, often used to exchange data between web services and applications, represent semi-structured data. For example, when booking a flight, JSON data containing information on flight schedules, passenger details, and booking references is sent between the airline’s server and a travel booking app.
- E-commerce: Product catalogs in e-commerce platforms, such as those on Amazon, are semi-structured. Each product may have different attributes (e.g., color, size, material), so using JSON allows the data to accommodate various products without strict database rules flexibly.
- Email processing: Email data is semi-structured with tags like “To,” “From,” “Subject,” and “Body.” This structure allows organizations to categorize, filter, and analyze email data for tasks like spam detection, sentiment analysis, and customer service improvements.
Sample semi-structured data
[
{
"listing_id": "001",
"price": 320000,
"square_feet": 2000,
"bedrooms": 3,
"year_built": 1995,
"location": {
"city": "Suburb",
"state": "California"
},
"features": ["garage", "backyard", "fireplace"]
},
{
"listing_id": "002",
"price": 450000,
"square_feet": 2500,
"bedrooms": 4,
"year_built": 2010,
"location": {
"city": "City Center",
"state": "California"
},
"features": ["balcony", "gym access", "city view"]
},
{
"listing_id": "003",
"price": 280000,
"square_feet": 1500,
"bedrooms": 2,
"year_built": 1980,
"location": {
"city": "Outskirts",
"state": "Nevada"
},
"features": ["large lot", "quiet neighborhood"]
}
]
Applicable techniques
Semi-structured data, often found in JSON, XML, and HTML, requires techniques that can handle flexible data structures.
- Data Parsing and Extraction:
- Regular Expressions and Parsing Libraries: These tools extract specific fields from semi-structured text. For instance, Python’s re library or BeautifulSoup can parse HTML.
- ETL Tools: Extract-Transform-Load (ETL) tools like Apache NiFi and Alteryx streamline handling and transforming semi-structured data.
- Machine Learning and Deep Learning:
- Tree-Based Models: Decision trees and random forests work well with semi-structured data converted into structured formats.
- Natural Language Processing for Text Data: NLP methods like sentiment analysis and topic modeling can be applied if semi-structured data contains textual information (e.g., JSON with text reviews).
- Graph-Based Techniques:
- Graph Neural Networks (GNNs): If semi-structured data represents relationships (e.g., a network of interconnected entities), GNNs can learn complex patterns and connections.
- Knowledge Graphs: Used to model relationships in semi-structured data, such as connections between people, products, or events in e-commerce or social networks.