A comprehensive guide to data types in data science – part 2

Part 1 | Part 3 | Part 4

Unstructured data

Unstructured data is more challenging to process and analyze because it does not follow a set format and includes data that does not fit neatly into tables.

Characteristics

  • Data may be text or contain multimedia elements.
  • Typically found in documents, emails, images, and videos.

Types

  • Text Documents: Social media posts, emails, and reports where the information is in free-text format.
  • Multimedia: Images, audio, and video files, such as customer service calls or product photos.

Common uses

  • Unstructured data is often analyzed for insights using natural language processing (NLP) for text, computer vision for images, and speech recognition for audio. This data type is important in sentiment analysis, image recognition, and voice assistants.

Real-world usage

Unstructured data is prevalent in fields where insights are drawn from text, images, and other non-tabular data formats.

  • Social media and marketing: Social media platforms collect vast amounts of unstructured data, such as posts, comments, and likes. This data is analyzed for sentiment, customer feedback, and trends, helping brands understand customer perceptions and shape their marketing strategies. NLP techniques are often applied to analyze text on Twitter, Facebook, or Instagram.
  • Healthcare: Medical images (e.g., X-rays, MRI scans) and doctor’s notes are examples of unstructured data. Radiology departments use computer vision techniques to analyze image data for diagnostic purposes, such as identifying tumors. Additionally, NLP extracts critical insights from doctors’ notes to assist patient care.
  • Legal and compliance: In the legal industry, unstructured data such as contracts, case files, and emails is processed and analyzed for case management, discovery, and compliance. AI-powered tools review these documents to identify clauses, terms, and potential risks, improving the efficiency of legal processes.

Sample unstructured data

Just bought a new house, and I’m in love with it! The neighborhood is amazing, and the view is incredible. #blessed #newhome
Does anyone know a reliable real estate agent in San Francisco? Looking to buy soon. Any recommendations? #househunting
The market prices are insane right now! I can’t believe how much a small apartment costs these days. #realestate #inflation

Applicable techniques

Text, photos, and audio are common unstructured data types that require specific methods, such as computer vision for images and natural language processing (NLP) for text.

Text data (e.g., social media posts, emails):

  • Natural language processing (NLP):
    • Sentiment analysis: This method is used to analyze the sentiment of textual data, such as reviews from customers or posts on social media, in order to understand the audience’s emotions.
    • Text classification: Algorithms like Naive Bayes, SVM, or deep learning models classify emails (spam vs. non-spam) or categorize news articles.
    • Topic modeling: Techniques to find subjects or patterns in texts, such as Latent Dirichlet Allocation (LDA).
    • Named entity recognition (NER): Identifies entities (like people, places, and organizations) within text data for extracting valuable insights.
  • Deep learning for text:
    • Recurrent neural networks (RNNs): Useful for sequential data like text. For applications like text generation or sentiment analysis, variants like LSTMs and GRUs work well.
    • Transformers (e.g., BERT, GPT): Used in NLP for text classification, question-answering, and summarization tasks.

Image data (e.g., medical images, product photos):

  • Computer vision:
    • Convolutional neural networks (CNNs): Widely used in image recognition tasks like facial recognition, object detection, and medical image classification.
    • Transfer learning: Fine-tuning pre-trained models like ResNet or VGG for specific tasks with smaller image datasets.

Audio data (e.g., call recordings, voice commands):

  • Speech recognition and audio processing:
    • WaveNet and MFCC features: Used to process raw audio for tasks like speech-to-text.
    • Deep learning with RNNs and CNNs: Effective for audio classification, such as identifying emotions in customer service calls.