I described the introduction and workflow of data science in the previous article. Let’s consider a real-world scenario in which we wish to forecast home values by considering several variables. This section will cover every stage of the data science pipeline, along with explanations and sample Python code.
Note: This tutorial simplifies the data science workflow to illustrate basic data collection and deployment steps. As a result, accuracy metrics like MAE and R² may be high, and the model may not be reliable for real-world use. For production, thorough data preparation and model tuning are needed.
Problem Definition
Consider a real estate firm that wishes to develop a model for estimating home values. Thanks to this prediction model, they will better comprehend the aspects that affect prices, which will also help buyers and sellers make well-informed decisions.
Steps and Python code:
1. Data collection
Process explanation:
The first step is to collect information. In our situation, this can entail gathering information on prior home sales, including information about the location, year built, square footage, number of bedrooms, and so forth. To illustrate this, we will utilize a dataset frequently found in data science libraries.
import pandas as pd
# Load the data (using a sample dataset from seaborn for demonstration)
data = pd.read_csv('house_prices.csv') # Replace with actual file path if you have one
print("Data Sample:")
print(data.head())
Result explanation:
This code reads in the dataset and displays the first few rows, helping us understand the data structure and see the features we have.
2. Data Cleaning
Process explanation:
Real-world data is often messy. In this step, we will handle missing values, remove duplicates, and ensure all columns are in the correct format. For example, if some houses have missing square footage values, we may need to fill those gaps or remove those records.
# Check for missing values
print("Missing Values:")
print(data.isnull().sum())
# Fill missing values in 'square_feet' column
data['square_feet'] = data['square_feet'].fillna(data['square_feet'].mean())
# Drop rows where 'price' is missing
data = data.dropna(subset=['price'])
# Remove duplicates
data = data.drop_duplicates()
print("Data after Cleaning:")
print(data.info())
Result explanation:
After this step, the data is clean and ready for analysis, with missing values handled and duplicates removed. This makes the dataset more reliable.
3. Exploratory Data Analysis (EDA)
Process explanation:
EDA helps us understand patterns in the data. We will create simple visualizations to see how features like the number of bedrooms or location affect house prices. It can reveal trends and insights, like whether larger houses tend to have higher prices.
import matplotlib.pyplot as plt
import seaborn as sns
# Visualize the relationship between house price and square footage
plt.figure(figsize=(10, 6))
sns.scatterplot(x='square_feet', y='price', data=data)
plt.title("House Price vs Square Footage")
plt.xlabel("Square Feet")
plt.ylabel("Price")
plt.show()
# Check correlations between numerical features only
correlation = data[['price', 'square_feet', 'bedrooms', 'year_built']].corr()
print("Correlation Matrix:")
print(correlation)
Result Explanation:
This analysis shows that square footage might have a positive correlation with house prices. The correlation matrix shows how each feature relates to house price, guiding us to focus on which features.
4. Feature Engineering
Process explanation:
Feature engineering is the process that involves creating new features or altering current ones. For instance, we might create a new feature, “age of house”, by subtracting the year built from the current year. It could give us more insight into how the house’s age affects price.
import datetime
# Create a new feature: age of house
current_year = datetime.datetime.now().year
data['age_of_house'] = current_year - data['year_built']
print("Data after Feature Engineering:")
print(data[['year_built', 'age_of_house']].head())
Result explanation:
By creating the “age of house” feature, we have added a potentially significant price predictor to our dataset. This feature can now be used in model training.
5. Model Building (Machine learning model)
Process explanation:
Now, we train a machine learning model to predict house prices. For demonstration, we will use a simple scaler model. Features like the house’s age, square footage, and number of bedrooms will be used to train it.
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
# Feature Engineering: Adding 'age_of_house' feature
current_year = datetime.datetime.now().year
data['age_of_house'] = current_year - data['year_built']
# Define features (X) and target (y)
X = data[['square_feet', 'bedrooms', 'age_of_house']]
y = data['price']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)
# Feature Importance
print("\nFeature Importances:")
feature_importances = rf_model.feature_importances_
for feature, importance in zip(X.columns, feature_importances):
print(f"Feature: {feature}, Importance: {importance}")
Result Explanation:
The model is trained, and the feature_importances_ attribute allows us to evaluate the relative importance of each feature by showing its contribution to the prediction. It shows how much each feature contributes to the model’s predictions. Higher importance values indicate a stronger influence on the target.
6. Model Evaluation
Process explanation:
Once the model is trained, we need to evaluate its accuracy using metrics like Mean Absolute Error (MAE) or R². This step helps us understand how well our model performs and if it’s reliable.
from sklearn.metrics import mean_absolute_error, r2_score
# Make predictions and evaluate the model
y_pred = rf_model.predict(X_test_scaled)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Absolute Error:", mae)
print("R² Score:", r2)
Result Explanation:
MAE gives us the average error in predicted house prices, and R² shows how much of the variance in price is explained by the model. A low MAE and high R² mean our model is performing well.
7. Deployment and Monitoring
Process explanation:
Once the model performs well, we can deploy it for real-time predictions. We’ll create a simple function that takes input features and outputs a predicted house price. In production, this function could be part of a web app where users enter details about a house and get a price estimate.
# Define a prediction function for new data
def predict_price(square_feet, bedrooms, year_built):
age_of_house = current_year - year_built
# Create a DataFrame with the same structure as the training data
input_features = pd.DataFrame([[square_feet, bedrooms, age_of_house]], columns=['square_feet', 'bedrooms', 'age_of_house'])
# Scale the input features to match the training data
input_features_scaled = scaler.transform(input_features)
predicted_price = rf_model.predict(input_features_scaled)
return predicted_price[0]
# Example prediction
example_prediction = predict_price(2000, 3, 2010)
print("\nPredicted House Price for a 2000 sq ft, 3-bedroom house built in 2010:", example_prediction)
Result Explanation:
The function predict_price allows us to input features and receive a predicted price. Monitoring would involve tracking the accuracy of predictions over time and updating the model as new data becomes available.
Summary
- Data Collection: We gathered a dataset of house prices and related features.
- Data Cleaning: We handled missing values and removed duplicates.
- EDA: We visualized relationships and calculated correlations.
- Feature Engineering: We created a new feature, “age of house”.
- Model Building: We trained a linear regression model to predict prices.
- Model Evaluation: We assessed model performance using MAE and R².
- Deployment: We created a function for real-time predictions.
This workflow demonstrates each step in a data science project, from data gathering to deployment, with code to reflect each process.