Predicting future tour arrivals in Thailand

The capacity to forecast future tour arrivals is very beneficial for planning and maximizing tourism resources, enhancing guest experiences, and anticipating seasonal demands. It can also assist companies and tourism boards in creating data-driven plans for staffing, inventory management, and marketing campaigns. Accurate forecasting allows for optimal resource allocation, improved revenue management, and a better overall visitor experience.

Our model primarily focuses on optimizing the prediction of tour arrivals using historical data.

End-to-end process

Data Collection: Load the provided Excel data into a Python environment to inspect and prepare it for analysis.

Data Cleaning: Handle any missing, inconsistent, or outlier values. If relevant, add external factors such as weather data, public events, or holidays.

Exploratory Data Analysis (EDA): Analyze trends, seasonality, and patterns in the data.

Feature Engineering: Create features that could improve the predictive power, such as month indicators.

Model Building: Choose and apply a model. We will employ a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM), which can learn temporal dependencies and is thus appropriate for time-series data.

Model Evaluation: Analyze model performance with metrics such as Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and Mean Absolute Error (MAE).

Deployment and Monitoring: Implement the model for real-time predictions and set up monitoring to ensure continued accuracy.

Future tour arrivals in Thailand: predictive modeling with Python

Here is a detailed, step-by-step explanation for each code segment:

Importing Libraries

#Use the "pip install" command to install required libraries if there is a ModuleNotFoundError
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Input, LSTM, Dense, Dropout
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
from keras_tuner import RandomSearch
import tensorflow as tf

Description:

Matplotlib, NumPy, and Pandas: These libraries visualize and manage data. Pandas help load and process the data, NumPy performs numerical operations, and Matplotlib plots data trends.

train_test_split: Splits the data into training and testing sets.

Metrics: mean_absolute_error, mean_squared_error, and mean_absolute_percentage_error are evaluation metrics for assessing model accuracy.

MinMaxScaler: Scales data between 0 and 1, essential for LSTM performance.

Sequential, Input, LSTM, Dense, Dropout:

  • Sequential is defined as a linear stack of layers, and LSTM layers capture temporal dependencies.
  • Input defines the shape and structure of the input data in a neural network model.
  • Dense layers are fully connected.
  • Dropout adds regularization.

Adam: Optimizer that adapts learning rates to improve convergence speed.

EarlyStopping, ReduceLROnPlateau: Callbacks for training control; EarlyStopping halts training if validation loss stops improving, and ReduceLROnPlateau reduces the learning rate when a plateau is reached in validation loss.

RandomSearch: A hyperparameter tuning tool from Keras Tuner is used to find the optimal hyperparameters for the model.

Data collection

# Step 1: Data Collection
# Load the data
data = pd.read_csv('TourArrivalThailandMonthly_Jan_2015_to_Sep_2024.csv')

Description:

Loads the tour arrival data from an Excel file into a dataframe, which will be used as input for the prediction model.

Data cleaning

# Step 2: Data Cleaning
# Inspect and clean the data
print(data.info())
data.dropna(inplace=True)  # Drop missing values
print(data.describe())  # Check for any outliers or anomalies

Description:

This step ensures data integrity by checking for missing values and removing them. It also summarizes the data distribution to identify any potential outliers.

EDA

# Step 3: Exploratory Data Analysis (EDA)
# Plotting the trend of tour arrivals
plt.figure(figsize=(12, 6))
plt.plot(data['Date'], data['Tour_Arrivals'], label='Tour Arrivals')
plt.title('Tour Arrivals Over Time')
plt.ylabel('Number of Arrivals')
plt.gca().axes.get_xaxis().set_visible(False)  # Hide x-axis labels
plt.legend()
plt.show()

Description:

Visualizes the trend of tour arrivals over time, providing insights into seasonality or patterns that could impact the model’s performance.

Feature engineering

# Step 4: Feature Engineering
# Creating month and year features for seasonality analysis
data['Date'] = pd.to_datetime(data['Date'], format='%Y-%m-%d', errors='coerce')
data['Month'] = data['Date'].dt.month
data['Year'] = data['Date'].dt.year
data.set_index('Date', inplace=True)

Description:

It adds Month and Year features to capture seasonal patterns, which are crucial in time-series data like tourism arrivals.

Data preprocessing

# Step 5: Data Preprocessing for Model Building
# Prepare data for LSTM model
arrivals = data['Tour_Arrivals'].values.reshape(-1, 1)

# Normalize the data
scaler = MinMaxScaler()
arrivals_scaled = scaler.fit_transform(arrivals)

Description:

Data is transformed into an LSTM-compatible format. Scaling the data between 0 and 1 stabilizes the training process and enhances model performance.

Time-series dataset creation

# Creating a time-series dataset
def create_dataset(data, time_step=12):
    X, y = [], []
    for i in range(len(data) - time_step - 1):
        X.append(data[i:(i + time_step), 0])
        y.append(data[i + time_step, 0])
    return np.array(X), np.array(y)

time_step = 12  # Using 12 months (1 year) to predict the next
X, y = create_dataset(arrivals_scaled, time_step)

Description:

Prepares data for the LSTM model by creating sequences of 12 months (input X) to predict the next month (output y).

Data reshaping and splitting

# Reshape data to fit LSTM input requirements
X = X.reshape(X.shape[0], X.shape[1], 1)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Description:

Reshapes data to 3D format for LSTM input requirements (samples, timesteps, features) and splits it into training and testing sets.

Model building with tuning

# Step 6: Model Optimization using Keras Tuner
# Define the function to build the model for Keras Tuner
def build_model(hp):
    model = Sequential()
    
    # Add Input layer to specify input shape
    model.add(Input(shape=(X_train.shape[1], 1)))
    
    # First LSTM layer with tunable units and dropout rate
    model.add(LSTM(units=hp.Int('units', min_value=32, max_value=128, step=16), return_sequences=True))
    model.add(Dropout(hp.Choice('dropout_rate_1', values=[0.1, 0.2, 0.3])))

    # Second LSTM layer with tunable units and dropout rate
    model.add(LSTM(units=hp.Int('units_2', min_value=32, max_value=128, step=16), return_sequences=False))
    model.add(Dropout(hp.Choice('dropout_rate_2', values=[0.1, 0.2, 0.3])))
    
    # Output layer
    model.add(Dense(1))
    
    # Compile the model with a tunable learning rate
    model.compile(optimizer=Adam(learning_rate=hp.Choice('learning_rate', [1e-2, 1e-3, 1e-4])), 
                  loss='mean_squared_error')
    return model

Description:

Defines an LSTM model with tunable hyperparameters (e.g., LSTM units, dropout rate, learning rate). Keras Tuner uses this function to search for the best combination of hyperparameters.

Keras tuner setup

# Setting up the Keras Tuner for hyperparameter search
tuner = RandomSearch(
    build_model,
    objective='val_loss',
    max_trials=5,
    executions_per_trial=3,
    directory='keras_tuner_dir',
    project_name='lstm_tour_arrival_prediction'
)

Description:

Configures the RandomSearch tuner to minimize validation loss, searching through different hyperparameter combinations.

# Running the tuner search for the best hyperparameters
tuner.search(X_train, y_train, epochs=50, validation_data=(X_test, y_test), 
             callbacks=[EarlyStopping(monitor='val_loss', patience=5)])

Description:

Runs the hyperparameter tuning process on the training data, using early stopping to prevent overfitting during the search.

Best model retrieval

# Retrieve the best model
best_model = tuner.get_best_models(num_models=1)[0]

# Recompile the model with the same loss and optimizer settings
best_model.compile(optimizer='adam', loss='mean_squared_error')

Description:

Selects the best model configuration from the tuning process based on validation loss.

Model training

# Step 7: Training the Optimized Model
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, min_lr=1e-5)

history = best_model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test), 
                         batch_size=16, callbacks=[early_stop, reduce_lr])

Description:

Trains the best model configuration with early stopping and learning rate reduction callbacks, further optimizing the training process.

Model evaluation

# Step 8: Model Evaluation
train_predict = best_model.predict(X_train)
test_predict = best_model.predict(X_test)
train_predict = scaler.inverse_transform(train_predict)
test_predict = scaler.inverse_transform(test_predict)
y_train_inv = scaler.inverse_transform([y_train])
y_test_inv = scaler.inverse_transform([y_test])

print("Optimized Train MAE:", mean_absolute_error(y_train_inv[0], train_predict[:, 0]))
print("Optimized Test MAE:", mean_absolute_error(y_test_inv[0], test_predict[:, 0]))
print("Optimized Test RMSE:", np.sqrt(mean_squared_error(y_test_inv[0], test_predict[:, 0])))
print("Optimized Test MAPE:", mean_absolute_percentage_error(y_test_inv[0], test_predict[:, 0]))

Description:

Generates prediction value

Deployment and monitoring

It saves the optimized model ready for deployment, which allows further monitoring in a production setting.

This workflow builds an end-to-end solution for forecasting future tourist arrivals in Thailand, tuning parameters for optimal performance, and saving the model for potential deployment.