← View series: machine learning
~/blog
Feature Engineering for Flight Price Prediction
Flight prices depend on a mix of temporal, categorical, and route-specific factors — departure date, the airline, number of stops, departure time. The raw dataset captures most of this information, but in formats that models cannot consume directly. This notebook walks through the feature engineering pipeline: extracting structured signals from date strings, splitting timestamps into hour and minute components, encoding stop counts, and one-hot encoding categorical columns.
The pandas and numpy stack handles tabular operations and array transformations. Seaborn and matplotlib are loaded for quick visualization checks during exploration.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlineReading the flight data from an Excel file gives access to 10,683 records with 11 columns covering airline, route, timings, duration, stops, and price. The structure is mostly categorical — only Price is numeric — which means most of the engineering work will involve encoding and extraction.
df = pd.read_excel('flight_price.xlsx')
df.head()Date_of_Journey is a string in DD/MM/YYYY format. Splitting it into separate day, month, and year columns lets the model treat them as independent numerical features rather than a single opaque string. Month and day-of-month often correlate with demand fluctuations — holiday seasons, weekend travel patterns.
df['Date'] = df['Date_of_Journey'].str.split('/').str[0]
df['Month'] = df['Date_of_Journey'].str.split('/').str[1]
df['Year'] = df['Date_of_Journey'].str.split('/').str[2]
df['Date'] = df['Date'].astype(int)
df['Month'] = df['Month'].astype(int)
df['Year'] = df['Year'].astype(int)
df.drop('Date_of_Journey', axis=1, inplace=True)Arrival_Time contains values like '01:10 22 Mar' where the date suffix is irrelevant for prediction — only the time matters. Stripping the date part and splitting the time string into hour and minute components converts a messy text field into clean integer features. The original column is dropped once the useful parts are extracted.
df['Arrival_Time'] = df['Arrival_Time'].apply(lambda x: x.split(' ')[0])
df['Arrival_hour'] = df['Arrival_Time'].str.split(':').str[0]
df['Arrival_min'] = df['Arrival_Time'].str.split(':').str[1]
df['Arrival_hour'] = df['Arrival_hour'].astype(int)
df['Arrival_min'] = df['Arrival_min'].astype(int)
df.drop('Arrival_Time', axis=1, inplace=True)Departure time is stored as a single HH:MM string. Splitting it into separate hour and minute columns exposes time-of-day patterns — early morning vs late night departures often carry different price baselines.
df['Departure_hour'] = df['Dep_Time'].str.split(':').str[0]
df['Departure_min'] = df['Dep_Time'].str.split(':').str[1]
df['Departure_hour'] = df['Departure_hour'].astype(int)
df['Departure_min'] = df['Departure_min'].astype(int)
df.drop('Dep_Time', axis=1, inplace=True)Duration is stored as a human-readable string like '2h 50m'. Parsing this into total minutes gives a continuous numerical feature that models can use directly. The conversion handles varying formats — some entries have only hours ('19h') while others have both hours and minutes.
def parse_duration(d):
if 'h' not in d:
return 0
hours = int(d.split('h')[0].strip())
minutes = 0
if 'm' in d:
minutes = int(d.split('h')[1].replace('m', '').strip())
return hours * 60 + minutes
df['Duration_min'] = df['Duration'].apply(parse_duration)
df.drop('Duration', axis=1, inplace=True)Total_Stops is a categorical column with values like 'non-stop', '1 stop', '2 stops'. A check shows a single null value and one '4 stops' entry. Encoding the column as an ordered integer preserves the ordinal relationship — more stops generally means longer routes, which correlates with higher prices. The null is imputed with the mode value (1 stop).
stops_map = {'non-stop': 0, '1 stop': 1, '2 stops': 2, '3 stops': 3, '4 stops': 4}
df['Total_Stops'] = df['Total_Stops'].fillna(df['Total_Stops'].mode()[0]).map(stops_map)The Route column contains path strings like 'BLR → DEL' but shares information already captured by Source, Destination, and Total_Stops. Dropping this column avoids a sparse text feature that would require heavy preprocessing for limited predictive gain.
df.drop('Route', axis=1, inplace=True)Airline, Source, and Destination are categorical features with 12, 5, and 6 unique values respectively. Additional_Info describes in-flight services with 10 unique tags. One-hot encoding transforms these into binary columns that regression and tree-based models can use directly.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df[['Airline', 'Source', 'Destination']])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
df = pd.concat([df.drop(['Airline', 'Source', 'Destination'], axis=1), encoded_df], axis=1)The resulting feature set now has numerical columns derived from 11 raw fields — with all categorical variables one-hot encoded, temporal features extracted, and missing values addressed. The Additional_Info column is still in its original form and could be encoded or grouped into broader categories as the next step.