Pandas in Action: Exploring & Cleaning Malaysia's Vehicle Landscape

Issue 29: From Raw Data to Revealing Insights

Aug 14, 2024

Welcome to our Pandas case study series, where we embark on data-driven adventures! Today, we're delving into Malaysia's vehicle registration data, using Pandas to uncover hidden trends and patterns. We'll not only explore but also take our first steps in cleaning the data for future analysis.

Dataset Overview

We're working with a dataset capturing a slice of Malaysia's vehicle registrations. While compact, it holds a treasure trove of information: It can be accessed with

Malaysian Cars Registration Dataset

date_reg: Date of vehicle registration
type: Type of vehicle
maker: Vehicle manufacturer
model: Specific model of the vehicle
colour: Color of the vehicle
fuel: Fuel type used by the vehicle
state: State where the vehicle is registered

Loading and Initial Exploration

Let's begin our journey by loading the dataset and taking a first look:

import pandas as pd

# Load the dataset 
df = pd.read_csv("vehicle_registration.csv")

# Display the first few rows to get a feel for the data
print(df.head().to_markdown(index=False, numalign="left", stralign="left"))

# Get information about the DataFrame (column names, data types, non-null counts)
print(df.info())

Unveiling Basic Insights

Now, let's use some fundamental Pandas functions to gain initial insights:

# Count the occurrences of each vehicle type
type_counts = df['type'].value_counts()
print("\nVehicle Type Counts:\n", type_counts.head().to_markdown(numalign="left", stralign="left"))

# Get the top 5 most popular car makers
maker_counts = df['maker'].value_counts()
print("\nTop 5 Car Makers:\n", maker_counts.head().to_markdown(numalign="left", stralign="left"))

# Count the number of registrations by state
state_counts = df.groupby('state').size()
print("\nRegistrations by State:\n", state_counts.head().to_markdown(numalign="left", stralign="left"))

Data Cleaning in Action

Let's address some common data cleaning tasks to prepare our dataset for deeper analysis:

# Check for missing values
print("\nMissing Values:\n", df.isnull().sum().to_markdown(numalign="left", stralign="left"))

# Convert 'date_reg' to datetime format (assuming the format is 'YYYY-MM-DD')
df['date_reg'] = pd.to_datetime(df['date_reg'], format='%Y-%m-%d')

# Standardize 'maker' and 'model' to lowercase
df['maker'] = df['maker'].astype(str).str.lower()
df['model'] = df['model'].astype(str).str.lower()

# If you decide to drop rows with missing values:
df.dropna(inplace=True)

Enhanced Exploration

With cleaner data, let's extract a bit more information:

# Find the most recent registration date
latest_registration = df['date_reg'].max()
print(f"\nLatest Registration Date: {latest_registration}")

# Count the number of unique car models
unique_models = df['model'].nunique()
print(f"\nNumber of Unique Car Models: {unique_models}")

What's Next?

In our upcoming newsletters, we'll build on this foundation. We'll delve into visualizations, explore relationships between variables, and analyze trends over time. Get ready to uncover fascinating stories hidden within Malaysia's vehicle registration data!

Poll

Call to Action

Don't wait! Start experimenting with your own datasets using these Pandas techniques. Share your discoveries and any questions you have in the comments below. Let's continue this data-driven journey together!

Happy cleaning and exploring!

Young Pythoneers

Discussion about this post