Pandas in Action: Exploring & Cleaning Malaysia's Vehicle Landscape
Issue 29: From Raw Data to Revealing Insights
Welcome to our Pandas case study series, where we embark on data-driven adventures! Today, we're delving into Malaysia's vehicle registration data, using Pandas to uncover hidden trends and patterns. We'll not only explore but also take our first steps in cleaning the data for future analysis.
Dataset Overview
We're working with a dataset capturing a slice of Malaysia's vehicle registrations. While compact, it holds a treasure trove of information: It can be accessed with
Malaysian Cars Registration Dataset
date_reg
: Date of vehicle registrationtype
: Type of vehiclemaker
: Vehicle manufacturermodel
: Specific model of the vehiclecolour
: Color of the vehiclefuel
: Fuel type used by the vehiclestate
: State where the vehicle is registered
Loading and Initial Exploration
Let's begin our journey by loading the dataset and taking a first look:
import pandas as pd
# Load the dataset
df = pd.read_csv("vehicle_registration.csv")
# Display the first few rows to get a feel for the data
print(df.head().to_markdown(index=False, numalign="left", stralign="left"))
# Get information about the DataFrame (column names, data types, non-null counts)
print(df.info())
Unveiling Basic Insights
Now, let's use some fundamental Pandas functions to gain initial insights:
# Count the occurrences of each vehicle type
type_counts = df['type'].value_counts()
print("\nVehicle Type Counts:\n", type_counts.head().to_markdown(numalign="left", stralign="left"))
# Get the top 5 most popular car makers
maker_counts = df['maker'].value_counts()
print("\nTop 5 Car Makers:\n", maker_counts.head().to_markdown(numalign="left", stralign="left"))
# Count the number of registrations by state
state_counts = df.groupby('state').size()
print("\nRegistrations by State:\n", state_counts.head().to_markdown(numalign="left", stralign="left"))
Data Cleaning in Action
Let's address some common data cleaning tasks to prepare our dataset for deeper analysis:
# Check for missing values
print("\nMissing Values:\n", df.isnull().sum().to_markdown(numalign="left", stralign="left"))
# Convert 'date_reg' to datetime format (assuming the format is 'YYYY-MM-DD')
df['date_reg'] = pd.to_datetime(df['date_reg'], format='%Y-%m-%d')
# Standardize 'maker' and 'model' to lowercase
df['maker'] = df['maker'].astype(str).str.lower()
df['model'] = df['model'].astype(str).str.lower()
# If you decide to drop rows with missing values:
df.dropna(inplace=True)
Enhanced Exploration
With cleaner data, let's extract a bit more information:
# Find the most recent registration date
latest_registration = df['date_reg'].max()
print(f"\nLatest Registration Date: {latest_registration}")
# Count the number of unique car models
unique_models = df['model'].nunique()
print(f"\nNumber of Unique Car Models: {unique_models}")
What's Next?
In our upcoming newsletters, we'll build on this foundation. We'll delve into visualizations, explore relationships between variables, and analyze trends over time. Get ready to uncover fascinating stories hidden within Malaysia's vehicle registration data!
Poll
Call to Action
Don't wait! Start experimenting with your own datasets using these Pandas techniques. Share your discoveries and any questions you have in the comments below. Let's continue this data-driven journey together!
Happy cleaning and exploring!