Beginners Guide to Data Exploration with Python

16 September 2024

Introduction

I recently stumbled upon a potential job opportunity through a mate. While I might not tick all the boxes, I do have relevant experience in several areas the employer is after. So, I took a punt and chucked my CV into the mix. However, if I do land an interview, I’ll need to brush up on my previous experience and sharpen my skills a bit.

In the world of data, it’s often about storytelling. You’re either crafting something that stands on its own - a dashboard that a passer-by can glance at and grasp - or you’re weaving a narrative. Building a narrative requires more than just tidying up data and slapping together a dashboard. You need to understand the business, the customers, and definitely the purpose of the dashboard or presentation you’re creating.

In this case, I’ve never worked in this potential employer’s industry. So how do I get interview-ready when I don’t know the customer or the business?

My advice? Find industry data or something similar. Kaggle is a cracking place to start. If you can’t find data related to the industry you’re aiming for, no worries. Just pick a topic you’re genuinely interested in. Your natural enthusiasm can help you shine in an interview. Remember, unless you’re going for a contractor role (where the game’s played a bit differently), building a connection is the most important part of the interview. Your technical prowess and wall full of certificates mean nowt if the interviewers don’t fancy working with you. Conversely, a candidate with a weaker technical skillset but the right soft skills and attitude can be coached into the perfect fit. It’s easier to train someone in technical stuff than in soft skills. So if there’s even a smidgen of a chance to let your enthusiasm shine, let it glow!

In my haste to get interview-ready, I decided to build a dashboard in an industry-adjacent field. And since I’ve been run off my feet lately and short on blog post ideas, I thought I’d share bits of the journey with you.

Exploratory Data Analysis (EDA): Playing Detective with Data

Exploratory Data Analysis, or EDA for short, is like being a detective with your data. It’s the crucial first step in any data analysis project, where you roll up your sleeves and get to know your dataset intimately. Reading about it is one thing, but getting your hands on a massive dataset from Kaggle or similar can really drive home just how vast and complex data can be. Personally, I’m looking at car data, but I’d recommend sports datasets if you’re stuck for ideas. The sheer quantity and variety of data collected in many sports is fascinating and will challenge you to find the best way of telling a story.

Imagine you’ve just received a massive box of jigsaw puzzle pieces. Before you start trying to fit them together, you’d probably want to tip them out, sort them by colour, find the edge pieces, and get a general sense of what you’re dealing with. That’s essentially what EDA is all about.

During EDA, you’ll be asking questions like:

What sort of data do I have?
Are there any patterns or trends lurking in there?
Are there any odd bits that don’t quite fit?
What might this data be trying to tell me?

It’s a process of summarising, visualising, and really getting to grips with your data before you start any heavy-duty analysis or modelling. It’s about understanding the story your data is trying to tell, and maybe uncovering a few plot twists along the way!

Why Python is Brilliant for EDA

Now, you might be wondering why we’re using Python for this data detective work. Well, Python is a bit like the Swiss Army knife of programming languages – it’s versatile, powerful, and has tools for just about everything. Also, I like typing things in my IDE, but maybe I’ll try to do a follow-up post on how we can do the same things with Excel or something like that. No promises, mind, but it might be a future post, one day.

Here’s why Python is brilliant for EDA:

It’s beginner-friendly: Python reads almost like plain English, which makes it easier to learn and understand, especially if you’re new to programming.
Powerful libraries: Python has a treasure trove of libraries specifically designed for data analysis. Pandas, for instance, is like a supercharged spreadsheet program, while Matplotlib and Seaborn let you create stunning visualisations with just a few lines of code.
Flexibility: Whether you’re working with a small CSV file or a massive database, Python can handle it. It’s equally at home with structured and unstructured data.
Automation: Once you’ve written your Python code, you can easily reuse it or tweak it for different datasets. This saves heaps of time in the long run.
Community support: Python has a massive, friendly community. If you ever get stuck, there’s a good chance someone else has faced the same problem and shared a solution online.
It’s free and open-source: Unlike some statistical software packages that cost an arm and a leg, Python is completely free to use.
Jupyter Notebooks: These interactive documents let you mix code, visualisations, and narrative text all in one place. It’s brilliant for sharing your EDA process and findings with others.

So, whether you’re a data newbie or a seasoned pro, Python provides all the tools you need to dive into your data and start exploring. It’s like having a well-equipped laboratory right at your fingertips!

I won’t go into setting up a Python environment, because frankly it can be either incredibly easy or really frustrating. It depends on loads of things, not least your operating system, but also what permissions you have and what other things are running on the computer. I personally find it much easier to get it all up and running in Linux and a bit more effort to do so with Windows, but your mileage may vary.

Loading and Examining Data

Let’s dive into some practical examples. We’ll use the Titanic dataset, which is a classic for learning data analysis.

import pandas as pd
import seaborn as sns

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# View the first few rows
print(titanic.head())

# Get dataset info
titanic.info()

This code snippet loads the Titanic dataset and gives us a quick look at what we’re dealing with. The head() function shows us the first few rows, while info() provides a summary of the dataset.

Data Cleaning and Preprocessing

Next, we’ll do some basic data cleaning:

# Check for missing values
print(titanic.isnull().sum())

# Fill missing values in 'age' column with median
titanic['age'].fillna(titanic['age'].median(), inplace=True)

# Create a new feature 'is_alone'
titanic['is_alone'] = (titanic['sibsp'] + titanic['parch'] == 0).astype(int)

Here, we’re checking for missing values, filling in missing ages with the median age, and creating a new feature to indicate if a passenger was travelling alone.

Data Visualisation

Now for the fun part - visualising our data:

import matplotlib.pyplot as plt

# Histogram of passenger ages
plt.figure(figsize=(10, 6))
plt.hist(titanic['age'], bins=20)
plt.title('Distribution of Passenger Ages')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

This code creates a histogram of passenger ages, giving us a visual representation of the age distribution on the Titanic.

Basic Statistical Analysis

Finally, let’s do some basic statistical analysis:

# Correlation between numeric columns
correlation = titanic.corr()
print(correlation['survived'].sort_values(ascending=False))

# Survival rate by passenger class
survival_rate = titanic.groupby('class')['survived'].mean()
print(survival_rate)

This code calculates correlations between numeric columns and the survival rate, and then computes the survival rate by passenger class.

Closing Thoughts

Exploratory Data Analysis is a crucial skill for any data professional, and Python makes it accessible and powerful. Whether you’re prepping for an interview or just looking to expand your skills, diving into a dataset and exploring it with Python is a brilliant way to learn.

Remember, the goal isn’t just to crunch numbers, but to tell a story with your data. As you practice EDA, try to think about what questions you’d ask if this were real-world data for a business. What insights could you draw? How could these findings inform decisions?

And don’t forget - the journey of learning in data science never really ends. There’s always a new technique to master or a new library to explore. So keep at it, stay curious, and happy data exploring!

#good-data #python #pandas #exploratory-data-analysis #seaborn #matplotlib #kaggle