Have you often encountered such troubles: facing a pile of messy data, not knowing how to start analyzing? Or heard that Python data analysis is powerful, but always feel the entry barrier is high? Today I'll share with you the most practical tips and methods in Python data analysis.
As a Python data analysis enthusiast, I deeply understand the importance of data processing. I remember when I first encountered data analysis, I was also confused. But through continuous practice, I found that once you master the right tools and methods, data analysis isn't as difficult as imagined.
Before we begin, let's understand Python's most core library for data analysis - Pandas. You might ask, why Pandas? Because it's like the Swiss Army knife of data analysis, capable of solving almost all common data processing problems.
import pandas as pd
import numpy as np
data = {
'Name': ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu'],
'Age': [25, 30, 35, 28],
'Salary': [8000, 12000, 15000, 10000],
'Department': ['Tech', 'Sales', 'Tech', 'Marketing']
}
df = pd.DataFrame(data)
See that? Just a few lines of code created a structured data table. This is the charm of Pandas.
Data cleaning is the most important step in the analysis process. I often say, "garbage in, garbage out" - if the input data quality isn't good, the analysis results won't be reliable.
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
df.drop_duplicates(inplace=True)
df['Age'] = df['Age'].astype(int)
In actual work, I encountered such a case: in an e-commerce company's sales data, some product prices were abnormally high or low. Through the following methods, we can quickly discover and handle these outliers:
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['Salary'] >= lower_bound) & (df['Salary'] <= upper_bound)]
The most interesting part of data analysis is discovering patterns hidden in the data. Let's see how to perform some common analysis operations using Pandas.
print(df.describe())
dept_salary = df.groupby('Department')['Salary'].mean()
correlation = df['Salary'].corr(df['Age'])
Once, while analyzing a company's HR data, I discovered an interesting phenomenon: the salary difference between different departments reached over 50%. This prompted me to further investigate the reasons.
Data visualization is the best way to tell a story. Through charts, we can intuitively show data characteristics and patterns.
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(10, 6))
dept_salary.plot(kind='bar')
plt.title('Department Average Salary Comparison')
plt.xlabel('Department')
plt.ylabel('Average Salary')
plt.tight_layout()
plt.show()
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='Salary', bins=20)
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
Once you've mastered the basics, you can try some more advanced analysis techniques. For example, we can use Pandas' advanced features for time series analysis:
dates = pd.date_range('20230101', periods=100)
sales = pd.Series(np.random.randn(100).cumsum() * 1000 + 5000, index=dates)
ma7 = sales.rolling(window=7).mean()
ma30 = sales.rolling(window=30).mean()
plt.figure(figsize=(12, 6))
plt.plot(sales.index, sales.values, label='Daily Sales')
plt.plot(ma7.index, ma7.values, label='7-day Moving Average')
plt.plot(ma30.index, ma30.values, label='30-day Moving Average')
plt.title('Sales Trend Analysis')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.tight_layout()
plt.show()
In practical work, I've summarized some very useful tips:
print(df.info())
print(df.isnull().sum())
for column in df.columns:
print(f"
{column}'s unique values:")
print(df[column].value_counts())
def clean_data(df):
"""Data cleaning function"""
df = df.copy()
# Handle missing values
df.fillna(df.mean(numeric_only=True), inplace=True)
# Remove duplicate rows
df.drop_duplicates(inplace=True)
return df
def analyze_data(df):
"""Data analysis function"""
results = {}
# Basic statistics
results['summary'] = df.describe()
# Group statistics
results['group_stats'] = df.groupby('Department')['Salary'].agg(['mean', 'count'])
return results
pivot_table = pd.pivot_table(df,
values='Salary',
index='Department',
columns='Age',
aggfunc='mean',
fill_value=0)
Let's look at a complete practical case. Suppose we need to analyze a company's sales data:
np.random.seed(42)
n_records = 1000
dates = pd.date_range('20230101', periods=n_records)
products = ['Product A', 'Product B', 'Product C', 'Product D']
regions = ['East', 'North', 'South', 'West']
sales_data = {
'Date': np.random.choice(dates, n_records),
'Product': np.random.choice(products, n_records),
'Region': np.random.choice(regions, n_records),
'Sales Volume': np.random.randint(10, 100, n_records),
'Unit Price': np.random.uniform(100, 1000, n_records)
}
sales_df = pd.DataFrame(sales_data)
sales_df['Sales Amount'] = sales_df['Sales Volume'] * sales_df['Unit Price']
sales_df = clean_data(sales_df)
product_sales = sales_df.groupby('Product')['Sales Amount'].sum().sort_values(ascending=False)
region_sales = sales_df.groupby('Region').agg({
'Sales Amount': 'sum',
'Sales Volume': 'sum'
}).round(2)
daily_sales = sales_df.groupby('Date')['Sales Amount'].sum()
monthly_sales = sales_df.groupby(sales_df['Date'].dt.to_period('M'))['Sales Amount'].sum()
plt.figure(figsize=(15, 10))
plt.subplot(2, 2, 1)
product_sales.plot(kind='bar')
plt.title('Product Sales Comparison')
plt.xticks(rotation=45)
plt.subplot(2, 2, 2)
plt.pie(region_sales['Sales Amount'], labels=region_sales.index, autopct='%1.1f%%')
plt.title('Regional Sales Proportion')
plt.subplot(2, 2, (3,4))
monthly_sales.plot()
plt.title('Monthly Sales Trend')
plt.xlabel('Month')
plt.ylabel('Sales Amount')
plt.tight_layout()
plt.show()
Through this article, we've discussed various aspects of Python data analysis in detail. From basic data processing to advanced analysis techniques, from simple statistical analysis to complex visualization displays, do you now have a deeper understanding of Python data analysis?
Remember, data analysis is not just technology, but also an art. It requires us to use creative thinking to discover the hidden value in data. Do you have any unique data analysis experiences to share? Or have you encountered any difficulties in the learning process? Welcome to discuss in the comments section.