Learn the essentials of Python's pandas library in this beginner's guide. Discover how to effortlessly manipulate and analyze data for your projects.
pip install pandas
import pandas as pd
# Series
s = pd.Series([1, 3, 5, 'a', 6, 8])
print(s)
0 1
1 3
2 5
3 a
4 6
5 8
dtype: object
# to check datatype
type(s)
pandas.core.series.Series
# DataFrame:
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 28, 22],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
df
| | Name | Age | City |
|---|-------|-----|----------------|
| 0 | John | 25 | New York |
| 1 | Alice | 28 | San Francisco |
| 2 | Bob | 22 | Los Angeles |
#to check datatype
type(df)
pandas.core.frame.DataFrame
import numpy as np
df1 = pd.DataFrame(np.arange(0,15).reshape(3,5), index= ['R1', 'R2', 'R3'], columns= ['C1', 'C2', 'C3', 'C4', 'C5'])
df1
| | C1 | C2 | C3 | C4 | C5 |
|----|----|----|----|----|----|
| R1 | 0 | 1 | 2 | 3 | 4 |
| R2 | 5 | 6 | 7 | 8 | 9 |
| R3 | 10 | 11 | 12 | 13 | 14 |
#Accessing the elements: Indexing and slicing
# Accessing a single column by label
print(df1['C1'])
R1 0
R2 5
R3 10
Name: C1, dtype: int64
We can’t access rows in the similar way as above, For accessing row we need to use ’loc’ discused below
# Here even the df1 is a dataframe data type, a single row or a colomn will be of data type series only,
type(df1['C1'])
pandas.core.series.Series
# Slicing multiple rows by label
print(df1['R1':'R2'])
C1 C2 C3 C4 C5
R1 0 1 2 3 4
R2 5 6 7 8 9
# Slicing multiple rows by position
df1[1:3]
| | C1 | C2 | C3 | C4 | C5 |
| --- | -- | -- | -- | -- | -- |
| R2 | 5 | 6 | 7 | 8 | 9 |
| R3 | 10 | 11 | 12 | 13 | 14 |
# Example
df1[df1['C1'] > 2]
| | C1 | C2 | C3 | C4 | C5 |
|------|----|----|----|----|----|
| **R2** | 5 | 6 | 7 | 8 | 9 |
| **R3** | 10 | 11 | 12 | 13 | 14 |
# Accessing a single row by label
df1.loc['R1']
C1 0
C2 1
C3 2
C4 3
C5 4
Name: R1, dtype: int64
# Accessing a specific element by label and column
df1.loc['R2', 'C3']
7
# Slicing by labels
print(df1.loc['R1':'R2'])
C1 C2 C3 C4 C5
R1 0 1 2 3 4
R2 5 6 7 8 9
# Accessing a single row by position
df1.iloc[1]
C1 5
C2 6
C3 7
C4 8
C5 9
Name: R2, dtype: int64
# Accessing a specific element by position
print(df1.iloc[1, 0])
5
# Slicing by position
print(df1.iloc[1:3])
C1 C2 C3 C4 C5
R2 5 6 7 8 9
R3 10 11 12 13 14
# Reading from CSV
df = pd.read_csv('jobs_in_data.csv') #Here we are using a sample .csv file, the file is provided at the bottom.
# Reading from Excel
# df = pd.read_excel('example.xlsx')
# Reading from SQL
# df = pd.read_sql('SELECT * FROM table', connection)
#Writing:
# Writing to CSV
#df.to_csv('output.csv', index=False)
# Writing to Excel
#df.to_excel('output.xlsx', index=False)
# Writing to SQL
#df.to_sql('table', connection, index=False, if_exists='replace')
df
| | work_year | job_title | job_category | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|---|-----------|-------------------------|---------------------------------|------------------|---------|---------------|---------------------|------------------|------------------|--------------|-------------------|--------------|
| 0 | 2023 | Data DevOps Engineer | Data Engineering | EUR | 88000 | 95012 | Germany | Mid-level | Full-time | Hybrid | Germany | L |
| 1 | 2023 | Data Architect | Data Architecture and Modeling | USD | 186000 | 186000 | United States | Senior | Full-time | In-person | United States | M |
| 2 | 2023 | Data Architect | Data Architecture and Modeling | USD | 81800 | 81800 | United States | Senior | Full-time | In-person | United States | M |
| 3 | 2023 | Data Scientist | Data Science and Research | USD | 212000 | 212000 | United States | Senior | Full-time | In-person | United States | M |
| 4 | 2023 | Data Scientist | Data Science and Research | USD | 93300 | 93300 | United States | Senior | Full-time | In-person | United States | M |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9350 | 2021 | Data Specialist | Data Management and Strategy | USD | 165000 | 165000 | United States | Senior | Full-time | Remote | United States | L |
| 9351 | 2020 | Data Scientist | Data Science and Research | USD | 412000 | 412000 | United States | Senior | Full-time | Remote | United States | L |
| 9352 | 2021 | Principal Data Scientist | Data Science and Research | USD | 151000 | 151000 | United States | Mid-level | Full-time | Remote | United States | L |
| 9353 | 2020 | Data Scientist | Data Science and Research | USD | 105000 | 105000 | United States | Entry-level | Full-time | Remote | United States | S |
| 9354 | 2020 | Business Data Analyst | Data Analysis | USD | 100000 | 100000 | United States | Entry-level | Contract | Remote | United States | L |
*Note: 9355 rows × 12 columns*
df.head() #Displays the first few rows of the DataFrame.
| | work_year | job_title | job_category | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|----|-----------|-------------------------|---------------------------------|------------------|--------|---------------|---------------------|------------------|------------------|--------------|------------------|--------------|
| 0 | 2023 | Data DevOps Engineer | Data Engineering | EUR | 88000 | 95012 | Germany | Mid-level | Full-time | Hybrid | Germany | L |
| 1 | 2023 | Data Architect | Data Architecture and Modeling | USD | 186000 | 186000 | United States | Senior | Full-time | In-person | United States | M |
| 2 | 2023 | Data Architect | Data Architecture and Modeling | USD | 81800 | 81800 | United States | Senior | Full-time | In-person | United States | M |
| 3 | 2023 | Data Scientist | Data Science and Research | USD | 212000 | 212000 | United States | Senior | Full-time | In-person | United States | M |
| 4 | 2023 | Data Scientist | Data Science and Research | USD | 93300 | 93300 | United States | Senior | Full-time | In-person | United States | M |
df.tail() #Displays the last few rows of the DataFrame.
| | work_year | job_title | job_category | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|--------|-----------|------------------------------|--------------------------------|------------------|--------|--------------|---------------------|-------------------|------------------|--------------|-------------------|--------------|
| 9350 | 2021 | Data Specialist | Data Management and Strategy | USD | 165000 | 165000 | United States | Senior | Full-time | Remote | United States | L |
| 9351 | 2020 | Data Scientist | Data Science and Research | USD | 412000 | 412000 | United States | Senior | Full-time | Remote | United States | L |
| 9352 | 2021 | Principal Data Scientist | Data Science and Research | USD | 151000 | 151000 | United States | Mid-level | Full-time | Remote | United States | L |
| 9353 | 2020 | Data Scientist | Data Science and Research | USD | 105000 | 105000 | United States | Entry-level | Full-time | Remote | United States | S |
| 9354 | 2020 | Business Data Analyst | Data Analysis | USD | 100000 | 100000 | United States | Entry-level | Contract | Remote | United States | L |
df.sample(2) #Displays a random sample of 5 rows from the DataFrame.
| | work_year | job_title | job_category | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|---|-----------|---------------------|----------------------------|-----------------|--------|---------------|---------------------|-------------------|------------------|--------------|-------------------|--------------|
| 1 | 2023 | Applied Scientist | Data Science and Research | USD | 172040 | 172040 | United States | Senior | Full-time | Remote | United States | M |
| 2 | 2023 | Data Analyst | Data Analysis | USD | 126000 | 126000 | United States | Senior | Full-time | In-person | United States | M |
Summary Statistics:
# Summary Statistics:
df.describe() #Generates descriptive statistics for numerical columns.
| | work_year | salary | salary_in_usd |
|--------------|--------------|--------------|---------------|
| count | 9355.000000 | 9355.000000 | 9355.000000 |
| mean | 2022.760449 | 149927.981293| 150299.495564 |
| std | 0.519470 | 63608.835387 | 63177.372024 |
| min | 2020.000000 | 14000.000000 | 15000.000000 |
| 25% | 2023.000000 | 105200.000000| 105700.000000 |
| 50% | 2023.000000 | 143860.000000| 143000.000000 |
| 75% | 2023.000000 | 187000.000000| 186723.000000 |
| max | 2023.000000 | 450000.000000| 450000.000000 |
Data Types:
# Datatype:
df.info() #Provides a concise summary of the DataFrame, including data types and non-null values.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9355 entries, 0 to 9354
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 work_year 9355 non-null int64
1 job_title 9355 non-null object
2 job_category 9355 non-null object
3 salary_currency 9355 non-null object
4 salary 9355 non-null int64
5 salary_in_usd 9355 non-null int64
6 employee_residence 9355 non-null object
7 experience_level 9355 non-null object
8 employment_type 9355 non-null object
9 work_setting 9355 non-null object
10 company_location 9355 non-null object
11 company_size 9355 non-null object
dtypes: int64(3), object(9)
memory usage: 877.2+ KB
# Detecting Missing Values
df.isnull() #or df.isna()
| work_year | job_title | job_category | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|-----------|-----------|--------------|-----------------|--------|---------------|--------------------|-------------------|------------------|--------------|------------------|--------------|
| False | False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| False | False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | False | False | False |
9355 rows × 12 columns
#Counting Missing Values:
df.isnull().sum()
work_year 0
job_title 0
job_category 0
salary_currency 0
salary 0
salary_in_usd 0
employee_residence 0
experience_level 0
employment_type 0
work_setting 0
company_location 0
company_size 0
dtype: int64
# Dropping Missing Values:
# Use dropna() to remove rows or columns with missing values.
# Dropping rows with any missing values
df.dropna(inplace=True)
# Dropping columns with any missing values
df.dropna(axis=1, inplace=True)
# Filling Missing Values:
# Use fillna(value) to replace missing values with a specific value.
# Filling missing values in salalry field with a specific value (e.g., mean)
df.fillna(value=df['salary'].mean(), inplace=True)
# Forward Fill (ffill) and Backward Fill (bfill):
# Use ffill() to fill missing values with the previous value and bfill() to fill with the next value.
# Forward fill missing values
df.ffill(inplace=True)
# Backward fill missing values
df.bfill(inplace=True)
# Grouping by 'Category' column
grouped_data = df.groupby('job_title')
# Calculating mean for each group
mean_values = grouped_data.mean() # After grouping, apply aggregation functions (e.g., mean, sum, count) to get summary statistics for each group.
print(mean_values)
work_year salary salary_in_usd
job_title
AI Architect 2023.000000 249000.000000 250328.000000
AI Developer 2022.944444 140500.000000 141140.888889
AI Engineer 2023.000000 169208.416667 171663.972222
AI Programmer 2022.800000 74000.000000 68817.400000
AI Research Engineer 2022.750000 67275.000000 73271.500000
... ... ... ...
Sales Data Analyst 2020.000000 60000.000000 60000.000000
Software Data Engineer 2023.000000 120000.000000 111627.666667
Staff Data Analyst 2021.000000 84999.333333 79917.000000
Staff Data Scientist 2020.500000 134500.000000 134500.000000
Staff Machine Learning Engineer 2021.000000 185000.000000 185000.000000
[125 rows x 3 columns]