A Beginners Guide to Pandas in Python

Learn the essentials of Python's pandas library in this beginner's guide. Discover how to effortlessly manipulate and analyze data for your projects.

A Beginners Guide to Pandas in Python

Introduction to Pandas: A Beginner’s Guide

What is Pandas?

  • Pandas is a powerful Python library for data manipulation and analysis. It provides easy-to-use data structures, high-performance data analysis tools, and various data cleaning and transformation capabilities.

Installation

  • Before you start using Pandas, you need to install it. Open your terminal or command prompt and run:
Python
pip install pandas

Importing Pandas

  • Once installed, you can import Pandas in your Python script or Jupyter notebook:
Python
import pandas as pd

Data Structures in Pandas

Series:

  • Series is like a single column of a spreadsheet
  • A one-dimensional labeled array capable of holding any data type.
  • It can be created from lists, arrays, or dictionaries.

DataFrame:

  • DataFrame is the entire spreadsheet
  • A two-dimensional table with labeled axes (rows and columns).
  • It is the most commonly used Pandas object, resembling a spreadsheet or SQL table.
  • DataFrames can be created from various data sources, including dictionaries, lists, NumPy arrays, and external files like CSV or Excel.
Python
# Series

s = pd.Series([1, 3, 5, 'a', 6, 8])

print(s)
0    1
1    3
2    5
3    a
4    6
5    8
dtype: object
Python
# to check datatype
type(s)
pandas.core.series.Series
Python
# DataFrame:

data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 28, 22],
        'City': ['New York', 'San Francisco', 'Los Angeles']}

df = pd.DataFrame(data)

df
|   | Name  | Age | City           |
|---|-------|-----|----------------|
| 0 | John  | 25  | New York       |
| 1 | Alice | 28  | San Francisco  |
| 2 | Bob   | 22  | Los Angeles    |
Python
#to check datatype

type(df)
pandas.core.frame.DataFrame

Using Numpy to create data for the dataframe

Python
import numpy as np

df1 = pd.DataFrame(np.arange(0,15).reshape(3,5), index= ['R1', 'R2', 'R3'], columns= ['C1', 'C2', 'C3', 'C4', 'C5'])

df1
|    | C1 | C2 | C3 | C4 | C5 |
|----|----|----|----|----|----|
| R1 | 0  | 1  | 2  | 3  | 4  |
| R2 | 5  | 6  | 7  | 8  | 9  |
| R3 | 10 | 11 | 12 | 13 | 14 |

Column Selection:

  • Accessing a single column: df[‘Column_Name’]
  • Accessing multiple columns: df[[‘Column1’, ‘Column2’]]
Python
#Accessing the elements: Indexing and slicing

# Accessing a single column by label
print(df1['C1'])  
R1     0
R2     5
R3    10
Name: C1, dtype: int64

We can’t access rows in the similar way as above, For accessing row we need to use ’loc’ discused below

Python
# Here even the df1 is a dataframe data type, a single row or a colomn will be of data type series only,  

type(df1['C1'])
pandas.core.series.Series
Python
# Slicing multiple rows by label
print(df1['R1':'R2'])  
 
    C1  C2  C3  C4  C5
R1   0   1   2   3   4
R2   5   6   7   8   9
Python
# Slicing multiple rows by position
df1[1:3]
|     | C1 | C2 | C3 | C4 | C5 |
| --- | -- | -- | -- | -- | -- |
| R2  | 5  | 6  | 7  | 8  | 9  |
| R3  | 10 | 11 | 12 | 13 | 14 |

Filtering Rows:

  • Using conditions to filter rows based on a specific criteria.
Python
# Example
df1[df1['C1'] > 2]
|      | C1 | C2 | C3 | C4 | C5 |
|------|----|----|----|----|----|
| **R2** |  5 |  6 |  7 |  8 |  9 |
| **R3** | 10 | 11 | 12 | 13 | 14 |

Indexing with loc and iloc

  • In Pandas, effective data manipulation often involves precise indexing. The loc and iloc attributes provide powerful tools for label-based and integer-location-based indexing in DataFrames, enabling you to access, slice, and manipulate data with ease.

loc - Label-based Indexing:

  • loc is used for accessing a group of rows and columns by labels or a boolean array.
Python
# Accessing a single row by label

df1.loc['R1']
C1    0
C2    1
C3    2
C4    3
C5    4
Name: R1, dtype: int64
Python
# Accessing a specific element by label and column

df1.loc['R2', 'C3']
7
Python
# Slicing by labels
print(df1.loc['R1':'R2'])
    C1  C2  C3  C4  C5
R1   0   1   2   3   4
R2   5   6   7   8   9

iloc - Integer-location based Indexing:

  • iloc is used for accessing a group of rows and columns by integer position.
Python
# Accessing a single row by position

df1.iloc[1]
C1    5
C2    6
C3    7
C4    8
C5    9
Name: R2, dtype: int64
Python
# Accessing a specific element by position

print(df1.iloc[1, 0])
5
Python
# Slicing by position

print(df1.iloc[1:3])
    C1  C2  C3  C4  C5
R2   5   6   7   8   9
R3  10  11  12  13  14

Reading and Writing Data

Reading Data

  • Pandas supports reading data from various file formats such as CSV, Excel, SQL, and more.

Writing Data

  • You can also write data back to these formats:
Python
# Reading from CSV
df = pd.read_csv('jobs_in_data.csv')  #Here we are using a sample .csv file, the file is provided at the bottom.

# Reading from Excel
# df = pd.read_excel('example.xlsx')

# Reading from SQL
# df = pd.read_sql('SELECT * FROM table', connection)

#Writing:

# Writing to CSV
#df.to_csv('output.csv', index=False)

# Writing to Excel
#df.to_excel('output.xlsx', index=False)

# Writing to SQL
#df.to_sql('table', connection, index=False, if_exists='replace')

df
|   | work_year | job_title               | job_category                    | salary_currency | salary  | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|---|-----------|-------------------------|---------------------------------|------------------|---------|---------------|---------------------|------------------|------------------|--------------|-------------------|--------------|
| 0 | 2023      | Data DevOps Engineer     | Data Engineering                | EUR              | 88000   | 95012         | Germany             | Mid-level        | Full-time        | Hybrid       | Germany           | L            |
| 1 | 2023      | Data Architect          | Data Architecture and Modeling | USD              | 186000  | 186000        | United States       | Senior           | Full-time        | In-person    | United States      | M            |
| 2 | 2023      | Data Architect          | Data Architecture and Modeling | USD              | 81800   | 81800         | United States       | Senior           | Full-time        | In-person    | United States      | M            |
| 3 | 2023      | Data Scientist          | Data Science and Research       | USD              | 212000  | 212000        | United States       | Senior           | Full-time        | In-person    | United States      | M            |
| 4 | 2023      | Data Scientist          | Data Science and Research       | USD              | 93300   | 93300         | United States       | Senior           | Full-time        | In-person    | United States      | M            |
| ... | ...       | ...                     | ...                             | ...              | ...     | ...           | ...                 | ...              | ...              | ...          | ...               | ...          |
| 9350 | 2021      | Data Specialist         | Data Management and Strategy    | USD              | 165000  | 165000        | United States       | Senior           | Full-time        | Remote       | United States      | L            |
| 9351 | 2020      | Data Scientist          | Data Science and Research       | USD              | 412000  | 412000        | United States       | Senior           | Full-time        | Remote       | United States      | L            |
| 9352 | 2021      | Principal Data Scientist | Data Science and Research       | USD              | 151000  | 151000        | United States       | Mid-level        | Full-time        | Remote       | United States      | L            |
| 9353 | 2020      | Data Scientist          | Data Science and Research       | USD              | 105000  | 105000        | United States       | Entry-level      | Full-time        | Remote       | United States      | S            |
| 9354 | 2020      | Business Data Analyst    | Data Analysis                   | USD              | 100000  | 100000        | United States       | Entry-level      | Contract         | Remote       | United States      | L            |

*Note: 9355 rows × 12 columns*

Data Exploration in pandas

  • Data exploration is a crucial step in the data analysis process, and Pandas provides powerful tools to help you understand and analyze your dataset.

Viewing Data:

  • head(): Displays the first few rows of the DataFrame.
  • tail(): Displays the last few rows of the DataFrame.
  • sample(n): Displays a random sample of n rows from the DataFrame.
Python
df.head() #Displays the first few rows of the DataFrame.
|    | work_year | job_title               | job_category                    | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|----|-----------|-------------------------|---------------------------------|------------------|--------|---------------|---------------------|------------------|------------------|--------------|------------------|--------------|
| 0  | 2023      | Data DevOps Engineer    | Data Engineering                | EUR              | 88000  | 95012         | Germany             | Mid-level        | Full-time        | Hybrid       | Germany          | L            |
| 1  | 2023      | Data Architect          | Data Architecture and Modeling | USD              | 186000 | 186000        | United States       | Senior           | Full-time        | In-person    | United States    | M            |
| 2  | 2023      | Data Architect          | Data Architecture and Modeling | USD              | 81800  | 81800         | United States       | Senior           | Full-time        | In-person    | United States    | M            |
| 3  | 2023      | Data Scientist          | Data Science and Research       | USD              | 212000 | 212000        | United States       | Senior           | Full-time        | In-person    | United States    | M            |
| 4  | 2023      | Data Scientist          | Data Science and Research       | USD              | 93300  | 93300         | United States       | Senior           | Full-time        | In-person    | United States    | M            |
Python
df.tail() #Displays the last few rows of the DataFrame.
|        | work_year |           job_title           |          job_category          | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|--------|-----------|------------------------------|--------------------------------|------------------|--------|--------------|---------------------|-------------------|------------------|--------------|-------------------|--------------|
| 9350   | 2021      | Data Specialist              | Data Management and Strategy    | USD              | 165000 | 165000       | United States       | Senior            | Full-time        | Remote       | United States     | L            |
| 9351   | 2020      | Data Scientist               | Data Science and Research       | USD              | 412000 | 412000       | United States       | Senior            | Full-time        | Remote       | United States     | L            |
| 9352   | 2021      | Principal Data Scientist     | Data Science and Research       | USD              | 151000 | 151000       | United States       | Mid-level         | Full-time        | Remote       | United States     | L            |
| 9353   | 2020      | Data Scientist               | Data Science and Research       | USD              | 105000 | 105000       | United States       | Entry-level       | Full-time        | Remote       | United States     | S            |
| 9354   | 2020      | Business Data Analyst        | Data Analysis                   | USD              | 100000 | 100000       | United States       | Entry-level       | Contract         | Remote       | United States     | L            |
Python
df.sample(2) #Displays a random sample of 5 rows from the DataFrame.
|   | work_year | job_title          | job_category               | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|---|-----------|---------------------|----------------------------|-----------------|--------|---------------|---------------------|-------------------|------------------|--------------|-------------------|--------------|
| 1 | 2023      | Applied Scientist  | Data Science and Research  | USD             | 172040 | 172040        | United States       | Senior            | Full-time        | Remote       | United States     | M            |
| 2 | 2023      | Data Analyst       | Data Analysis              | USD             | 126000 | 126000        | United States       | Senior            | Full-time        | In-person    | United States     | M            |

Summary Statistics:

  • describe(): Generates descriptive statistics for numerical columns.
Python
# Summary Statistics:

df.describe() #Generates descriptive statistics for numerical columns.
|              | work_year    | salary       | salary_in_usd |
|--------------|--------------|--------------|---------------|
| count        | 9355.000000  | 9355.000000  | 9355.000000   |
| mean         | 2022.760449  | 149927.981293| 150299.495564 |
| std          | 0.519470     | 63608.835387 | 63177.372024  |
| min          | 2020.000000  | 14000.000000 | 15000.000000  |
| 25%          | 2023.000000  | 105200.000000| 105700.000000 |
| 50%          | 2023.000000  | 143860.000000| 143000.000000 |
| 75%          | 2023.000000  | 187000.000000| 186723.000000 |
| max          | 2023.000000  | 450000.000000| 450000.000000 |

Data Types:

  • info(): Provides a concise summary of the DataFrame, including data types and non-null values.
Python
# Datatype:

df.info()   #Provides a concise summary of the DataFrame, including data types and non-null values.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9355 entries, 0 to 9354
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           9355 non-null   int64 
 1   job_title           9355 non-null   object
 2   job_category        9355 non-null   object
 3   salary_currency     9355 non-null   object
 4   salary              9355 non-null   int64 
 5   salary_in_usd       9355 non-null   int64 
 6   employee_residence  9355 non-null   object
 7   experience_level    9355 non-null   object
 8   employment_type     9355 non-null   object
 9   work_setting        9355 non-null   object
 10  company_location    9355 non-null   object
 11  company_size        9355 non-null   object
dtypes: int64(3), object(9)
memory usage: 877.2+ KB

Handling Missing Data:

  • Handling missing data is an essential part of data cleaning and preprocessing. Pandas provides several methods to deal with missing values in a DataFrame.

Checking for Missing Data:

  • Use isnull() or isna() to check for missing values in the DataFrame.
Python
# Detecting Missing Values

df.isnull()     #or df.isna()
| work_year | job_title | job_category | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|-----------|-----------|--------------|-----------------|--------|---------------|--------------------|-------------------|------------------|--------------|------------------|--------------|
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| ...       | ...       | ...          | ...             | ...    | ...           | ...                | ...               | ...              | ...          | ...              | ...          |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |

9355 rows × 12 columns
Python
#Counting Missing Values:

df.isnull().sum()
work_year             0
job_title             0
job_category          0
salary_currency       0
salary                0
salary_in_usd         0
employee_residence    0
experience_level      0
employment_type       0
work_setting          0
company_location      0
company_size          0
dtype: int64

Handling Missing Values

Python
# Dropping Missing Values:

# Use dropna() to remove rows or columns with missing values.

# Dropping rows with any missing values
df.dropna(inplace=True)

# Dropping columns with any missing values
df.dropna(axis=1, inplace=True)
Python
# Filling Missing Values:

# Use fillna(value) to replace missing values with a specific value.


# Filling missing values in salalry field with a specific value (e.g., mean)
df.fillna(value=df['salary'].mean(), inplace=True)
Python
# Forward Fill (ffill) and Backward Fill (bfill):

# Use ffill() to fill missing values with the previous value and bfill() to fill with the next value.


# Forward fill missing values
df.ffill(inplace=True)

# Backward fill missing values
df.bfill(inplace=True)

Grouping and Aggregation:

  • Grouping and aggregation are powerful operations in Pandas that allow you to group your data based on certain criteria and perform calculations on those groups.

Grouping Data:

  • Use groupby() to group data by a single column.
Python
# Grouping by 'Category' column
grouped_data = df.groupby('job_title')

# Calculating mean for each group

mean_values = grouped_data.mean() # After grouping, apply aggregation functions (e.g., mean, sum, count) to get summary statistics for each group.

print(mean_values)
                                   work_year         salary  salary_in_usd
job_title                                                                 
AI Architect                     2023.000000  249000.000000  250328.000000
AI Developer                     2022.944444  140500.000000  141140.888889
AI Engineer                      2023.000000  169208.416667  171663.972222
AI Programmer                    2022.800000   74000.000000   68817.400000
AI Research Engineer             2022.750000   67275.000000   73271.500000
...                                      ...            ...            ...
Sales Data Analyst               2020.000000   60000.000000   60000.000000
Software Data Engineer           2023.000000  120000.000000  111627.666667
Staff Data Analyst               2021.000000   84999.333333   79917.000000
Staff Data Scientist             2020.500000  134500.000000  134500.000000
Staff Machine Learning Engineer  2021.000000  185000.000000  185000.000000

[125 rows x 3 columns]