Learn the essentials of Python's pandas library in this beginner's guide. Discover how to effortlessly manipulate and analyze data for your projects.
 
  pip install pandasimport pandas as pd# Series
s = pd.Series([1, 3, 5, 'a', 6, 8])
print(s)0    1
1    3
2    5
3    a
4    6
5    8
dtype: object
# to check datatype
type(s)pandas.core.series.Series
# DataFrame:
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 28, 22],
        'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
df|   | Name  | Age | City           |
|---|-------|-----|----------------|
| 0 | John  | 25  | New York       |
| 1 | Alice | 28  | San Francisco  |
| 2 | Bob   | 22  | Los Angeles    |
#to check datatype
type(df)pandas.core.frame.DataFrame
import numpy as np
df1 = pd.DataFrame(np.arange(0,15).reshape(3,5), index= ['R1', 'R2', 'R3'], columns= ['C1', 'C2', 'C3', 'C4', 'C5'])
df1|    | C1 | C2 | C3 | C4 | C5 |
|----|----|----|----|----|----|
| R1 | 0  | 1  | 2  | 3  | 4  |
| R2 | 5  | 6  | 7  | 8  | 9  |
| R3 | 10 | 11 | 12 | 13 | 14 |
#Accessing the elements: Indexing and slicing
# Accessing a single column by label
print(df1['C1'])  R1     0
R2     5
R3    10
Name: C1, dtype: int64
We can’t access rows in the similar way as above, For accessing row we need to use ’loc’ discused below
# Here even the df1 is a dataframe data type, a single row or a colomn will be of data type series only,  
type(df1['C1'])pandas.core.series.Series
# Slicing multiple rows by label
print(df1['R1':'R2'])  
     C1  C2  C3  C4  C5
R1   0   1   2   3   4
R2   5   6   7   8   9
# Slicing multiple rows by position
df1[1:3]|     | C1 | C2 | C3 | C4 | C5 |
| --- | -- | -- | -- | -- | -- |
| R2  | 5  | 6  | 7  | 8  | 9  |
| R3  | 10 | 11 | 12 | 13 | 14 |
# Example
df1[df1['C1'] > 2]|      | C1 | C2 | C3 | C4 | C5 |
|------|----|----|----|----|----|
| **R2** |  5 |  6 |  7 |  8 |  9 |
| **R3** | 10 | 11 | 12 | 13 | 14 |
# Accessing a single row by label
df1.loc['R1']C1    0
C2    1
C3    2
C4    3
C5    4
Name: R1, dtype: int64
# Accessing a specific element by label and column
df1.loc['R2', 'C3']7
# Slicing by labels
print(df1.loc['R1':'R2'])    C1  C2  C3  C4  C5
R1   0   1   2   3   4
R2   5   6   7   8   9
# Accessing a single row by position
df1.iloc[1]C1    5
C2    6
C3    7
C4    8
C5    9
Name: R2, dtype: int64
# Accessing a specific element by position
print(df1.iloc[1, 0])5
# Slicing by position
print(df1.iloc[1:3])    C1  C2  C3  C4  C5
R2   5   6   7   8   9
R3  10  11  12  13  14
# Reading from CSV
df = pd.read_csv('jobs_in_data.csv')  #Here we are using a sample .csv file, the file is provided at the bottom.
# Reading from Excel
# df = pd.read_excel('example.xlsx')
# Reading from SQL
# df = pd.read_sql('SELECT * FROM table', connection)
#Writing:
# Writing to CSV
#df.to_csv('output.csv', index=False)
# Writing to Excel
#df.to_excel('output.xlsx', index=False)
# Writing to SQL
#df.to_sql('table', connection, index=False, if_exists='replace')
df|   | work_year | job_title               | job_category                    | salary_currency | salary  | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|---|-----------|-------------------------|---------------------------------|------------------|---------|---------------|---------------------|------------------|------------------|--------------|-------------------|--------------|
| 0 | 2023      | Data DevOps Engineer     | Data Engineering                | EUR              | 88000   | 95012         | Germany             | Mid-level        | Full-time        | Hybrid       | Germany           | L            |
| 1 | 2023      | Data Architect          | Data Architecture and Modeling | USD              | 186000  | 186000        | United States       | Senior           | Full-time        | In-person    | United States      | M            |
| 2 | 2023      | Data Architect          | Data Architecture and Modeling | USD              | 81800   | 81800         | United States       | Senior           | Full-time        | In-person    | United States      | M            |
| 3 | 2023      | Data Scientist          | Data Science and Research       | USD              | 212000  | 212000        | United States       | Senior           | Full-time        | In-person    | United States      | M            |
| 4 | 2023      | Data Scientist          | Data Science and Research       | USD              | 93300   | 93300         | United States       | Senior           | Full-time        | In-person    | United States      | M            |
| ... | ...       | ...                     | ...                             | ...              | ...     | ...           | ...                 | ...              | ...              | ...          | ...               | ...          |
| 9350 | 2021      | Data Specialist         | Data Management and Strategy    | USD              | 165000  | 165000        | United States       | Senior           | Full-time        | Remote       | United States      | L            |
| 9351 | 2020      | Data Scientist          | Data Science and Research       | USD              | 412000  | 412000        | United States       | Senior           | Full-time        | Remote       | United States      | L            |
| 9352 | 2021      | Principal Data Scientist | Data Science and Research       | USD              | 151000  | 151000        | United States       | Mid-level        | Full-time        | Remote       | United States      | L            |
| 9353 | 2020      | Data Scientist          | Data Science and Research       | USD              | 105000  | 105000        | United States       | Entry-level      | Full-time        | Remote       | United States      | S            |
| 9354 | 2020      | Business Data Analyst    | Data Analysis                   | USD              | 100000  | 100000        | United States       | Entry-level      | Contract         | Remote       | United States      | L            |
*Note: 9355 rows × 12 columns*
df.head() #Displays the first few rows of the DataFrame.|    | work_year | job_title               | job_category                    | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|----|-----------|-------------------------|---------------------------------|------------------|--------|---------------|---------------------|------------------|------------------|--------------|------------------|--------------|
| 0  | 2023      | Data DevOps Engineer    | Data Engineering                | EUR              | 88000  | 95012         | Germany             | Mid-level        | Full-time        | Hybrid       | Germany          | L            |
| 1  | 2023      | Data Architect          | Data Architecture and Modeling | USD              | 186000 | 186000        | United States       | Senior           | Full-time        | In-person    | United States    | M            |
| 2  | 2023      | Data Architect          | Data Architecture and Modeling | USD              | 81800  | 81800         | United States       | Senior           | Full-time        | In-person    | United States    | M            |
| 3  | 2023      | Data Scientist          | Data Science and Research       | USD              | 212000 | 212000        | United States       | Senior           | Full-time        | In-person    | United States    | M            |
| 4  | 2023      | Data Scientist          | Data Science and Research       | USD              | 93300  | 93300         | United States       | Senior           | Full-time        | In-person    | United States    | M            |
df.tail() #Displays the last few rows of the DataFrame.|        | work_year |           job_title           |          job_category          | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|--------|-----------|------------------------------|--------------------------------|------------------|--------|--------------|---------------------|-------------------|------------------|--------------|-------------------|--------------|
| 9350   | 2021      | Data Specialist              | Data Management and Strategy    | USD              | 165000 | 165000       | United States       | Senior            | Full-time        | Remote       | United States     | L            |
| 9351   | 2020      | Data Scientist               | Data Science and Research       | USD              | 412000 | 412000       | United States       | Senior            | Full-time        | Remote       | United States     | L            |
| 9352   | 2021      | Principal Data Scientist     | Data Science and Research       | USD              | 151000 | 151000       | United States       | Mid-level         | Full-time        | Remote       | United States     | L            |
| 9353   | 2020      | Data Scientist               | Data Science and Research       | USD              | 105000 | 105000       | United States       | Entry-level       | Full-time        | Remote       | United States     | S            |
| 9354   | 2020      | Business Data Analyst        | Data Analysis                   | USD              | 100000 | 100000       | United States       | Entry-level       | Contract         | Remote       | United States     | L            |
df.sample(2) #Displays a random sample of 5 rows from the DataFrame.|   | work_year | job_title          | job_category               | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|---|-----------|---------------------|----------------------------|-----------------|--------|---------------|---------------------|-------------------|------------------|--------------|-------------------|--------------|
| 1 | 2023      | Applied Scientist  | Data Science and Research  | USD             | 172040 | 172040        | United States       | Senior            | Full-time        | Remote       | United States     | M            |
| 2 | 2023      | Data Analyst       | Data Analysis              | USD             | 126000 | 126000        | United States       | Senior            | Full-time        | In-person    | United States     | M            |
Summary Statistics:
# Summary Statistics:
df.describe() #Generates descriptive statistics for numerical columns.|              | work_year    | salary       | salary_in_usd |
|--------------|--------------|--------------|---------------|
| count        | 9355.000000  | 9355.000000  | 9355.000000   |
| mean         | 2022.760449  | 149927.981293| 150299.495564 |
| std          | 0.519470     | 63608.835387 | 63177.372024  |
| min          | 2020.000000  | 14000.000000 | 15000.000000  |
| 25%          | 2023.000000  | 105200.000000| 105700.000000 |
| 50%          | 2023.000000  | 143860.000000| 143000.000000 |
| 75%          | 2023.000000  | 187000.000000| 186723.000000 |
| max          | 2023.000000  | 450000.000000| 450000.000000 |
Data Types:
# Datatype:
df.info()   #Provides a concise summary of the DataFrame, including data types and non-null values.<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9355 entries, 0 to 9354
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           9355 non-null   int64 
 1   job_title           9355 non-null   object
 2   job_category        9355 non-null   object
 3   salary_currency     9355 non-null   object
 4   salary              9355 non-null   int64 
 5   salary_in_usd       9355 non-null   int64 
 6   employee_residence  9355 non-null   object
 7   experience_level    9355 non-null   object
 8   employment_type     9355 non-null   object
 9   work_setting        9355 non-null   object
 10  company_location    9355 non-null   object
 11  company_size        9355 non-null   object
dtypes: int64(3), object(9)
memory usage: 877.2+ KB
# Detecting Missing Values
df.isnull()     #or df.isna()| work_year | job_title | job_category | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|-----------|-----------|--------------|-----------------|--------|---------------|--------------------|-------------------|------------------|--------------|------------------|--------------|
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| ...       | ...       | ...          | ...             | ...    | ...           | ...                | ...               | ...              | ...          | ...              | ...          |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
| False     | False     | False        | False           | False  | False         | False              | False             | False            | False        | False            | False        |
9355 rows × 12 columns
#Counting Missing Values:
df.isnull().sum()work_year             0
job_title             0
job_category          0
salary_currency       0
salary                0
salary_in_usd         0
employee_residence    0
experience_level      0
employment_type       0
work_setting          0
company_location      0
company_size          0
dtype: int64
# Dropping Missing Values:
# Use dropna() to remove rows or columns with missing values.
# Dropping rows with any missing values
df.dropna(inplace=True)
# Dropping columns with any missing values
df.dropna(axis=1, inplace=True)# Filling Missing Values:
# Use fillna(value) to replace missing values with a specific value.
# Filling missing values in salalry field with a specific value (e.g., mean)
df.fillna(value=df['salary'].mean(), inplace=True)# Forward Fill (ffill) and Backward Fill (bfill):
# Use ffill() to fill missing values with the previous value and bfill() to fill with the next value.
# Forward fill missing values
df.ffill(inplace=True)
# Backward fill missing values
df.bfill(inplace=True)# Grouping by 'Category' column
grouped_data = df.groupby('job_title')
# Calculating mean for each group
mean_values = grouped_data.mean() # After grouping, apply aggregation functions (e.g., mean, sum, count) to get summary statistics for each group.
print(mean_values)                                   work_year         salary  salary_in_usd
job_title                                                                 
AI Architect                     2023.000000  249000.000000  250328.000000
AI Developer                     2022.944444  140500.000000  141140.888889
AI Engineer                      2023.000000  169208.416667  171663.972222
AI Programmer                    2022.800000   74000.000000   68817.400000
AI Research Engineer             2022.750000   67275.000000   73271.500000
...                                      ...            ...            ...
Sales Data Analyst               2020.000000   60000.000000   60000.000000
Software Data Engineer           2023.000000  120000.000000  111627.666667
Staff Data Analyst               2021.000000   84999.333333   79917.000000
Staff Data Scientist             2020.500000  134500.000000  134500.000000
Staff Machine Learning Engineer  2021.000000  185000.000000  185000.000000
[125 rows x 3 columns]