How to Sum a Column Based on Another Column in Pandas DataFrame (GroupBy) and Retain Additional Column Values

In data analysis, a common task is to aggregate numerical values (e.g., summing sales revenue) based on categories (e.g., regions, products) while retaining additional context (e.g., salesperson, date). Pandas, the go-to Python library for data manipulation, simplifies this with its groupby functionality. However, retaining non-aggregated columns (e.g., names, dates) alongside summed values can be tricky—especially when categories have multiple entries for these columns.

This blog will guide you through summing a column based on another column using groupby, with a focus on retaining additional columns effectively. We’ll cover basic to advanced scenarios, common pitfalls, and practical examples to ensure you can apply these techniques to real-world datasets.

Table of Contents#

  1. Prerequisites
  2. Understanding the Problem
  3. Step 1: Create a Sample DataFrame
  4. Basic GroupBy Sum (Without Additional Columns)
  5. Retaining Additional Column Values: Key Scenarios
  6. Advanced Use Cases
  7. Common Pitfalls and How to Avoid Them
  8. Conclusion
  9. References

Prerequisites#

Before diving in, ensure you have:

  • Basic familiarity with Python and Pandas.
  • Pandas installed (run pip install pandas if not).
  • A Python environment (e.g., Jupyter Notebook, VS Code).

Understanding the Problem#

Suppose you have a sales DataFrame with columns: Region, Product, Salesperson, Sale_Date, and Revenue. You want to:

  • Sum the Revenue for each Region.
  • Retain additional columns like Salesperson or Sale_Date for context.

The challenge arises when a group (e.g., "North Region") has multiple entries for Salesperson (e.g., "Alice" and "Bob"). How do you decide which salesperson to retain? We’ll solve this with Pandas’ groupby and aggregation tools.

Step 1: Create a Sample DataFrame#

Let’s start with a sample DataFrame to demonstrate. Run this code to create it:

import pandas as pd
 
data = {
    "Region": ["North", "North", "South", "East", "East", "West"],
    "Product": ["Laptop", "Phone", "Laptop", "Phone", "Laptop", "Phone"],
    "Salesperson": ["Alice", "Alice", "Bob", "Charlie", "Charlie", "Diana"],
    "Sale_Date": ["2023-01-05", "2023-01-10", "2023-01-05", "2023-01-12", "2023-01-15", "2023-01-08"],
    "Revenue": [1200, 800, 1500, 900, 1300, 750]
}
 
df = pd.DataFrame(data)
print(df)

Output:

  Region  Product Salesperson   Sale_Date  Revenue
0  North   Laptop       Alice  2023-01-05     1200
1  North    Phone       Alice  2023-01-10      800
2  South   Laptop         Bob  2023-01-05     1500
3   East    Phone     Charlie  2023-01-12      900
4   East   Laptop     Charlie  2023-01-15     1300
5   West    Phone       Diana  2023-01-08      750

Basic GroupBy Sum (Without Additional Columns)#

First, let’s recall how to sum a column by another column without retaining extra columns. To sum Revenue by Region:

# Group by 'Region' and sum 'Revenue'
region_revenue = df.groupby("Region")["Revenue"].sum()
print(region_revenue)

Output:

Region
East     2200  # 900 + 1300
North    2000  # 1200 + 800
South    1500
West      750
Name: Revenue, dtype: int64

This returns a Series with Region as the index and summed Revenue. To convert it to a DataFrame, use reset_index():

region_revenue_df = df.groupby("Region")["Revenue"].sum().reset_index()
print(region_revenue_df)

Output:

  Region  Revenue
0   East     2200
1  North     2000
2  South     1500
3   West      750

But we’re missing Salesperson and Sale_Date! Let’s fix that.

Retaining Additional Column Values: Key Scenarios#

To retain columns like Salesperson, we need to handle two scenarios: when the column has unique values per group or non-unique values per group.

Scenario 1: Additional Columns Have Unique Values Per Group#

If each group (e.g., "North Region") has one unique value for the additional column (e.g., only "Alice" works in the North), include the column in the groupby statement.

Example: Sum Revenue by Region and retain Salesperson (assuming one salesperson per region):

# Check if 'Salesperson' is unique per 'Region'
print(df.groupby("Region")["Salesperson"].nunique())  # Output: North=1, South=1, East=1, West=1 → unique!
 
# Group by both 'Region' and 'Salesperson', then sum 'Revenue'
unique_retain = df.groupby(["Region", "Salesperson"])["Revenue"].sum().reset_index()
print(unique_retain)

Output:

  Region Salesperson  Revenue
0   East     Charlie     2200
1  North       Alice     2000
2  South         Bob     1500
3   West       Diana      750

Perfect! Since each region has one salesperson, grouping by both columns retains Salesperson.

Scenario 2: Additional Columns Have Non-Unique Values Per Group#

If a group has multiple values for the additional column (e.g., "North Region" has two salespeople: "Alice" and "Eve"), use aggregation functions like first(), last(), max(), or min() to pick which value to retain.

Example: Suppose we add a row where "North" has another salesperson, "Eve":

# Add a new row to make 'Salesperson' non-unique in North
new_row = {"Region": "North", "Product": "Tablet", "Salesperson": "Eve", "Sale_Date": "2023-01-18", "Revenue": 500}
df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
print(df)  # Now North has Alice and Eve

Updated DataFrame:

  Region  Product Salesperson   Sale_Date  Revenue
0  North   Laptop       Alice  2023-01-05     1200
1  North    Phone       Alice  2023-01-10      800
2  South   Laptop         Bob  2023-01-05     1500
3   East    Phone     Charlie  2023-01-12      900
4   East   Laptop     Charlie  2023-01-15     1300
5   West    Phone       Diana  2023-01-08      750
6  North   Tablet        Eve  2023-01-18      500  # New row

Now, "North" has two salespeople: Alice and Eve. To retain Salesperson, use agg() with a function like first() (retain the first salesperson in the group):

# Aggregate: sum 'Revenue' and retain first 'Salesperson' per region
non_unique_retain = df.groupby("Region").agg(
    Total_Revenue=("Revenue", "sum"),
    Salesperson=("Salesperson", "first")  # Use 'first' to pick the first salesperson
).reset_index()
 
print(non_unique_retain)

Output:

  Region  Total_Revenue Salesperson
0   East           2200     Charlie
1  North           2500       Alice  # Sum: 1200 + 800 + 500 = 2500; first salesperson: Alice
2  South           1500         Bob
3   West            750       Diana

Other aggregation options:

  • last(): Retain the last salesperson in the group.
  • max()/min(): Retain the alphabetically first/last name.
  • Custom function: e.g., lambda x: ", ".join(x.unique()) to list all salespeople.

Example with last() and a custom function:

non_unique_retain_custom = df.groupby("Region").agg(
    Total_Revenue=("Revenue", "sum"),
    Last_Salesperson=("Salesperson", "last"),
    All_Salespeople=("Salesperson", lambda x: ", ".join(x.unique()))
).reset_index()
 
print(non_unique_retain_custom)

Output:

  Region  Total_Revenue Last_Salesperson All_Salespeople
0   East           2200          Charlie         Charlie
1  North           2500              Eve      Alice, Eve
2  South           1500              Bob             Bob
3   West            750            Diana           Diana

Advanced Use Cases#

Grouping by Multiple Columns#

You can group by multiple columns (e.g., Region and Product) and retain additional columns like Sale_Date.

Example: Sum Revenue by Region and Product, retain the first Sale_Date:

multi_group = df.groupby(["Region", "Product"]).agg(
    Total_Revenue=("Revenue", "sum"),
    First_Sale_Date=("Sale_Date", "first")
).reset_index()
 
print(multi_group)

Output:

  Region  Product  Total_Revenue First_Sale_Date
0   East   Laptop           1300      2023-01-15
1   East    Phone            900      2023-01-12
2  North   Laptop           1200      2023-01-05
3  North    Phone            800      2023-01-10
4  North   Tablet            500      2023-01-18
5  South   Laptop           1500      2023-01-05
6   West    Phone            750      2023-01-08

Named Aggregation for Clarity#

Pandas 0.25+ supports named aggregation (using pd.NamedAgg) for readability. Redo the previous example with named aggregation:

multi_group_named = df.groupby(["Region", "Product"]).agg(
    Total_Revenue=pd.NamedAgg(column="Revenue", aggfunc="sum"),
    First_Sale_Date=pd.NamedAgg(column="Sale_Date", aggfunc="first")
).reset_index()
 
print(multi_group_named)  # Same output as above

Handling Missing Values#

If your DataFrame has NaN values, use dropna() before grouping to avoid including NaN groups:

# Add a NaN to 'Region'
df_with_nan = df.copy()
df_with_nan.loc[0, "Region"] = pd.NA
 
# Drop NaN and group
clean_group = df_with_nan.dropna(subset=["Region"]).groupby("Region")["Revenue"].sum().reset_index()

Common Pitfalls and How to Avoid Them#

  1. Forgetting reset_index(): Results in a Series instead of a DataFrame. Always use reset_index() to get a clean DataFrame.
  2. Retaining non-unique columns without aggregation: Trying to retain Salesperson without groupby or agg() will throw an error (e.g., df.groupby("Region")["Revenue", "Salesperson"].sum()).
  3. Mismatched aggregation functions: Using sum() on a string column (e.g., Salesperson) causes errors. Only aggregate numeric columns with sum.
  4. Ignoring NaN groups: Use dropna() to exclude NaN values from groups.

Conclusion#

Summing a column based on another in Pandas while retaining additional columns is straightforward once you:

  • Identify if the additional columns have unique values per group (include in groupby).
  • Use aggregation functions (e.g., first(), last()) for non-unique columns.
  • Leverage reset_index() to convert Series to DataFrames.

With these techniques, you can efficiently aggregate data and retain context for better analysis.

References#