Pandas - Cumulative Sum By Group (cumsum)

2019-10-21
Pandas - Cumulative Sum By Group (cumsum)

Do It In Python - pandas

Generate a Random Data Frame

In order to show the cumulative sum in time sequence, the column: date is created and shuffled into a random sequence.

import pandas as pd
import numpy as np
# Set the seed to fix the result
np.random.seed(418)
# Genearte dataframe for demo purpose
N=3
df = pd.DataFrame({"item":['Apple','Orange','Lemon']*N,
                   "date":pd.date_range(pd.datetime.today().date(), periods=3*N, freq="D"),
                   "sale": np.random.randint(1,20,3*N)})
# Shuffle the sequence of date
np.random.shuffle(df.date.values)
index item date sale
0 Apple 2019-10-22 10
1 Orange 2019-10-26 17
2 Lemon 2019-10-25 10
3 Apple 2019-10-23 16
4 Orange 2019-10-27 9
5 Lemon 2019-10-24 7
6 Apple 2019-10-21 11
7 Orange 2019-10-28 3
8 Lemon 2019-10-29 14

Compute Cumulative Sum By Items

Bad Example

The goal is to compute the cumulative sum over date by different items. However, the index of the original data frame is not ordered in the desired sequence: date. The column: cum_sale is computed by the original index within a group: item. Take Apple for example, it is computed in the order of index: 0, 3, 6, which leads to the cumulative sum of 10, 26, 37 and is not in the chronological order.

df["cum_sale"]=df.groupby(['item'])['sale'].cumsum(axis=0)
index item date sale cum_sale
6 Apple 2019-10-21 11 37
0 Apple 2019-10-22 10 10
3 Apple 2019-10-23 16 26
5 Lemon 2019-10-24 7 17
2 Lemon 2019-10-25 10 10
8 Lemon 2019-10-29 14 31
1 Orange 2019-10-26 17 17
4 Orange 2019-10-27 9 26
7 Orange 2019-10-28 3 29

Good Example

Pandas operates sequence by row indexes. Thus, data frame first needs to be sorted by the desired order: date, followed by resetting the index to match the order of the variable of interest. Now, the output is summed up by group of item in the chronological order.

df=df.sort_values(['date']).reset_index(drop=True)
df["cum_sale"]=df.groupby(['item'])['sale'].cumsum(axis=0)
index item date sale cum_sale
0 Apple 2019-10-21 11 11
1 Apple 2019-10-22 10 21
2 Apple 2019-10-23 16 37
3 Lemon 2019-10-24 7 7
4 Lemon 2019-10-25 10 17
8 Lemon 2019-10-29 14 31
5 Orange 2019-10-26 17 17
6 Orange 2019-10-27 9 26
7 Orange 2019-10-28 3 29

Bonus: Do It In R - dplyr

library(dplyr)
df %>%
  group_by(item) %>%
  arrange(date) %>%
  mutate(cum_sale=cumsum(sale))
index item date sale cum_sale
6 Apple 2019-10-21 11 11
0 Apple 2019-10-22 10 21
3 Apple 2019-10-23 16 37
5 Lemon 2019-10-24 7 7
2 Lemon 2019-10-25 10 17
8 Lemon 2019-10-29 14 31
1 Orange 2019-10-26 17 17
4 Orange 2019-10-27 9 26
7 Orange 2019-10-28 3 29