Pandas - Cumulative Sum By Group (cumsum)

2019-10-21

Do It In Python - pandas

Generate a Random Data Frame

In order to show the cumulative sum in time sequence, the column: date is created and shuffled into a random sequence.

import pandas as pd
import numpy as np
# Set the seed to fix the result
np.random.seed(418)
# Genearte dataframe for demo purpose
N=3
df = pd.DataFrame({"item":['Apple','Orange','Lemon']*N,
                   "date":pd.date_range(pd.datetime.today().date(), periods=3*N, freq="D"),
                   "sale": np.random.randint(1,20,3*N)})
# Shuffle the sequence of date
np.random.shuffle(df.date.values)

index	item	date	sale
0	Apple	2019-10-22	10
1	Orange	2019-10-26	17
2	Lemon	2019-10-25	10
3	Apple	2019-10-23	16
4	Orange	2019-10-27	9
5	Lemon	2019-10-24	7
6	Apple	2019-10-21	11
7	Orange	2019-10-28	3
8	Lemon	2019-10-29	14

Compute Cumulative Sum By Items

Bad Example

The goal is to compute the cumulative sum over date by different items. However, the index of the original data frame is not ordered in the desired sequence: date. The column: cum_sale is computed by the original index within a group: item. Take Apple for example, it is computed in the order of index: 0, 3, 6, which leads to the cumulative sum of 10, 26, 37 and is not in the chronological order.

df["cum_sale"]=df.groupby(['item'])['sale'].cumsum(axis=0)

index	item	date	sale	cum_sale
6	Apple	2019-10-21	11	37
0	Apple	2019-10-22	10	10
3	Apple	2019-10-23	16	26
5	Lemon	2019-10-24	7	17
2	Lemon	2019-10-25	10	10
8	Lemon	2019-10-29	14	31
1	Orange	2019-10-26	17	17
4	Orange	2019-10-27	9	26
7	Orange	2019-10-28	3	29

Good Example

Pandas operates sequence by row indexes. Thus, data frame first needs to be sorted by the desired order: date, followed by resetting the index to match the order of the variable of interest. Now, the output is summed up by group of item in the chronological order.

df=df.sort_values(['date']).reset_index(drop=True)
df["cum_sale"]=df.groupby(['item'])['sale'].cumsum(axis=0)

index	item	date	sale	cum_sale
0	Apple	2019-10-21	11	11
1	Apple	2019-10-22	10	21
2	Apple	2019-10-23	16	37
3	Lemon	2019-10-24	7	7
4	Lemon	2019-10-25	10	17
8	Lemon	2019-10-29	14	31
5	Orange	2019-10-26	17	17
6	Orange	2019-10-27	9	26
7	Orange	2019-10-28	3	29

Bonus: Do It In R - `dplyr`

library(dplyr)
df %>%
  group_by(item) %>%
  arrange(date) %>%
  mutate(cum_sale=cumsum(sale))

index	item	date	sale	cum_sale
6	Apple	2019-10-21	11	11
0	Apple	2019-10-22	10	21
3	Apple	2019-10-23	16	37
5	Lemon	2019-10-24	7	7
2	Lemon	2019-10-25	10	17
8	Lemon	2019-10-29	14	31
1	Orange	2019-10-26	17	17
4	Orange	2019-10-27	9	26
7	Orange	2019-10-28	3	29

Pandas - Cumulative Sum By Group (cumsum)

Do It In Python - pandas

Generate a Random Data Frame

Compute Cumulative Sum By Items

Bad Example

Good Example

Bonus: Do It In R - dplyr

Bonus: Do It In R - `dplyr`