Do It In Python - pandas
Generate a Random Data Frame
In order to show the cumulative sum in time sequence, the column: date is created and shuffled into a random sequence.
import pandas as pd
import numpy as np
# Set the seed to fix the result
np.random.seed(418)
# Genearte dataframe for demo purpose
N=3
df = pd.DataFrame({"item":['Apple','Orange','Lemon']*N,
"date":pd.date_range(pd.datetime.today().date(), periods=3*N, freq="D"),
"sale": np.random.randint(1,20,3*N)})
# Shuffle the sequence of date
np.random.shuffle(df.date.values)
|
index
|
item
|
date
|
sale
|
|
0
|
Apple
|
2019-10-22
|
10
|
|
1
|
Orange
|
2019-10-26
|
17
|
|
2
|
Lemon
|
2019-10-25
|
10
|
|
3
|
Apple
|
2019-10-23
|
16
|
|
4
|
Orange
|
2019-10-27
|
9
|
|
5
|
Lemon
|
2019-10-24
|
7
|
|
6
|
Apple
|
2019-10-21
|
11
|
|
7
|
Orange
|
2019-10-28
|
3
|
|
8
|
Lemon
|
2019-10-29
|
14
|
Compute Cumulative Sum By Items
Bad Example
The goal is to compute the cumulative sum over date by different items. However, the index of the original data frame is not ordered in the desired sequence: date. The column: cum_sale is computed by the original index within a group: item. Take Apple for example, it is computed in the order of index: 0, 3, 6, which leads to the cumulative sum of 10, 26, 37 and is not in the chronological order.
df["cum_sale"]=df.groupby(['item'])['sale'].cumsum(axis=0)
|
index
|
item
|
date
|
sale
|
cum_sale
|
|
6
|
Apple
|
2019-10-21
|
11
|
37
|
|
0
|
Apple
|
2019-10-22
|
10
|
10
|
|
3
|
Apple
|
2019-10-23
|
16
|
26
|
|
5
|
Lemon
|
2019-10-24
|
7
|
17
|
|
2
|
Lemon
|
2019-10-25
|
10
|
10
|
|
8
|
Lemon
|
2019-10-29
|
14
|
31
|
|
1
|
Orange
|
2019-10-26
|
17
|
17
|
|
4
|
Orange
|
2019-10-27
|
9
|
26
|
|
7
|
Orange
|
2019-10-28
|
3
|
29
|
Good Example
Pandas operates sequence by row indexes. Thus, data frame first needs to be sorted by the desired order: date, followed by resetting the index to match the order of the variable of interest. Now, the output is summed up by group of item in the chronological order.
df=df.sort_values(['date']).reset_index(drop=True)
df["cum_sale"]=df.groupby(['item'])['sale'].cumsum(axis=0)
|
index
|
item
|
date
|
sale
|
cum_sale
|
|
0
|
Apple
|
2019-10-21
|
11
|
11
|
|
1
|
Apple
|
2019-10-22
|
10
|
21
|
|
2
|
Apple
|
2019-10-23
|
16
|
37
|
|
3
|
Lemon
|
2019-10-24
|
7
|
7
|
|
4
|
Lemon
|
2019-10-25
|
10
|
17
|
|
8
|
Lemon
|
2019-10-29
|
14
|
31
|
|
5
|
Orange
|
2019-10-26
|
17
|
17
|
|
6
|
Orange
|
2019-10-27
|
9
|
26
|
|
7
|
Orange
|
2019-10-28
|
3
|
29
|
Bonus: Do It In R - dplyr
library(dplyr)
df %>%
group_by(item) %>%
arrange(date) %>%
mutate(cum_sale=cumsum(sale))
|
index
|
item
|
date
|
sale
|
cum_sale
|
|
6
|
Apple
|
2019-10-21
|
11
|
11
|
|
0
|
Apple
|
2019-10-22
|
10
|
21
|
|
3
|
Apple
|
2019-10-23
|
16
|
37
|
|
5
|
Lemon
|
2019-10-24
|
7
|
7
|
|
2
|
Lemon
|
2019-10-25
|
10
|
17
|
|
8
|
Lemon
|
2019-10-29
|
14
|
31
|
|
1
|
Orange
|
2019-10-26
|
17
|
17
|
|
4
|
Orange
|
2019-10-27
|
9
|
26
|
|
7
|
Orange
|
2019-10-28
|
3
|
29
|