Do It In Python - pandas
Generate a Random Data Frame
In order to show the cumulative sum in time sequence, the column: date
is created and shuffled into a random sequence.
import pandas as pd
import numpy as np
# Set the seed to fix the result
np.random.seed(418)
# Genearte dataframe for demo purpose
N=3
df = pd.DataFrame({"item":['Apple','Orange','Lemon']*N,
"date":pd.date_range(pd.datetime.today().date(), periods=3*N, freq="D"),
"sale": np.random.randint(1,20,3*N)})
# Shuffle the sequence of date
np.random.shuffle(df.date.values)
index
|
item
|
date
|
sale
|
0
|
Apple
|
2019-10-22
|
10
|
1
|
Orange
|
2019-10-26
|
17
|
2
|
Lemon
|
2019-10-25
|
10
|
3
|
Apple
|
2019-10-23
|
16
|
4
|
Orange
|
2019-10-27
|
9
|
5
|
Lemon
|
2019-10-24
|
7
|
6
|
Apple
|
2019-10-21
|
11
|
7
|
Orange
|
2019-10-28
|
3
|
8
|
Lemon
|
2019-10-29
|
14
|
Compute Cumulative Sum By Items
Bad Example
The goal is to compute the cumulative sum over date by different items. However, the index
of the original data frame is not ordered in the desired sequence: date
. The column: cum_sale
is computed by the original index within a group: item
. Take Apple
for example, it is computed in the order of index: 0
, 3
, 6
, which leads to the cumulative sum of 10
, 26
, 37
and is not in the chronological order.
df["cum_sale"]=df.groupby(['item'])['sale'].cumsum(axis=0)
index
|
item
|
date
|
sale
|
cum_sale
|
6
|
Apple
|
2019-10-21
|
11
|
37
|
0
|
Apple
|
2019-10-22
|
10
|
10
|
3
|
Apple
|
2019-10-23
|
16
|
26
|
5
|
Lemon
|
2019-10-24
|
7
|
17
|
2
|
Lemon
|
2019-10-25
|
10
|
10
|
8
|
Lemon
|
2019-10-29
|
14
|
31
|
1
|
Orange
|
2019-10-26
|
17
|
17
|
4
|
Orange
|
2019-10-27
|
9
|
26
|
7
|
Orange
|
2019-10-28
|
3
|
29
|
Good Example
Pandas operates sequence by row indexes. Thus, data frame first needs to be sorted by the desired order: date
, followed by resetting the index to match the order of the variable of interest. Now, the output is summed up by group of item
in the chronological order.
df=df.sort_values(['date']).reset_index(drop=True)
df["cum_sale"]=df.groupby(['item'])['sale'].cumsum(axis=0)
index
|
item
|
date
|
sale
|
cum_sale
|
0
|
Apple
|
2019-10-21
|
11
|
11
|
1
|
Apple
|
2019-10-22
|
10
|
21
|
2
|
Apple
|
2019-10-23
|
16
|
37
|
3
|
Lemon
|
2019-10-24
|
7
|
7
|
4
|
Lemon
|
2019-10-25
|
10
|
17
|
8
|
Lemon
|
2019-10-29
|
14
|
31
|
5
|
Orange
|
2019-10-26
|
17
|
17
|
6
|
Orange
|
2019-10-27
|
9
|
26
|
7
|
Orange
|
2019-10-28
|
3
|
29
|
Bonus: Do It In R - dplyr
library(dplyr)
df %>%
group_by(item) %>%
arrange(date) %>%
mutate(cum_sale=cumsum(sale))
index
|
item
|
date
|
sale
|
cum_sale
|
6
|
Apple
|
2019-10-21
|
11
|
11
|
0
|
Apple
|
2019-10-22
|
10
|
21
|
3
|
Apple
|
2019-10-23
|
16
|
37
|
5
|
Lemon
|
2019-10-24
|
7
|
7
|
2
|
Lemon
|
2019-10-25
|
10
|
17
|
8
|
Lemon
|
2019-10-29
|
14
|
31
|
1
|
Orange
|
2019-10-26
|
17
|
17
|
4
|
Orange
|
2019-10-27
|
9
|
26
|
7
|
Orange
|
2019-10-28
|
3
|
29
|