What’s pipe?
Pipe is a method in pandas.DataFrame capable of passing existing functions from packages or self-defined functions to dataframe. It is part of the methods that enable method chaining. By using pipe, multiple processes can be combined with method chaining without nesting. Let’s look into an example here to show its benefits.
In the documentation of pandas
, there are 3 functions: h(df)
, g(df,arg1=a)
, f(df,arg2=b, arg3=c)
applied on df
in this order. Usually, three functions are nested in the sequence of calling. It is hard to read the functions & arguments at first glance. By method chaining, the relationships among operations can be shown in a clearer format.
# Nested functions
f(g(h(df), arg1=a), arg2=b, arg3=c)
# Method chaining
(
df.pipe(h)
pipe(g, arg1=a)
pipe(f, arg2=b, arg3=c)
)
Let’s use online shipping as an example to show a different approach combining multiple processes in a row. There are 5 functions add_to_cart
, checkout
, shipping
, billing
, and place_order
used to complete a transaction by customers. The following two examples shows common ways of calling multiple functions consecutively.
The first one that uses nested functions heavily is hard to read without proper formatting and hard to recognize the argument of each function. From my personal experience, the second one is harder for me as it requires giving meaningful names to the intermediate results otherwise hard to be recognized later. Also, the intermediate results are sometimes one-time results and not used in the later part of process. That’s why I wouldn’t choose this way if I have alternatives.
# 1. Nested functions
place_order(
billing(
shipping(
checkout(add_to_cart(items)),
"address"),
"credit_card"))
# 2. Save all the intermediate results
cart = add_to_cart(items)
new_order = checkout(cart)
shipping_info = shipping(new_order, "address")
billing_info = billing(shipping_info, "credit_card")
completed_order = place_order(billing_info)
For the same process, using method chaining/pipe makes the process readable and easily recognizable the argument of each function call. It clearly shows the sequence of the execution and the arguments without the need to nest.
# Method chaining
items.pipe(add_to_cart)
.pipe(checkout)
.pipe(shipping,"address")
.pipe(billing, "credit_card")
.pipe(place_order)
This situation is common in data science as there are numerous processes involved in data manipulation. Oftentimes, the intermediate results are not important since the goal of data manipulation is to get the final data clean.
Examples
Let’s look into several examples to see how pipe really works. Here, a dataframe containing student name, subject and score is randomly generated.
import pandas as pd
import numpy as np
# Set seed
np.random.seed(520)
# Create a dataframe
df = pd.DataFrame({
'name': ['Ted'] * 3 + ['Lisa'] * 3 + ['Sam'] * 3,
'subject': ['math', 'physics', 'history'] * 3,
'score': np.random.randint(60, 100, 9)
})
index | name | subject | score |
---|---|---|---|
0 | Ted | math | 87 |
1 | Ted | physics | 80 |
2 | Ted | history | 75 |
3 | Lisa | math | 79 |
4 | Lisa | physics | 78 |
5 | Lisa | history | 77 |
6 | Sam | math | 85 |
7 | Sam | physics | 61 |
8 | Sam | history | 88 |
To get rank by subject in a line
Return pandas dataframe
The goal is to get the rank of every subject in one line and append it to the original dataframe. Thus, a function - get_subject_rank
- is created to complete this task. By passing the functions, the rank is appended to the original dataframe.
def get_subject_rank(input_df):
# Avoid overwrite the original dataframe
input_df = input_df.copy()
input_df['subject_rank'] = (input_df
.groupby(['subject'])['score']
.rank(ascending=False))
return input_df
# pipe method
df.pipe(get_subject_rank)
index | name | subject | score | subject_rank |
---|---|---|---|---|
0 | Ted | math | 87 | 1 |
1 | Ted | physics | 80 | 1 |
2 | Ted | history | 75 | 3 |
3 | Lisa | math | 79 | 3 |
4 | Lisa | physics | 78 | 2 |
5 | Lisa | history | 77 | 2 |
6 | Sam | math | 85 | 2 |
7 | Sam | physics | 61 | 3 |
8 | Sam | history | 88 | 1 |
Return pandas series
Pipe can return arbitrary outputs when defined in functions. In the following example, the function returns pandas series once df_or_not = False
. Other arguments needs to be specified in the calling in pipe when functions have more than one arguments, also shown in the example below.
def get_subject_rank(input_df, df_or_not=True):
# Avoid overwrite the original dataframe
input_df = input_df.copy()
if df_or_not is True:
input_df['subject_rank'] = (input_df
.groupby(['subject'])['score']
.rank(ascending=False))
return input_df
else:
output_series = (input_df
.groupby(['subject'])['score']
.rank(ascending=False))
return output_series
# pipe method - return arbitary output
df.pipe(get_subject_rank, df_or_not = False)
## 0 1.0
## 1 1.0
## 2 3.0
## 3 3.0
## 4 2.0
## 5 2.0
## 6 2.0
## 7 3.0
## 8 1.0
## Name: score, dtype: float64
Data is not the first argument
When calling functions in pipe, the first argument of the function by default is the dataframe/series applied by pipe. Here is an example of a function that modifies scores - add_score
. The first argument - input_df
- is df
. There is no need to specify input_df
in the calling in pipe.
def add_score(input_df, added_score):
# Avoid overwrite the original dataframe
input_df = input_df.copy()
input_df = input_df.assign(new_score=lambda x: x.score+added_score)
return input_df
df.pipe(add_score, 2)
index | name | subject | score | new_score |
---|---|---|---|---|
0 | Ted | math | 87 | 89 |
1 | Ted | physics | 80 | 82 |
2 | Ted | history | 75 | 77 |
3 | Lisa | math | 79 | 81 |
4 | Lisa | physics | 78 | 80 |
5 | Lisa | history | 77 | 79 |
6 | Sam | math | 85 | 87 |
7 | Sam | physics | 61 | 63 |
8 | Sam | history | 88 | 90 |
The two arguments of add_score
are swapped with each other. In this case, df
is the second argument in the calling. Thus, a tuple - (function, “the argument of data”) - is passed to point out that which argument is the data to apply the function on.
def add_score(added_score, input_df):
# Avoid overwrite the original dataframe
input_df = input_df.copy()
input_df = input_df.assign(new_score=lambda x: x.score+added_score)
return input_df
df.pipe((add_score, "input_df"), 2)
index | name | subject | score | new_score |
---|---|---|---|---|
0 | Ted | math | 87 | 89 |
1 | Ted | physics | 80 | 82 |
2 | Ted | history | 75 | 77 |
3 | Lisa | math | 79 | 81 |
4 | Lisa | physics | 78 | 80 |
5 | Lisa | history | 77 | 79 |
6 | Sam | math | 85 | 87 |
7 | Sam | physics | 61 | 63 |
8 | Sam | history | 88 | 90 |
Debug in method chaining
Some critics might have concerns that it is hard to debug with long chaining processes due to the lack of intermediate results returned. No worries! In this post, the author had provided a great way to tackle this problem - decorators. A decorators is a function that extends the behavior of wrapped function without explicitly modifying it.
Let’s look into actual examples. By using decorators
& logging
together, any properties of dataframe can be returned in log files when specified in decorators. Here, shape & columns are returned using log_shape
& log_columns
. The logging information are also printed below for reference.
Note: wraps
is used to eliminate the side effect of decorators so that the name, docstring, arguments list, etc. are carried after the usage of decorators.
from functools import wraps
import logging
def log_shape(func):
@wraps(func)
def wrapper(*args, **kwargs):
result = func(*args, **kwargs)
logging.info("%s,%s" % (func.__name__, result.shape))
return result
return wrapper
def log_columns(func):
@wraps(func)
def wrapper(*args, **kwargs):
result = func(*args, **kwargs)
logging.info("%s,%s" % (func.__name__, result.columns))
return result
return wrapper
@log_columns
@log_shape
def get_subject_rank(input_df):
input_df = input_df.copy()
input_df['subject_rank'] = (input_df
.groupby(['subject'])['score']
.rank(ascending=False))
return input_df
@log_columns
@log_shape
def add_score(input_df, added_score):
input_df = input_df.copy()
input_df = input_df.assign(new_score=lambda x: x.score+added_score)
return input_df
(
df.pipe(get_subject_rank)
.pipe(add_score, 2)
)
The codes above are modified from Tom’s codes.
INFO - get_subject_rank,(9, 4)
INFO - get_subject_rank,Index(['name', 'subject', 'score', 'subject_rank'], dtype='object')
INFO - add_score,(9, 5)
INFO - add_score,Index(['name', 'subject', 'score', 'subject_rank', 'new_score'], dtype='object')
Pipe is a flexible method to accommodate customized functions during pandas
operations. It is great that pandas has implemented lots of methods to enable method chaining during the data manipulation process. I enjoy exploring more possibility & efficient way to play with data in pandas
.
More info about why chaining (Python) & pipe (R) are useful for data scientists can be found in this article by Tom Augspurger - one of the main contributors of pandas
about method chaining and the chapter about pipes of R for data science