Why Assign?
Assign is a method in pandas.DataFrame capable of creating new columns or re-assigning existing columns. Note that the original columns will be overwritten when being re-assigned. This method returns a new object - a copy - with new columns added.
The assign method is efficient and versatile during a sequence of multiple operations, when used with method chaining. It eliminates the need of saving an intermediate DataFrame during the data manipulation process. Additionally, since assign returns a new object, the error SettingWithCopy
won’t occur. For someone who started their data manipulation journey in R like myself, assign has similar programming logic and process like pipeline - %>%
in R, which is another plus!
Examples
Generate a Dataframe
import pandas as pd
# Create a dataframe
df = pd.DataFrame({
'record': ["2019-09-10 COSTCO 78.20", "2019-09-30 KFC 8.21", "2019-10-01 SHELL 23.25"]
})
index | record |
---|---|
0 | 2019-09-10 COSTCO 78.20 |
1 | 2019-09-30 KFC 8.21 |
2 | 2019-10-01 SHELL 23.25 |
Hereby I demonstrate the process of adding new columns to a DataFrame during data cleaning. The goal is to extract the date, item and price from the transaction data and convert them into desired format.
Common Usage
Common usage to add columns into a DataFrame is to use indexing to create new columns. However, it requires lots of intermediate steps that create hurdles in reading & understanding code. Note that method chaining cannot be used here.
# Extract date by subsetting the first 10 letters
df['date'] = df.record.str.slice(start=0, stop=10)
# Convert date into datetime format
df['dateformat'] = pd.to_datetime(df['date'])
# Extract item
df['item'] = df.record.str.extract(r'([a-zA-Z]+)')
# Extract price
df['price'] = df.record.str.extract(r'(\d+\.\d+)')
# Convert string into float
df['price'] = df.price.astype('float')
df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 3 entries, 0 to 2
## Data columns (total 5 columns):
## record 3 non-null object
## date 3 non-null object
## dateformat 3 non-null datetime64[ns]
## item 3 non-null object
## price 3 non-null float64
## dtypes: datetime64[ns](1), float64(1), object(3)
## memory usage: 248.0+ bytes
index | record | date | dateformat | item | price |
---|---|---|---|---|---|
0 | 2019-09-10 COSTCO 78.20 | 2019-09-10 | 2019-09-10 | COSTCO | 78.20 |
1 | 2019-09-30 KFC 8.21 | 2019-09-30 | 2019-09-30 | KFC | 8.21 |
2 | 2019-10-01 SHELL 23.25 | 2019-10-01 | 2019-10-01 | SHELL | 23.25 |
Assign Method
One of the benefits of using assign together with method chaining is that intermediate results could be checked without assignment. By doing so, new operations would not contaminate the original DataFrame. It makes debugging much easier since there is no need to rerun the complete process from the beginning. The final object could be named as an object once all the operations are validated correctly.
Severals things to keep in mind while using assign. First, column names need to be unique in each assign operation. Next, within the same assign, it is possible to create new columns that depend on previously defined columns within the same assign - as shown in the following example. Lastly, method chaining makes the assign method even more powerful!
# Remeber to have () outside the operations if there are more than 2 methods chaining.
df = (df.assign(date=lambda x: x.record.str.slice(start=0, stop=10),
# convert the date into datetime format
dateformat=lambda x: pd.to_datetime(x.date),
item=lambda x: x.record.str.extract(r'([a-zA-Z]+)'),
price=lambda x: x.record.str.extract(r'(\d+\.\d+)'))
# the same column could not be used in one assign
# method chaining for another new column
.assign(price=lambda x: x.price.astype('float')))
df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 3 entries, 0 to 2
## Data columns (total 5 columns):
## record 3 non-null object
## date 3 non-null object
## dateformat 3 non-null datetime64[ns]
## item 3 non-null object
## price 3 non-null float64
## dtypes: datetime64[ns](1), float64(1), object(3)
## memory usage: 248.0+ bytes
index | record | date | dateformat | item | price |
---|---|---|---|---|---|
0 | 2019-09-10 COSTCO 78.20 | 2019-09-10 | 2019-09-10 | COSTCO | 78.20 |
1 | 2019-09-30 KFC 8.21 | 2019-09-30 | 2019-09-30 | KFC | 8.21 |
2 | 2019-10-01 SHELL 23.25 | 2019-10-01 | 2019-10-01 | SHELL | 23.25 |
Adding Columns with Apply Method
The apply method applies a function along an axis of the DataFrame. Thus, it is necessary to apply functions on a pandas.Series when only one column is of interest. However, the apply operation would be interrupted when operations are chained.
Here comes assign as the cure. Alternatively, the assign operation here can be chained as the 3rd assign in the previous example. In assign, the x in lambda function is referred to the DataFrame. Thus, the series of interest would need to be specified in the lambda function.
# Apply to a Series with lambda function
df['euro_price'] = df.price.apply(lambda x: x * 0.9)
# Using assing with lambda function
df = df.assign(euro_price=lambda x: x.price * 0.9)
index | record | date | dateformat | item | price | euro_price |
---|---|---|---|---|---|---|
0 | 2019-09-10 COSTCO 78.20 | 2019-09-10 | 2019-09-10 | COSTCO | 78.20 | 70.380 |
1 | 2019-09-30 KFC 8.21 | 2019-09-30 | 2019-09-30 | KFC | 8.21 | 7.389 |
2 | 2019-10-01 SHELL 23.25 | 2019-10-01 | 2019-10-01 | SHELL | 23.25 | 20.925 |
More info about why chaining (Python) & pipe (R) are useful for data scientists can be found in the chapter about pipes of R for data science.
Examples above are written for demonstration purpose. The method chaining process is my preferred way of data manipulation when using assign. I’ll share more about method chaining in pandas soon. Feel free to share your thoughts with me.