Pandas - Assign Method

2019-11-03
Assign Method in Pandas

Why Assign?

Assign is a method in pandas.DataFrame capable of creating new columns or re-assigning existing columns. Note that the original columns will be overwritten when being re-assigned. This method returns a new object - a copy - with new columns added.

The assign method is efficient and versatile during a sequence of multiple operations, when used with method chaining. It eliminates the need of saving an intermediate DataFrame during the data manipulation process. Additionally, since assign returns a new object, the error SettingWithCopy won’t occur. For someone who started their data manipulation journey in R like myself, assign has similar programming logic and process like pipeline - %>% in R, which is another plus!

Examples

Generate a Dataframe

import pandas as pd

# Create a dataframe
df = pd.DataFrame({
    'record': ["2019-09-10 COSTCO 78.20", "2019-09-30 KFC 8.21", "2019-10-01 SHELL 23.25"]
})
index record
0 2019-09-10 COSTCO 78.20
1 2019-09-30 KFC 8.21
2 2019-10-01 SHELL 23.25

Hereby I demonstrate the process of adding new columns to a DataFrame during data cleaning. The goal is to extract the date, item and price from the transaction data and convert them into desired format.

Common Usage

Common usage to add columns into a DataFrame is to use indexing to create new columns. However, it requires lots of intermediate steps that create hurdles in reading & understanding code. Note that method chaining cannot be used here.

# Extract date by subsetting the first 10 letters
df['date'] = df.record.str.slice(start=0, stop=10)
# Convert date into datetime format
df['dateformat'] = pd.to_datetime(df['date'])
# Extract item
df['item'] = df.record.str.extract(r'([a-zA-Z]+)')
# Extract price
df['price'] = df.record.str.extract(r'(\d+\.\d+)')
# Convert string into float
df['price'] = df.price.astype('float')

df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 3 entries, 0 to 2
## Data columns (total 5 columns):
## record        3 non-null object
## date          3 non-null object
## dateformat    3 non-null datetime64[ns]
## item          3 non-null object
## price         3 non-null float64
## dtypes: datetime64[ns](1), float64(1), object(3)
## memory usage: 248.0+ bytes
index record date dateformat item price
0 2019-09-10 COSTCO 78.20 2019-09-10 2019-09-10 COSTCO 78.20
1 2019-09-30 KFC 8.21 2019-09-30 2019-09-30 KFC 8.21
2 2019-10-01 SHELL 23.25 2019-10-01 2019-10-01 SHELL 23.25

Assign Method

One of the benefits of using assign together with method chaining is that intermediate results could be checked without assignment. By doing so, new operations would not contaminate the original DataFrame. It makes debugging much easier since there is no need to rerun the complete process from the beginning. The final object could be named as an object once all the operations are validated correctly.

Severals things to keep in mind while using assign. First, column names need to be unique in each assign operation. Next, within the same assign, it is possible to create new columns that depend on previously defined columns within the same assign - as shown in the following example. Lastly, method chaining makes the assign method even more powerful!

# Remeber to have () outside the operations if there are more than 2 methods chaining.
df = (df.assign(date=lambda x: x.record.str.slice(start=0, stop=10),
                # convert the date into datetime format
                dateformat=lambda x: pd.to_datetime(x.date),
                item=lambda x: x.record.str.extract(r'([a-zA-Z]+)'),
                price=lambda x: x.record.str.extract(r'(\d+\.\d+)'))
        # the same column could not be used in one assign
        # method chaining for another new column
        .assign(price=lambda x: x.price.astype('float')))

df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 3 entries, 0 to 2
## Data columns (total 5 columns):
## record        3 non-null object
## date          3 non-null object
## dateformat    3 non-null datetime64[ns]
## item          3 non-null object
## price         3 non-null float64
## dtypes: datetime64[ns](1), float64(1), object(3)
## memory usage: 248.0+ bytes
index record date dateformat item price
0 2019-09-10 COSTCO 78.20 2019-09-10 2019-09-10 COSTCO 78.20
1 2019-09-30 KFC 8.21 2019-09-30 2019-09-30 KFC 8.21
2 2019-10-01 SHELL 23.25 2019-10-01 2019-10-01 SHELL 23.25

Adding Columns with Apply Method

The apply method applies a function along an axis of the DataFrame. Thus, it is necessary to apply functions on a pandas.Series when only one column is of interest. However, the apply operation would be interrupted when operations are chained.

Here comes assign as the cure. Alternatively, the assign operation here can be chained as the 3rd assign in the previous example. In assign, the x in lambda function is referred to the DataFrame. Thus, the series of interest would need to be specified in the lambda function.

# Apply to a Series with lambda function
df['euro_price'] = df.price.apply(lambda x: x * 0.9)

# Using assing with lambda function
df = df.assign(euro_price=lambda x: x.price * 0.9)
index record date dateformat item price euro_price
0 2019-09-10 COSTCO 78.20 2019-09-10 2019-09-10 COSTCO 78.20 70.380
1 2019-09-30 KFC 8.21 2019-09-30 2019-09-30 KFC 8.21 7.389
2 2019-10-01 SHELL 23.25 2019-10-01 2019-10-01 SHELL 23.25 20.925

More info about why chaining (Python) & pipe (R) are useful for data scientists can be found in the chapter about pipes of R for data science.

Examples above are written for demonstration purpose. The method chaining process is my preferred way of data manipulation when using assign. I’ll share more about method chaining in pandas soon. Feel free to share your thoughts with me.