Pandas Conditional Creation of a Series/Dataframe Column


In this tutorial, we will learn how to add a new column to a Pandas dataframe based on certain conditions.

Solution 1: Using np.where()

If we have only two choices to select from, we can use np.where() to create a new column conditionally. The syntax is as follows:

df['new_column'] = np.where(condition, value_if_condition_true, value_if_condition_false)

Suppose we have a DataFrame that contains information about the sales of different products in different regions. We want to add a new column called "Sale Status" that is set to "High" if the sales are greater than 1000, otherwise "Low". We can use np.where() to create the new column "Sale Status" based on the sales values.

import pandas as pd
import numpy as np

df = pd.DataFrame({'Product': ['A', 'B', 'C', 'D'],
                   'Region': ['North', 'South', 'East', 'West'],
                   'Sales': [1200, 400, 900, 1100]})

df['Sale Status'] = np.where(df['Sales'] > 1000, 'High', 'Low')

print(df)

Output:

  Product Region  Sales Sale Status
0       A  North   1200        High
1       B  South    400         Low
2       C   East    900         Low
3       D   West   1100        High

If we have more than two conditions to select from, we can use np.select().

Here's an example where we have a DataFrame of employee information and we want to create a new column indicating their job level based on their salary:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'employee': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eve'],
    'salary': [50000, 70000, 90000, 110000, 130000]
})

# Create new column indicating job level
conditions = [
    df['salary'] < 60000,
    (df['salary'] >= 60000) & (df['salary'] < 80000),
    (df['salary'] >= 80000) & (df['salary'] < 100000),
    df['salary'] >= 100000
]
choices = ['Junior', 'Mid-level', 'Senior', 'Executive']
df['job_level'] = np.select(conditions, choices, default='Unknown')

print(df)

In this example, we create four conditions based on salary ranges and assign a corresponding job level using np.select(). The default parameter is set to 'Unknown' so that if none of the conditions match, the job level will be 'Unknown'.

This code will output the following DataFrame:

 employee  salary  job_level
0    Alice   50000     Junior
1      Bob   70000  Mid-level
2  Charlie   90000     Senior
3     Dave  110000  Executive
4      Eve  130000  Executive

Solution 2: Using List Comprehension

List comprehension is another way to create another column conditionally. If we are working with object dtypes in columns, like in our example, list comprehensions typically outperform most other methods.

Here's how we can use list comprehension to create the Sale Status column in a previous example:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Product': ['A', 'B', 'C', 'D'],
                   'Region': ['North', 'South', 'East', 'West'],
                   'Sales': [1200, 400, 900, 1100]})
df['Sale Status'] = ['High' if x > 1000 else 'Low' for x in df['Sales']]
print(df)

Output:

 Product Region  Sales Sale Status
0       A  North   1200        High
1       B  South    400         Low
2       C   East    900         Low
3       D   West   1100        High

When to use each solution

When it comes to adding a conditional column to a pandas DataFrame, there are multiple ways to achieve the desired result, such as using np.where() or list comprehension. However, which one you choose will depend on the specific requirements of your task and the size of your data.

np.where() is a NumPy function that allows you to create a new column based on a condition. It is very fast and efficient, especially for large datasets, as it uses vectorization. This means that the operation is performed on an entire array rather than on individual elements.

On the other hand, list comprehension is another way to create a conditional column in pandas. It is a more flexible method than np.where() since it can handle more complex conditions and operations. List comprehension is especially useful when working with columns that have object dtype, like in the example provided in the question. It is also generally faster than other methods for small datasets. However, it can become slower as the size of the data increases.

Therefore, if you have a large dataset, and your condition is simple, such as in the example provided in the question, np.where() is likely to be the most efficient method to use. However, if you have a smaller dataset or a more complex condition, then list comprehension might be a better option to use. Ultimately, the best approach will depend on the specific requirements of your task and the characteristics of your data.

Conclusion

In conclusion, adding a conditional column to a Pandas dataframe can be achieved using various methods such as np.where() and list comprehension. np.where() is useful when there are only a few conditions to be met, while list comprehension is a good option when dealing with object data types in the columns. Understanding the pros and cons of these methods can help you choose the most efficient and effective solution for your specific use case. With this knowledge, you can easily add conditional columns to your Pandas dataframes and make data manipulation easier and more efficient.