Create new column based on values from other columns in Pandas


In this tutorial, we will learn how to add a new column to a Pandas dataframe based on the values on other columns.

Solution 1: Using pandas apply()

The apply function in pandas is used to apply a function along a specific axis of a DataFrame. The basic syntax for using the apply function is as follows:

df.apply(function, axis=0)

Where df is the DataFrame, function is the function to apply to the DataFrame, and axis is the axis along which the function is applied. The default value for axis is 0, which means that the function is applied to each column of the DataFrame.

When the apply function is called, it takes each column (or row, if axis=1) of the DataFrame and passes it as a Series object to the function that was specified. The function can then perform any operation on the Series object and return a single value, which is then returned as the result for that column (or row).

Here is an example of using the apply function to calculate the sum of each row in a DataFrame:

import pandas as pd

# create a DataFrame
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})

# define a function to calculate the sum of a Series object
def row_sum(series):
    return series.sum()

# apply the function to each row of the DataFrame
df['row_sum'] = df.apply(row_sum, axis=1)

print(df)

Output:

   a  b  c  row_sum
0  1  4  7       12
1  2  5  8       15
2  3  6  9       18

In this example, the row_sum function is defined to take a Series object and return the sum of the values in the Series. The apply function is then used to apply this function to each row of the DataFrame, using axis=1. The result is a new column in the DataFrame called row_sum, which contains the sum of each row.

Solution 2: Using numpy vectorize()

The vectorize() function in NumPy is a tool that allows you to apply a function to a NumPy array element-wise. While not preferred to apply, it is interesting to see. It converts a function that takes a scalar argument and returns a scalar result into a function that can operate on arrays of any size or shape.

Here's how it works:

  1. You start by defining a function that you want to apply to the elements of a NumPy array. This function should take a scalar argument and return a scalar result.
  2. You then pass this function to the vectorize() function. The vectorize() function returns a new function that can be used to apply the original function to a NumPy array.
  3. You can then call the new function, passing in the NumPy array as the argument. The vectorize() function will automatically apply the original function to each element of the array, and return a new array containing the results.

Here's an example:

import numpy as np

def my_func(x):
    if x < 0:
        return 0
    else:
        return x ** 2

# create a NumPy array
arr = np.array([-1, 0, 1, 2, 3])

# apply the function to the array using vectorize
new_arr = np.vectorize(my_func)(arr)

# print the results
print(new_arr)

In this example, we define a function called my_func() that takes a scalar argument x and returns a scalar result. We then use vectorize() to create a new function that can apply my_func() to each element of a NumPy array. We apply this new function to the array arr, which contains the values [-1, 0, 1, 2, 3]. The result is a new array new_arr, which contains the values [0, 0, 1, 4, 9].

Note that the vectorize() function can be slower than writing your own explicit loop, especially for large arrays, because it has some overhead. However, it can be a convenient way to apply a function to an array in cases where you don't want to write a loop yourself.

Example

Here's a different example that demonstrates how to create a new column based on values from other columns and apply a function of multiple columns row-wise in Pandas, with both apply and vectorize:

Let's say we have a dataframe with the following columns:

  • Name: The name of the person
  • Age: The age of the person
  • Gender: The gender of the person, either 'Male' or 'Female'
  • Weight: The weight of the person in pounds
  • Height: The height of the person in inches

We want to create a new column called BMI (Body Mass Index) based on the person's weight and height, and we also want to classify the person's BMI into categories based on the following criteria:

  • BMI < 18.5: 'Underweight'
  • 18.5 <= BMI < 25: 'Normal'
  • 25 <= BMI < 30: 'Overweight'
  • BMI >= 30: 'Obese'

Here's how we can do it:

import pandas as pd
import numpy as np

# Create the dataframe
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [23, 32, 45, 18, 27],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female'],
    'Weight': [130, 180, 200, 150, 120],
    'Height': [65, 70, 72, 68, 62]
})

# Define the function to calculate BMI
def calculate_bmi(row):
    height_m = row['Height'] / 39.37 # Convert height to meters
    weight_kg = row['Weight'] / 2.20462 # Convert weight to kilograms
    bmi = weight_kg / (height_m ** 2)
    return bmi

# Apply the function to create a new column called 'BMI'
df['BMI'] = df.apply(calculate_bmi, axis=1)

# Define the function to classify BMI into categories
def classify_bmi(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif bmi < 25:
        return 'Normal'
    elif bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'

# Apply the function to create a new column called 'BMI Category'
df['BMI Category'] = np.vectorize(classify_bmi)(df['BMI'])

# Print the dataframe
print(df)

Output:

     Name  Age  Gender  Weight  Height        BMI BMI Category
0    Alice   23  Female     130      65  21.632849       Normal
1      Bob   32    Male     180      70  25.826973   Overweight
2  Charlie   45    Male     200      72  27.124522   Overweight
3    David   18    Male     150      68  22.807124       Normal
4      Eva   27  Female     120      62  21.948000       Normal

As you can see, we first defined a function calculate_bmi to calculate the BMI based on the person's weight and height. We then used the apply function to apply this function row-wise to the dataframe, creating a new column called 'BMI'.

Next, we defined another function classify_bmi to classify the BMI into categories based on the criteria we defined. We used the vectorize function from NumPy to apply this function element-wise to the 'BMI' column, creating a new column called 'BMI Category'.

When to use each solution

Pandas apply and NumPy vectorize are both powerful tools for applying a function to an array-like object. Here are some general guidelines to help you decide when to use which:

Use pandas apply when:

  • You have a DataFrame or Series object and want to apply a function to one or more columns or rows of data
  • You want to apply a function that takes a pandas object as input or returns a pandas object as output
  • You want to apply a function that operates on a row or column of data, rather than an element-wise function

Use NumPy vectorize when:

  • You have a 1-dimensional NumPy array and want to apply a function to each element in the array
  • You want to apply a function that takes a scalar input and returns a scalar output
  • You want to apply a function that operates element-wise, rather than on rows or columns of data

In general, if you are working with pandas objects (e.g., DataFrames or Series) and need to apply a function to rows or columns of data, you should use pandas apply. If you are working with 1-dimensional NumPy arrays and need to apply a function element-wise, you should use NumPy vectorize.

Conclusion

In conclusion, both the apply function in Pandas and the vectorize function in NumPy provide powerful tools for performing operations on data. However, they have some key differences that make them more suitable for different types of tasks.

apply is a flexible and versatile function that can be used for applying any custom function to a DataFrame or Series, making it a great choice for more complex data manipulation tasks. On the other hand, vectorize is more specialized and optimized for performing element-wise operations on arrays. This makes it a faster and more efficient choice for performing simple mathematical operations on large arrays of data.

Overall, the choice between apply and vectorize will depend on the specific task at hand and the type of data being used. By understanding the strengths and limitations of each function, you can select the best tool for the job and ensure that your data manipulation tasks are completed accurately and efficiently.