Create new column based on values from other columns in Pandas
In this tutorial, we will learn how to add a new column to a Pandas dataframe based on the values on other columns.
apply()
Solution 1: Using pandas The apply
function in pandas is used to apply a function along a specific axis of a DataFrame. The basic syntax for using the apply
function is as follows:
df.apply(function, axis=0)
Where df
is the DataFrame, function
is the function to apply to the DataFrame, and axis
is the axis along which the function is applied. The default value for axis
is 0, which means that the function is applied to each column of the DataFrame.
When the apply
function is called, it takes each column (or row, if axis=1
) of the DataFrame and passes it as a Series object to the function that was specified. The function can then perform any operation on the Series object and return a single value, which is then returned as the result for that column (or row).
Here is an example of using the apply
function to calculate the sum of each row in a DataFrame:
import pandas as pd
# create a DataFrame
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
# define a function to calculate the sum of a Series object
def row_sum(series):
return series.sum()
# apply the function to each row of the DataFrame
df['row_sum'] = df.apply(row_sum, axis=1)
print(df)
Output:
a b c row_sum
0 1 4 7 12
1 2 5 8 15
2 3 6 9 18
In this example, the row_sum
function is defined to take a Series object and return the sum of the values in the Series. The apply
function is then used to apply this function to each row of the DataFrame, using axis=1
. The result is a new column in the DataFrame called row_sum
, which contains the sum of each row.
vectorize()
Solution 2: Using numpy The vectorize()
function in NumPy is a tool that allows you to apply a function to a NumPy array element-wise. While not preferred to apply
, it is interesting to see. It converts a function that takes a scalar argument and returns a scalar result into a function that can operate on arrays of any size or shape.
Here's how it works:
- You start by defining a function that you want to apply to the elements of a NumPy array. This function should take a scalar argument and return a scalar result.
- You then pass this function to the
vectorize()
function. Thevectorize()
function returns a new function that can be used to apply the original function to a NumPy array. - You can then call the new function, passing in the NumPy array as the argument. The
vectorize()
function will automatically apply the original function to each element of the array, and return a new array containing the results.
Here's an example:
import numpy as np
def my_func(x):
if x < 0:
return 0
else:
return x ** 2
# create a NumPy array
arr = np.array([-1, 0, 1, 2, 3])
# apply the function to the array using vectorize
new_arr = np.vectorize(my_func)(arr)
# print the results
print(new_arr)
In this example, we define a function called my_func()
that takes a scalar argument x and returns a scalar result. We then use vectorize()
to create a new function that can apply my_func()
to each element of a NumPy array. We apply this new function to the array arr
, which contains the values [-1, 0, 1, 2, 3]
. The result is a new array new_arr
, which contains the values [0, 0, 1, 4, 9]
.
Note that the vectorize()
function can be slower than writing your own explicit loop, especially for large arrays, because it has some overhead. However, it can be a convenient way to apply a function to an array in cases where you don't want to write a loop yourself.
Example
Here's a different example that demonstrates how to create a new column based on values from other columns and apply a function of multiple columns row-wise in Pandas, with both apply
and vectorize
:
Let's say we have a dataframe with the following columns:
Name
: The name of the personAge
: The age of the personGender
: The gender of the person, either 'Male' or 'Female'Weight
: The weight of the person in poundsHeight
: The height of the person in inches
We want to create a new column called BMI
(Body Mass Index) based on the person's weight and height, and we also want to classify the person's BMI into categories based on the following criteria:
BMI < 18.5
: 'Underweight'18.5 <= BMI < 25
: 'Normal'25 <= BMI < 30
: 'Overweight'BMI >= 30
: 'Obese'
Here's how we can do it:
import pandas as pd
import numpy as np
# Create the dataframe
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [23, 32, 45, 18, 27],
'Gender': ['Female', 'Male', 'Male', 'Male', 'Female'],
'Weight': [130, 180, 200, 150, 120],
'Height': [65, 70, 72, 68, 62]
})
# Define the function to calculate BMI
def calculate_bmi(row):
height_m = row['Height'] / 39.37 # Convert height to meters
weight_kg = row['Weight'] / 2.20462 # Convert weight to kilograms
bmi = weight_kg / (height_m ** 2)
return bmi
# Apply the function to create a new column called 'BMI'
df['BMI'] = df.apply(calculate_bmi, axis=1)
# Define the function to classify BMI into categories
def classify_bmi(bmi):
if bmi < 18.5:
return 'Underweight'
elif bmi < 25:
return 'Normal'
elif bmi < 30:
return 'Overweight'
else:
return 'Obese'
# Apply the function to create a new column called 'BMI Category'
df['BMI Category'] = np.vectorize(classify_bmi)(df['BMI'])
# Print the dataframe
print(df)
Output:
Name Age Gender Weight Height BMI BMI Category
0 Alice 23 Female 130 65 21.632849 Normal
1 Bob 32 Male 180 70 25.826973 Overweight
2 Charlie 45 Male 200 72 27.124522 Overweight
3 David 18 Male 150 68 22.807124 Normal
4 Eva 27 Female 120 62 21.948000 Normal
As you can see, we first defined a function calculate_bmi
to calculate the BMI based on the person's weight and height. We then used the apply
function to apply this function row-wise to the dataframe, creating a new column called 'BMI'.
Next, we defined another function classify_bmi
to classify the BMI into categories based on the criteria we defined. We used the vectorize
function from NumPy to apply this function element-wise to the 'BMI' column, creating a new column called 'BMI Category'.
When to use each solution
Pandas apply
and NumPy vectorize
are both powerful tools for applying a function to an array-like object. Here are some general guidelines to help you decide when to use which:
Use pandas apply
when:
- You have a DataFrame or Series object and want to apply a function to one or more columns or rows of data
- You want to apply a function that takes a pandas object as input or returns a pandas object as output
- You want to apply a function that operates on a row or column of data, rather than an element-wise function
Use NumPy vectorize
when:
- You have a 1-dimensional NumPy array and want to apply a function to each element in the array
- You want to apply a function that takes a scalar input and returns a scalar output
- You want to apply a function that operates element-wise, rather than on rows or columns of data
In general, if you are working with pandas objects (e.g., DataFrames or Series) and need to apply a function to rows or columns of data, you should use pandas apply. If you are working with 1-dimensional NumPy arrays and need to apply a function element-wise, you should use NumPy vectorize.
Conclusion
In conclusion, both the apply
function in Pandas and the vectorize
function in NumPy provide powerful tools for performing operations on data. However, they have some key differences that make them more suitable for different types of tasks.
apply
is a flexible and versatile function that can be used for applying any custom function to a DataFrame or Series, making it a great choice for more complex data manipulation tasks. On the other hand, vectorize
is more specialized and optimized for performing element-wise operations on arrays. This makes it a faster and more efficient choice for performing simple mathematical operations on large arrays of data.
Overall, the choice between apply
and vectorize
will depend on the specific task at hand and the type of data being used. By understanding the strengths and limitations of each function, you can select the best tool for the job and ensure that your data manipulation tasks are completed accurately and efficiently.