# Create new column based on values from other columns in Pandas

In this tutorial, we will learn how to add a new column to a Pandas dataframe based on the values on other columns.

`apply()`

Solution 1: Using pandas The `apply`

function in pandas is used to apply a function along a specific axis of a DataFrame. The basic syntax for using the `apply`

function is as follows:

```
df.apply(function, axis=0)
```

Where `df`

is the DataFrame, `function`

is the function to apply to the DataFrame, and `axis`

is the axis along which the function is applied. The default value for `axis`

is 0, which means that the function is applied to each column of the DataFrame.

When the `apply`

function is called, it takes each column (or row, if `axis=1`

) of the DataFrame and passes it as a Series object to the function that was specified. The function can then perform any operation on the Series object and return a single value, which is then returned as the result for that column (or row).

Here is an example of using the `apply`

function to calculate the sum of each row in a DataFrame:

```
import pandas as pd
# create a DataFrame
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
# define a function to calculate the sum of a Series object
def row_sum(series):
return series.sum()
# apply the function to each row of the DataFrame
df['row_sum'] = df.apply(row_sum, axis=1)
print(df)
```

Output:

```
a b c row_sum
0 1 4 7 12
1 2 5 8 15
2 3 6 9 18
```

In this example, the `row_sum`

function is defined to take a Series object and return the sum of the values in the Series. The `apply`

function is then used to apply this function to each row of the DataFrame, using `axis=1`

. The result is a new column in the DataFrame called `row_sum`

, which contains the sum of each row.

`vectorize()`

Solution 2: Using numpy The `vectorize()`

function in NumPy is a tool that allows you to apply a function to a NumPy array element-wise. While not preferred to `apply`

, it is interesting to see. It converts a function that takes a scalar argument and returns a scalar result into a function that can operate on arrays of any size or shape.

Here's how it works:

- You start by defining a function that you want to apply to the elements of a NumPy array. This function should take a scalar argument and return a scalar result.
- You then pass this function to the
`vectorize()`

function. The`vectorize()`

function returns a new function that can be used to apply the original function to a NumPy array. - You can then call the new function, passing in the NumPy array as the argument. The
`vectorize()`

function will automatically apply the original function to each element of the array, and return a new array containing the results.

Here's an example:

```
import numpy as np
def my_func(x):
if x < 0:
return 0
else:
return x ** 2
# create a NumPy array
arr = np.array([-1, 0, 1, 2, 3])
# apply the function to the array using vectorize
new_arr = np.vectorize(my_func)(arr)
# print the results
print(new_arr)
```

In this example, we define a function called `my_func()`

that takes a scalar argument x and returns a scalar result. We then use `vectorize()`

to create a new function that can apply `my_func()`

to each element of a NumPy array. We apply this new function to the array `arr`

, which contains the values `[-1, 0, 1, 2, 3]`

. The result is a new array `new_arr`

, which contains the values `[0, 0, 1, 4, 9]`

.

Note that the `vectorize()`

function can be slower than writing your own explicit loop, especially for large arrays, because it has some overhead. However, it can be a convenient way to apply a function to an array in cases where you don't want to write a loop yourself.

# Example

Here's a different example that demonstrates how to create a new column based on values from other columns and apply a function of multiple columns row-wise in Pandas, with both `apply`

and `vectorize`

:

Let's say we have a dataframe with the following columns:

`Name`

: The name of the person`Age`

: The age of the person`Gender`

: The gender of the person, either 'Male' or 'Female'`Weight`

: The weight of the person in pounds`Height`

: The height of the person in inches

We want to create a new column called `BMI`

(Body Mass Index) based on the person's weight and height, and we also want to classify the person's BMI into categories based on the following criteria:

`BMI < 18.5`

: 'Underweight'`18.5 <= BMI < 25`

: 'Normal'`25 <= BMI < 30`

: 'Overweight'`BMI >= 30`

: 'Obese'

Here's how we can do it:

```
import pandas as pd
import numpy as np
# Create the dataframe
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [23, 32, 45, 18, 27],
'Gender': ['Female', 'Male', 'Male', 'Male', 'Female'],
'Weight': [130, 180, 200, 150, 120],
'Height': [65, 70, 72, 68, 62]
})
# Define the function to calculate BMI
def calculate_bmi(row):
height_m = row['Height'] / 39.37 # Convert height to meters
weight_kg = row['Weight'] / 2.20462 # Convert weight to kilograms
bmi = weight_kg / (height_m ** 2)
return bmi
# Apply the function to create a new column called 'BMI'
df['BMI'] = df.apply(calculate_bmi, axis=1)
# Define the function to classify BMI into categories
def classify_bmi(bmi):
if bmi < 18.5:
return 'Underweight'
elif bmi < 25:
return 'Normal'
elif bmi < 30:
return 'Overweight'
else:
return 'Obese'
# Apply the function to create a new column called 'BMI Category'
df['BMI Category'] = np.vectorize(classify_bmi)(df['BMI'])
# Print the dataframe
print(df)
```

Output:

```
Name Age Gender Weight Height BMI BMI Category
0 Alice 23 Female 130 65 21.632849 Normal
1 Bob 32 Male 180 70 25.826973 Overweight
2 Charlie 45 Male 200 72 27.124522 Overweight
3 David 18 Male 150 68 22.807124 Normal
4 Eva 27 Female 120 62 21.948000 Normal
```

As you can see, we first defined a function `calculate_bmi`

to calculate the BMI based on the person's weight and height. We then used the `apply`

function to apply this function row-wise to the dataframe, creating a new column called 'BMI'.

Next, we defined another function `classify_bmi`

to classify the BMI into categories based on the criteria we defined. We used the `vectorize`

function from NumPy to apply this function element-wise to the 'BMI' column, creating a new column called 'BMI Category'.

# When to use each solution

Pandas `apply`

and NumPy `vectorize`

are both powerful tools for applying a function to an array-like object. Here are some general guidelines to help you decide when to use which:

Use pandas `apply`

when:

- You have a DataFrame or Series object and want to apply a function to one or more columns or rows of data
- You want to apply a function that takes a pandas object as input or returns a pandas object as output
- You want to apply a function that operates on a row or column of data, rather than an element-wise function

Use NumPy `vectorize`

when:

- You have a 1-dimensional NumPy array and want to apply a function to each element in the array
- You want to apply a function that takes a scalar input and returns a scalar output
- You want to apply a function that operates element-wise, rather than on rows or columns of data

In general, if you are working with pandas objects (e.g., DataFrames or Series) and need to apply a function to rows or columns of data, you should use pandas apply. If you are working with 1-dimensional NumPy arrays and need to apply a function element-wise, you should use NumPy vectorize.

# Conclusion

In conclusion, both the `apply`

function in Pandas and the `vectorize`

function in NumPy provide powerful tools for performing operations on data. However, they have some key differences that make them more suitable for different types of tasks.

`apply`

is a flexible and versatile function that can be used for applying any custom function to a DataFrame or Series, making it a great choice for more complex data manipulation tasks. On the other hand, `vectorize`

is more specialized and optimized for performing element-wise operations on arrays. This makes it a faster and more efficient choice for performing simple mathematical operations on large arrays of data.

Overall, the choice between `apply`

and `vectorize`

will depend on the specific task at hand and the type of data being used. By understanding the strengths and limitations of each function, you can select the best tool for the job and ensure that your data manipulation tasks are completed accurately and efficiently.