Excel’s VLOOKUP with Python Code: An Easy Equivalent with Pandas
Excel is the world's most popular data analysis tool. One of its most popular functions is VLOOKUP – it lets Excel users easily match records across two tables to combine their values into one table.
While Excel and its functions can be useful, many people are switching to Python for data analysis and manipulation due to its power and flexibility compared to spreadsheets. However, for those making the transition, the lack of a built in VLOOKUP function in Python can be a barrier to adoption.
The rest of this blog will teach you how to implement the VLOOKUP function in Python. Alternatively, you can install the Mito Spreadsheet to use Excel formulas, like VLOOKUP directly in Python.
In the rest of this blog, we'll provide you with a quick guide to using Python's merge() function as a replacement for VLOOKUP, so you can get back to analyzing your data in Python using your Excel expertise.
What Is the VLOOKUP Function in Excel?
The VLOOKUP function in Excel is a powerful tool that allows you to look up data in a table and return corresponding values from another column. It stands for “vertical lookup” because it searches for a value in the leftmost column of a table and returns a value in the same row from a column to the right of the lookup column.
VLOOKUP is commonly used in Excel for tasks such as retrieving prices, calculating commissions, or matching customer names with customer IDs. It's a valuable tool for anyone who works with data in Excel, which is why it’s common for those moving to Python to want something similar.
What Is the merge() Function in Python?
The merge() function in Python is part of the Pandas library, which is a popular data manipulation and analysis tool. It works by combining two DataFrames on a common column, similar to a SQL join. Once the data is merged, you can perform a variety of operations on the merged data, such as filtering, sorting and grouping.
Setting Up the Data
Before we can do a VLOOKUP in Python, we need to set up our data. For this example, we'll use a simple table of sales data, but the same approach will work on whatever data set you’re working with.
To get started, import the data into a Pandas DataFrame.
import pandas as pd
student_records = pd.DataFrame({
'Name': ['Martha Wax', 'Nate Joy', 'Aaron Sand'],
'Year': [2016, 2016, 2016],
})
grades = pd.DataFrame({
'Name': ['aaron sand', 'martha wax', 'nate joy', 'jake jones'],
'Math': [88, 100, 89, 92],
'English': [92, 91, 90, 94],
'History': [95, 92, 91, 90],
'Science': [89, 88, 90, 87],
})
Using the Pandas merge() function for VLOOKUP
There are a few important discrepancies between the default behavior of Excel's VLOOKUP and Python's merge function that we need to pay special attention to:
- Excel VLOOKUP is case insensitive. Python's merge function is case sensitive.
- Excel VLOOKUP returns just one column after finding the match. Python's merge function returns all columns from the second dataframe after finding the match.
- Excel VLOOKUP returns just the first match. Python's merge function returns all matches.
Below, we will show how to handle each of these cases so Python's merge function performs just like Excel's VLOOKUP.
Handling Case Insensitivity
By default, Excel's VLOOKUP function is case insensitive, whereas Python's merge function is case-sensitive. For example, VLOOKUP treats "Mito" and "mito" equivalent and Python's merge does not.
Therefore, in order to replicate Excel's VLOOKUP function in Python, we first need to convert the merge key to lowercase. In the code below, we're converting the name to lowercase to do a case insensitive search:
# Convert the Name columns to lowercase for case insensitive merge
grades['Name'] = grades['Name'].str.lower()
student_records['Name'] = student_records['Name'].str.lower()
Returning Just One Column
If we are only interested in getting the Science grades for each student, then we need to drop the other columns from the second dataframe. Otherwise, Python's merge function will return all of the columns, unlike the VLOOKUP function.
To remove the columns that shouldn't be included in the final dataset, use pandas' .drop function.
# Drop the columns we don't want in the final dataframe
tmp_grades = grades.drop(['English', 'History'], axis=1)
Remove duplicate values
In order to replicate Excel's VLOOKUP function which only returns the first match that it finds, we need to remove rows that have duplicated merge keys from the second dataframe. In this example, since we're matching on the column "Name", if the grades dataframe has multiple records for the same name, we should drop one of the duplicated records from the dataframe.
# Drop duplicate values so we only get one match per value
tmp_grades = tmp_grades.drop_duplicates(subset=['Name'], keep='first')
Perform the merge
Now that we've prepped the data ignorer to handle the discrepancies between the Excel VLOOKUP function and the Python merge function, the last step is to perform the merge!
# Perform the merge to add the Science column to the student_records
student_records = student_records.merge(
tmp_grades,
left_on=['Name'],
right_on=['Name']
)
All together, that makes this code snippet:
import pandas as pd
student_records = pd.DataFrame({
'Name': ['Martha Wax', 'Nate Joy', 'Aaron Sand'],
'Year': [2016, 2016, 2016],
})
grades = pd.DataFrame({
'Name': ['aaron sand', 'martha wax', 'nate joy', 'jake jones'],
'Math': [88, 100, 89, 92],
'English': [92, 91, 90, 94],
'History': [95, 92, 91, 90],
'Science': [89, 88, 90, 87],
})
# Convert the Name columns to lowercase for case insensitive merge
grades['Name'] = grades['Name'].str.lower()
student_records['Name'] = student_records['Name'].str.lower()
# Drop the columns we don't want in the final dataframe
tmp_grades = grades.drop(['English', 'History'], axis=1)
# Drop duplicate values so we only get one match per value
tmp_grades = tmp_grades.drop_duplicates(subset=['Name'], keep='first')
# Perform the merge to add the Science column to the student_records
student_records = student_records.merge(
tmp_grades,
left_on=['Name'],
right_on=['Name']
)
Advanced VLOOKUP Techniques with Pandas
In the following section, we will explore some advanced techniques using the Pandas library to perform VLOOKUP-like operations. This is an approximation of the VLOOKUP function when the final argument is True
in order to do an approximate match. This includes handling cases where exact matches are not possible and leveraging the power of Python's fuzzy matching capabilities to find the closest matches. Let's dive in!
Notice in the dataframes below that names are not exact matches between the grades and student_records dataframes. For example, "Aaron Sand" is listed as "a-a-ron sand" in the other dataset.
from fuzzywuzzy import process
student_records = pd.DataFrame({
'Name': ['Martha Wax', 'Nate Joy', 'Aaron Sand'],
'Year': [2016, 2016, 2016],
})
grades = pd.DataFrame({
'Name': ['a-a-ron sand', 'martha wax', 'nathan joy', 'jacob jones'],
'Math': [88, 100, 89, 92],
'English': [92, 91, 90, 94],
'History': [95, 92, 91, 90],
'Science': [89, 88, 90, 87],
'Gym': [90, 92, 91, 90],
'Art': [91, 90, 94, 92]
})
# For each Name in student_records, find the best match in grades
student_records['best_match'] = student_records['Name'].apply(lambda x: process.extractOne(x, grades['Name'].tolist()))
student_records['best_match'] = student_records['best_match'].apply(lambda x: x[0])
# Drop Gym and Art columns from the grades DataFrame so they're not included in the final
# final student_record. Create a tmp_df so we don't modify the original dataframe.
tmp_grades = grades.drop(['Gym', 'Art'], axis=1)
# Drop duplicates from grades on the Name column to ensure we only
# return one result per student_record
tmp_grades = tmp_grades.drop_duplicates(subset=['Name'], keep='first')
# Add the columns Math, English, History, and Science to student_records merging on the best match
student_records = student_records.merge(tmp_grades, left_on='best_match', right_on='Name')
# Drop the best_match and Name_y column because we no longer need them
student_records = student_records.drop(['best_match', 'Name_y'], axis=1)
# Rename the Name_x column to Name
student_records = student_records.rename(columns={'Name_x': 'Name'})
Comparing Excel VLOOKUP and Pandas merge() for dataframes
VLOOKUP:
- Case insensitive: Excel's VLOOKUP does not differentiate between uppercase and lowercase characters, treating them as identical.
- Returns right-side columns: VLOOKUP only returns values from columns located to the right of the lookup column.
- Returns error for missing values: If VLOOKUP doesn't find a match, it returns an error.
- Merges on one column: VLOOKUP can only look up values based on a single column.
- Better for simple lookups: Excel’s VLOOKUP function is ideal for simple lookups as it returns just one value for each lookup. Unlike the pandas merge function which, by default, returns every column in the other dataset
- Faster with large data sets: When working with large datasets, VLOOKUP can execute faster than Pandas merge() function.
merge():
- Case sensitive: The merge() function in Pandas treats uppercase and lowercase characters as distinct, making it case sensitive.
- Returns any column: Unlike VLOOKUP, merge() can return values from any column, not just those on the right.
- Fills missing values with NaN: If merge() doesn't find a match, it fills the missing values with NaN (Not a Number).
- Merges on multiple columns: Pandas merge() function is capable of merging data based on multiple columns, providing more flexibility.
- Superior for complex manipulation: The merge() function is ideal for complex data manipulation tasks due to its flexibility and powerful capabilities.
- More powerful and flexible: The merge() function can handle more complex tasks compared to Excel's VLOOKUP, making it a more robust tool for data analysis.
Conclusion
In conclusion, for those who are accustomed to using Excel's VLOOKUP function but want to take advantage of the power of Python, Pandas' merge() function is an essential tool to have in your data analysis arsenal. Not only is it simple to use, but it also offers a level of flexibility and versatility that goes beyond the capabilities of Excel's VLOOKUP.
With a bit of practice, you can use merge() to easily manipulate data in Python, enabling you to analyze large data sets more efficiently than ever before. Whether you're a data analysis professional or just starting out, Pandas' merge() function is a must-have tool for working with data in Python. So, start exploring the many possibilities of merge() and take your data analysis to the next level!