Excelython — Part 5: Vlookup in Python
I can easily say that the breakthrough in my business development was when I learned VLOOKUP.
Surprisingly, as many people you know that are expert on using VLOOKUP formula, there almost the same amount of people that still are joining to different tables by searching values one by one!!!!!!!
In the previous part of our project we managed to import all the files that we are going to use, as dataframes in our Python Script. So now you see in the variables panel, 6 different dataframes as File_1 to File_6.
Project Step 2
As part of the preprocessing that we are going to do in order to extract the information that our project asks, we have to create a master panel that combines data from all the files.
As a base we are going to use File 1 and we are going to import on it, all the data that we need from other files.
Excel Step 1
In File 1, create a new column with name = “Warehouse Stock”.
Excel Step 2
In File 1, on the first cell bellow the new created column, let’s write a VLOOKUP formula that will get the Warehouse Stock, from File 2.
=VLOOKUP(A2;File 2!A:B;2;0)
Drag until the last row this formula and your new column will have the data you need.
Python Step 1
In order to do the same in Python, there is no VLOOKUP. There are various methods of combining tables. You can elevate your knowledge in these methods and techniques by reading the following article:
link: Merge, join, concatenate and compare
In our case now, this step it is again one line of code, where we define:
- The dataframe that we are going to use the resulted merge. This can happen as in Excel in the File_1, which we use it as base panel or you can have another name and create a new dataframe and call it ex. File_7.
- pd.merge( = the function of pandas that we want to do.
- File 1 = The first table where we want to get the data.
- File 2 = The target file that has the data we want.
- how = Define the way that the merge will happen based on the keys that we are going to set.
i) Keys = The common value in the tables that we want to join the tables with the desired value.
ex. What we want is the Warehouse Stock for each SKU. In both files the common data that we have is the SKU column. So the key = SKU.
6. key = SKU
So, the line of code that we are going to write is:
File_1 = pd.merge(File_1, File_2, how="left", on = "SKU")
The result is that the File_1 now has also the information of “Warehouse Stock”.
Job Done!!! This is the VLOOKUP in Python!
There are some things that you have to keep in mind for other scenarios:
- KEY has to be exactly the same value and type in both dataframes. It will not work if in one dataframe is number and in the other one is text.
- In case where the second table has columns that you don’t need in the first table, you have the options:
i) Create a new dataframe that has only the column of the KEY and the column of the requested value. In our example if in the File_2 the columns where like the picture below
The code to create a new dataframe will be:
Dataframe Indexingnew_File_2 = File_2.loc[:,["SKU","Warehouse Stock"]]
or
new_File_2 = File_2.iloc[:,[0,2]]
## Columns count in Python starts from zero, so with iloc you define ## that you need the column 0 = first and the column 2 = third
When you are indexing columns in a dataframe, it is much preferable to use the .loc method, in order to define names. If in the future you need to revise your script, believe me, you will not remember that column 2 was Warehouse Stock!!!
ii) You can drop the unwanted columns, if there are less than 5, otherwise you will end up with a code line that it will have more length than the full script!!!
So in the example with File_2 with extra column what we would do is the following:
File_2 = File_2.drop(["Some Other Column"], axis = 1)
By doing this we dropped from dataframe the unwanted column.
3. If you have a column with name = Test in File_1 and a column with name = Test in File_2, and accidentally you haven’t dropped it from File_2, then when you are finished with pd.merge in File_1 will be a column with name = Test_x and a column with name Test_y.