src: https://viso.ai/deep-learning/data-preprocessing-techniques-for-machine-learning-with-python/

Excelython — Part 8: Checkpoint

Yiannis Servos
3 min readMay 27, 2021

--

In Part 1, a lot of you didn’t have a clue about how easy Python is. Since then you are working on a business project and have learned several key methods in Python on playing with excel files and dataframes.

In this session we will do again some of the methods we learned so far, in order to finalize the preprocessing of our main panel(File_1) and to establish the knowledge we got so far.

In summary we have learned so far:

  1. How to open an Excel File in Python.
  2. How to create a new column in a Python DataFrame.
  3. How to do VLOOKUP in Python.
  4. How to write an IF statement in Python.
  5. How to do Pivot Tables in Python.

Project Step 5–6

In Part 7 we closed by having a DataFrame like this:

In order to finalize the preprocessing phase of our project, we need to get the values of Pending Orders, that they are in File 4 and we need to get the Cost Price per Product that exist in File 5.

Excel Step 1

In File 1 we will create a new column with name = “Pending Orders”. Then we will write a VLOOKUP formula that will get us the Pending Orders from File 4. Don’t forget that probably you don’t have pending orders for all of your products, so it is wise to use an IFERROR statement before VLOOKUP, so you will get zero in the rows you don’t have pending orders, instead of getting an #N/A.

=IFERROR(VLOOKUP(A2;’File 4'!A:B;2;0);0)

Drag the formula to the bottom and now you have a column with all the pending orders per product in File 1.

Excel Step 2

In File 1 we will create a new column with name = “Cost Price”. Again we will write a VLOOKUP formula that will get us the Cost Price from File 5. It is rare to have a product without cost but in any case we will use an IFERROR statement before VLOOKUP, so you will get zero in the rows you don’t have pending orders, instead of getting an #N/A, as we did before.

=IFERROR(VLOOKUP(A2;’File 5'!A:B;2;0);0)

Drag the formula to the bottom and now you have a column with the Cost Price per product in File 1.

Now File 1 will look like this.

Python Step 1

In Python we will perform just to merges with File_4 and File_5, on SKU and we are done.

File_1 = pd.merge(File_1, File_4, how="left", on = "SKU")
File_1 = pd.merge(File_1, File_5, how="left", on = "SKU")

In case you had a File_4, which had only the SKU’s with pending orders>0, this merge would give you some nan values in the File_1.

In order to replace the nan values in an entire dataframe or in a specific columns you use the .FILLNA() function.

You define the dataframe or the columns that contains nan values and inside the brackets you enter the value that you want to replace the nan value.

In our case i will show you how you would do the above in the entire File_4 and in the “Cost Price” column of File_5.

File_4 = File_4.fillna(0)
File_5["Cost Price"] = File_5["Cost Price"].fillna(0)

By the time we have already did the merge I am going to do .FILLNA() in the entire File_1.

File_1 = File_1.fillna(0)

Our File_1 dataframe now looks like this

Now, we are done on creating a panel that contains all the required data in order to move on the processing phase of our project, that we will do it in the next part.

Regards

--

--

Yiannis Servos

Experienced in Retail Category Management and enthusiast about technology and programming.