Reading Table Rows in Docx Using Python

Read tables from docx file to pandas DataFrames

For those of y'all who are working in the field of Data Analytics, Machine Learning (ML) and do their coding in python, would accept already been using Pandas, the data manipulation and analytics library for their daily tasks. Pandas comes upwards with huge set of APIs and functions that allow importing data from diverse file formats such as comma-separated values, JSON, SQL, Microsoft Excel in the form of tables which is chosen as DataFrames in the pandas world. Pandas allows various data manipulation operations such every bit selecting, filtering, merging, reshaping, data cleaning, preprocessing and data wrangling features.

Non all the data required for edifice the ML/DL pipelines would be bachelor in databases, csv or excel files and sometimes data scientists and car learning engineers may need to read some tables embedded within the Give-and-take document file in docx format and procedure the data. All the same, the pandas library is not providing an out-of-the-box solution to read the tables in the document file. In the remaining section of this weblog, I am going to take you through the technique to achieve this desired functionality with the help of another python library named python-docx.

Setup and Installation

Before we move on with the code for understanding the tabular array extraction table from docx to pandas dataframes, let'due south chop-chop go through the initial setup and installation. I presume python iii.x is already installed in your system. If not, python tin be installed from link.

Install pandas & python-docx

          pip install pandas
pip install python-docx

Sample Data

I created a uncomplicated word certificate with some sample table data. Information technology comprises of three tables. Showtime table comprises of a small catalog of former books with a single column header; 2d tables comes up with no column header of a item person in a department. The third table contains sales data of various products spanned beyond a quarter with two column headers and outer column merged. Here is a link to the spreadsheet for you to download.

Reading docx file

Showtime import pandas and python-docx libraries and utilise Document method of python-docx to open and read the file

          import pandas every bit pd
from docx import Certificate          document = Document("<<docx file path>>")

The "certificate" object comes up with various attributes and methods.

Since our focus is to excerpt the data from tables in docx, nosotros would utilise document.tables[<table_num>] to read a specific tabular array nowadays in the file by passing the table number

Procedure the tabular array data to pandas dataframe

Using the in-congenital attributes of python-docx library, read each rows of the table and retrieve the text from each cells and create python list of list containing each row. Then convert that python data construction to pandas DataFrame.

          information = [[cell.text for prison cell in row.cells] for row in tabular array.rows]
df = pd.DataFrame(information)

Custom formatting on dataframe

The tabular array(s) nowadays in docx tin be of various formats. More often than not, in case of simple table, there will be single header or even comes without a column header. Also, it is not rare to see some tables with multiple headers with merged cells. Based on the scenario in paw, the initial dataframe should exist custom formatted to include the proper column indexes before we showtime with the data manipulation.

For single header scenario, fix the beginning row of dataframe as column header and reset the row alphabetize.

          df = df.rename(columns=df.iloc[0]).drib(df.alphabetize[0]).reset_index.(drop=True)

For two headers scenario, create the multi-index based on first ii row values as column headers and reset the row alphabetize to showtime from 0

          outside_col, inside_col = df.iloc[0], df.iloc[1]
hier_index = pd.MultiIndex.from_tuples(list(null(outside_col, inside_col)))          df = pd.DataFrame(data,columns=hier_index).drop(df.alphabetize[[0,one]] ).reset_index(drop=True)

Packaging all to a higher place steps into a role

Information technology is always a good coding practice to bundle the logic into a custom function so that nosotros can hands phone call and reuse the code as required. The below code snippet does all the above mentioned logic to recall yous with pandas DataFrame in one case we laissez passer the desired parameters like document to be read, table number and number of headers the table has got.

Execute the role for various scenarios

Let's call the in a higher place office by passing the parameters and see if it works for three scenarios for which it was written. Download the docx file I provided in previous section and pass the parameters every bit follows. There are three sample tables in that document. First table got one cavalcade header, second table contains no column reader while the 3rd tabular array got two column headers.

Single header tabular array

          certificate = Document("PandasTableExtraction.docx")
table_num=ane
nheader=i
df = read_docx_table(document,table_num,nheader)
print(df)                      Output:                                Books               Author Twelvemonth            
0  The Crimson Alphabetic character  Nathaniel Hawthorne  1850
i   Eye of Darkness        Joseph Conrad  1902
2           The Pearl       John Steinbeck  1947
3   Gulliver'south Travel       Jonathan Swift  1726
4         On the Road         Jack Kerouac  1957

No header table

          document = Document("PandasTableExtraction.docx")
table_num=2
nheader=0
df = read_docx_table(certificate,table_num,nheader)
print(df)                      Output:            
            0      ane
0   First Name  David
ane    Final Name  Smith
two   Department  Sales
three  Blood Group     A+
4       Gender   Male person

2 header table

          document = Certificate("PandasTableExtraction.docx")
table_num=three
nheader=ii
df = read_docx_table(document,table_num,nheader)
impress(df)                      Output:            
            Jan 2020       Feb 2020       Mar 2020            


            Items      Qty Toll      Qty Price      Qty Price
0     Pens      853    50      973    fifty      953    55
1    Books      238    30      208    35      268    40
2  Binders      214    70      184    70      202    75
iii  Pencils     1810    10     1570    10     1765    10
iv  Erasers     1502     5     1307     7     2018     vii

Information Manipulation

Now that we accept the data available in pandas DataFrame construction, we can perform all sorts of data manipulation and analysis operations that pandas supports. Nonetheless, since we extracted the data from docx, all the column datatypes would be inferred equally string/object. Hence we may need to blazon bandage sure numerical and date columns as required in lodge to effectively perform sure filters and operations on them.

Ex: the kickoff table extracted contains Yr column and if we need to list downwards the books published after 1900, nosotros need to convert it to int cavalcade first

          df['Year'] = df['Year'].astype(int)
print(df[df['Year'] > 1900])                      Output:
                        Books          Writer  Twelvemonth
ane  Center of Darkness   Joseph Conrad  1902
2          The Pearl  John Steinbeck  1947
4        On the Road    Jack Kerouac  1957

Ex: the third table extracted contains Qty and Cost columns and if we need to list downward the items which were sold > 1000 units and toll ≥10 during Jan 2020

          df['Year'] = df['Yr'].astype(int)
print(df[df['Year'] > 1900])                      Output:
                        Books          Author  Year
1  Heart of Darkness   Joseph Conrad  1902
2          The Pearl  John Steinbeck  1947
four        On the Road    Jack Kerouac  1957

This is just an example for you to go started in reading the tabular array in docx file and prep it to perform farther data analysis. Please go through the pandas data manipulation techniques available to get the experience of power of pandas. Hither is an useful link.

Summary

In this tutorial, you learned how to apply open and read docx file containing the tables. You saw how to load the tables to the pandas dataframe object and prep by irresolute datatypes and explore data using pandas. Information analysis entails lots of trial and error. This experimentation tin exist challenging, but also provides you a lot of fun and learning!

Reading Table Rows in Docx Using Python

Source: https://medium.com/@karthikeyan.eaganathan/read-tables-from-docx-file-to-pandas-dataframes-f7e409401370

Rowland Maidest