Reading Table Rows in Docx Using Python
Read tables from docx file to pandas DataFrames
For those of y'all who are working in the field of Data Analytics, Machine Learning (ML) and do their coding in python, would accept already been using Pandas, the data manipulation and analytics library for their daily tasks. Pandas comes upwards with huge set of APIs and functions that allow importing data from diverse file formats such as comma-separated values, JSON, SQL, Microsoft Excel in the form of tables which is chosen as DataFrames in the pandas world. Pandas allows various data manipulation operations such every bit selecting, filtering, merging, reshaping, data cleaning, preprocessing and data wrangling features.
Non all the data required for edifice the ML/DL pipelines would be bachelor in databases, csv or excel files and sometimes data scientists and car learning engineers may need to read some tables embedded within the Give-and-take document file in docx format and procedure the data. All the same, the pandas library is not providing an out-of-the-box solution to read the tables in the document file. In the remaining section of this weblog, I am going to take you through the technique to achieve this desired functionality with the help of another python library named python-docx.
Setup and Installation
Before we move on with the code for understanding the tabular array extraction table from docx to pandas dataframes, let'due south chop-chop go through the initial setup and installation. I presume python iii.x is already installed in your system. If not, python tin be installed from link.
Install pandas & python-docx
pip install pandas
pip install python-docx
Sample Data
I created a uncomplicated word certificate with some sample table data. Information technology comprises of three tables. Showtime table comprises of a small catalog of former books with a single column header; 2d tables comes up with no column header of a item person in a department. The third table contains sales data of various products spanned beyond a quarter with two column headers and outer column merged. Here is a link to the spreadsheet for you to download.
Reading docx file
Showtime import pandas and python-docx libraries and utilise Document method of python-docx to open and read the file
import pandas every bit pd
from docx import Certificate document = Document("<<docx file path>>")
The "certificate" object comes up with various attributes and methods.
Since our focus is to excerpt the data from tables in docx, nosotros would utilise document.tables[<table_num>] to read a specific tabular array nowadays in the file by passing the table number
Procedure the tabular array data to pandas dataframe
Using the in-congenital attributes of python-docx library, read each rows of the table and retrieve the text from each cells and create python list of list containing each row. Then convert that python data construction to pandas DataFrame.
information = [[cell.text for prison cell in row.cells] for row in tabular array.rows]
df = pd.DataFrame(information)
Custom formatting on dataframe
The tabular array(s) nowadays in docx tin be of various formats. More often than not, in case of simple table, there will be single header or even comes without a column header. Also, it is not rare to see some tables with multiple headers with merged cells. Based on the scenario in paw, the initial dataframe should exist custom formatted to include the proper column indexes before we showtime with the data manipulation.
For single header scenario, fix the beginning row of dataframe as column header and reset the row alphabetize.
df = df.rename(columns=df.iloc[0]).drib(df.alphabetize[0]).reset_index.(drop=True)
For two headers scenario, create the multi-index based on first ii row values as column headers and reset the row alphabetize to showtime from 0
outside_col, inside_col = df.iloc[0], df.iloc[1]
hier_index = pd.MultiIndex.from_tuples(list(null(outside_col, inside_col))) df = pd.DataFrame(data,columns=hier_index).drop(df.alphabetize[[0,one]] ).reset_index(drop=True)
Packaging all to a higher place steps into a role
Information technology is always a good coding practice to bundle the logic into a custom function so that nosotros can hands phone call and reuse the code as required. The below code snippet does all the above mentioned logic to recall yous with pandas DataFrame in one case we laissez passer the desired parameters like document to be read, table number and number of headers the table has got.
Execute the role for various scenarios
Let's call the in a higher place office by passing the parameters and see if it works for three scenarios for which it was written. Download the docx file I provided in previous section and pass the parameters every bit follows. There are three sample tables in that document. First table got one cavalcade header, second table contains no column reader while the 3rd tabular array got two column headers.
Single header tabular array
certificate = Document("PandasTableExtraction.docx")
table_num=ane
nheader=i
df = read_docx_table(document,table_num,nheader)
print(df) Output: Books Author Twelvemonth
0 The Crimson Alphabetic character Nathaniel Hawthorne 1850
i Eye of Darkness Joseph Conrad 1902
2 The Pearl John Steinbeck 1947
3 Gulliver'south Travel Jonathan Swift 1726
4 On the Road Jack Kerouac 1957
No header table
document = Document("PandasTableExtraction.docx")
table_num=2
nheader=0
df = read_docx_table(certificate,table_num,nheader)
print(df) Output:
0 ane
0 First Name David
ane Final Name Smith
two Department Sales
three Blood Group A+
4 Gender Male person
2 header table
document = Certificate("PandasTableExtraction.docx")
table_num=three
nheader=ii
df = read_docx_table(document,table_num,nheader)
impress(df) Output:
Jan 2020 Feb 2020 Mar 2020
Items Qty Toll Qty Price Qty Price
0 Pens 853 50 973 fifty 953 55
1 Books 238 30 208 35 268 40
2 Binders 214 70 184 70 202 75
iii Pencils 1810 10 1570 10 1765 10
iv Erasers 1502 5 1307 7 2018 vii
Information Manipulation
Now that we accept the data available in pandas DataFrame construction, we can perform all sorts of data manipulation and analysis operations that pandas supports. Nonetheless, since we extracted the data from docx, all the column datatypes would be inferred equally string/object. Hence we may need to blazon bandage sure numerical and date columns as required in lodge to effectively perform sure filters and operations on them.
Ex: the kickoff table extracted contains Yr column and if we need to list downwards the books published after 1900, nosotros need to convert it to int cavalcade first
df['Year'] = df['Year'].astype(int)
print(df[df['Year'] > 1900]) Output:
Books Writer Twelvemonth
ane Center of Darkness Joseph Conrad 1902
2 The Pearl John Steinbeck 1947
4 On the Road Jack Kerouac 1957
Ex: the third table extracted contains Qty and Cost columns and if we need to list downward the items which were sold > 1000 units and toll ≥10 during Jan 2020
df['Year'] = df['Yr'].astype(int)
print(df[df['Year'] > 1900]) Output:
Books Author Year
1 Heart of Darkness Joseph Conrad 1902
2 The Pearl John Steinbeck 1947
four On the Road Jack Kerouac 1957
This is just an example for you to go started in reading the tabular array in docx file and prep it to perform farther data analysis. Please go through the pandas data manipulation techniques available to get the experience of power of pandas. Hither is an useful link.
Summary
In this tutorial, you learned how to apply open and read docx file containing the tables. You saw how to load the tables to the pandas dataframe object and prep by irresolute datatypes and explore data using pandas. Information analysis entails lots of trial and error. This experimentation tin exist challenging, but also provides you a lot of fun and learning!
Reading Table Rows in Docx Using Python
Source: https://medium.com/@karthikeyan.eaganathan/read-tables-from-docx-file-to-pandas-dataframes-f7e409401370