Information Extraction from PDF

Japneet Singh Chawla
2 min readMay 8, 2020

--

Introduction

Natural Language Processing is a task that involves data collection from various sources and not every time one is lucky to get the baked data. Many times you have to extract data from various sources, one of them is Files.

In this post, I will be talking specifically about the PDF files.

Getting the Guns ready

After some exploration on the internet, I came across a python package PyPDF which sounded a good contender to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files, although its usage details are not that clear that’s why I thought of writing a post to explain it.

Installation

pip install PyPDF2

Reading the File and extracting Text

import PyPDF2 
filename = 'complete path of your pdf file'
#opening the file
pdfFileObj = open(filename,'rb')
#creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#Discerning the number of pages will allow us to parse through all the pages.
num_pages = pdfReader.numPages
count = 0
text = ""#The while loop will read each page.
while count < num_pages:

pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
print(text)

Using the above code one can easily extract text from the PDF files

Also, have a look at some posts related to NLP

GROKs for Information Extraction

Using GROKs in Python

Document Vectorization

GloVE Vectorization

Word Vectorization

To read more about the text extraction and other tech-related stuff do have a look at my blog.

--

--

Japneet Singh Chawla
Japneet Singh Chawla

No responses yet