Data-scraping PDF-parsing python bot
As easy as it may seem, getting properly formatted relevant data is never that easy. This is especially true for businesses who send reports as PDFs. In my case, our cotton gin’s website is designed to run reports on loads, quality, etc… Once you run the report it looks like this:
There are a few problems with this when you are talking about data.
- It isn’t a csv.
- There are things other than the data we care about in the table ie: summaries.
- It won’t update itself.
Oh and this is the process to download this pdf:
So what’s the solution?
A python bot to be more specific. There are a few python libraries that make this data scraping much easier. Those libraries are selenium and tabula. Selenium is a webpage automation driver in which you can tell it to click on certain things and enter text. Tabula is pdf table parser that has python bindings.
To start, install tabula and selenium with pip
pip install tabula-py
pip install -U selenium
First off, yo need to have a bot navigate your website something like this.
This process is surprisingly simple to accomplish in selenium.
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from time import sleep
import pandas as pd
import tabula
import time
import osoptions = webdriver.ChromeOptions()
download_folder = "reports"profile = {"plugins.plugins_list": [{"enabled": False,
"name": "Chrome PDF Viewer"}],
"download.default_directory": download_folder,
"download.extensions_to_open": ""}options.add_experimental_option("prefs", profile)
driver = webdriver.Chrome(chrome_options = options)
driver.get("http://cottonhost.com/96726/")
actions = ActionChains(driver)# Log in
element = driver.find_element_by_name("LOGINBUTTON").click();# Filter to report
elem2 = driver.find_element_by_name('PRODID').send_keys("login")
elem3 = driver.find_element_by_name('PRODPASS').send_keys("password")
element = driver.find_element_by_name("LOGINBUTTON").click();
time.sleep(1);
s2= Select(driver.find_element_by_xpath("/html/body/form/table/tbody/tr[3]/td[2]/select")).select_by_value('PRODLOADRPT')
s2= Select(driver.find_element_by_xpath("/html/body/form/table/tbody/tr[4]/td[2]/select")).select_by_value('W2017')
filter = driver.find_element_by_xpath("/html/body/form/input[1]").click();
time.sleep(1);
# Run Report
reportit = driver.find_element_by_xpath("/html/body/form[@id='FILTERS']/input[10]").click();
driver.switch_to_frame(driver.find_element_by_tag_name("iframe"))
pdf = driver.find_element_by_xpath("html/body/object[@type='application/pdf']");
# print (pdf);
print (pdf.get_attribute("data"))
driver.get(pdf.get_attribute("data"))
time.sleep(4);
driver.close()
Selenium
Although this may seem complex, at it’s core it is pretty simple. With selenium there are a few actions: navigating to a website, clicking, entering text, as well as reading from parts of a website. The profile setup at the top is just to set chrome so it automatically downloads pdfs without viewing them and it downloads into the download_folder.
From there, you create a driver which is just the object that drives chrome. A driver.get("http:/mywebsite.com")
will navigate to the website you pass. Once on a website you can navigate in a variety of ways documented here.
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_tag_name
find_element_by_link_text
find_element_by_class_name
find_element_by_css_selector
find_element_by_partial_link_text
Once you have defined driver.find_element_by_method('identifier')
then you can add an action on the end such as .click()
or .send_keys("hello world")
and with these actions you can easily log into a website and navigate. Once you have navigated to your PDF, it should automatically download because of the profile that was set up. From there you can close the driver with driver.close()
which will close the chrome window.
Tabula
Now that the hard part is done and the pdf is downloaded, tabula is extremely easy to operate. The basic idea behind tabula is that either it auto-predicts where columns are (not ideal) or you can set a bounding box and place the columns. From there tabula takes over and extracts the table as well as any other information that got caught in the table. Here is the command for running tabula:
tabula.convert_into("report.pd"), "report.csv", output_format="csv",area=(72, 42, 590, 755),guess = False, options=" --pages all --columns 35,55,71,89,115,159,180.2,200,271,322,379,487,520,593,618,647,671,687,710,732,773")
This command will read the pdf and is set to go through all pages and output a csv. To set the columns, you need the guess = False
and using the --columns
command with the location of the columns in points. The area can also be set with the area=(pixel location from top, pixel location from left, bottom of table pixels from top of page, right of table pixels from left of page)
then when you run this, it spits out a perfect csv!
Here’s the script I have to go through the folder reports and convert the PDFs to csvs
for filename in os.listdir('reports'):
if filename.endswith(".PDF"):
print(filename)
filestring = str(filename)
filecsv = filestring[:-4]+'.csv'
tabula.convert_into(os.path.join(download_folder,
filename), filecsv, output_format="csv",area=(72, 42, 590, 755),guess = False,options="--pages all --columns 35,55,71,89,115,159,180.2,200,271,322,379,487,520,593,618,647,671,687,710,732,773")
I mean it’s not bad, but it just needs a little cleaning. After extracting the table, I just have sqlalchemy push it to my pgsql database. In the database I just have a trigger function to remove rows that don’t follow the Farm ID column where values are BOWLES(somevalue). This allows me to get the data cleaned up for reporting. So once the data ends up in my database, it looks something like this:
Note: Two different reports were used in this, however the process is similar for any report.