Scraping for files

November 12, 2018

I came across this great resource while trying to become better read on privacy-enhancing technologies, a fascinating area of CS/security research which focuses on building technologies that help facilitate the privacy of their users. The site links to many interesting papers, but unfortunately there is no easy way to bulk download them, and I want them all :-)

Here is a simple Python program I wrote to scrape the page for the PDFs and write them to a local directory. It is designed to be generic and expandable, I’m sure it will come in handy again in the future. Maybe it will be useful for someone reading this too.

import requests
import re
import os
from bs4 import BeautifulSoup
from tqdm import tqdm
from urllib.parse import urlparse


class Scraper:

    """
    
	This is a scraper module using bs4
    which can scrape content with arbitrary filetype from a webpage.
    
	"""

    def __init__(self, target):
        self.content = requests.get(target).text
        self.soup = BeautifulSoup(self.content, 'html.parser')

    def scrape(self, extension):
        links = list(filter(lambda x: x.endswith(extension),
                            [a['href'] for a in self.soup.find_all('a')]))
        return set(links)

    def get_and_write(self, entries):
        for entry in entries:
            parsed = urlparse(entry)
            path = os.path.abspath(
                'downloaded/{}'.format(os.path.basename(parsed.path)))
            if os.path.isfile('downloaded/{}'.format(os.path.basename(parsed.path))):
                continue
            else:
                try:
                    response = requests.get(entry, stream=True)
                    with open(path, "wb") as f:
                        for data in tqdm(response.iter_content()):
                            f.write(data)
                except Exception:
                    print("Connection Error")
                    continue
        return "Done"


if __name__ == "__main__":
    url = "https://people.cs.umass.edu/~amir/CMPSCI691PT-Fall14-schedule.html"
	filetype = ".pdf"
	crawler = Scraper(url)
    crawler.get_and_write(crawler.scrape(filetype))