URL extractors are a very popular tool for everyone involved in the digital space, from marketers to SEO professionals. It is also a big part for web scrapers in the programming community. These scripts range from very simple ones (like the one in this tutorial) to very advanced web crawlers used by the industry leaders.
Let’s see how we can quickly build our own URL scraper using Python.
To continue following this tutorial we will need the two Python libraries: httplib2 and bs4.
If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code:
pip install httplib2
pip install bs4
Get HTML content from URL using Python
To begin this part, let’s first import the libraries we just installed:
from bs4 import BeautifulSoup, SoupStrainer
Now, let’s decide on the URL that we would like to extract the links from. As an example, I will extract the links from the homepage of this blog https://pyshark.com/:
url = 'https://pyshark.com/'
Next, we will create an instance of a class that represents a client HTTP interface:
http = httplib2.Http()
We will need this instance in order to perform HTTP requests to the URLs we would like to extract links from.
Now we will need to perform the following HTTP request:
response, content = http.request(url)
An important note is that .request() method returns a tuple, the first being an instance of a Response class, and the second being the content of the body of the URL we are working with.
Now, we will only need to use the content component of the tuple, being the actual HTML content of the webpage, which contains the entity of the body in a string format.
Finding and extracting links from HTML using Python
At this point we have the HTML content of the URL we would like to extract links from. We are only one step away from getting all the information we need.
Let’s see how we can extract the needed information:
for link in BeautifulSoup(content).find_all('a', href=True): links.append(link['href'])
To begin with, we create an empty list ( links) that we will use to store the links that we will extract from the HTML content of the webpage.
Then, we create a BeautifulSoup() object and pass the HTML content to it. What it does is it creates a nested representations of the HTML content.
As the final step, what we need to do is actually discover the links from the entire HTML content of the webapage. To do it, we use the .find_all() method and let it know that we would like to discover only the tags that are actually links.
Once the script discovers the URLs, it will append them to the links list we have created before. In order to check what we found, simply print out the content of the final list:
for link in links:
And we should see each URL printed out one by one.
This article introduces the basics of link scraping from web pages using httplib2 and bs4 libraries as well as created a full process example.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python Programming articles.
Originally published at https://pyshark.com on September 14, 2020.