How to Scrape HTML Tables into Excel

May 27th, 2020

Today we are going to see how we can scrape Wikipedia table information using Python and BeautifulSoup into an Excel sheet simply and elegantly.

This article aims to get you started on a real-world problem solving while keeping it super simple, so you get familiar and get practical results as fast as possible.

So the first thing we need is to make sure we have Python 3 installed. If not, you can just get Python 3 and get it installed before you proceed.

Then you can install beautiful soup with.

pip3 install beautifulsoup4

We will also need the libraries requests, lxml and soupsieve to fetch data, break it down to XML and to use CSS selectors. Install them using.

pip3 install requests soupsieve lxml

Once installed, open an editor and type in.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

Now let's go to the Wikipedia List of Football Champions page and inspect the data we can get.

This is how it looks.

Back to our code now. Let's try and get this data by pretending we are a browser like this.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import csv


headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.11 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
'Accept-Encoding': 'identity'
}
#'Accept-Encoding': 'identity'


url = 'https://en.wikipedia.org/wiki/List_of_English_football_champions'

response=requests.get(url,headers=headers)

#print(response.content)

soup=BeautifulSoup(response.content,'lxml')

Save this as scrapeTable.py.

If you run it

python3 scrapeTable.py

You will see the whole HTML page.

Now, let's use CSS selectors to get to the data we want... To do that, let's go back to Chrome and open the inspect tool. We now need to get to all the table details. We notice that with the class 'wikitable' holds all the individual table details together.

Let's write code that traverses the Table rows and then the columns like this.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import csv


headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.11 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
'Accept-Encoding': 'identity'
}
#'Accept-Encoding': 'identity'


url = 'https://en.wikipedia.org/wiki/List_of_English_football_champions'

response=requests.get(url,headers=headers)

#print(response.content)

soup=BeautifulSoup(response.content,'lxml')


for item in soup.select('.wikitable')[2]:
	try:

		for trow in item.select('tr'):
			cnt=0
			row_data = []
			for tcell in trow.select('td'):
				if (len(tcell.get_text().strip())>0):
					print(tcell.get_text().strip())
				cnt=cnt 1
			if (cnt>3):
				print('---------------------------')




	except Exception as e:
		#raise e
		print('')

So when you run it, you get.

Bingo!! We got the table data.

Now, let's push the code into Excel by using CSV exports. For this, we will use the CSV module in Python

with open('winner.csv', 'a', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["Year", "Champions", "Runners-up", "Third-place", "Top Goal Scorer", "Goals"])

That code creates the header row of the CSV.

And the following code adds the column details stored in a list after every row is finished.

for tcell in trow.select('td'):
				if (len(tcell.get_text().strip())>0):
					print(tcell.get_text().strip())
					row_data.append(tcell.get_text().strip())
				cnt=cnt 1
			if (cnt>3):
				print('---------------------------')
				with open('winner.csv', 'a', newline='') as file:
					writer = csv.writer(file)
					writer.writerow(row_data)

Here is how it looks all together.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import csv


headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.11 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
'Accept-Encoding': 'identity'
}
#'Accept-Encoding': 'identity'


url = 'https://en.wikipedia.org/wiki/List_of_English_football_champions'

response=requests.get(url,headers=headers)

#print(response.content)

soup=BeautifulSoup(response.content,'lxml')

with open('winner.csv', 'a', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["Year", "Champions", "Runners-up", "Third-place", "Top Goal Scorer", "Goals"])

for item in soup.select('.wikitable')[2]:
	try:

		for trow in item.select('tr'):
			cnt=0
			row_data = []
			for tcell in trow.select('td'):
				if (len(tcell.get_text().strip())>0):
					print(tcell.get_text().strip())
					row_data.append(tcell.get_text().strip())
				cnt=cnt 1
			if (cnt>3):
				print('---------------------------')
				with open('winner.csv', 'a', newline='') as file:
					writer = csv.writer(file)
					writer.writerow(row_data)



	except Exception as e:
		#raise e
		print('')

That when run, it will print the results like earlier but will also create a CSV file in addition.

You can open the file in Excel to see the following data.

That was fun

If you want to use this in production and want to scale to thousands of links, then you will find that you will get IP blocked easily by many websites. In this scenario, using a rotating proxy service to rotate IPs is almost a must. You can use a service like Proxies API to route your calls through a pool of millions of residential proxies.

If you want to scale the crawling speed and dont want to set up you own infrastructure, you can use our Cloud base crawler crawltohell.com to easily crawl thousands of URLs at high speed from our network of crawlers.

Get our articles in your inbox