A beginner’s guide to web scraping with Python and Scrapy

Since their inception, websites are used to share information.

Whether it is a Wikipedia article, YouTube channel, Instagram account, or a Twitter handle.

But, what if we want to get any specific data programmatically?

A beginner’s guide to web scraping with Python and Scrapy

But, most of the time, website owners dont provide any API.

In that case, we are only left with the possibility to extract the data using web scraping.

It makes the whole process of retrieving specific data very easy and straightforward.

It’s free, every week, in your inbox.

This tutorial will be an ultimate guide for you to learn web scraping using Python programming language.

At first, Ill walk you through some basic examples to make you familiar with web scraping.

Later on, well use that knowledge to extract data of football matches from Livescore.cz .

Im using pipenv for this tutorial, but you might use pip and venv, or conda.

Lets now create a new project named web_scraper by using the scrapy cli.

If you are using pipenv like me, use:

pipenv run scrapy startproject web_scraper .

Otherwise, from your virtual environment, use:

scrapy startproject web_scraper .

At first, well locate the logo of theLive Code Streamwebsite inside HTML.

The code

To get started we need to create a new spider for this project.

We can do that by either creating a new file or using the CLI.

Code explanation:

you’re able to even use some external libraries like BeautifulSoup and lxml .

Finally, poke the Copy XPath menu item.

Have a look at the below screenshot to understand it better.

These names are pre-defined in Scrapy library.

So, you must use them as it is.

Otherwise, the program will not work as intended.

launch the Spider:

As we are already inside theweb_scraperfolder in command prompt.

Lets execute our spider and fill the result inside a new filelcs.jsonusingthe below code.

Yes, the result we get will be well-structured using JSON format.

Results:

When the above code executes, well see a new filelcs.jsonin our project folder.

Football tournaments are organized frequently throughout the world.

There are several websites that provide a live feed of match results while they are being played.

But, most of these websites dont offer any official API.

For example, lets have a look atLivescore.czwebsite.

Lets create a new spider in our project to retrieve the tournament names.

trigger the below command to let the spider crawl the home page of Livescore.cz website.

The web scraping result will then be added inside a new file calledls_t.jsonin JSON format.

By now you know the drill.

This is what our web spider has extracted on 18 November 2020 fromLivescore.cz.

Remember that the output may change every day.

Create a new file inside/web_scraper/web_scraper/spiders/and name it aslivescore.py.

Now, enter the below code in it.

The code structure of this file is the same as our previous examples.

Here, we just updated theparse()method with a new functionality.

Basically, we extracted all the HTMLelements from the page.

Then, we loop through them to find out whether it is a tournament or a match.

If it is a tournament, we extracted its name.

initiate the example:

key in the following command inside the console and execute it.

Conclusion

Data Analysts often useweb scrapingbecause it helps them in collecting data to predict the future.

We can even use it to monitor the prices of products.

In other words, web scraping has many use cases andPythonis completely capable to do that.

So, what are you waiting for?

Try scraping your favorite websites now.

initiate the example:#

Conclusion#

Also tagged with#

initiate the example:

Conclusion

Also tagged with