Tutorials

How To Scrape Data from Any Website: 5 Code and No-Code Methods

Unlock the secrets of effective web scraping. Dive into informative articles, tips, and tutorials on web data extraction. Increase your scraping skills with us.
Julien Keraval
October 14, 2024
5 min

A wealth of valuable data and information is stored on the websites. However, harnessing such data in a precise and efficient way might not be that simple. This is the area where web scraping tools come into play. However, if you are not interested in paying for such tools, we have rounded up the best 5 free web scraping methods.

So, without any further ado, let’s dive in!

What is Web Scraping?

Web scraping is a method to extract large amounts of data from a website. This type of data extraction is done using software. Being an automated process, web scraping tends to be an efficient way to extract large chunks of data in an unstructured or structured format.

Individuals and businesses use this data extraction method for various purposes including:

  • Market research
  • Lead generation
  • Price monitoring

A specialized tool used for web scraping is referred to as a ‘Web Scraper’. It is designed to extract data quickly and accurately. The level of complexity and design of a web scraper might vary depending on the project.

Is Web Scraping Legal?

We usually come across a common question “Is web scraping legal?” The shortest and most precise answer is “Yes”. Web scraping is legal if you are extracting data from a publicly available website. On the other hand, it is imperative to understand that web scraping shouldn’t be done in a manner that raises any concerns about the extraction and usage of data.

Besides, there are certain laws that provide necessary guidelines regarding web scraping. These include:

  • Computer Fraud and Abuse Act (CFAA)
  • Digital Millennium Copyright Act
  • Contract Act
  • Data Protection Act
  • Anti-hacking Laws

Why Web Scraping is Useful?

Scraping a website is getting more complicated with each passing day. However, with the availability of web scraping tools, it is a lot easier to extract large-scale data. So, whether you run a well-established business or still struggle to grow your business, web scraping can be more than helpful.

To help you understand why web scraping is so useful, we have briefly discussed some of its most prominent benefits.

1. Allows Generating Quality Leads

Lead generation tends to be a tiresome task. However, with web scraping, generating quality leads won’t take too long. With an efficient web scraping tool, you can scrape the most relevant data of your targeted audience.

For instance, you can scrape data by using various filters such as company, job title, education, and demography. Once you get the contact information of the target audience, it’s time to start your marketing campaign.

2. Offer More Value to Your Customer

Customers are always willing to pay more if a product offers more value. With web scraping, it is possible to improve the quality of your product or services. For this purpose, you need to scrape information about the customers and their feedback regarding the performance of your product.

3. Makes it Easy to Monitor Your Competitor

It is essential to monitor the latest changes made to your competitor’s website. This is the area where web scraping can be helpful. For example, you can monitor what types of new products your competitor has launched.

You can also get valuable insights regarding your competitor’s audience or potential customers. This allows you to carve a new market strategy.

4. Helps with Making Investment Decisions

Investment decisions are usually complex. So, you need to collect and analyze the relevant information before reaching a decision. For this purpose, you can take advantage of web scraping to extract data and conduct analysis.

What Data Can We “Scrape” from the Web?

Technically, you can possibly scrape any website available for public consumption. However, when taking into account the ethical or legal aspects, you can’t do it all the time. So, it would be appropriate to understand some general rules before performing web scraping.

Some of these rules include:

  • Don’t scrape private data that needs a passcode or username to access.
  • Avoid copying or scraping web data, which is copyrighted.
  • You can’t scrape the data if is explicitly prohibited by ToS (Terms of Service).

Free Web Scraping Methods

Web scraping has been a popular source for valuable data extraction. In addition to paid web scraping tools, you can also take advantage of free scraping methods.

To help you with this, here are some of the methods that you can use depending on your data extraction needs:

1. Manual Scraping with Upwork and Fiverr

If you are interested in manual data scraping, you can hire a freelancer via popular freelancing platforms like Upwork and Fiverr. These platforms help you find a web scraping expert depending on your data extraction needs.

Both Upwork and Fiverr promote their top-rated freelancers. So, you can easily find a seasoned web scraper offering online services. You can even find local web scrapers using these platforms.

2. Python Library – BeautifulSoup

Beautifulsoup web scraping tool

BeautifulSoup is a Python library, which allows you to scrape information from selected web pages. It uses XML or HTML parser and provides Pythonic idioms while searching, iterating, and modifying the parse tree. Using this library, you can extract data out of HTML and XML files.

You need a pip package to install BeautifulSoup on Linux or Windows. If you already own this package, just follow these simple steps:

Step 1: Open the command prompt in Python.

Step 2: Run this command and wait for the BeautifulSoup to install.


pip install beautifulsoup4

Note: BeautifulSoup doesn’t parse documents. Hence, a parser library like “html5lib” or “lxml” is also installed through this command.

Step 3: This step involves the selection of a preferred parser library. You can choose from different options including html5lib, html.parser, or lxml.

Step 4: Verify the installation by implementing it with Python.

How to Scrape with BeautifulSoup

Below are the key steps to follow when scraping data with BeautifulSoup.

Step 1: Extracting the HTML using this request:


#!/usr/bin/python3
import requests

url = 'https://www.scrapin.io'

data = requests.get(url)

print(data.text)

Step 2: Extract the content from the HTML using this prompt.


#!/usr/bin/python3

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'https://www.scrapin.io'

data = requests.get(url)

my_data = []
html = BeautifulSoup(data.text, 'html.parser')
articles = html.select('a.post-card')

for article in articles:
    title = article.select('.card-title')[0].get_text()
    excerpt = article.select('.card-text')[0].get_text()
    pub_date = article.select('.card-footer small')[0].get_text()
    my_data.append({"title": title, "excerpt": excerpt, "pub_date": pub_date})

Step 3: Save the above code in a file named fetch.py, and run it using the following code:


python3 fetch.py

3. JavaScript Library - Puppeteer

Follow these steps to initialize your first puppeteer scraper:

Step 1: To start with, you need to create the first puppeteer scraper folder on your computer. You need to use mkdir for creating this example folder. Use this code:


mkdir first-puppeteer-scraper-example

Step 2: Now, initialize the Node.js repository with a package.json file. Use the npm init command to initialize the package.json.


npm init –y

Step 3: Once you have typed this command, you should come across this package.json file.


{
  "name": "first-puppeteer-scraper-example",
  "version": "1.0.0",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "dependencies": {
    "puppeteer": "^19.6.2"
  },
  "type": "module",
  "devDependencies": {},
  "description": ""
}

Step 4: Here, you need to install the Puppeteer library. Use this command to install Puppeteer.


npm install puppeteer

Step 5: After Puppeteer library installed you can scrape any web page using JavaScript


const puppeteer = require('puppeteer');

async function scrapeWebsite(url) {
    // Launch the browser
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    // Navigate to the URL
    await page.goto(url);

    // Scrape data - as an example, let's scrape the title of the page
    const pageTitle = await page.evaluate(() => {
        return document.title;
    });

    console.log(`Title of the page is: ${pageTitle}`);

    // Close the browser
    await browser.close();
}

// Replace 'https://example.com' with the URL you want to scrape
scrapeWebsite('https://example.com');

4. Web Scraping Tool - Webscraper

Webscraper

You can scrape the web by using Webscraper. Just follow these steps:

Step 1: Install the Webscraper extension from Chrome and open it.

Step 2: Open the site that you want to scrape and create a sitemap. Next, you need to specify multiple URLs with ranges.

For the examples listed below, you can use a range URL like “http://example.com/page/[1-3]”.

http://example.com/page/1

http://example.com/page/2

http://example.com/page/3

For the links listed below, use a range URL with zero padding like “http://example.com/page/[001-100]”.

http://example.com/page/001

http://example.com/page/002

http://example.com/page/003

For the link examples provided below, use a range URL with increment like “http://example.com/page/[0-100:10]”

http://example.com/page/0

http://example.com/page/10

http://example.com/page/20

Step 3: Once you have created a site map, the next step is to create selectors. These selectors are added in a tree-like structure.

Step 4: The next step involves inspection of the selector tree. For this purpose, you need to inspect the Selector graph panel.

Step 5: With this, you are all set to scrape your desired web page. Just open the Scrape panel and start web scraping.

5. Web Scraping API – ScraperApi

Web scraping is easy with ScraperAPI. This API is created for hassle-free integration and customization. To enable JS rendering, IP geolocation, residential proxies, and rotating proxies, just add &render=true, &country_code=us or &premium=true.

Below are the steps to follow when you want to use ScraperAPI with the Python Request library:

Step 1: Send requests to ScraperAPI using the API endpoint, Python SDK, or proxy port. Here is the code example:


import requests

from bs4 import BeautifulSoup

list_of_urls = ['http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/']

NUM_RETRIES = 3

scraped_quotes = []

for url in list_of_urls:

    params = {'api_key': API_KEY, 'url': url}

    for in range(NUMRETRIES):

        try:

            response = requests.get('http://api.scraperapi.com/', params=urlencode(params))

            if response.status_code in [200, 404]:

                ## escape for loop if the API returns a successful response

                break

        except requests.exceptions.ConnectionError:

            response = ''

    ## parse data if 200 status code (successful response)

    if response.status_code == 200:

        """

        Insert the parsing code for your use case here...

        """

        ## Example: parse data with beautifulsoup

        html_response = response.text

        soup = BeautifulSoup(html_response, "html.parser")

        quotes_sections = soup.find_all('div', class_="quote")

        ## loop through each quotes section and extract the quote and author

        for quote_block in quotes_sections:

            quote = quote_block.find('span', class_='text').text

            author = quote_block.find('small', class_='author').text

        ## add scraped data to "scraped_quotes" list

            scraped_quotes.append({

                'quote': quote,

                'author': author

            })

print(scraped_quotes)

Step 2: Configuring your code to automatically catch and retry failed requests returned by ScraperAPI. For this purpose, use the code example provided below.


import requests

from bs4 import BeautifulSoup

list_of_urls = ['http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/']

NUM_RETRIES = 3

scraped_quotes = []

for url in list_of_urls:

    params = {'api_key': API_KEY, 'url': url}

    for in range(NUMRETRIES):

        try:

            response = requests.get('http://api.scraperapi.com/', params=urlencode(params))

            if response.status_code in [200, 404]:

                ## escape for loop if the API returns a successful response

                break

        except requests.exceptions.ConnectionError:

            response = ''

    ## parse data if 200 status code (successful response)

    if response.status_code == 200:

        """

        Insert the parsing code for your use case here...

        """

        ## Example: parse data with beautifulsoup

        html_response = response.text

        soup = BeautifulSoup(html_response, "html.parser")

        quotes_sections = soup.find_all('div', class_="quote")

        ## loop through each quotes section and extract the quote and author

        for quote_block in quotes_sections:

            quote = quote_block.find('span', class_='text').text

            author = quote_block.find('small', class_='author').text

         ## add scraped data to "scraped_quotes" list

            scraped_quotes.append({

                'quote': quote,

                'author': author

            })

    print(scraped_quotes)

Step 3: Scale up your scraping by spreading your requests to multiple concurrent threads. You can use this web scraping code.


import requests

from bs4 import BeautifulSoup

import concurrent.futures

import csv

import urllib.parse

API_KEY = 'INSERT_API_KEY_HERE'

NUM_RETRIES = 3

NUM_THREADS = 5

## Example list of urls to scrape

list_of_urls = [

           'http://quotes.toscrape.com/page/1/',

           'http://quotes.toscrape.com/page/2/',

        ]

## we will store the scraped data in this list

scraped_quotes = []

def scrape_url(url):

params = {'api_key': API_KEY, 'url': url}

  # send request to scraperapi, and automatically retry failed requests

    for in range(NUMRETRIES):

        try:

            response = requests.get('http://api.scraperapi.com/', params=urlencode(params))

            if response.status_code in [200, 404]:

                ## escape for loop if the API returns a successful response

                break

        except requests.exceptions.ConnectionError:

            response = ''

    ## parse data if 200 status code (successful response)

    if response.status_code == 200:

       ## Example: parse data with beautifulsoup

        html_response = response.text

        soup = BeautifulSoup(html_response, "html.parser")

        quotes_sections = soup.find_all('div', class_="quote")

        ## loop through each quotes section and extract the quote and author

        for quote_block in quotes_sections:

            quote = quote_block.find('span', class_='text').text

            author = quote_block.find('small', class_='author').text

         ## add scraped data to "scraped_quotes" list

            scraped_quotes.append({

                'quote': quote,

                'author': author

            })

with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:

    executor.map(scrape_url, list_of_urls)

print(scraped_quotes)

Limit of Web Scraping

Before you go ahead and start web scraping, it would be appropriate to learn about the limitations you might face. Here are a few of the most prominent limitations of web scraping:

  • Due to the dynamic nature of websites, it is hard for web scrapers to extract required data by applying predefined logic and patterns.
  • The use of heavy JavaScript or AJAX by a website also makes web scraping more challenging.
  • Also, the anti-scraping software prevents scrapers from extracting data using specific IP addresses.

How to Protect Your Website Against Web Scraping?

If you don’t like others to scrape your website’s data, we have got you covered. For your assistance, we have created a list of ways to protect your website against web scraping.

These include:

  • Control the visits of scrapers by setting limits on connections and requests.
  • Hide the valuable data by publishing it in the form of an image or flash format. This will prevent scraping tools from accessing your structured data.
  • Use Javascript or cookies to verify that the visitor aren’t scraping tools or web scraping applications.
  • You can also add Captchas to ensure that only humans visit your site.
  • Identify and block scraping tools and traffic from malicious sources.
  • Don’t forget to update the HTML tags frequently.

Final Thoughts

The availability of free web scraping methods and tools can open up new opportunities for businesses with limited budgets. As a result, you can access valuable data associated with your target audience. Each of the methods provided above has its strengths and weaknesses. However, it’s up to you to choose a perfect data collection process depending on your web scraping needs.

Scrape Anything from LinkedIn, without limits.

A streamlined LinkedIn scraper API for real-time data scraping of profiles and company information at scale.
No credit card required
20 free requests
Try for free