Scraping behind a login

Puppeteer and Playwright can be particularly useful when scraping data accessible only behind a login wall. This article shows a practical example of such a case.

# Scraping expenses on Amazon

For our example, we will be logging in to our Amazon account and scraping the price off each order in the previous year, then adding them all up to show us our total Amazon expenditures over that period of time.

A combination of UI automation and scraping will allow us to first log in to the platform, and then to retrieve the information about all our orders.

    WARNING

    This example is only intended for learning purposes. Always make sure the website you are planning to scrape allows such behaviour.

    Run the above examples as follows, making sure to choose the right Amazon URL and currency:

      TIP

      Under the hood, Amazon can change quite quickly. You might need to adjust the locators and/or flow slightly to have the script work for you.

      WARNING

      Websites might restrict headless browser traffic (opens new window) in order to protect their users from fraud. 2FA will also interfere with the script if enabled.

      # Takeaways

      1. We can scrape information available behind a login wall with Puppeteer and Playwright.
      2. Some websites might not allow scraping. Always make sure you check their terms of service beforehand.

      # Further reading

      1. Basic scraping (opens new window) with Puppeteer and Playwright