Mastering Web Scraping with Scrapoxy: Unleash Your Data Extraction Wizardry!

Unlock the potential of web scraping with this session! 

1/ Building Web Scrapers - The Art Unveiled

2/ Proxy and Browser Farms Adventure

3/ Scrapoxy Orchestration - Elevate Your Scalability

4/ Protection Measures Disclosed

This concise session will immerse you in the world of web scraping.

#WebScraping #Proxy #ReverseEngineering 🕵️‍♂️

Rate this content
Bookmark
Video Summary and Transcription
The talk delves into the world of web scraping, focusing on the use of Scrapoxy, a proxy aggregator that enhances data extraction efficiency. It introduces various proxy types, including data center proxies and ISP proxies, which help in bypassing website protections. The speaker emphasizes the role of rotating residential proxies in mixing user activity to avoid detection. The video also highlights the importance of using JavaScript libraries like Axios and Cheerio for efficient text extraction and parsing. Playwright, a headless browser, is recommended for executing JavaScript and mimicking real browser behavior to bypass antibot systems. The talk also covers the challenges of handling time zone inconsistencies in web scraping, which can lead to requests being flagged by antibot systems. Finally, the discussion includes how Isabella utilizes web scraping to gather reviews from TrekkieReviews.com, aiming to analyze them with sentiment analysis for her travel tool.

This talk has been presented at Node Congress 2024, check out the latest edition of this JavaScript Conference.

FAQ

Fabien Vauchel is a web scraping enthusiast who works at Wiremind, a company specializing in revenue management within the transportation industry. He is also the creator of Scrapoxy, a free and open-source proxy aggregator.

Scrapoxy supports major cloud providers such as AWS, Azure, GCP, and DigitalOcean. It also supports proxy services like Zyte, Railbite, IPRail, and many others.

Scrapoxy is a free and open-source proxy aggregator created by Fabien Vauchel. It allows users to manage and route traffic through various cloud providers and proxy services, supporting major providers like AWS, Azure, GCP, and DigitalOcean.

TrekkieReviews.com is a website where users can check out accommodations in any city. Users can search for a city and find a list of available accommodations, including detailed information like name, description, address, email, and reviews.

Fabien Vauchel recommends using Axios, a JavaScript library for handling requests, and Cheerio, a JavaScript library for parsing HTML using CSS selectors.

The different types of proxies mentioned are data center proxies, ISP (Internet Service Provider) proxies, and rotating residential proxies. Each type has its own advantages and use cases.

Scrapoxy helps manage proxies by allowing users to add and configure various types of connectors, such as data center, ISP, and residential proxies. It automates the management of proxy instances, including starting and stopping them as needed.

A headless browser like Playwright is used in web scraping to execute JavaScript and mimic real browser behavior, which is essential for bypassing advanced antibot systems. Playwright is open-source and maintained by Microsoft.

Inconsistencies in time zones can trigger antibot systems to flag and block requests. For example, if the IP address time zone differs from the browser's time zone, the requests may be rejected. Adjusting the browser's time zone to match the IP address time zone can resolve this issue.

Isabella, a final-year IT student, uses web scraping to collect public reviews from the website TrekkieReviews.com. She aims to analyze these reviews using large language models for sentiment analysis to create a tool that curates ultimate travel experiences.

1. Introduction to Web Scraping and Proxy Systems#

Short description:

Hi, I'm Fabien Vauchel. I've been deeply passionate about web scrapping for years. I work at Wiremind, an amazing company specializing in revenue management within the transportation industry. I'm also the creator of Scrapoxy, a free and open-source proxy aggregator. It supports major cloud providers and proxy services. It's fully written in TypeScript with NetJS and Angular frameworks.

Hi, I'm Fabien Vauchel. I've been deeply passionate about web scrapping for years. My enthusiasm led me to explore the fascinating world of proxy and antibot systems.

I work at Wiremind, an amazing company specializing in revenue management within the transportation industry. Our work at Wiremind involves handling millions of prices on a daily basis, which requires substantial investment in web scrapping technologies.

I'm also the creator of Scrapoxy. Scrapoxy is a free and open-source proxy aggregator. It allows you to manage and route traffic through cloud providers and proxy services. It supports major cloud providers such as AWS, Azure, GCP, and DigitalOcean. It supports proxy services like Zyte, Railbite, IPRail and many others. It's fully written in TypeScript with NetJS and Angular frameworks.

2. Isabella's Journey to Web Scraping#

Short description:

Before diving into this amazing product, let me share with you a little story. Isabella, a final-year student in IT school, noticed a gap in the market and realized she needed a vast amount of data to create her ultimate trip tool. She decided to focus on accommodations and made sure to consider all legal aspects. Now, let me introduce you to the website she chose to scrape, TrekkieReviews.com. It is your go-to spot for checking out accommodations in any city. Isabella is interested in analyzing reviews to see what people think about accommodations.

Before diving into this amazing product, let me share with you a little story. Enter Isabella. She's a final-year student in IT school. Isabella has a bright mind and a lot of energy and also a thirst for traveling. Every year, she embarks on a one-month backpacking journey to a random country. But here's a twist. This level of planning consumed her entire year in preparation for just one month of travel. Isabella couldn't help but notice a gap in the market. Why wasn't there such a tool in a digital era pumped with AI? This could be her ticket to a successful business. She realized she needed a vast amount of data to create such a tool. This vast amount of data will train a large language model to curate her ultimate trip. However, she's a total newcomer in the web scrapping industry. How to collect massive amounts of data? To kick off things, she decided to focus all efforts on accommodations.

However, Isabella is very careful in her approach to business. Before she starts scrapping data, she makes sure to consider all the legal aspects. She knows it's important not to overwhelm the website by making too many requests too quickly. She also respects privacy. She only collects information that is already public, like reviews and doesn't take any personal details like names. She doesn't sign the website terms and conditions either. She's free from any contract. Now that everything is clear, she is ready to collect the data. Let me introduce you to the website she chose to scrap, TrekkieReviews.com.

So what's TrekkieReview all about? It is your go-to spot for checking out accommodation in any city you're interested in. Here is how it works. You simply enter the name of the city you want to explore in the search bar, and you will see a list of all available accommodations. Let's say that Isabella is dreaming of Paris. She will find 50 accommodations. If she clicks on one hotel, she will get all the information like its name, description, address, email, and reviews. Isabella is interested in reviews. It is all about analyzing those reviews to see what people think about accommodations.

3. Deep Dive into Web Scraping#

Short description:

With the latest large language models, Isabella can perform sentiment analysis and bypass website protections. By following requests using the Chrome inspector tool, Isabella focuses on extracting HTML content and uses libraries like Axios and Cheerios for efficient text extraction. Her spider enables her to retrieve hotel information in a structured way.

With the latest large language models, she can deep dive into sentiment analysis and extract the main feeling or issue. But wait! There is more. The website is also super secure. I've put different levels of protection to keep the place safe. And during this presentation, Isabella will try to bypass each protection one by one.

To better understand how requests are chained, let's open the Chrome inspector tool. Here is a scoop. When you land on a webpage, it's like getting a big packet delivery at your doorstep. Inside this package, there's a whole bunch of stuff. You've got HTML, CSS, JavaScript, and images. But let's keep it simple. We are only interested in HTML. I click on dock and make sure to preserve logs so we don't lose track of anything. Now let's follow the request.

The first one is a search form with the URL level 1. The second one is a list of hotels and paginations with the URL cities and parameters cities set to Paris. The last one is a detail of the hotel with the URL hotel slash ID. If we click on the response tab, you will see the full HTML content to extract the hotel ratings. However, it can be complex to execute requests and parse HTML manually. So I want to spotlight two powerful libraries Axios and Cheerios. Axios is a JavaScript library designed to handle requests. Cheerios is another JavaScript library which parses HTML using CSS selectors, making text extraction a breeze. All right, let's break down our spiders.

So if I'm opening my spider, I've got many information and many methods. The first one is a method to go to the homepage. After that, I can search hotels and cities. Here I'm going to Paris and I retrieve the links of the hotels. When I've got the links, I can get the hotels, detail, extract the name, email, and rating. Let's run this spider. If I'm running this spider, I've got 50 items and I can check the information in a structured way.

4. Increasing Security and Introducing Proxies#

Short description:

I've got different information here, names, emails, and ratings. By identifying and modifying the request headers, I can bypass the website's spider detection and increase security. Moving to level three, I encounter the issue of too many requests from the same IP address. To solve this, I introduce a proxy system that relays requests and masks the original source. Data center proxies, hosted on services like AWS, Azure, or GCP, are fast and reliable but can be easily detected by Antibot solutions.

I've got different information here, names, emails, and ratings. It's perfect. I can increase the security level. Let's move to the level two. It's perfect, it's here. So if I'm running against a spider, I've got unknown browser error. It is because a spider identified itself as Axios. The website rejected the spider. If we open Chrome and check the request headers, we'll see that Chrome is sending insightful information such as user agent, Chrome is identifying itself as Mozilla on Linux. Also, there are other headers associated with the user agent. I will add those information to the spiders. I will write the user agent and say that it is Windows. I will add also other information such as sexyh headers, sexyh mobile, sexyh UA platforms. And if I'm running against a spider this time, I've got my 50 items. It's perfect. Now let's move to the next level.

I'm moving to level three and I'm running my spider. The spider has some delays and I will get an error which is too many requests. It is because I'm doing a lot of requests from the same IP address. My laptop is sending all requests to the web server from the same IP and the server is rejecting me because I'm doing too many requests. I need many IP addresses to send the request. That's where I want to introduce a proxy. What is a proxy? A proxy is a system running on the internet. It relays requests to a server and the server believes that the request is coming from the proxy, not the real source. And of course, there are many types of proxies. The first type is a data center proxy. This kind of proxy runs on AWS, Azure, or GCP. It is the first serious kind of proxy that you can use. They are fast and reliable. However, they can be easily identified by Antibot solutions.

5. Understanding ISP Proxies#

Short description:

IP ranges are associated with autonomous system numbers like Amazon or Microsoft. To bypass Antibot solutions, ISP proxies can be used. These proxies rent IP addresses from clean autonomous system numbers, like mobile providers, and mix your activity with other IP addresses.

To explain you, IP ranges are associated to autonomous system numbers. And the name of the autonomous system numbers can be Amazon or Microsoft. So it can be easily identified by Antibot solutions. But there is a trick to get around it. And this is called ISP proxy. Internet Service Provider Proxy. Let's talk about ISP proxy and how they work. Autonomous proxy are set up in data centers, but they don't use IP address from these data centers. Instead, they rent IP address from a clean autonomous system number like mobile providers such as Verizon. They get a bunch of IP addresses and the proxy uses one of them. This means that when you are using the proxy, your activity gets mixed with all the IP addresses that is keeping you hidden.

6. Rotating Residential Proxies and ScrapOxy#

Short description:

The last type of proxy is the rotating residential proxy. It uses IP addresses from devices like laptops or mobile phones, which come from real users. ScrapOxy, the super proxy aggregator, is used to manage the proxy strategy. It offers a user-friendly UI and allows easy addition of connectors for different types of proxies. AWS connectors can be created with just a few clicks, and ScrapOxy handles all the installations and management of instances. The proxies can be monitored for status, metrics, and real IP address and geo information.

And there is the last type of proxy, the rotating residential proxy. The IP address comes from a device which can be a laptop or a mobile phone. How does it work? When a developer wants to earn money from his application, he has 3 solutions. First, he can sell subscriptions like monthly or annual subscriptions which unlock features. Second, he can add advertising like having an ad at the bottom of the application. And third, he can share the bandwidth of the device, of course, only with the user agreement. But that's where the IP comes from. This is the type of proxy which is very powerful because IP addresses come from real users. And of course, there are millions of endpoints available.

Now we will use ScrapOxy, the super proxy aggregator to manage our proxy strategy. To start ScrapOxy, it is very easy. It's just a docker line to run. So I'm running that. In a second, I've got ScrapOxy up and ready. I can go to the UI here and enter my credential. And I've got one project already created. On this project, I can add very easily a new connector. So I'm going to the marketplace and I've got 100 connectors available. I can add data center connectors, ISP connector, residential connectors, 4G mobile connector, hardware connector, whatever. If I want to add an AWS connector, I just click on create and I enter my credentials. ScrapOxy will handle all the installations, start and stop the instance for you. You don't have to manage that anymore.

So let me demo that. I already created one AWS connector. If I'm starting this connector, it will create very quickly different proxy on AWS. Now I've got 10 instances on AWS. I can see status and metrics. And also I have the real IP address of the proxy here and geo information. Here is proxy are based in Dublin. And I can confirm that with the coverage map, every proxy are in Dublin.

7. Integrating ScrapOxy and Moving to Level 5#

Short description:

Let's integrate ScrapOxy into the spiders by adding the proxy configuration. By using residential networks like ProxySeller, we can avoid the error of data centers being forbidden. After adding the connector to ScrapOxy, I can see all hotel information being collected by the spider. Now, let's move on to level 5.

So now let's integrate ScrapOxy into the spiders. I will copy the username and go back to my spider. So I can add the proxy here, localhost, port 8888, protocol HTTP and my credentials. I need the password. And if I'm going back there, I can add the password. That's perfect. Let's restart the spider.

So now it's working and I've got my 50 items. Let's move to the next level, level 4. If I'm running again the spider, I've got a big error, data center are forbidden. It's because the Antibot system detects the autonomous system number, which is Amazon 02. So I need to use residential network for that. Today, I will use ProxySeller residential network. They provide data center proxy, ISP proxy, residential proxy and mobile proxy. I already created ProxySeller credential in ScrapOxy and I will add a connector. First, I stop the AWS connector to avoid paying anything more. And I will create one connector. I need 50 proxies, based in the US. And they start in a second. Yeah, that's perfect. So I'm checking the proxy list and can see that ScrapOxy stops the instances of AWS. And I can see the new proxy from the residential network. If I'm going to the coverage map, I will see in a minute all the new proxies. Here I've got 50 proxies in the US.

So now I can go back to the spider. As you see, I didn't touch the spider. I just added a connector to ScrapOxy. And if I'm running again the spider, I can see all hotel information are collected. I've got my 50 item, it's perfect. Now I will move to the next level, level 5.

8. Using Playwright and Handling Fingerprint Errors#

Short description:

I encountered an error with no fingerprint when running the spider. To solve this, I used Playwright, a headless browser that allows me to send requests with real browser information. By executing the fingerprint request and sending the information to the antibot, I can quickly make every request. However, antibody systems also look for consistency, which caused an error with inconsistent time zones when moving to the next level.

When I run the spider this time, I've got a big error, no fingerprint. And to understand that, I need to go back to my browser. And on the browser, as you can see, there are a lot of POST requests. The browser is sending a lot of information to the Antibot system. So I'm going to check what kind of information am I sending. I'm sending the platform type, the time zone, and the real user agents. We cannot spoof anymore this kind of information.

I need a real browser to send my request instead of access and execute JavaScript, a browser which can be controlled by a script. I will use Playwright. So Playwright is a headless browser. We can see that it's helpful for the presentations. I can execute JavaScript and it works with Chrome, Firefox, Edge, and Safari. It is open source and maintained by Microsoft. So let's see how we can adapt our spiders.

Now I can create a Playwright script based on the previous spider. In the Playwright script, I've got the same methods. I go to the home page. I post a form and get a list of hotels. And I get the details of each hotel. I extract names, emails, and ratings. So if I'm running this spider, you will see a browser opening and going to the home page. I'm executing the fingerprint request and sending all the information to the antibot. And now I can do every request very quickly. As you see, we cannot see the page because it downloads only content without rendering. So I've got my 50 items. But of course, antibody systems are not only catching fingerprint information. They are catching consistency.

So if I'm moving to the next level, the LoveX6, and I'm running again the Playwright spider, the spider connects to the home page, sends a fingerprint. But when I execute other requests, I've got a big error, inconsistent time zone. It is happening because we are sending the real time zone of the browser.

9. Adjusting Time Zone and Completing Requests#

Short description:

The antibot has a real time zone, which caused inconsistencies with the browser time zone. By changing the browser time zone to America, Chicago, I was able to execute all requests successfully using Playwright. Thank you!

So the antibot has a real time zone. And the browser time zone is Europe, Paris. And we are using IP addresses from the US. There is a five-hour difference on the time zone. I need to correct that. I cannot correct the IP address time zone. But I can correct the time zone of the browser. To do that, I need to go to the setting here and change the time zone ID to America, Chicago.

If I'm running again the spider, the spider connects to the home page, sends the fingerprint information. And this time, the IP time zone is consistent to the browser time zone. I can execute all requests. And as you see, I've got every request executed by Playwright. And I've got my 50 items.

That's all for me. Thank you very much. Download Sprapoxy and join Wiremind.

Fabien Vauchelles
Fabien Vauchelles
21 min
04 Apr, 2024

Comments

Sign in or register to post your comment.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Don't Solve Problems, Eliminate Them
React Advanced 2021React Advanced 2021
39 min
Don't Solve Problems, Eliminate Them
Top Content
Kent C. Dodds discusses the concept of problem elimination rather than just problem-solving. He introduces the idea of a problem tree and the importance of avoiding creating solutions prematurely. Kent uses examples like Tesla's electric engine and Remix framework to illustrate the benefits of problem elimination. He emphasizes the value of trade-offs and taking the easier path, as well as the need to constantly re-evaluate and change approaches to eliminate problems.
Jotai Atoms Are Just Functions
React Day Berlin 2022React Day Berlin 2022
22 min
Jotai Atoms Are Just Functions
Top Content
State management in React is a highly discussed topic with many libraries and solutions. Jotai is a new library based on atoms, which represent pieces of state. Atoms in Jotai are used to define state without holding values and can be used for global, semi-global, or local states. Jotai atoms are reusable definitions that are independent from React and can be used without React in an experimental library called Jotajsx.
Debugging JS
React Summit 2023React Summit 2023
24 min
Debugging JS
Top Content
Watch video: Debugging JS
Debugging JavaScript is a crucial skill that is often overlooked in the industry. It is important to understand the problem, reproduce the issue, and identify the root cause. Having a variety of debugging tools and techniques, such as console methods and graphical debuggers, is beneficial. Replay is a time-traveling debugger for JavaScript that allows users to record and inspect bugs. It works with Redux, plain React, and even minified code with the help of source maps.
The Epic Stack
React Summit US 2023React Summit US 2023
21 min
The Epic Stack
Top Content
Watch video: The Epic Stack
This Talk introduces the Epic Stack, a project starter and reference for modern web development. It emphasizes that the choice of tools is not as important as we think and that any tool can be fine. The Epic Stack aims to provide a limited set of services and common use cases, with a focus on adaptability and ease of swapping out tools. It incorporates technologies like Remix, React, Fly to I.O, Grafana, and Sentry. The Epic Web Dev offers free materials and workshops to gain a solid understanding of the Epic Stack.
Fighting Technical Debt With Continuous Refactoring
React Day Berlin 2022React Day Berlin 2022
29 min
Fighting Technical Debt With Continuous Refactoring
Top Content
Watch video: Fighting Technical Debt With Continuous Refactoring
This Talk discusses the importance of refactoring in software development and engineering. It introduces a framework called the three pillars of refactoring: practices, inventory, and process. The Talk emphasizes the need for clear practices, understanding of technical debt, and a well-defined process for successful refactoring. It also highlights the importance of visibility, reward, and resilience in the refactoring process. The Talk concludes by discussing the role of ownership, management, and prioritization in managing technical debt and refactoring efforts.
AHA Programming
React Summit Remote Edition 2020React Summit Remote Edition 2020
32 min
AHA Programming
Top Content
The Talk discusses the concept of AHA programming, which emphasizes thoughtful abstractions. It presents a live-coded example of the life-cycle of an abstraction and demonstrates how to fix bugs and enhance abstractions. The importance of avoiding complex abstractions and the value of duplication over the wrong abstraction are highlighted. The Talk also provides insights on building the right abstractions and offers resources for further learning.

Workshops on related topic

React, TypeScript, and TDD
React Advanced 2021React Advanced 2021
174 min
React, TypeScript, and TDD
Top Content
Featured WorkshopFree
Paul Everitt
Paul Everitt
ReactJS is wildly popular and thus wildly supported. TypeScript is increasingly popular, and thus increasingly supported.

The two together? Not as much. Given that they both change quickly, it's hard to find accurate learning materials.

React+TypeScript, with JetBrains IDEs? That three-part combination is the topic of this series. We'll show a little about a lot. Meaning, the key steps to getting productive, in the IDE, for React projects using TypeScript. Along the way we'll show test-driven development and emphasize tips-and-tricks in the IDE.
Web3 Workshop - Building Your First Dapp
React Advanced 2021React Advanced 2021
145 min
Web3 Workshop - Building Your First Dapp
Top Content
Featured WorkshopFree
Nader Dabit
Nader Dabit
In this workshop, you'll learn how to build your first full stack dapp on the Ethereum blockchain, reading and writing data to the network, and connecting a front end application to the contract you've deployed. By the end of the workshop, you'll understand how to set up a full stack development environment, run a local node, and interact with any smart contract using React, HardHat, and Ethers.js.
Remix Fundamentals
React Summit 2022React Summit 2022
136 min
Remix Fundamentals
Top Content
Featured WorkshopFree
Kent C. Dodds
Kent C. Dodds
Building modern web applications is riddled with complexity And that's only if you bother to deal with the problems
Tired of wiring up onSubmit to backend APIs and making sure your client-side cache stays up-to-date? Wouldn't it be cool to be able to use the global nature of CSS to your benefit, rather than find tools or conventions to avoid or work around it? And how would you like nested layouts with intelligent and performance optimized data management that just works™?
Remix solves some of these problems, and completely eliminates the rest. You don't even have to think about server cache management or global CSS namespace clashes. It's not that Remix has APIs to avoid these problems, they simply don't exist when you're using Remix. Oh, and you don't need that huge complex graphql client when you're using Remix. They've got you covered. Ready to build faster apps faster?
At the end of this workshop, you'll know how to:- Create Remix Routes- Style Remix applications- Load data in Remix loaders- Mutate data with forms and actions
Vue3: Modern Frontend App Development
Vue.js London Live 2021Vue.js London Live 2021
169 min
Vue3: Modern Frontend App Development
Top Content
Featured WorkshopFree
Mikhail Kuznetsov
Mikhail Kuznetsov
The Vue3 has been released in mid-2020. Besides many improvements and optimizations, the main feature of Vue3 brings is the Composition API – a new way to write and reuse reactive code. Let's learn more about how to use Composition API efficiently.

Besides core Vue3 features we'll explain examples of how to use popular libraries with Vue3.

Table of contents:
- Introduction to Vue3
- Composition API
- Core libraries
- Vue3 ecosystem

Prerequisites:
IDE of choice (Inellij or VSC) installed
Nodejs + NPM
Developing Dynamic Blogs with SvelteKit & Storyblok: A Hands-on Workshop
JSNation 2023JSNation 2023
174 min
Developing Dynamic Blogs with SvelteKit & Storyblok: A Hands-on Workshop
Top Content
Featured WorkshopFree
Alba Silvente Fuentes
Roberto Butti
2 authors
This SvelteKit workshop explores the integration of 3rd party services, such as Storyblok, in a SvelteKit project. Participants will learn how to create a SvelteKit project, leverage Svelte components, and connect to external APIs. The workshop covers important concepts including SSR, CSR, static site generation, and deploying the application using adapters. By the end of the workshop, attendees will have a solid understanding of building SvelteKit applications with API integrations and be prepared for deployment.
Build Modern Applications Using GraphQL and Javascript
Node Congress 2024Node Congress 2024
152 min
Build Modern Applications Using GraphQL and Javascript
Featured Workshop
Emanuel Scirlet
Miguel Henriques
2 authors
Come and learn how you can supercharge your modern and secure applications using GraphQL and Javascript. In this workshop we will build a GraphQL API and we will demonstrate the benefits of the query language for APIs and what use cases that are fit for it. Basic Javascript knowledge required.