1. Introduction to Web Scraping and Proxy Systems
Hi, I'm Fabien Vauchel. I've been deeply passionate about web scrapping for years. I work at Wiremind, an amazing company specializing in revenue management within the transportation industry. I'm also the creator of Scrapoxy, a free and open-source proxy aggregator. It supports major cloud providers and proxy services. It's fully written in TypeScript with NetJS and Angular frameworks.
Hi, I'm Fabien Vauchel. I've been deeply passionate about web scrapping for years. My enthusiasm led me to explore the fascinating world of proxy and antibot systems.
I work at Wiremind, an amazing company specializing in revenue management within the transportation industry. Our work at Wiremind involves handling millions of prices on a daily basis, which requires substantial investment in web scrapping technologies.
I'm also the creator of Scrapoxy. Scrapoxy is a free and open-source proxy aggregator. It allows you to manage and route traffic through cloud providers and proxy services. It supports major cloud providers such as AWS, Azure, GCP, and DigitalOcean. It supports proxy services like Zyte, Railbite, IPRail and many others. It's fully written in TypeScript with NetJS and Angular frameworks.
2. Isabella's Journey to Web Scraping
Before diving into this amazing product, let me share with you a little story. Isabella, a final-year student in IT school, noticed a gap in the market and realized she needed a vast amount of data to create her ultimate trip tool. She decided to focus on accommodations and made sure to consider all legal aspects. Now, let me introduce you to the website she chose to scrape, TrekkieReviews.com. It is your go-to spot for checking out accommodations in any city. Isabella is interested in analyzing reviews to see what people think about accommodations.
Before diving into this amazing product, let me share with you a little story. Enter Isabella. She's a final-year student in IT school. Isabella has a bright mind and a lot of energy and also a thirst for traveling. Every year, she embarks on a one-month backpacking journey to a random country. But here's a twist. This level of planning consumed her entire year in preparation for just one month of travel. Isabella couldn't help but notice a gap in the market. Why wasn't there such a tool in a digital era pumped with AI? This could be her ticket to a successful business. She realized she needed a vast amount of data to create such a tool. This vast amount of data will train a large language model to curate her ultimate trip. However, she's a total newcomer in the web scrapping industry. How to collect massive amounts of data? To kick off things, she decided to focus all efforts on accommodations.
However, Isabella is very careful in her approach to business. Before she starts scrapping data, she makes sure to consider all the legal aspects. She knows it's important not to overwhelm the website by making too many requests too quickly. She also respects privacy. She only collects information that is already public, like reviews and doesn't take any personal details like names. She doesn't sign the website terms and conditions either. She's free from any contract. Now that everything is clear, she is ready to collect the data. Let me introduce you to the website she chose to scrap, TrekkieReviews.com.
So what's TrekkieReview all about? It is your go-to spot for checking out accommodation in any city you're interested in. Here is how it works. You simply enter the name of the city you want to explore in the search bar, and you will see a list of all available accommodations. Let's say that Isabella is dreaming of Paris. She will find 50 accommodations. If she clicks on one hotel, she will get all the information like its name, description, address, email, and reviews. Isabella is interested in reviews. It is all about analyzing those reviews to see what people think about accommodations.
3. Deep Dive into Web Scraping
With the latest large language models, Isabella can perform sentiment analysis and bypass website protections. By following requests using the Chrome inspector tool, Isabella focuses on extracting HTML content and uses libraries like Axios and Cheerios for efficient text extraction. Her spider enables her to retrieve hotel information in a structured way.
With the latest large language models, she can deep dive into sentiment analysis and extract the main feeling or issue. But wait! There is more. The website is also super secure. I've put different levels of protection to keep the place safe. And during this presentation, Isabella will try to bypass each protection one by one.
To better understand how requests are chained, let's open the Chrome inspector tool. Here is a scoop. When you land on a webpage, it's like getting a big packet delivery at your doorstep. Inside this package, there's a whole bunch of stuff. You've got HTML, CSS, JavaScript, and images. But let's keep it simple. We are only interested in HTML. I click on dock and make sure to preserve logs so we don't lose track of anything. Now let's follow the request.
The first one is a search form with the URL level 1. The second one is a list of hotels and paginations with the URL cities and parameters cities set to Paris. The last one is a detail of the hotel with the URL hotel slash ID. If we click on the response tab, you will see the full HTML content to extract the hotel ratings. However, it can be complex to execute requests and parse HTML manually. So I want to spotlight two powerful libraries Axios and Cheerios. Axios is a JavaScript library designed to handle requests. Cheerios is another JavaScript library which parses HTML using CSS selectors, making text extraction a breeze. All right, let's break down our spiders.
So if I'm opening my spider, I've got many information and many methods. The first one is a method to go to the homepage. After that, I can search hotels and cities. Here I'm going to Paris and I retrieve the links of the hotels. When I've got the links, I can get the hotels, detail, extract the name, email, and rating. Let's run this spider. If I'm running this spider, I've got 50 items and I can check the information in a structured way.
4. Increasing Security and Introducing Proxies
I've got different information here, names, emails, and ratings. By identifying and modifying the request headers, I can bypass the website's spider detection and increase security. Moving to level three, I encounter the issue of too many requests from the same IP address. To solve this, I introduce a proxy system that relays requests and masks the original source. Data center proxies, hosted on services like AWS, Azure, or GCP, are fast and reliable but can be easily detected by Antibot solutions.
I've got different information here, names, emails, and ratings. It's perfect. I can increase the security level. Let's move to the level two. It's perfect, it's here. So if I'm running against a spider, I've got unknown browser error. It is because a spider identified itself as Axios. The website rejected the spider. If we open Chrome and check the request headers, we'll see that Chrome is sending insightful information such as user agent, Chrome is identifying itself as Mozilla on Linux. Also, there are other headers associated with the user agent. I will add those information to the spiders. I will write the user agent and say that it is Windows. I will add also other information such as sexyh headers, sexyh mobile, sexyh UA platforms. And if I'm running against a spider this time, I've got my 50 items. It's perfect. Now let's move to the next level.
I'm moving to level three and I'm running my spider. The spider has some delays and I will get an error which is too many requests. It is because I'm doing a lot of requests from the same IP address. My laptop is sending all requests to the web server from the same IP and the server is rejecting me because I'm doing too many requests. I need many IP addresses to send the request. That's where I want to introduce a proxy. What is a proxy? A proxy is a system running on the internet. It relays requests to a server and the server believes that the request is coming from the proxy, not the real source. And of course, there are many types of proxies. The first type is a data center proxy. This kind of proxy runs on AWS, Azure, or GCP. It is the first serious kind of proxy that you can use. They are fast and reliable. However, they can be easily identified by Antibot solutions.
5. Understanding ISP Proxies
IP ranges are associated with autonomous system numbers like Amazon or Microsoft. To bypass Antibot solutions, ISP proxies can be used. These proxies rent IP addresses from clean autonomous system numbers, like mobile providers, and mix your activity with other IP addresses.
To explain you, IP ranges are associated to autonomous system numbers. And the name of the autonomous system numbers can be Amazon or Microsoft. So it can be easily identified by Antibot solutions. But there is a trick to get around it. And this is called ISP proxy. Internet Service Provider Proxy. Let's talk about ISP proxy and how they work. Autonomous proxy are set up in data centers, but they don't use IP address from these data centers. Instead, they rent IP address from a clean autonomous system number like mobile providers such as Verizon. They get a bunch of IP addresses and the proxy uses one of them. This means that when you are using the proxy, your activity gets mixed with all the IP addresses that is keeping you hidden.
6. Rotating Residential Proxies and ScrapOxy
The last type of proxy is the rotating residential proxy. It uses IP addresses from devices like laptops or mobile phones, which come from real users. ScrapOxy, the super proxy aggregator, is used to manage the proxy strategy. It offers a user-friendly UI and allows easy addition of connectors for different types of proxies. AWS connectors can be created with just a few clicks, and ScrapOxy handles all the installations and management of instances. The proxies can be monitored for status, metrics, and real IP address and geo information.
And there is the last type of proxy, the rotating residential proxy. The IP address comes from a device which can be a laptop or a mobile phone. How does it work? When a developer wants to earn money from his application, he has 3 solutions. First, he can sell subscriptions like monthly or annual subscriptions which unlock features. Second, he can add advertising like having an ad at the bottom of the application. And third, he can share the bandwidth of the device, of course, only with the user agreement. But that's where the IP comes from. This is the type of proxy which is very powerful because IP addresses come from real users. And of course, there are millions of endpoints available.
Now we will use ScrapOxy, the super proxy aggregator to manage our proxy strategy. To start ScrapOxy, it is very easy. It's just a docker line to run. So I'm running that. In a second, I've got ScrapOxy up and ready. I can go to the UI here and enter my credential. And I've got one project already created. On this project, I can add very easily a new connector. So I'm going to the marketplace and I've got 100 connectors available. I can add data center connectors, ISP connector, residential connectors, 4G mobile connector, hardware connector, whatever. If I want to add an AWS connector, I just click on create and I enter my credentials. ScrapOxy will handle all the installations, start and stop the instance for you. You don't have to manage that anymore.
So let me demo that. I already created one AWS connector. If I'm starting this connector, it will create very quickly different proxy on AWS. Now I've got 10 instances on AWS. I can see status and metrics. And also I have the real IP address of the proxy here and geo information. Here is proxy are based in Dublin. And I can confirm that with the coverage map, every proxy are in Dublin.
7. Integrating ScrapOxy and Moving to Level 5
Let's integrate ScrapOxy into the spiders by adding the proxy configuration. By using residential networks like ProxySeller, we can avoid the error of data centers being forbidden. After adding the connector to ScrapOxy, I can see all hotel information being collected by the spider. Now, let's move on to level 5.
So now let's integrate ScrapOxy into the spiders. I will copy the username and go back to my spider. So I can add the proxy here, localhost, port 8888, protocol HTTP and my credentials. I need the password. And if I'm going back there, I can add the password. That's perfect. Let's restart the spider.
So now it's working and I've got my 50 items. Let's move to the next level, level 4. If I'm running again the spider, I've got a big error, data center are forbidden. It's because the Antibot system detects the autonomous system number, which is Amazon 02. So I need to use residential network for that. Today, I will use ProxySeller residential network. They provide data center proxy, ISP proxy, residential proxy and mobile proxy. I already created ProxySeller credential in ScrapOxy and I will add a connector. First, I stop the AWS connector to avoid paying anything more. And I will create one connector. I need 50 proxies, based in the US. And they start in a second. Yeah, that's perfect. So I'm checking the proxy list and can see that ScrapOxy stops the instances of AWS. And I can see the new proxy from the residential network. If I'm going to the coverage map, I will see in a minute all the new proxies. Here I've got 50 proxies in the US.
So now I can go back to the spider. As you see, I didn't touch the spider. I just added a connector to ScrapOxy. And if I'm running again the spider, I can see all hotel information are collected. I've got my 50 item, it's perfect. Now I will move to the next level, level 5.
8. Using Playwright and Handling Fingerprint Errors
I encountered an error with no fingerprint when running the spider. To solve this, I used Playwright, a headless browser that allows me to send requests with real browser information. By executing the fingerprint request and sending the information to the antibot, I can quickly make every request. However, antibody systems also look for consistency, which caused an error with inconsistent time zones when moving to the next level.
When I run the spider this time, I've got a big error, no fingerprint. And to understand that, I need to go back to my browser. And on the browser, as you can see, there are a lot of POST requests. The browser is sending a lot of information to the Antibot system. So I'm going to check what kind of information am I sending. I'm sending the platform type, the time zone, and the real user agents. We cannot spoof anymore this kind of information.
I need a real browser to send my request instead of access and execute JavaScript, a browser which can be controlled by a script. I will use Playwright. So Playwright is a headless browser. We can see that it's helpful for the presentations. I can execute JavaScript and it works with Chrome, Firefox, Edge, and Safari. It is open source and maintained by Microsoft. So let's see how we can adapt our spiders.
Now I can create a Playwright script based on the previous spider. In the Playwright script, I've got the same methods. I go to the home page. I post a form and get a list of hotels. And I get the details of each hotel. I extract names, emails, and ratings. So if I'm running this spider, you will see a browser opening and going to the home page. I'm executing the fingerprint request and sending all the information to the antibot. And now I can do every request very quickly. As you see, we cannot see the page because it downloads only content without rendering. So I've got my 50 items. But of course, antibody systems are not only catching fingerprint information. They are catching consistency.
So if I'm moving to the next level, the LoveX6, and I'm running again the Playwright spider, the spider connects to the home page, sends a fingerprint. But when I execute other requests, I've got a big error, inconsistent time zone. It is happening because we are sending the real time zone of the browser.
9. Adjusting Time Zone and Completing Requests
The antibot has a real time zone, which caused inconsistencies with the browser time zone. By changing the browser time zone to America, Chicago, I was able to execute all requests successfully using Playwright. Thank you!
So the antibot has a real time zone. And the browser time zone is Europe, Paris. And we are using IP addresses from the US. There is a five-hour difference on the time zone. I need to correct that. I cannot correct the IP address time zone. But I can correct the time zone of the browser. To do that, I need to go to the setting here and change the time zone ID to America, Chicago.
If I'm running again the spider, the spider connects to the home page, sends the fingerprint information. And this time, the IP time zone is consistent to the browser time zone. I can execute all requests. And as you see, I've got every request executed by Playwright. And I've got my 50 items.
That's all for me. Thank you very much. Download Sprapoxy and join Wiremind.
Comments