Mastering Web Scraping with Scrapoxy: Unleash Your Data Extraction Wizardry!

Rate this content
Bookmark
The video provides an in-depth look at web scraping using Scrapoxy, a proxy aggregator that supports major cloud providers like AWS, Azure, and GCP. It introduces Isabella, an IT student who scrapes TrekkieReviews.com to gather accommodation reviews using tools like Axios and Cheerio for efficient text extraction. The video discusses the importance of using different types of proxies such as data center proxies, ISP proxies, and rotating residential proxies to avoid detection. It also highlights the role of headless browsers like Playwright in bypassing advanced antibot systems. Scrapoxy simplifies proxy management by allowing easy addition and configuration of various proxies, ensuring efficient and secure web scraping.

From Author:

Unlock the potential of web scraping with this session! 

1/ Building Web Scrapers - The Art Unveiled

2/ Proxy and Browser Farms Adventure

3/ Scrapoxy Orchestration - Elevate Your Scalability

4/ Protection Measures Disclosed

This concise session will immerse you in the world of web scraping.

#WebScraping #Proxy #ReverseEngineering 🕵️‍♂️

This talk has been presented at Node Congress 2024, check out the latest edition of this Tech Conference.

FAQ

Fabien Vauchel is a web scraping enthusiast who works at Wiremind, a company specializing in revenue management within the transportation industry. He is also the creator of Scrapoxy, a free and open-source proxy aggregator.

Scrapoxy is a free and open-source proxy aggregator created by Fabien Vauchel. It allows users to manage and route traffic through various cloud providers and proxy services, supporting major providers like AWS, Azure, GCP, and DigitalOcean.

Scrapoxy supports major cloud providers such as AWS, Azure, GCP, and DigitalOcean. It also supports proxy services like Zyte, Railbite, IPRail, and many others.

Isabella, a final-year IT student, uses web scraping to collect public reviews from the website TrekkieReviews.com. She aims to analyze these reviews using large language models for sentiment analysis to create a tool that curates ultimate travel experiences.

TrekkieReviews.com is a website where users can check out accommodations in any city. Users can search for a city and find a list of available accommodations, including detailed information like name, description, address, email, and reviews.

Fabien Vauchel recommends using Axios, a JavaScript library for handling requests, and Cheerio, a JavaScript library for parsing HTML using CSS selectors.

The different types of proxies mentioned are data center proxies, ISP (Internet Service Provider) proxies, and rotating residential proxies. Each type has its own advantages and use cases.

Scrapoxy helps manage proxies by allowing users to add and configure various types of connectors, such as data center, ISP, and residential proxies. It automates the management of proxy instances, including starting and stopping them as needed.

A headless browser like Playwright is used in web scraping to execute JavaScript and mimic real browser behavior, which is essential for bypassing advanced antibot systems. Playwright is open-source and maintained by Microsoft.

Inconsistencies in time zones can trigger antibot systems to flag and block requests. For example, if the IP address time zone differs from the browser's time zone, the requests may be rejected. Adjusting the browser's time zone to match the IP address time zone can resolve this issue.

Fabien Vauchelles
Fabien Vauchelles
21 min
04 Apr, 2024

Comments

Sign in or register to post your comment.

Video Transcription

1. Introduction to Web Scraping and Proxy Systems

Short description:

Hi, I'm Fabien Vauchel. I've been deeply passionate about web scrapping for years. I work at Wiremind, an amazing company specializing in revenue management within the transportation industry. I'm also the creator of Scrapoxy, a free and open-source proxy aggregator. It supports major cloud providers and proxy services. It's fully written in TypeScript with NetJS and Angular frameworks.

Hi, I'm Fabien Vauchel. I've been deeply passionate about web scrapping for years. My enthusiasm led me to explore the fascinating world of proxy and antibot systems.

I work at Wiremind, an amazing company specializing in revenue management within the transportation industry. Our work at Wiremind involves handling millions of prices on a daily basis, which requires substantial investment in web scrapping technologies.

I'm also the creator of Scrapoxy. Scrapoxy is a free and open-source proxy aggregator. It allows you to manage and route traffic through cloud providers and proxy services. It supports major cloud providers such as AWS, Azure, GCP, and DigitalOcean. It supports proxy services like Zyte, Railbite, IPRail and many others. It's fully written in TypeScript with NetJS and Angular frameworks.

2. Isabella's Journey to Web Scraping

Short description:

Before diving into this amazing product, let me share with you a little story. Isabella, a final-year student in IT school, noticed a gap in the market and realized she needed a vast amount of data to create her ultimate trip tool. She decided to focus on accommodations and made sure to consider all legal aspects. Now, let me introduce you to the website she chose to scrape, TrekkieReviews.com. It is your go-to spot for checking out accommodations in any city. Isabella is interested in analyzing reviews to see what people think about accommodations.

Before diving into this amazing product, let me share with you a little story. Enter Isabella. She's a final-year student in IT school. Isabella has a bright mind and a lot of energy and also a thirst for traveling. Every year, she embarks on a one-month backpacking journey to a random country. But here's a twist. This level of planning consumed her entire year in preparation for just one month of travel. Isabella couldn't help but notice a gap in the market. Why wasn't there such a tool in a digital era pumped with AI? This could be her ticket to a successful business. She realized she needed a vast amount of data to create such a tool. This vast amount of data will train a large language model to curate her ultimate trip. However, she's a total newcomer in the web scrapping industry. How to collect massive amounts of data? To kick off things, she decided to focus all efforts on accommodations.

However, Isabella is very careful in her approach to business. Before she starts scrapping data, she makes sure to consider all the legal aspects. She knows it's important not to overwhelm the website by making too many requests too quickly. She also respects privacy. She only collects information that is already public, like reviews and doesn't take any personal details like names. She doesn't sign the website terms and conditions either. She's free from any contract. Now that everything is clear, she is ready to collect the data. Let me introduce you to the website she chose to scrap, TrekkieReviews.com.

So what's TrekkieReview all about? It is your go-to spot for checking out accommodation in any city you're interested in. Here is how it works. You simply enter the name of the city you want to explore in the search bar, and you will see a list of all available accommodations. Let's say that Isabella is dreaming of Paris. She will find 50 accommodations. If she clicks on one hotel, she will get all the information like its name, description, address, email, and reviews. Isabella is interested in reviews. It is all about analyzing those reviews to see what people think about accommodations.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Don't Solve Problems, Eliminate Them
React Advanced Conference 2021React Advanced Conference 2021
39 min
Don't Solve Problems, Eliminate Them
Top Content
Kent C. Dodds discusses the concept of problem elimination rather than just problem-solving. He introduces the idea of a problem tree and the importance of avoiding creating solutions prematurely. Kent uses examples like Tesla's electric engine and Remix framework to illustrate the benefits of problem elimination. He emphasizes the value of trade-offs and taking the easier path, as well as the need to constantly re-evaluate and change approaches to eliminate problems.
Jotai Atoms Are Just Functions
React Day Berlin 2022React Day Berlin 2022
22 min
Jotai Atoms Are Just Functions
Top Content
State management in React is a highly discussed topic with many libraries and solutions. Jotai is a new library based on atoms, which represent pieces of state. Atoms in Jotai are used to define state without holding values and can be used for global, semi-global, or local states. Jotai atoms are reusable definitions that are independent from React and can be used without React in an experimental library called Jotajsx.
Debugging JS
React Summit 2023React Summit 2023
24 min
Debugging JS
Top Content
Watch video: Debugging JS
Debugging JavaScript is a crucial skill that is often overlooked in the industry. It is important to understand the problem, reproduce the issue, and identify the root cause. Having a variety of debugging tools and techniques, such as console methods and graphical debuggers, is beneficial. Replay is a time-traveling debugger for JavaScript that allows users to record and inspect bugs. It works with Redux, plain React, and even minified code with the help of source maps.
The Epic Stack
React Summit US 2023React Summit US 2023
21 min
The Epic Stack
Top Content
Watch video: The Epic Stack
This Talk introduces the Epic Stack, a project starter and reference for modern web development. It emphasizes that the choice of tools is not as important as we think and that any tool can be fine. The Epic Stack aims to provide a limited set of services and common use cases, with a focus on adaptability and ease of swapping out tools. It incorporates technologies like Remix, React, Fly to I.O, Grafana, and Sentry. The Epic Web Dev offers free materials and workshops to gain a solid understanding of the Epic Stack.
Fighting Technical Debt With Continuous Refactoring
React Day Berlin 2022React Day Berlin 2022
29 min
Fighting Technical Debt With Continuous Refactoring
Top Content
This Talk discusses the importance of refactoring in software development and engineering. It introduces a framework called the three pillars of refactoring: practices, inventory, and process. The Talk emphasizes the need for clear practices, understanding of technical debt, and a well-defined process for successful refactoring. It also highlights the importance of visibility, reward, and resilience in the refactoring process. The Talk concludes by discussing the role of ownership, management, and prioritization in managing technical debt and refactoring efforts.
AHA Programming
React Summit Remote Edition 2020React Summit Remote Edition 2020
32 min
AHA Programming
Top Content
The Talk discusses the concept of AHA programming, which emphasizes thoughtful abstractions. It presents a live-coded example of the life-cycle of an abstraction and demonstrates how to fix bugs and enhance abstractions. The importance of avoiding complex abstractions and the value of duplication over the wrong abstraction are highlighted. The Talk also provides insights on building the right abstractions and offers resources for further learning.

Workshops on related topic

React, TypeScript, and TDD
React Advanced Conference 2021React Advanced Conference 2021
174 min
React, TypeScript, and TDD
Top Content
Featured WorkshopFree
Paul Everitt
Paul Everitt
ReactJS is wildly popular and thus wildly supported. TypeScript is increasingly popular, and thus increasingly supported.

The two together? Not as much. Given that they both change quickly, it's hard to find accurate learning materials.

React+TypeScript, with JetBrains IDEs? That three-part combination is the topic of this series. We'll show a little about a lot. Meaning, the key steps to getting productive, in the IDE, for React projects using TypeScript. Along the way we'll show test-driven development and emphasize tips-and-tricks in the IDE.
Web3 Workshop - Building Your First Dapp
React Advanced Conference 2021React Advanced Conference 2021
145 min
Web3 Workshop - Building Your First Dapp
Top Content
Featured WorkshopFree
Nader Dabit
Nader Dabit
In this workshop, you'll learn how to build your first full stack dapp on the Ethereum blockchain, reading and writing data to the network, and connecting a front end application to the contract you've deployed. By the end of the workshop, you'll understand how to set up a full stack development environment, run a local node, and interact with any smart contract using React, HardHat, and Ethers.js.
Remix Fundamentals
React Summit 2022React Summit 2022
136 min
Remix Fundamentals
Top Content
Featured WorkshopFree
Kent C. Dodds
Kent C. Dodds
Building modern web applications is riddled with complexity And that's only if you bother to deal with the problems
Tired of wiring up onSubmit to backend APIs and making sure your client-side cache stays up-to-date? Wouldn't it be cool to be able to use the global nature of CSS to your benefit, rather than find tools or conventions to avoid or work around it? And how would you like nested layouts with intelligent and performance optimized data management that just works™?
Remix solves some of these problems, and completely eliminates the rest. You don't even have to think about server cache management or global CSS namespace clashes. It's not that Remix has APIs to avoid these problems, they simply don't exist when you're using Remix. Oh, and you don't need that huge complex graphql client when you're using Remix. They've got you covered. Ready to build faster apps faster?
At the end of this workshop, you'll know how to:- Create Remix Routes- Style Remix applications- Load data in Remix loaders- Mutate data with forms and actions
Vue3: Modern Frontend App Development
Vue.js London Live 2021Vue.js London Live 2021
169 min
Vue3: Modern Frontend App Development
Top Content
Featured WorkshopFree
Mikhail Kuznetcov
Mikhail Kuznetcov
The Vue3 has been released in mid-2020. Besides many improvements and optimizations, the main feature of Vue3 brings is the Composition API – a new way to write and reuse reactive code. Let's learn more about how to use Composition API efficiently.

Besides core Vue3 features we'll explain examples of how to use popular libraries with Vue3.

Table of contents:
- Introduction to Vue3
- Composition API
- Core libraries
- Vue3 ecosystem

Prerequisites:
IDE of choice (Inellij or VSC) installed
Nodejs + NPM
Developing Dynamic Blogs with SvelteKit & Storyblok: A Hands-on Workshop
JSNation 2023JSNation 2023
174 min
Developing Dynamic Blogs with SvelteKit & Storyblok: A Hands-on Workshop
Top Content
Featured WorkshopFree
Alba Silvente Fuentes
Roberto Butti
2 authors
This SvelteKit workshop explores the integration of 3rd party services, such as Storyblok, in a SvelteKit project. Participants will learn how to create a SvelteKit project, leverage Svelte components, and connect to external APIs. The workshop covers important concepts including SSR, CSR, static site generation, and deploying the application using adapters. By the end of the workshop, attendees will have a solid understanding of building SvelteKit applications with API integrations and be prepared for deployment.
Build Modern Applications Using GraphQL and Javascript
Node Congress 2024Node Congress 2024
152 min
Build Modern Applications Using GraphQL and Javascript
Featured Workshop
Emanuel Scirlet
Miguel Henriques
2 authors
Come and learn how you can supercharge your modern and secure applications using GraphQL and Javascript. In this workshop we will build a GraphQL API and we will demonstrate the benefits of the query language for APIs and what use cases that are fit for it. Basic Javascript knowledge required.