Friday, April 23, 2021

The hard days of a WEB Scraper Men

The morning begins with a review of mail, analytics of my site for web scraping and new orders for scraping. Then I begin to delve into the essence of each technical task and build a strategy for obtaining data, analyze donor sites for protection from bots and dynamism.

What is a dynamic site?

A dynamic site or JavaScript site uses AJAX requests and consists of resizable web pages. The source code of such web pages is generated when the HTML file is processed by the interpreter of some programming language. In other words, after your browser receives the html code, CSS and javascript files from the server, it starts rendering the page by additionally loading information (pictures, text) into separate blocks, and can also change these blocks. As a result, after all these manipulations, the html code changes beyond recognition. 

It is difficult to retrieve data from such sites, the process of processing Internet pages takes much longer, because it takes time for JavaScript to render. To scrap such sites, you need a browser and Selenium.

What is Selenium?

Most often, this technology is used to automate testing of web applications. However, with the help of Selenium WebDriver, you can automate any repetitive actions performed through the browser, and in our case, receive the html-code of Internet pages.

Selenium also allows you to write scripts in almost any programming language, but mostly Python is used. Selenium WebDriver technology is a key component of many open source and proprietary automation tools. Selenium allows you to control the browser remotely, so you can create distributed stands consisting of many machines with different operating systems and browsers, and even run browsers in the clouds.

Thus, I introduced you to the difficulties of extracting data from web sites. Finally, a link to a cool picture in my instagram Ask questions in the comments.

No comments:

Post a Comment