Thursday, April 29, 2021

Social networks

A social network is an Internet platform where you can post information about yourself and exchange information, photos, messages, and various files with other users.

What social networks are there?

Facebook - truly international platform, the number of its users exceeds one billion. The year of foundation is 2004.

Twitter - social network that allows you to share short messages (mainly news) and has over 330 million active users. Started working in 2006.

Dev.to is a social network for programmers, where I also got myself the profile. It was easy to do, the interface is very user-friendly and pleasant.

LinkedIn - this network is positioned more as a professional one and has more than 590 users. It appeared in 2003.

Komunity.io - professional networking redefined where I started my account. This young network is somewhat similar to LinkedIn.

Instagram - social network that allows you to share photos and videos. It has over 850 million users. It appeared in 2010.

The influence of social networks on people's lives is enormous, many do not even fully realize the scale of this phenomenon, and after all, social networks are already the most popular activity on the Internet. Today, out of the 100 most visited sites in the world, 25 are classic social networks, and another 60 are socialized to one degree or another. More than 90% of companies around the world use social media in their work. About 85% of people trust information from social networks. Whole revolutions are even arranged through them. Social networks have become the very center of the modern Internet.

Monday, April 26, 2021

Regular expressions and XPath

Writing regular expressions is not an easy task. I confess to you that I didn’t like to do it before. You do everything logically correctly, but the regular season does not give the desired result. Regular expressions are seldom used for web scraping many people use XPath.

What is XPath?

XPath is a language for querying elements of an XML document or HTML code. Designed to provide access to parts of an XML or HTML document. XPath aims to implement DOM navigation in XML and HTML.

Despite this, I still use regular expressions. You can read more about regular expressions in my blog, and here I will tell you why I decided to use regular expressions for parsing.

Sometimes the information is in a piece of JavaScript, and after you get it using XPath you need to continue processing that information, here you need regular expressions. Why not use regular expressions for the first act of scraping, then subject the result to regular expression processing and, if necessary, apply the third act? It seems convenient and logical to me.

You can visit my zintro profile and find out more about me

Friday, April 23, 2021

The hard days of a WEB Scraper Men

The morning begins with a review of mail, analytics of my site for web scraping and new orders for scraping. Then I begin to delve into the essence of each technical task and build a strategy for obtaining data, analyze donor sites for protection from bots and dynamism.

What is a dynamic site?

A dynamic site or JavaScript site uses AJAX requests and consists of resizable web pages. The source code of such web pages is generated when the HTML file is processed by the interpreter of some programming language. In other words, after your browser receives the html code, CSS and javascript files from the server, it starts rendering the page by additionally loading information (pictures, text) into separate blocks, and can also change these blocks. As a result, after all these manipulations, the html code changes beyond recognition. 

It is difficult to retrieve data from such sites, the process of processing Internet pages takes much longer, because it takes time for JavaScript to render. To scrap such sites, you need a browser and Selenium.

What is Selenium?

Most often, this technology is used to automate testing of web applications. However, with the help of Selenium WebDriver, you can automate any repetitive actions performed through the browser, and in our case, receive the html-code of Internet pages.

Selenium also allows you to write scripts in almost any programming language, but mostly Python is used. Selenium WebDriver technology is a key component of many open source and proprietary automation tools. Selenium allows you to control the browser remotely, so you can create distributed stands consisting of many machines with different operating systems and browsers, and even run browsers in the clouds.

Thus, I introduced you to the difficulties of extracting data from web sites. Finally, a link to a cool picture in my instagram Ask questions in the comments.