Javascript is off

Today we will do deep-web parsing, the program steals content on someone else's site

Author: Aren Proger 10-12-2020

50 views

Today we will do deep-web parsing, the program steals content on someone else's site

the official definition is that parsing is needed to collect and analyze data, in practice, in 9 out of 10 cases, parsers do exactly to take from someone's content. These can be blog sites, which will take articles, groups and channels in Facebook or Telegram and who will take posts or even the simplest adult sites that very often filled exclusively due to parsegs and here it may seem like to make your own parser this is a very difficult task but in fact, everything turns out to be impossible just a billion times easier make your parser if you already know css and for example Let's now write our simple parser in the python programming language, and we will parse the list of games of the stopgame site that have an exclusive rating. The first thing we need to do is, of course, just go through the browser to the stopgame site and go there to the corresponding page with game reviews, as you can see here on the site they have a separate link for the page where the games with the rating are displayed "exclusively" that is, this is what we need in fact, it is both good and bad that there is such a link a little later I will explain why then by simple calculations, we can see that the page displays 20 games and there are only six pages, that is, 20 multiplied by 6 and it turns out that, according to the logic of games with an assessment, there should be amazingly 120 pieces, but if you count them yourself, it will turn out only 114 games, but it's simple, the feature of the page navigation, but between pages itself the link changes simply by modifying addresses and if we inspect the dom element of the page itself, we will see that the list of games is just a set of blocks with classes "lent-block"

It remains to write this code

import requests
from bs4 import BeautifulSoup as BS

max_page = 6
pages = []

for x in range(1, max_page + 1):
	pages.append( requests.get('https://stopgame.ru/review/new/stopchoice/p' + str(x) ) )

for r in pages:
	html = BS(r.content, 'html.parser')

for el in html.select('.lent-block'):
	title = el.select('.lent-title > a')
	print(title[0].text)

For great deepening, explore libraries requests and BeautifulSoup