1. ホーム
  2. パイソン

Python crawl 楽しくて実用的な小説

2022-02-26 09:05:45


1. まず、関連するモジュールをインポートする

import os
import requests
from bs4 import BeautifulSoup

# Declare the request headers
headers = {
	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'

# Create the folder where the novel text is stored
if not os.path.exists('. /novel'):
    os.mkdir('. /novel/')
# Access the site and get the page data
response = requests.get('http://www.biquw.com/book/1/').text

2. サイトにリクエストを送信し、サイトデータを取得する


#### Rewrite the access code
response = requests.get('http://www.biquw.com/book/1/')
response.encoding = response.apparent_encoding

The Chinese data returned this way is the correct one




According to the above image, the data is stored in the a tag. a's parent tag is li, li's parent tag is ul, and above the ul tag is a div tag. So if you want to get the novel chapter data of the whole page, then you need to get the div tag first. And the div tag contains a class attribute, we can get the specified div tag through the class attribute, see the code for details~
# lxml: html parsing library convert html code to python object, python can control html code
soup = BeautifulSoup(response.text, 'lxml')
book_list = soup.find('div', class_='book_list').find_all('a')
# soup object returns a list after getting bulk data, we can iterate over the list to extract it
for book in book_list:
    book_name = book.text
    # After getting the list data, you need to get the link to the article details page, the link is in the href attribute of the a tag
    book_url = book['href']

book_info_html = requests.get('http://www.biquw.com/book/1/' + book_url, headers=headers)
book_info_html.encoding = book_info_html.apparent_encoding
soup = BeautifulSoup(book_info_html.text, 'lxml')

info = soup.find('div', id='htmlContent')

with open('. /novel/' + book_name + '.txt', 'a', encoding='utf-8') as f:

3. データを取得した後、ページからデータを抽出する


  1. まず、ブラウザを開きます
  2. F12キーを押して、デベロッパーツールを表示させます。
  3. 要素セレクタの確認
  4. ページ内に欲しいデータを選択し、要素を配置する
  5. データが存在する要素のタグを確認する
According to the above image, the data is stored in the a tag. a's parent tag is li, li's parent tag is ul, and above the ul tag is a div tag. So if you want to get the novel chapter data of the whole page, then you need to get the div tag first. And the div tag contains a class attribute, we can get the specified div tag through the class attribute, see the code for details~
# lxml: html parsing library convert html code to python object, python can control html code
soup = BeautifulSoup(response.text, 'lxml')
book_list = soup.find('div', class_='book_list').find_all('a')
# soup object returns a list after getting bulk data, we can iterate over the list to extract it
for book in book_list:
    book_name = book.text
    # After getting the list data, you need to get the link to the article details page, the link is in the href attribute of the a tag
    book_url = book['href']

4. 小説詳細ページへのリンクを取得した後、詳細ページに再度アクセスし、記事データを取得する

book_info_html = requests.get('http://www.biquw.com/book/1/' + book_url, headers=headers)
book_info_html.encoding = book_info_html.apparent_encoding
soup = BeautifulSoup(book_info_html.text, 'lxml')

5. 小説詳細ページの静的ページ解析

info = soup.find('div', id='htmlContent')

6. データダウンロード

with open('. /novel/' + book_name + '.txt', 'a', encoding='utf-8') as f:


