python クロール陽性画像 mm131 (更新)

2022-02-17 08:35:46

<フォント Python クロールイメージ (LSP編)

記事の目次

前書き
I. 何を使えばいいのですか？
II. 課題テンプレート
- 1. URLに応じてデータを解析する（すべてのクローラーがURLを解析する必要がある、これはlspのURLであるため解析に取り出されることはない）
- 2. テンプレートの適用
概要

前書き

先生によって割り当てられた宿題を完了するために、クロール青春が2、この宿題テンプレートクロール画像を借りて、試してみてください

I. 何を使えばいいのですか？

Pythonベース、個人の習慣に応じたソフトウェアは、ブラウザでプログラムすることができる統合ソフトウェアAnacondaを使用することができますので、pythonパッケージなどをインストールする必要はありません、より便利です

II. 課題テンプレート

1. URLに応じたデータの解析（すべてのクローラーが解析すること、これはlspのURLなので解析に持ち出されることはない）

2. テンプレートの適用

まず最初に、URLから必要なhtmlの部分を取得します。

import json
import re
import requests
from bs4 import BeautifulSoup
import sys
import os
import datetime
today = datetime.date.today().strftime('%Y%m%d')
def crawl_wiki_data(n):
    """crawl html"""
    headers = { 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    url='https://m.mm131.net/more.php?page='
    n=int(n)+1
    for page in range(1,n):
        url=url+str(page)
        print(url)
        response = requests.get(url,headers=headers)
        print(response.status_code)
        soup=BeautifulSoup(response.content,'lxml')
        content=soup.find('body')
        parse_wiki_data(content)
        url='https://m.mm131.net/more.php?page='

ステップ2、htmlのその部分から目的のディレクトリ名を取得し、ギャラリーへのリンクアドレスを取得します。

def parse_wiki_data(content):
    """
    Generate json file to C:/Users/19509/Desktop/python directory
    """
    girls=[]
    bs=BeautifulSoup(str(content),'lxml')
    all_article=bs.find_all('article')
    for h2_title in all_article:
        girl={}
        #articles
        girl["name"]=h2_title.find('a',class_="post-title-link").text
        # link
        girl["link"]="https://m.mm131.net"+h2_title.find('a',class_="post-title-link").get('href')
        girls.append(girl)
    json_data=json.loads(str(girls).replace("\'","\""))
    with open('C:/Users/19509/Desktop/python/girls/'+today+'.json','w',encoding='UTF-8') as f:
        json.dump(json_data,f,ensure_ascii=False)
    crawl_pic_urls()

3番目のステップは、ギャラリーのリンクに基づいて、jsonファイルから各画像をクロールし、各画像へのリンクを配列に格納して、次の関数に渡して画像をダウンロードすることです。

def crawl_pic_urls():
    """
    Crawl the links to the images in each album
    """
    with open('C:/Users/19509/Desktop/python/girls/'+today+'.json','r',encoding='UTF-8') as file:
        json_array = json.loads(file.read())
    headers = { 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 
     }
    for girl in json_array:
        name = girl['name']
        link = girl['link']
        pic_urls=[]
        # crawl the pic_urls
        response = requests.get(link,headers = headers)
        bs = BeautifulSoup(response.content,'lxml')
        # pull pages
        pic=bs.find('div',class_="paging").find('span',class_="rw").text
        pic=re.findall("\d+",pic)
        pic_number=int(pic[1])+1
        #pull picture link
        pic_url=bs.find('div',class_="post-content single-post-content").find('img').get('src')
        pic_urls.append(pic_url)
        list=[]
        
        for x in range(len(pic_url)):
            list.append(pic_url[x])
        for m in range(2,pic_number):
            all_pic_urls=''
            list[33]=str(m)
            for k in range(len(list)):
                all_pic_urls+=list[k]
            pic_urls.append(all_pic_urls)
            headers = {"Referer": link,
                       "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/7.0.0.9 SLBChan/25"}
        down_pic(name, pic_urls,headers)

Step 4 画像をダウンロードし、保存する

def down_pic(name,pic_urls,headers):
    """Download images"""
    path = 'C:/Users/19509/Desktop/python/girls/'+'pic/'+name+'/'
    if not os.path.exists(path):
        os.makedirs(path)
    for i,pic_url in enumerate(pic_urls):
        try:
            pic = requests.get(pic_url,headers=headers)
            string = str(i+1)+'.jpg'
            with open(path+string,'wb') as f:
                f.write(pic.content)
                print('Successfully downloaded the %s picture: %s' %(str(i+1),str(pic_url)))
        except Exception as e:
                print('Failed to download the %s image:%s' %(str(i+1),str(pic_url)))
                print(e)
                continue

最後に、ダウンロードパスの絶対パスを出力し、すべての関数を実行するmain関数を記述します。

def show_pic_path(path):
    """Iterate over each image crawled and print the absolute path of all images"""
    pic_num=0
    for (dirpath,dirnames,filenames) in os.walk(path):
        for filename in filenames:
            pic_num+=1
            print("%dth pic: %s" %(pic_num,os.path.join(dirpath,filename)))
            print("Total crawled lsp pictures %d pictures" % pic_num)
if __name__ == '__main__':
    n=input('How many pages do you want:')
    html = crawl_wiki_data(n)
     # Print the path to the crawled player's picture
    show_pic_path('C:/Users/19509/Desktop/python/girls/pic')

    print("All information crawled! Thanks")

注意: 'C:/Users/19509/Desktop/python/girls' は私のディレクトリであり、あなたのものではないので、あなた自身のディレクトリとそれに対応する girls フォルダを作成する必要があります。

概要

この記事は、ヘッダーが前のヘッダーと同じではないときに画像をダウンロードし、使用される正規表現のセット内の画像の数を引っ張るなどの細かい点がたくさん含まれ、問題があります：画像のセットの名前は、中国語に変更することはできません、私は兄があるかどうかわからない意志！私は、この記事で説明したように、私は、この記事で説明したように、この記事は、ヘッダーをダウンロードし、画像の数を引っ張るなどの細かい点がたくさん含まれ、問題は、画像のセットの名前は、中国語にはできませんが、私は、兄があるかどうかわからない。