1. ホーム

AttributeError: 'NoneType' オブジェクトには 'get' 属性がありません。

2022-02-21 05:08:51
<パス

クローラー「Zhihu」で発生した最近のトラブルについて。
AttributeError: 'NoneType' オブジェクトには 'get' 属性がありません。
このオブジェクトは空のオブジェクト None であるため、get 属性がないことを意味します。

完了手順は以下の通りです。

#! /usr/bin/env python
# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup
import requests
import time

def captcha(captcha_data):
    with open("captcha.jpg", "wb") as f:
        f.write(captcha_data)
    text = input("Please enter a verification code: ")
    # return the verification code entered by the user
    return text

def zhihuLogin():
    # Build a Session object that can hold page cookies
    sess = requests.Session()

    # Request headers
    headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

    # First get the login page, find the data that needs to be POSTed (_xsrf), and the cookie value of the current page will be recorded
    html = sess.get("https://www.zhihu.com/#signin", headers = headers).text

    # Call the lxml parsing library
    bs = BeautifulSoup(html, "lxml")

    # _xsrf is used to prevent CSRF attacks (cross-site request forgery), often called cross-domain attacks, which are a way to use a website's trust mechanism for users to do bad things
    # Cross-domain attacks are usually done by disguising the request as a user trusted by the website (using cookies), stealing user information and deceiving the web server
    # So the website will store this MD5 string by setting a hidden field, this string is used to verify the user cookie and the server Session a way

    # Find the input tag with the name attribute value _xsrf, and then take out the value of the value
    _xsrf = bs.find("input", attrs={"name": "_xsrf"}).get("value")

    # Match the URL address of the captcha against the UNIX timestamp
    captcha_url = "https://www.zhihu.com/captcha.gif?r=%d&type=login" % (time.time() * 1000)
    # Send a request for an image, get the image data stream
    captcha_data = sess.get(captcha_url, headers = headers).content
    # Get the text in the captcha, which needs to be entered manually
    text = captcha(captcha_data)

    data = {
        # "_xsrf" : _xsrf,
        "username" : "***",
        "password" : "***",
        "captcha" : text
    }

    # Send the POST data needed to log in and get the cookies after logging in (saved in sess)
    response = sess.post("https://www.zhihu.com/login/email", data = data, headers = headers)
    # print response.text

    # send a request with a cookie that has a login status to get the source code of the target page
    response = sess.get("https://www.zhihu.com/people/hrycici/activities", headers = headers)
    with open("my.html", "wb") as f:
        f.write(response.text.encode("utf-8"))

if __name__ == "__main__":
    zhihuLogin()



上記のプログラムの中で、_xsrfというのがありますが、これは以前はZhihuの仕組みだったのですが、今はなくなっているようなので、この部分をコメントアウトすると、Zhihuに問題なくログインできるようになります