クロームブラウザを設定するためのselenium+pythonオプション

2022-02-25 23:46:45

selenium+pythonでクロームブラウザを設定するためのオプションです。

1. 背景

セレンブラウザのレンダリング技術の使用では、ウェブサイトの情報をクロールし、デフォルトの場合は、通常の純粋なクロムのブラウザであり、我々は通常、ブラウザを使用して、しばしばいくつかのプラグイン、拡張機能、プロキシおよびその他のアプリケーションを追加します。我々はクロームブラウザでサイトをクロールするときに対応して、我々は、クローラの動作を満たすために、このクロームのいくつかの特別な設定を行う必要があるかもしれません。
よく使われる動作は
画像や動画の読み込みを無効にする：ページの読み込み速度を向上させることができます。
プロキシの追加：壁を越えて特定のページにアクセスしたり、IPアクセス回数制限に対応するために使用するクローリング対策技術です。
モバイルヘッダを使用する：一般的にクロール対策が弱いモバイルサイトへのアクセスに使用します。
拡張機能を追加する：通常のブラウザと同じように機能する。
エンコーディングの設定：中国語のサイトに対応し、文字化けを防止します。
JavaScriptの実行をブロックします。
.........

2.環境

パイソン3.6.1
システム：win7
IDE: pycharm
クロームブラウザのインストール
クロメドライバーの設定
セレン 3.7.0

3.クロームオプション

chromeOptionsは、クロームのスタートアップのプロパティを設定するクラスです。このクラスを通して、クロームの以下のパラメータを設定することができます（この部分はseleniumのソースコードで確認することができます）。
クロームのバイナリファイルの場所を設定する(binary_location)
起動時の引数の追加(add_argument)
拡張アプリケーションの追加 (add_extension、add_encoded_extension)
実験的な設定パラメータの追加(add_experimental_option)
デバッガアドレスの設定(debugger_address)
ソースコードです。

# . \Lib\site-packages\selenium\webdriver\chrome\options.py
class Options(object):

    def __init__(self):
        # Set the chrome binary location
        self._binary_location = ''
        # Add startup arguments
        self._arguments = []
        # Add extensions
        self._extension_files = []
        self._extensions = []
        # Add experimental setup parameters
        self._experimental_options = {}
        # Set the debugger address
        self._debugger_address = None

使用例

# Set the default encoding to utf-8, i.e. Chinese

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('lang=zh_CN.UTF-8')
driver = webdriver.Chrome(chrome_options = options)

4. 共通設定

公式サイトのリファレンスです。 https://sites.google.com/a/chromium.org/chromedriver/capabilities

4.1. エンコード形式を設定する

# Set the default encoding to utf-8, i.e. Chinese

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('lang=zh_CN.UTF-8')
driver = webdriver.Chrome(chrome_options = options)

4.2. モバイル端末のシミュレート

モバイル端末のユーザーエージェントフォームです。 http://www.fynas.com/ua
モバイル版ではクローラー対策が弱くなっているため

# Used to emulate mobile devices by setting up user-agent
# For example, to emulate the Android QQ browser
options.add_argument('user-agent="MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"')

# Simulate iPhone 6
options.add_argument('user-agent="Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/ 9.0 Mobile/13B143 Safari/601.1"')

4.3. 画像の読み込みを無効にする

画像を読み込むことなく、クロール速度を向上させることができます。

# Disable loading of images
from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)

# Start the browser and set the wait
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.set_window_size(configure.windowHeight, configure.windowWidth) # according to the desktop resolution, mainly to catch the captcha screenshot
wait = WebDriverWait(browser, timeout = configure.timeoutMain)

4.4. プロキシを追加する

プロキシの選択で、クロールの安定性を向上させるために、固定IPを選択しようとすると、特に注意を払う必要があるこの場所は、セレンクローラのプロキシを追加します。あなたがクローラを行うためにセレンを選択した場合、それはサイトが高いアンチクローリング能力（または他の直接scrapyに）、ページ間の一貫性、クッキー、ユーザーのステータスなどの高いレベルの監視を持っていることを意味するので。動的匿名IPを使用する場合、各IPの生存時間は非常に短い（1~3分）。

from selenium import webdriver
# Static IP: 102.23.1.105: 2005
# Abu cloud dynamic IP: http://D37EPSERV96VT4W2:[email protected]:9020
PROXY = "proxy_host:proxy:port"
options = webdriver.ChromeOptions()
desired_capabilities = options.to_capabilities()
desired_capabilities['proxy'] = {
    "httpProxy": PROXY,
    "ftpProxy": PROXY,
    "sslProxy": PROXY,
    "noProxy": None,
    "proxyType": "MANUAL",
    "class": "org.openqa.selenium.Proxy",
    "autodetect": False
}
driver = webdriver.Chrome(desired_capabilities = desired_capabilities)

4.5. ブラウザオプションの設定

selenium は通常、拡張機能を持たない純粋なブラウザを開きますが、時には以下のようなブラウザの設定を行いたい場合があります。 フラッシュオプションのデフォルト値を常にグローバルに許可、クッキーを消去、キャッシュを消去に設定する。 といった具合です。
これを実現するために、あるアイディアがあります。ここでは、クロームブラウザの例を示します。

クローム://設定/コンテンツ

クローム://設定/プライバシー

4.6. ブラウザ拡張アプリケーションの追加

seleniumは通常、拡張機能を持たない純粋なブラウザを開きますが、時には解析クラスのxpathヘルパー、翻訳クラス、追加情報（売上）の取得など、いくつかのプラグインの助けを借りてデータをクロールすることが必要です。そこで、必要なプラグインを含むchromedriverを起動するにはどうしたらよいでしょうか？
ここでは、クロームでXpath Helperプラグインを読み込む例を紹介します。

4.6.1. 適切なプラグインをダウンロードする

Xpath Helperのダウンロードはこちら http://download.csdn.net/download/gengliang123/9944202
以下のように、ファイルで始まり crx ファイルには、以下のサフィックスが付きます。

4.6.2. プラグインパスをコードに記入する

# Add the xpath helper application

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()

# Set up the application extension
extension_path = 'D:/extension/XPath-Helper_v2.0.2.crx'
chrome_options.add_extension(extension_path)

# Start the browser and set the wait
browser = webdriver.Chrome(chrome_options=chrome_options)

4.6.3. 結果表示

4.6.4. 注意事項

まず、クロール速度を向上させるために、できるだけ少ないプラグインをロードします。
次に、あるオプションでは、クロームブラウザのユーザーの設定情報をすべて読み込みますが、次のようにテストがうまくいきません。
参考記事 http://blog.csdn.net/y100100/article/details/44061469
参考記事 https://www.cnblogs.com/stonewang313/p/3938488.html
参考記事 http://blog.csdn.net/liaojianqiu0115/article/details/78353267
First enter C:\Users (users)\your computer name \AppDataLocal︓Google Chrome︓User Data︓Default Extensions, click on Extensions, the folder inside is the installed extensions, (remember to show the computer hidden folder first, otherwise you can not find The name is a bunch of unordered English letters that can't understand, my way is to click one by one to find the corresponding plug-in version number, version number in chrome Extensions option to find and then package the plug-in you need.名前にない英字がたくさんあってわからない、一つずつクリックして対応するプラグインのバージョン番号を探す。クロームの設定を開き、内部の拡張子をクリックして、開発者モードを選択し、あなたが下にインストールしたプラグインは、IDが表示されます、このID あなたがパッケージにしたいプラグインに対応し、パッケージの拡張子は、フォルダのバージョン番号の下に対応するフォルダ（またはあなたがコンピュータ上の任意の場所にフォルダをコピーしてください）、つまり、フォルダ内のフォルダのID名を見つけて、パッケージ拡張子をクリックすると、できる、対応する接尾語のCRCとpemとファイルの同じレベルのバージョン番号に表示されます。このcrxファイルは、我々が必要とするものです（しかし、この方法によると、私のローカルディレクトリにそのようなcrxファイルを見つけることではありませんが、別々にダウンロードする必要があります...）。準備が行われ、コードを参照してください。

# The first way.
# The chrome browser extensions are all under: C:\Users\Administrator\AppData\Local\Google\Chrome\User Data\Profile 2\Extensions\
chrome_options.add_argument("user-data-dir=C:/Users/Administrator/AppData/Local/Google/Chrome/User Data")

# Load all Chrome configurations, type chrome://version/ in the Chrome address bar to see your "profile path", and then call this configuration file when the browser starts, with the following code.
from selenium import webdriver
option = webdriver.ChromeOptions()
option.add_argument('--user-data-dir=C:\Users\Administrator\AppData\Local\Google\Chrome\User Data') # Set it to the user's own data directory
driver = webdriver.Chrome(chrome_options=option)

# Error results
First, all browser windows, including those opened by itself, will be controlled.
Second, other actions do not work and crash.
Traceback (most recent call last):
  File "E:/PyCharmCode/taobaoProductSelenium/taobaoSelenium.py", line 40, in <module>
    # Start the browser and set the wait
  File "E:\Miniconda\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 69, in __init__
    desired_capabilities=desired_capabilities)
  File "E:\Miniconda\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 151, in __init__
    self.start_session(desired_capabilities, browser_profile)
  File "E:\Miniconda\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 240, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "E:\Miniconda\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 308, in execute
    self.error_handler.check_response(response)
  File "E:\Miniconda\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: crashed
  (Driver info: chromedriver=2.32.498550 (9dec58e66c31bcc53a9ce3c7226f0c1c5810906a),platform=Windows NT 6.1.7601 SP1 x86_64)

from time import sleep 
from selenium import webdriver 
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions() 
prefs = {} 
# Set these two parameters to avoid the password prompt popup
prefs["credentials_enable_service"] = False 
prefs["profile.password_manager_enabled"] = False 
options.add_experimental_option("prefs", prefs) 
browser = webdriver.Chrome(chrome_options=options) 
browser.get('https://www.baidu.com/')

# Error results
First, all browser windows, including those opened by itself, will be controlled.
Second, other actions do not work and crash.
Traceback (most recent call last):
  File "E:/PyCharmCode/taobaoProductSelenium/taobaoSelenium.py", line 40, in <module>
    # Start the browser and set the wait
  File "E:\Miniconda\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 69, in __init__
    desired_capabilities=desired_capabilities)
  File "E:\Miniconda\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 151, in __init__
    self.start_session(desired_capabilities, browser_profile)
  File "E:\Miniconda\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 240, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "E:\Miniconda\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 308, in execute
    self.error_handler.check_response(response)
  File "E:\Miniconda\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: crashed
  (Driver info: chromedriver=2.32.498550 (9dec58e66c31bcc53a9ce3c7226f0c1c5810906a),platform=Windows NT 6.1.7601 SP1 x86_64)

4.7. ログイン時にパスワード保存のポップアップボックスを閉じる

最近、クロームを使用してウェブサイトにログインすると、必ずパスワード保存のプロンプトボックスが表示されます。すべてがパスワード保存のプロンプトボックスを表示するわけではなく、実際には、この問題を避けるために、クロームを起動する関連パラメータを設定する必要があるだけです。

from time import sleep 
from selenium import webdriver 
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions() 
prefs = {} 
# Set these two parameters to avoid the password prompt popup
prefs["credentials_enable_service"] = False 
prefs["profile.password_manager_enabled"] = False 
options.add_experimental_option("prefs", prefs) 
browser = webdriver.Chrome(chrome_options=options) 
browser.get('https://www.baidu.com/')

5. その他のパラメータ

参考記事 http://blog.csdn.net/liaojianqiu0115/article/details/78353267

5.1. クロームのアドレスバーコマンド

Chromeのブラウザのアドレスバーに以下のコマンドを入力すると、適切な結果が返されます。これらのコマンドは、メモリの状態、ブラウザの状態、ネットワークの状態、DNSサーバーの状態、プラグインのキャッシュなどを確認するものです。ただし、これらのコマンドは常に変化しているので、常に使えるとは限らないことに注意してください。
　　about:version - 現在のバージョンを表示します。
　　about:memory - ローカルブラウザのメモリ使用量を表示します。
　　about:plugins - インストールされているプラグインを表示します。
　　about:histograms - 履歴を表示する
　　about:dns - DNSのステータスを表示する
　　about:cache - キャッシュされたページを表示する
　　about:gpu - ハードウェアアクセラレーションが使用可能かどうか。
　　about:flags -いくつかのプラグインを有効にする /使用後にポップアップします: "注意してください、これらの実験は危険かもしれません"、私の設定を台無しにしないかしら！ /使用後にポップアップします: "Please be careful, these experiments may be risky", I would wonder if it mess up my configuration!
　　chrome://extensions/ - インストールされている拡張機能を表示します。

5.2. クローム・ユーティリティ・パラメータ

その他、Chromeの便利なパラメータと簡単な中国語の説明。上記の4.5.4と同じように使用し、もちろんシェルでも使用します。
　　-user-data-dir="[PATH]" ユーザーフォルダーUser Dataのパスを指定することで、ブックマークなどのユーザーデータをシステムパーティション以外のパーティションに保存することが可能です。
　　-disk-cache-dir="[PATH]" キャッシュCacheのパスを指定します。
　　-disk-cache-size= キャッシュのサイズをByteで指定します。
　　-first run 初期状態に戻す、初回実行時
　　-インコグニートステルスモード開始
　　-disable-javascript Javascriptを無効にする。
　　--omnibox-popup-count="num" アドレスバーのポップアップメニューの数をnumに変更します。私は15個に変更しました。
　　--user-agent="xxxxxxx" HTTPリクエストヘッダのAgent文字列を変更します、about:versionページで変更の効果を見ることができます
　　--disable-plugins 高速化のため、すべてのプラグインの読み込みを無効にします。効果はabout:pluginsのページで確認できます。
　　--disable-javascript JavaScriptを無効にします。
　　--disable-java java を無効にする。
　　--start-maximized 起動時に最大化する。
　　--no-sandbox サンドボックスモードを解除する。
　　-single-process単一プロセスとして実行します。
　　-process-per-tabは、タブごとに個別のプロセスを使用します。
　　--process-per-site サイトごとに別プロセスを使用します。
　　プラグイン --in-process-plugins は別プロセスを有効にしません。
　　-disable-popup-blockingは、ポップアップブロックを無効にします。
　　--disable-plugins プラグインを無効にする
　　--disable-images 画像を無効にする。
　　--ステルスモードで起動します。
　　--enable-udd-profiles アカウント切り替えメニューを有効にします。
　　--proxy-pac-url pacプロキシを使用する[1/2経由]。
　　-lang=zh-CN 言語を簡体字中国語に設定する
　　--disk-cache-dir キャッシュディレクトリをカスタマイズします。
　　-disk-cache-size キャッシュの最大値（バイト単位）です．
　　--media-cache-size マルチメディアキャッシュの最大値をカスタマイズします (バイト単位)。
　　--bookmark-menu ツールバーにブックマークボタンを追加します。
　　--enable-sync ブックマークの同期を有効にします。
　　-single-process Google Chromeを単一プロセスで実行する。
　　-start-maximized 起動時にGoogle Chromeを最大化する。
　　-disable-java Java を無効にする。
　　-no-sandbox サンドボックス以外のモードでの動作

クロームブラウザを設定するためのselenium+pythonオプション

selenium+pythonでクロームブラウザを設定するためのオプションです。

1. 背景

2.環境

3.クロームオプション

4. 共通設定

4.1. エンコード形式を設定する

4.2. モバイル端末のシミュレート

4.3. 画像の読み込みを無効にする

4.4. プロキシを追加する

4.5. ブラウザオプションの設定

4.6. ブラウザ拡張アプリケーションの追加

4.6.1. 適切なプラグインをダウンロードする

4.6.2. プラグインパスをコードに記入する

4.6.3. 結果表示

4.6.4. 注意事項

4.7. ログイン時にパスワード保存のポップアップボックスを閉じる

5. その他のパラメータ

5.1. クロームのアドレスバーコマンド

5.2. クローム・ユーティリティ・パラメータ

関連

最新

おすすめ