[解決済み] BeautifulSoup getText from between <p>, not pick up subsequent paragraphs

2022-03-01 10:04:17

質問

まず、私はPythonに関しては全くの初心者です。しかし、私はRSSフィードを見て、リンクを開き、記事からテキストを抽出するコードの一部を書きました。これは私が今のところ持っているものです。

from BeautifulSoup import BeautifulSoup
import feedparser
import urllib

# Dictionaries
links = {}
titles = {}

# Variables
n = 0

rss_url = "feed://www.gfsc.gg/_layouts/GFSC/GFSCRSSFeed.aspx?Division=ALL&Article=All&Title=News&Type=doc&List=%7b66fa9b18-776a-4e91-9f80-    30195001386c%7d%23%7b679e913e-6301-4bc4-9fd9-a788b926f565%7d%23%7b0e65f37f-1129-4c78-8f59-3db5f96409fd%7d%23%7bdd7c290d-5f17-43b7-b6fd-50089368e090%7d%23%7b4790a972-c55f-46a5-8020-396780eb8506%7d%23%7b6b67c085-7c25-458d-8a98-373e0ac71c52%7d%23%7be3b71b9c-30ce-47c0-8bfb-f3224e98b756%7d%23%7b25853d98-37d7-4ba2-83f9-78685f2070df%7d%23%7b14c41f90-c462-44cf-a773-878521aa007c%7d%23%7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528%7d%23%7baf17e955-96b7-49e9-ad8a-7ee0ac097f37%7d%23%7b3faca1d0-be40-445c-a577-c742c2d367a8%7d%23%7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6%7d%23%7b43e2b52d-e4f1-4628-84ad-0042d644deaf%7d"

# Parse the RSS feed
feed = feedparser.parse(rss_url)

# view the entire feed, one entry at a time
for post in feed.entries:
    # Create variables from posts
    link = post.link
    title = post.title
    # Add the link to the dictionary
    n += 1
    links[n] = link

for k,v in links.items():
    # Open RSS feed
    page = urllib.urlopen(v).read()
    page = str(page)
    soup = BeautifulSoup(page)

    # Find all of the text between paragraph tags and strip out the html
    page = soup.find('p').getText()

    # Strip ampersand codes and WATCH:
    page = re.sub('&\w+;','',page)
    page = re.sub('WATCH:','',page)

    # Print Page
    print(page)
    print(" ")

    # To stop after 3rd article, just whilst testing ** to be removed **
    if (k >= 3):
        break

これにより、次のような出力が得られます。

>>> (executing lines 1 to 45 of "RSS_BeautifulSoup.py")
Total deposits held with Guernsey banks at the end of June 2012 increased 2.1% in sterling terms by £2.1 billion from the end of March 2012 level of £101 billion, up to £103.1 billion. This is 9.4% lower than the same time a year ago.  Total assets and liabilities increased by £2.9 billion to £131.2 billion representing a 2.3% increase over the quarter though this was 5.7% lower than the level a year ago.  The higher figures reflected the effects both of volume and exchange rate factors.

The net asset value of total funds under management and administration has increased over the quarter ended 30 June 2012 by £711 million (0.3%) to reach £270.8 billion.For the year since 30 June 2011, total net asset values decreased by £3.6 billion (1.3%).

The Commission has updated the warranties on the Form REG, Form QIF and Form FTL to take into account the Commission’s Guidance Notes on Personal Questionnaires and Personal Declarations.  In particular, the following warranty (varies slightly dependent on the application) has been inserted in the aforementioned forms,

>>>

問題は、これが各記事の最初の段落であることですが、私は記事全体を表示する必要があります。何か手助けがあれば、ありがたくお受けします。

解決方法を教えてください。

あと少しです。

# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()

使用方法見つける (お気づきのように)1つの結果を見つけると停止します。そのため検索_すべてすべての段落が必要な場合。もし、ページの書式が一定であれば（今、1つのページに目を通しました）、次のようなものを使うこともできます。

soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})

をクリックして、記事本文をゼロにします。

[解決済み] BeautifulSoup getText from between <p>, not pick up subsequent paragraphs

質問

解決方法を教えてください。

関連

ピローによる動的キャプチャ認識のためのPythonサンプルコード

Pythonによるjieba分割ライブラリ

Python 可視化 big_screen ライブラリサンプル詳細

PythonによるExcelファイルの一括操作の説明

Python Pillow Image.save jpg画像圧縮問題

FacebookオープンソースワンストップサービスpythonのタイミングツールKats詳細

[解決済み】"No JSON object could be decoded "よりも良いエラーメッセージを表示する。

[解決済み】Pythonでgoogle APIのJSONコードを読み込むとエラーになる件

[解決済み】ValueError: pickleプロトコルがサポートされていません。3、python2 pickleはpython3 pickleでダンプしたファイルを読み込むことができない？

[解決済み] BeautifulSoupとScrapyのクローラの違い？

最新

nginxです。[emerg] 0.0.0.0:80 への bind() に失敗しました (98: アドレスは既に使用中です)

htmlページでギリシャ文字を使うには

ピュアhtml+cssでの要素読み込み効果

純粋なhtml + cssで五輪を実現するサンプルコード

ナビゲーションバー・ドロップダウンメニューのHTML+CSSサンプルコード

タイピング効果を実現するピュアhtml+css

htmlの選択ボックスのプレースホルダー作成に関する質問

html css3 伸縮しない画像表示効果

トップナビゲーションバーメニュー作成用HTML+CSS

html+css 実装サイバーパンク風ボタン

おすすめ

Pythonコンテナのための組み込み汎用関数操作

Python関数の高度な応用を解説

Python 人工知能人間学習描画機械学習モデル作成

Pythonの@decoratorsについてまとめてみました。

[解決済み】ImportError: sklearn.cross_validation という名前のモジュールがない。

[解決済み】TypeErrorの修正方法。Unicodeオブジェクトは、ハッシュ化する前にエンコードする必要がある？

[解決済み】「SyntaxError.Syntax」は何ですか？Missing parentheses in call to 'print'」はPythonでどういう意味ですか？

[解決済み] builtins.TypeError: strでなければならない、bytesではない

[解決済み】SyntaxError: デフォルト以外の引数がデフォルトの引数に続く

[解決済み】Python - "ValueError: not enough values to unpack (expected 2, got 1)" の修正方法 [閉店].