Search notes:

Python library: beautifulsoup4

Installation

> pip install requests
> pip install beautifulsoup4

Module members

BeautifulSoup The BeautifulSoup object represents the document to be parsed as a nested data structure.
BeautifulStoneSoup
builder A module
builder_registry A bs4.builder.TreeBuilderRegistry class
CData
Comment
Counter
dammit A module
Declaration
DEFAULT_OUTPUT_ENCODING A string
Doctype
element A module
FeatureNotFound
formatter A module
GuessedAtParserWarning
MarkupResemblesLocatorWarning
NavigableString
os A module
PageElement
ParserRejectedMarkup
ProcessingInstruction
PYTHON_SPECIFIC_ENCODINGS A set
re A module (possibly the re module?)
ResultSet
_s
Script
_soup
SoupStrainer
StopParsing
Stylesheet
sys A module (possibly the sys module?)
Tag
TemplateString
traceback A module
UnicodeDammit
warnings A module
XMLParsedAsHTMLWarning

Members of the BeautifulSoup class

A BeautifulSoup object represents the document to be parsed as a nested data structure.
BeautifulSoup inherits from Tag which inherits from PageElement which inherits from object.
Most of the methods exposed in BeautifulSoup are inherited from those two classes.
A PageElement contains the navigational information for some part of the page (i. e. its current location in the parse tree)
A Tag represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Because Tag implements __getattr__() (which calls find() under the hood), it's possible to get a reference to a tag like so:
doc = BeautifulSoup(htmlTxt, 'html.parser')
print(doc.title)
The members of BeautifulSoup are
_all_strings()
append()
ASCII_SPACES
childGenerator()
children
clear()
decode()
decode_contents()
_decode_markup Class method?
decompose()
decomposed
default
DEFAULT_BUILDER_FEATURES
DEFAULT_INTERESTING_STRING_TYPES
descendants
encode()
encode_contents()
endData()
extend()
extract()
_feed()
fetchNextSiblings()
fetchParents()
fetchPrevious()
fetchPreviousSiblings()
find()
_find_all()
findAll()
find_all()
findAllNext()
find_all_next()
findAllPrevious()
find_all_previous()
findChild()
findChildren()
findNext()
find_next()
findNextSibling()
find_next_sibling()
findNextSiblings()
find_next_siblings()
_find_one()
findParent()
find_parent()
findParents()
find_parents()
findPrevious()
find_previous()
findPreviousSibling()
find_previous_sibling()
findPreviousSiblings()
find_previous_siblings()
format_string()
formatter_for_name()
get()
get_attribute_list()
getText()
get_text()
handle_data()
handle_endtag()
handle_starttag()
has_attr()
has_key()
index()
insert()
insert_after()
insert_before()
is_empty_element
isSelfClosing
_is_xml
_last_descendant()
_lastRecursiveChild()
_linkage_fixer()
_markup_is_url Class method?
_markup_resembles_filename Class method?
new_string()
new_tag()
next
next_elements
nextGenerator()
nextSibling
nextSiblingGenerator()
next_siblings
NO_PARSER_SPECIFIED_WARNING
object_was_parsed()
parentGenerator()
parents
parserClass
popTag()
_popToTag()
prettify()
previous
previous_elements
previousGenerator()
previousSibling
previousSiblingGenerator()
previous_siblings
pushTag()
recursiveChildGenerator()
renderContents()
replaceWith()
replace_with()
replaceWithChildren()
replace_with_children()
reset()
ROOT_TAG_NAME
select()
select_one()
setup()
_should_pretty_print()
smooth()
string
string_container()
strings
stripped_strings
text
unwrap()
wrap()

Simple examples

from   bs4 import BeautifulSoup
import requests

html_text=requests.get('https://github.com/ReneNyffenegger/about-python/tree/master/libraries/BeautifulSoup/script.py').text

soup = BeautifulSoup(html_text)

print("Title:          ", soup.title)
print("  .name:        ", soup.title.name)
print("  .string       ", soup.title.string)
print("  .parent.name: ", soup.title.parent.name)

print()

print("Links:")

for a in soup.find_all('a'):
    print("  %-30s: %s" % (a.string, a.get('href')))
Github repository about-python, path: /libraries/BeautifulSoup/script.py
from   bs4 import BeautifulSoup

soup = BeautifulSoup(
  """<foo><c>text one<sub>ttt</sub>text two<sub>uuu</sub>text three</c></foo>"""
)


def descend(node, level):
    for child in node.contents:

        if child.name != None:
           print("  " * level, "<" + child.name+ ">")
           descend(child, level+1)
           print("  " * level, "</"+ child.name+ ">")
        else:
           print("  " * level, " " + child.string)

descend(soup, 0)
Github repository about-python, path: /libraries/BeautifulSoup/recursively.py

ModuleNotFoundError: No module named 'beautifulsoup4'

Wrong:
>>> import beautifulsoup4
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'beautifulsoup4'
Better:
>>> import bs4

See also

BeautifulSoup and requests

Index