Stream Our Mistakes EP 004
\n
\n
\nIn this episode, Matt walks us through html/web scraping using the popular python library, Beautiful Soup.
\n
\n\n\n
\n
\nHere's the code snippet from the session and links:
\n
\n\n\n
1\n 2\n 3\n 4\n 5\n 6\n 7\n 8\n 9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n32\n33\n34\n35\n36\n37\n38\n39\n40\n41\n42\n43\n44\n45\n46\n47\n
# Created for Stream Our Mistakes \n# https://streamourmistakes.blogspot.com/\n\n# Reference:\n# https://docs.python.org/3/library/urllib.request.html\n# https://www.crummy.com/software/BeautifulSoup/bs4/doc/\n\nfrom bs4 import BeautifulSoup\nimport urllib.request\n\n''' \n# local html to play with from documentation Uncomment to enable \nhtml_doc = """\n<html><head><title>The Dormouse's story</title></head>\n<body>\n<p class="title"><b>The Dormouse's story</b></p>\n\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,\n<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and\n<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n\n<p class="story">...</p>\n"""\n'''\n\n# Get the html from the web.\nf = urllib.request.urlopen('https://en.wikiquote.org/wiki/Aristotle')\n\n# Load the html into the parser.\nsoup = BeautifulSoup(f.read(), 'html.parser')\n\n# Show the whole raw \n# print(soup.prettify())\n\n# Access a single element.\n# print(soup.title)\n\n# Find all a tags in the html doc and print some information.\nlinks = soup.find_all('a')\n\nfor link in links:\n print(link.get('href'))\n\nprint(len(links))\n\n\n\n\n