Broadcasts.com - "004 - HTML Scraping with Beautiful Soup" (Stream Our Mistakes)

Technology
SEE MORE
- classical
- general
- talk
- News
- Family
- Bürgerfunk
- pop
- Islam
- soul
- jazz
- Comedy
- humor
- wissenschaft
- opera
- baroque
- gesellschaft
- theater
- Local
- alternative
- electro
- rock
- rap
- lifestyle
- Music
- como
- RNE
- ballads
- greek
- Buddhism
- deportes
- christian
- piano
- djs
- Dance
- dutch
- flamenco
- social
- hope
- christian rock
- academia
- afrique
- Business
- musique
- ελληνική-μουσική
- religion
- World radio
- Zarzuela
- travel
- World
- NFL
- media
- Art
- public
- Sports
- Gospel
- st.
- baptist
- Leisure
- Kids & Family
- musical
- club
- Culture
- Health & Fitness
- True Crime
- Fiction
- children
- Society & Culture
- TV & Film
- gold
- kunst
- música
- gay
- Natural
- a
- francais
- bach
- economics
- kultur
- evangelical
- tech
- Opinion
- Government
- gaming
- College
- technik
- History
- Jesus
- Health
- movies
- radio
- services
- Church
- podcast
- Education
- international
- Transportation
- Other
- kids
- podcasts
- philadelphia
- Noticias
- love
- sport
- Salud
- film
- and
- 4chan
- Disco
- Stories
- fashion
- Arts
- interviews
- hardstyle
- entertainment
- humour
- medieval
- literature
- alma
- Cultura
- video
- TV
- Science
- en

004 - HTML Scraping with Beautiful Soup

Published: Dec. 23, 2017, 1:21 a.m.

Stream Our Mistakes EP 004
\n
\n
\nIn this episode, Matt walks us through html/web scraping using the popular python library, Beautiful Soup.
\n
\n\n\n
\n
\nHere's the code snippet from the session and links:
\n
\n\n\n

 1\n 2\n 3\n 4\n 5\n 6\n 7\n 8\n 9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n32\n33\n34\n35\n36\n37\n38\n39\n40\n41\n42\n43\n44\n45\n46\n47

# Created for Stream Our Mistakes \n# https://streamourmistakes.blogspot.com/\n\n# Reference:\n# https://docs.python.org/3/library/urllib.request.html\n# https://www.crummy.com/software/BeautifulSoup/bs4/doc/\n\nfrom bs4 import BeautifulSoup\nimport urllib.request\n\n''' \n# local html to play with from documentation Uncomment to enable \nhtml_doc = """\n<html><head><title>The Dormouse's story</title></head>\n<body>\n<p class="title"><b>The Dormouse's story</b></p>\n\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,\n<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and\n<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n\n<p class="story">...</p>\n"""\n'''\n\n# Get the html from the web.\nf = urllib.request.urlopen('https://en.wikiquote.org/wiki/Aristotle')\n\n# Load the html into the parser.\nsoup = BeautifulSoup(f.read(), 'html.parser')\n\n# Show the whole raw \n# print(soup.prettify())\n\n# Access a single element.\n# print(soup.title)\n\n# Find all a tags in the html doc and print some information.\nlinks = soup.find_all('a')\n\nfor link in links:\n    print(link.get('href'))\n\nprint(len(links))\n

\n\n\n\n
\nlinks:
\nhttps://docs.python.org/3/library/urllib.request.html
\nhttps://www.crummy.com/software/BeautifulSoup/bs4/doc/
\n
\nSubscribe to the podcast on apple podcasts, google play, stitcher
\n
\n\n\nmatt
\nsite: http://octon.io/
\ngithub: https://github.com/mmdempsey
\n
\neddyizm
\nsite: http://eddyizm.com
\ntwitter: http://twitter.com/eddyizm
\ngithub: https://github.com/eddyizm
\n
\nperry
\ngithub: https://github.com/apk29
\n
\n---
\n**youtube live broadcast:**
\nhttps://youtube.com/user/eddyizm/live
\n
\nSubscribe to our channel and follow my twitter feed to be notified of our next live broadcast and feel free to leave us comments and suggestions on what you want to see.