goodthings4me.tistory.com
■ 크롤링파이썬 크롤링(Crawling) 연습 - 정리 #1
▷ 네이버 메인페이지 추출해보기
import urllib.request
from bs4 import BeautifulSoup
url = 'https://www.naver.com'
html = urllib.request.urlopen(url)
print(html)
# <http.client.HTTPResponse object at 0x000001B88E2FC880>
print(html.read())
bsObj = BeautifulSoup(html, 'html.parser') # 파싱
print(bsObj)
# 네이버 페이지 소스 보기로 보이는 내용과 같은 내용 출력
- BeautifulSoup은 데이터를 추출(파싱 후 추출)하는데 필요한 기능이 들어있는 라이브러리 (파싱 라이브러리)
- bs4 라이브러리 미설치 에러 발생 --> pip install bs4 하고 사용함
- ModuleNotFoundError: No module named 'bs4'
▷ 네이버 메인페이지에서 상단 '네이버를 시작페이지로' 추출해보기
top_right =bsObj.find('div', {'class':'service_area'})
# 많은 div 중에 추출 건만 한정
print(top_right)
<div class="service_area">
<a class="link_set" data-clk="top.mkhome" href="https://help.naver.com/support/welcomePage/guide.help" id="NM_set_home_btn">네이버를 시작페이지로</a>
<i class="sa_bar"></i>
<a class="link_jrnaver" data-clk="top.jrnaver" href="https://jr.naver.com"><i class="ico_jrnaver"></i><span class="blind">쥬니어네이버</span></a>
<a class="link_happybin" data-clk="top.happybean" href="https://happybean.naver.com"><i class="ico_happybin"></i><span class="blind">해피빈</span></a>
</div>
print(type(top_right))
# <class 'bs4.element.Tag'>
first_a = top_right.find('a')
print(first_a.text)
# 네이버를 시작페이지로
print(type(first_a))
# <class 'bs4.element.Tag'>
first_a_all = top_right.find_all('a')
print(first_a_all)
[<a class="link_set" data-clk="top.mkhome" href="https://help.naver.com/support/welcomePage/guide.help" id="NM_set_home_btn">네이버를 시작페이지로</a>, <a class="link_jrnaver" data-clk="top.jrnaver" href="https://jr.naver.com"><i class="ico_jrnaver"></i><span class="blind">쥬니어네이버</span></a>, <a class="link_happybin" data-clk="top.happybean" href="https://happybean.naver.com"><i class="ico_happybin"></i><span class="blind">해피빈</span></a>]
print(type(first_a_all))
# <class 'bs4.element.ResultSet'>
for a_tag in first_a_all:
print(a_tag)
<a class="link_set" data-clk="top.mkhome" href="https://help.naver.com/support/welcomePage/guide.help" id="NM_set_home_btn">네이버를 시작페이지로</a>
<a class="link_jrnaver" data-clk="top.jrnaver" href="https://jr.naver.com"><i class="ico_jrnaver"></i><span class="blind">쥬니어네이버</span></a>
<a class="link_happybin" data-clk="top.happybean" href="https://happybean.naver.com"><i class="ico_happybin"></i><span class="blind">해피빈</span></a>
result = [a_tag.text for a_tag in first_a_all]
# a_tag.text 대신에 a_tag.get_text() 해도 동일한 결과 출력
print(result)
# ['네이버를 시작페이지로', '쥬니어네이버', '해피빈']
▷ 네이버 메뉴 추출해보기
ul = bsObj.find('ul', {'class':'type_fix'}) # {'class':'list_nav'}도 가능
# <ul> 태그 중 첫 번째 <ul>만 추출
print(ul)
<ul class="list_nav type_fix">
<li class="nav_item">
<a class="nav" data-clk="svc.mail" href="https://mail.naver.com/"><i class="ico_mail"></i>메일</a>
</li>
<li class="nav_item"><a class="nav" data-clk="svc.cafe" href="https://section.cafe.naver.com/">카페</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.blog" href="https://section.blog.naver.com/">블로그</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.kin" href="https://kin.naver.com/">지식iN</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.shopping" href="https://shopping.naver.com/">쇼핑</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.pay" href="https://order.pay.naver.com/home">Pay</a></li>
<li class="nav_item">
<a class="nav" data-clk="svc.tvcast" href="https://tv.naver.com/"><i class="ico_tv"></i>TV</a>
</li>
</ul>
lis = ul.find_all('li')
print(lis)
[<li class="nav_item">
<a class="nav" data-clk="svc.mail" href="https://mail.naver.com/"><i class="ico_mail"></i>메일</a>
</li>, <li class="nav_item"><a class="nav" data-clk="svc.cafe" href="https://section.cafe.naver.com/">카페</a></li>, <li class="nav_item"><a class="nav" data-clk="svc.blog" href="https://section.blog.naver.com/">블로그</a></li>, <li class="nav_item"><a class="nav" data-clk="svc.kin" href="https://kin.naver.com/">지식iN</a></li>, <li class="nav_item"><a class="nav" data-clk="svc.shopping" href="https://shopping.naver.com/">쇼핑</a></li>, <li class="nav_item"><a class="nav" data-clk="svc.pay" href="https://order.pay.naver.com/home">Pay</a></li>, <li class="nav_item">
<a class="nav" data-clk="svc.tvcast" href="https://tv.naver.com/"><i class="ico_tv"></i>TV</a>
</li>]
print(len(lis)) # 7
for li in lis:
a_tag = li.find('a')
print(a_tag.text)
'''
메일
카페
블로그
지식iN
쇼핑
Pay
TV
'''
group_ul = bsObj.find('div', {'class':'group_nav'})
ul_all = group_ul.find_all('ul')
print(ul_all)
[<ul class="list_nav type_fix">
<li class="nav_item">
<a class="nav" data-clk="svc.mail" href="https://mail.naver.com/"><i class="ico_mail"></i>메일</a>
</li>
<li class="nav_item"><a class="nav" data-clk="svc.cafe" href="https://section.cafe.naver.com/">카페</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.blog" href="https://section.blog.naver.com/">블로그</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.kin" href="https://kin.naver.com/">지식iN</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.shopping" href="https://shopping.naver.com/">쇼핑</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.pay" href="https://order.pay.naver.com/home">Pay</a></li>
<li class="nav_item">
<a class="nav" data-clk="svc.tvcast" href="https://tv.naver.com/"><i class="ico_tv"></i>TV</a>
</li>
</ul>, <ul class="list_nav NM_FAVORITE_LIST">
<li class="nav_item"><a class="nav" data-clk="svc.dic" href="https://dict.naver.com/">사전</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.news" href="https://news.naver.com/">뉴스</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.stock" href="https://finance.naver.com/">증권</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.land" href="https://land.naver.com/">부동산</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.map" href="https://map.naver.com/">지도</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.movie" href="https://movie.naver.com/">영화</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.vibe" href="https://vibe.naver.com/">VIBE</a>
<li class="nav_item"><a class="nav" data-clk="svc.book" href="https://book.naver.com/">책</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.webtoon" href="https://comic.naver.com/">웹툰</a></li>
</li></ul>, <ul class="list_nav type_empty" style="display: none;"></ul>]
for li_tag in ul_all:
for li in li_tag.find_all('li'):
print(li.find('a').text, li.find('a')['href'])
메일 https://mail.naver.com/
카페 https://section.cafe.naver.com/
블로그 https://section.blog.naver.com/
지식iN https://kin.naver.com/
쇼핑 https://shopping.naver.com/
Pay https://order.pay.naver.com/home
TV https://tv.naver.com/
사전 https://dict.naver.com/
뉴스 https://news.naver.com/
증권 https://finance.naver.com/
부동산 https://land.naver.com/
지도 https://map.naver.com/
영화 https://movie.naver.com/
VIBE https://vibe.naver.com/
책 https://book.naver.com/
웹툰 https://comic.naver.com/
[참고] 한입에 크롤링 - Kyeongrok Kim
'코딩 연습 > 코딩배우기' 카테고리의 다른 글
파이썬 웹 크롤링(Web Crawling) 알아보기 #6 (0) | 2020.10.25 |
---|---|
파이썬 크롤링(Crawling) 연습 - BeautifulSoup 관련 함수 (0) | 2020.10.25 |
[python] 파이썬 웹 크롤링(Web Crawling) 알아보기 #5 (0) | 2020.10.21 |
[python] 파이썬 웹 크롤링(Web Crawling) 알아보기 #4 (0) | 2020.10.20 |
[python] 파이썬 웹 크롤링(Web Crawling) 알아보기 #3 (0) | 2020.10.18 |
댓글