본문 바로가기
코딩 연습/코딩배우기

[python] 파이썬 크롤링(Crawling) 연습 - 네이버 메인페이지 추출해보기

by good4me 2020. 10. 23.

goodthings4me.tistory.com

 

■ 크롤링파이썬 크롤링(Crawling) 연습 - 정리 #1

▷ 네이버 메인페이지 추출해보기

import urllib.request
from bs4 import BeautifulSoup

url = 'https://www.naver.com'
html = urllib.request.urlopen(url)

print(html)
# <http.client.HTTPResponse object at 0x000001B88E2FC880>

print(html.read())

html.read()

bsObj = BeautifulSoup(html, 'html.parser')  # 파싱
print(bsObj)
# 네이버 페이지 소스 보기로 보이는 내용과 같은 내용 출력
  • BeautifulSoup은 데이터를 추출(파싱 후 추출)하는데 필요한 기능이 들어있는 라이브러리 (파싱 라이브러리)
  • bs4 라이브러리 미설치 에러 발생 --> pip install bs4 하고 사용함
  • ModuleNotFoundError: No module named 'bs4'

good4me.co.kr

 

▷ 네이버 메인페이지에서 상단 '네이버를 시작페이지로' 추출해보기

네이버를 시작페이지로 소스

top_right =bsObj.find('div', {'class':'service_area'})
# 많은 div 중에 추출 건만 한정
print(top_right)
<div class="service_area">
<a class="link_set" data-clk="top.mkhome" href="https://help.naver.com/support/welcomePage/guide.help" id="NM_set_home_btn">네이버를 시작페이지로</a>
<i class="sa_bar"></i>
<a class="link_jrnaver" data-clk="top.jrnaver" href="https://jr.naver.com"><i class="ico_jrnaver"></i><span class="blind">쥬니어네이버</span></a>
<a class="link_happybin" data-clk="top.happybean" href="https://happybean.naver.com"><i class="ico_happybin"></i><span class="blind">해피빈</span></a>
</div>
print(type(top_right))
# <class 'bs4.element.Tag'>

first_a = top_right.find('a')
print(first_a.text)
# 네이버를 시작페이지로
print(type(first_a))
# <class 'bs4.element.Tag'>

first_a_all = top_right.find_all('a')
print(first_a_all)
[<a class="link_set" data-clk="top.mkhome" href="https://help.naver.com/support/welcomePage/guide.help" id="NM_set_home_btn">네이버를 시작페이지로</a>, <a class="link_jrnaver" data-clk="top.jrnaver" href="https://jr.naver.com"><i class="ico_jrnaver"></i><span class="blind">쥬니어네이버</span></a>, <a class="link_happybin" data-clk="top.happybean" href="https://happybean.naver.com"><i class="ico_happybin"></i><span class="blind">해피빈</span></a>]
print(type(first_a_all))
# <class 'bs4.element.ResultSet'>

for a_tag in first_a_all:
    print(a_tag)
<a class="link_set" data-clk="top.mkhome" href="https://help.naver.com/support/welcomePage/guide.help" id="NM_set_home_btn">네이버를 시작페이지로</a>
<a class="link_jrnaver" data-clk="top.jrnaver" href="https://jr.naver.com"><i class="ico_jrnaver"></i><span class="blind">쥬니어네이버</span></a>
<a class="link_happybin" data-clk="top.happybean" href="https://happybean.naver.com"><i class="ico_happybin"></i><span class="blind">해피빈</span></a>
result = [a_tag.text for a_tag in first_a_all]
# a_tag.text 대신에 a_tag.get_text() 해도 동일한 결과 출력
print(result)
# ['네이버를 시작페이지로', '쥬니어네이버', '해피빈']

 

 

▷ 네이버 메뉴 추출해보기

ul = bsObj.find('ul', {'class':'type_fix'})  # {'class':'list_nav'}도 가능
# <ul> 태그 중 첫 번째 <ul>만 추출
print(ul)
<ul class="list_nav type_fix">
<li class="nav_item">
<a class="nav" data-clk="svc.mail" href="https://mail.naver.com/"><i class="ico_mail"></i>메일</a>
</li>
<li class="nav_item"><a class="nav" data-clk="svc.cafe" href="https://section.cafe.naver.com/">카페</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.blog" href="https://section.blog.naver.com/">블로그</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.kin" href="https://kin.naver.com/">지식iN</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.shopping" href="https://shopping.naver.com/">쇼핑</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.pay" href="https://order.pay.naver.com/home">Pay</a></li>
<li class="nav_item">
<a class="nav" data-clk="svc.tvcast" href="https://tv.naver.com/"><i class="ico_tv"></i>TV</a>
</li>
</ul>
lis = ul.find_all('li')
print(lis)
[<li class="nav_item">
<a class="nav" data-clk="svc.mail" href="https://mail.naver.com/"><i class="ico_mail"></i>메일</a>
</li>, <li class="nav_item"><a class="nav" data-clk="svc.cafe" href="https://section.cafe.naver.com/">카페</a></li>, <li class="nav_item"><a class="nav" data-clk="svc.blog" href="https://section.blog.naver.com/">블로그</a></li>, <li class="nav_item"><a class="nav" data-clk="svc.kin" href="https://kin.naver.com/">지식iN</a></li>, <li class="nav_item"><a class="nav" data-clk="svc.shopping" href="https://shopping.naver.com/">쇼핑</a></li>, <li class="nav_item"><a class="nav" data-clk="svc.pay" href="https://order.pay.naver.com/home">Pay</a></li>, <li class="nav_item">
<a class="nav" data-clk="svc.tvcast" href="https://tv.naver.com/"><i class="ico_tv"></i>TV</a>
</li>]
print(len(lis))  # 7

for li in lis:
    a_tag = li.find('a')
    print(a_tag.text)
    

'''
메일
카페
블로그
지식iN
쇼핑
Pay
TV
'''
group_ul = bsObj.find('div', {'class':'group_nav'})
ul_all = group_ul.find_all('ul')
print(ul_all)
[<ul class="list_nav type_fix">
<li class="nav_item">
<a class="nav" data-clk="svc.mail" href="https://mail.naver.com/"><i class="ico_mail"></i>메일</a>
</li>
<li class="nav_item"><a class="nav" data-clk="svc.cafe" href="https://section.cafe.naver.com/">카페</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.blog" href="https://section.blog.naver.com/">블로그</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.kin" href="https://kin.naver.com/">지식iN</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.shopping" href="https://shopping.naver.com/">쇼핑</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.pay" href="https://order.pay.naver.com/home">Pay</a></li>
<li class="nav_item">
<a class="nav" data-clk="svc.tvcast" href="https://tv.naver.com/"><i class="ico_tv"></i>TV</a>
</li>
</ul>, <ul class="list_nav NM_FAVORITE_LIST">
<li class="nav_item"><a class="nav" data-clk="svc.dic" href="https://dict.naver.com/">사전</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.news" href="https://news.naver.com/">뉴스</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.stock" href="https://finance.naver.com/">증권</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.land" href="https://land.naver.com/">부동산</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.map" href="https://map.naver.com/">지도</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.movie" href="https://movie.naver.com/">영화</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.vibe" href="https://vibe.naver.com/">VIBE</a>
<li class="nav_item"><a class="nav" data-clk="svc.book" href="https://book.naver.com/">책</a></li>
<li class="nav_item"><a class="nav" data-clk="svc.webtoon" href="https://comic.naver.com/">웹툰</a></li>
</li></ul>, <ul class="list_nav type_empty" style="display: none;"></ul>]
for li_tag in ul_all:
    for li in li_tag.find_all('li'):
        print(li.find('a').text, li.find('a')['href'])
메일 https://mail.naver.com/
카페 https://section.cafe.naver.com/
블로그 https://section.blog.naver.com/
지식iN https://kin.naver.com/
쇼핑 https://shopping.naver.com/
Pay https://order.pay.naver.com/home
TV https://tv.naver.com/
사전 https://dict.naver.com/
뉴스 https://news.naver.com/
증권 https://finance.naver.com/
부동산 https://land.naver.com/
지도 https://map.naver.com/
영화 https://movie.naver.com/
VIBE https://vibe.naver.com/
책 https://book.naver.com/
웹툰 https://comic.naver.com/

 

[참고] 한입에 크롤링 - Kyeongrok Kim

 

댓글