본문 바로가기
코딩 연습/코딩배우기

파이썬 웹 크롤링(Web Crawling) 알아보기 #10

by good4me 2020. 11. 14.

goodthings4me.tistory.com

■ 파이썬 크롤링 실습 - 네이버 웹툰

requests, BeautifulSoup로 네이버 웹툰(m.comic.naver.com)  목록 크롤링 후 urllib, os, datetime 모듈로 디렉토리와 파일 이름을 다시 만들어서 이미지 저장해보기

다운로드 받을 네이버 웹툰 클릭 후,

전체 url(get_url)에서 파라미터 부분을 각각 분리하여 url과 params dict로 만든다.

requests.get()과 soup.select()로 a_tag_list(리스트)를 생성한다.
* soup.select()의 type은 리스트임
* tag.select()에서 detail.nhn은 점(.) 이스케이프 처리

완전한 url을 위해 urljoin()으로 url과 <a href>를 합친다

제목을 추출하기 위해 각 a_tag에서 find()로 class 속성 name을 찾는다

이미지 url을 리스트(ep_url_list)에 추가
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import os
from datetime import datetime

url = 'https://m.comic.naver.com/webtoon/list.nhn'
params = {
    'titleId': 746858,  # 이번 생도 잘 부탁
    'sortOrder': 'ASC',
}

html = requests.get(url, params = params).text
soup = BeautifulSoup(html, 'html.parser')

webtoon_title = soup.select('#ct .info_front .title')[0].text.strip()
ep_url_list = []
for tag in soup.select('.section_episode_list .item'):
    a_tag_list = tag.select('a[href*=detail\.nhn]')
    for a_tag in a_tag_list:
        #print(a_tag)
        ep_url = urljoin(url, a_tag['href'])
        ep_title = a_tag.find(class_='name').text  # find('strong').text
        #print(ep_url, ep_title)
        ep_url_list.append(ep_url)
    

for ep in ep_url_list:
    print(ep)
    
[실행 결과]

https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=1&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=2&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=3&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=4&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=5&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=6&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=7&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=8&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=9&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=10&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=11&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=12&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=13&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=14&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=15&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=16&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=17&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=18&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=19&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=20&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=21&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=22&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=23&week=sun&listSortOrder=ASC&listPage=1
https://m.comic.naver.com/webtoon/detail.nhn?titleId=746858&no=24&week=sun&listSortOrder=ASC&listPage=1

 

good4me.co.kr

 

※ 이미지 다운로드 (위 코드와 연결하여)

cnt = 0
for ep_url in ep_url_list:

    request_headers = {
        'User-Agent' : ('Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537\
                    .36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'),
        'Referer' : ep_url
    }
    
    ymdhms = datetime.today().strftime('%Y%m%m%H%M%S')  # 파일명에 추가
    
    html = requests.get(ep_url).text
    soup = BeautifulSoup(html, 'html.parser')
    for tag in soup.select('#ct p img[data-src]'):
        img_url = tag['data-src']
        img_data = requests.get(img_url, headers = request_headers).content   
        #img_name = img_url.split('/')[6]
        img_name = os.path.basename(img_url)
        img_path = os.path.join(webtoon_title, ymdhms + '_' + img_name)
        
        dir_path = os.path.dirname(img_path)  # 부모 path
        if not os.path.exists(dir_path):  # dir_path 경로 없으면 만들기
            os.makedirs(dir_path)
            
        print(img_path)
        with open(img_path, 'wb') as f:
            f.write(img_data)
        
        
    cnt += 1
    if cnt == 1:
        break


[실행 결과]
이번 생도 잘 부탁해\20201111233304_417b65e75e8468bee5f3e52819fe8683_001.jpg
~
이번 생도 잘 부탁해\20201111233304_417b65e75e8468bee5f3e52819fe8683_054.jpg

## 1회 파일 다운로드 완료됨

 

[참고] askcompany.kr - 크롤링 차근차근 시작하기

 

댓글