본문 바로가기
코딩 연습/파이썬 크롤링

티스토리 블로그 내 이미지 다운로드

by good4me 2022. 5. 10.

goodthings4me.tistory.com

파이썬 크롤링으로 티스토리 블로그 내에 있는 이미지 다운로드 - 개별 블로그 페이지에 있는 이미지 원본을 다운로드할 수 있는 소스 코드임

 

 

티스토리(tistory) 블로그 이미지 다운로드해주는 파이썬 코드

 

티스토리 블로그 이미지와 글 내용까지 다운로드하는 프로그램

[파이썬 소스 코드]

import requests
from bs4 import BeautifulSoup
from urllib import request
from PIL import Image
import os

headers = {
    'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36'
}

url = 'https://goodthings4me.tistory.com/764'

response = requests.get(url, headers=headers)
if response.ok:
    soup = BeautifulSoup(response.text, 'html.parser')
    title = soup.select_one('#content > div.inner > div.post-cover > div > h1').text
    print(title)

    # images = soup.find_all('figure', class_='imageblock')
    # images = soup.find_all('figure', class_='imageblock alignCenter')
    images = soup.find_all('figure')
    print(len(images))
    
    for i in range(len(images)):
        try:
            img_url = images[i].find('span')['data-url']
            print(img_url)

            ## pillow.Image로 이미지 format 알아내기
            imageObj = Image.open(requests.get(img_url, stream=True).raw)
            img_format = imageObj.format
            imge_size = imageObj.size
            print(f'img_url: {img_url}')
            print(f'img_format: {img_format}')
            print(f'imge_size: {imge_size}')
            print(f'image_basename): {os.path.basename(img_url)}')
            image_format = ''
            if os.path.basename(img_url):
                image_format = os.path.basename(img_url).split('.')[-1]
            else:
                if img_format:
                    image_format = img_format
                else:
                    image_format = 'jpg'

            request.urlretrieve(images[i].find('span')['data-url'], f'./img/{title[:10].strip()}_{i + 1}.{image_format}')
            print(f'#{i + 1} image downloaded..\n{title[:10].strip()}_{i + 1}.{image_format}')
            
        except Exception as e:
            print(f'Error.. #{i}: {e}')
            continue
  • 이미지 이름은 블로그 제목의 앞부분 일부를 사용하고 순번을 붙여서 지정한다.
  • 이미지가 있는 html 요소는 <figure> 태그인데, 속성 class를 붙여서 해보니 받아지는 이미지 숫자가 틀린데, 이는 class='imageblock alignCenter' 또는 class='imageblock'의 원인이며, 이를 해결하기 위해 <figure> 태그 전체를 대상으로 find_all() 한 후 'data-url=' 속성이 있는 <figure>만을 추출하도록 예외처리를 했다.
  • 이미지 확장자(포맷)는 'data-url='에 대부분 있지만, 혹시 없을 수도 있는 이미지의 포맷을 알아내기 위해 pillow.Image.open()을 사용했고, os.path.basename()를 통해 이미지 파일명을 추출하고 확장자만을 뽑아 다운로드 이미지의 확장자로 사용했다.
  • 예외처리 결과로 확인한 다운로드 오류 내용은 'data-url=' 속성이 없는 '관련글' 등의 썸네일 이미지에서 발생한다.

 

 

good4me.co.kr

 

[실행 결과]

신안군에 가면 임자도 있지요 - 임자도 자전거길 여행 후기
27
https://blog.kakaocdn.net/dn/dVGI9z/btrBDmUvGCQ/kqy1yx7JR42OKPephiZDy1/img.png
img_url: https://blog.kakaocdn.net/dn/dVGI9z/btrBDmUvGCQ/kqy1yx7JR42OKPephiZDy1/img.png
img_format: PNG
imge_size: (1149, 862)  
image_basename): img.png
#1 image downloaded..   
신안군에 가면 임자_1.png
https://blog.kakaocdn.net/dn/KBi26/btrBk8VZWea/Ew0lwBiG0IZN1M2Wq2HxKK/img.png
img_url: https://blog.kakaocdn.net/dn/KBi26/btrBk8VZWea/Ew0lwBiG0IZN1M2Wq2HxKK/img.png
img_format: PNG
imge_size: (518, 500)   
image_basename): img.png
#2 image downloaded..   
신안군에 가면 임자_2.png
https://blog.kakaocdn.net/dn/bggGgp/btrBipEzQf1/FvTdmhbD38r8XmfnkokkiK/img.jpg
img_url: https://blog.kakaocdn.net/dn/bggGgp/btrBipEzQf1/FvTdmhbD38r8XmfnkokkiK/img.jpg
img_format: JPEG        
imge_size: (2016, 1134) 
image_basename): img.jpg
#3 image downloaded..   
신안군에 가면 임자_3.jpg
https://blog.kakaocdn.net/dn/AnBsZ/btrBhBx7mKl/s36xvj1cxSnbuGnGJA8i4K/img.jpg
img_url: https://blog.kakaocdn.net/dn/AnBsZ/btrBhBx7mKl/s36xvj1cxSnbuGnGJA8i4K/img.jpg
img_format: JPEG        
imge_size: (2016, 1134) 
image_basename): img.jpg
#4 image downloaded..   
신안군에 가면 임자_4.jpg
https://blog.kakaocdn.net/dn/tnAms/btrBjIDytFt/4aVrqY8d0whPvbkMoxT0rk/img.jpg
img_url: https://blog.kakaocdn.net/dn/tnAms/btrBjIDytFt/4aVrqY8d0whPvbkMoxT0rk/img.jpg
img_format: JPEG        
imge_size: (2016, 1134) 
image_basename): img.jpg
#5 image downloaded..   
신안군에 가면 임자_5.jpg
https://blog.kakaocdn.net/dn/cOFK7v/btrBiqDCvNU/CBlmkq4PogniZG3knNAes0/img.jpg
img_url: https://blog.kakaocdn.net/dn/cOFK7v/btrBiqDCvNU/CBlmkq4PogniZG3knNAes0/img.jpg
img_format: JPEG       
imge_size: (2016, 1134)
image_basename): img.jpg
#6 image downloaded..
신안군에 가면 임자_6.jpg
https://blog.kakaocdn.net/dn/0ydba/btrBm88Q5NR/dU5yocnglb0qg1naRb0X20/img.jpg
img_url: https://blog.kakaocdn.net/dn/0ydba/btrBm88Q5NR/dU5yocnglb0qg1naRb0X20/img.jpg
img_format: JPEG
imge_size: (2016, 1134)
image_basename): img.jpg
#7 image downloaded..
신안군에 가면 임자_7.jpg
https://blog.kakaocdn.net/dn/sMr0q/btrBjJ3yWfO/KW3QUvdKf1Ous6oPo6PehK/img.jpg
img_url: https://blog.kakaocdn.net/dn/sMr0q/btrBjJ3yWfO/KW3QUvdKf1Ous6oPo6PehK/img.jpg
img_format: JPEG
imge_size: (2016, 1134)
image_basename): img.jpg
#8 image downloaded..
신안군에 가면 임자_8.jpg
https://blog.kakaocdn.net/dn/be4tTF/btrBm8A1lgc/QCoHQ4H9kj0ND6koDv8rOK/img.jpg
img_url: https://blog.kakaocdn.net/dn/be4tTF/btrBm8A1lgc/QCoHQ4H9kj0ND6koDv8rOK/img.jpg
img_format: JPEG
imge_size: (2016, 1134)
image_basename): img.jpg
#9 image downloaded..
신안군에 가면 임자_9.jpg
https://blog.kakaocdn.net/dn/bI8c9U/btrBkDoslEk/K9wtXdEHjE0rFqVehzGVHk/img.jpg
img_url: https://blog.kakaocdn.net/dn/bI8c9U/btrBkDoslEk/K9wtXdEHjE0rFqVehzGVHk/img.jpg
img_format: JPEG
imge_size: (2016, 1134)
image_basename): img.jpg
#10 image downloaded..
신안군에 가면 임자_10.jpg
https://blog.kakaocdn.net/dn/y6FJF/btrBGr1nB2n/1quowUUr3BRvnTXwwkOOYk/img.png
img_url: https://blog.kakaocdn.net/dn/y6FJF/btrBGr1nB2n/1quowUUr3BRvnTXwwkOOYk/img.png
img_format: PNG
imge_size: (1148, 646)
image_basename): img.png
#11 image downloaded..
신안군에 가면 임자_11.png
https://blog.kakaocdn.net/dn/Lw24M/btrBDmUAzH0/OXoQDPWsYZP9KPUEhQy4ZK/img.png
img_url: https://blog.kakaocdn.net/dn/Lw24M/btrBDmUAzH0/OXoQDPWsYZP9KPUEhQy4ZK/img.png
img_format: PNG
imge_size: (1148, 646)
image_basename): img.png
#12 image downloaded..
신안군에 가면 임자_12.png
https://blog.kakaocdn.net/dn/HezEv/btrBEDuLxh6/VTQnLFvUpsXZYJ1ghi4Knk/img.png
img_url: https://blog.kakaocdn.net/dn/HezEv/btrBEDuLxh6/VTQnLFvUpsXZYJ1ghi4Knk/img.png
img_format: PNG
imge_size: (1148, 646)
image_basename): img.png
#13 image downloaded..
신안군에 가면 임자_13.png
https://blog.kakaocdn.net/dn/mHN7T/btrBEDheRKl/vawhKCiD4VOJGALq9JKQPk/img.png
img_url: https://blog.kakaocdn.net/dn/mHN7T/btrBEDheRKl/vawhKCiD4VOJGALq9JKQPk/img.png
img_format: PNG
imge_size: (1148, 646)
image_basename): img.png
#14 image downloaded..
신안군에 가면 임자_14.png
https://blog.kakaocdn.net/dn/Z7YVw/btrBF5jZRBs/gRl03tK4Gfu13MumO8rAs1/img.png
img_url: https://blog.kakaocdn.net/dn/Z7YVw/btrBF5jZRBs/gRl03tK4Gfu13MumO8rAs1/img.png
img_format: PNG
imge_size: (1148, 646)
image_basename): img.png
#15 image downloaded..
신안군에 가면 임자_15.png
https://blog.kakaocdn.net/dn/1mCOx/btrBEEf84H6/OXh6NjfI6it5osfkj4Jcmk/img.png
img_url: https://blog.kakaocdn.net/dn/1mCOx/btrBEEf84H6/OXh6NjfI6it5osfkj4Jcmk/img.png
img_format: PNG
imge_size: (905, 636)
image_basename): img.png
#16 image downloaded..
신안군에 가면 임자_16.png
https://blog.kakaocdn.net/dn/cnhuZB/btrBF5j1VGb/4okmF6e3JvKooll7Fz1lv0/img.png
img_url: https://blog.kakaocdn.net/dn/cnhuZB/btrBF5j1VGb/4okmF6e3JvKooll7Fz1lv0/img.png
img_format: PNG
imge_size: (1148, 646)
image_basename): img.png
#17 image downloaded..
신안군에 가면 임자_17.png
https://blog.kakaocdn.net/dn/bWWb68/btrBEDuNuFm/V7z46F123KfaGTkvMcs8yk/img.png
img_url: https://blog.kakaocdn.net/dn/bWWb68/btrBEDuNuFm/V7z46F123KfaGTkvMcs8yk/img.png
img_format: PNG
imge_size: (1148, 646)
image_basename): img.png
#18 image downloaded..
신안군에 가면 임자_18.png
https://blog.kakaocdn.net/dn/by1X2Q/btrBGNXtC2M/Maovh807ADVATKkltL4OK0/img.png
img_url: https://blog.kakaocdn.net/dn/by1X2Q/btrBGNXtC2M/Maovh807ADVATKkltL4OK0/img.png
img_format: PNG
imge_size: (1148, 646)
image_basename): img.png
#19 image downloaded..
신안군에 가면 임자_19.png
https://blog.kakaocdn.net/dn/A400N/btrBF5YDsmx/PWkuNrqKLzH7T4bVMbRk51/img.png
img_url: https://blog.kakaocdn.net/dn/A400N/btrBF5YDsmx/PWkuNrqKLzH7T4bVMbRk51/img.png
img_format: PNG
imge_size: (1148, 646)
image_basename): img.png
#20 image downloaded..
신안군에 가면 임자_20.png
https://blog.kakaocdn.net/dn/cd8tsa/btrBA1QWXEW/mVXOK9kygcoR8W5Bg8rRnK/img.png
img_url: https://blog.kakaocdn.net/dn/cd8tsa/btrBA1QWXEW/mVXOK9kygcoR8W5Bg8rRnK/img.png
img_format: PNG
imge_size: (1148, 646)
image_basename): img.png
#21 image downloaded..
신안군에 가면 임자_21.png
https://blog.kakaocdn.net/dn/b528yM/btrBFb5ToCv/a2OYF7e3yrQcM9rRYXz7N1/img.png
img_url: https://blog.kakaocdn.net/dn/b528yM/btrBFb5ToCv/a2OYF7e3yrQcM9rRYXz7N1/img.png
img_format: PNG
imge_size: (1148, 646)
image_basename): img.png
#22 image downloaded..
신안군에 가면 임자_22.png
https://blog.kakaocdn.net/dn/QlSLb/btrBGMdd5KZ/1fnBxiOvKfX3rHgBx45BdK/img.png
img_url: https://blog.kakaocdn.net/dn/QlSLb/btrBGMdd5KZ/1fnBxiOvKfX3rHgBx45BdK/img.png
img_format: PNG
imge_size: (1148, 646)
image_basename): img.png
#23 image downloaded..
신안군에 가면 임자_23.png
Error.. #23: 'NoneType' object is not subscriptable
Error.. #24: 'NoneType' object is not subscriptable
Error.. #25: 'NoneType' object is not subscriptable
Error.. #26: 'NoneType' object is not subscriptable

 

 

다운로드된 이미지를 확인하기 위해 윈도우 탐색기 폴더를 보니 23개의 이미지 파일이 받아져 있다.

 

댓글