본문 바로가기
코딩 연습/코딩배우기

파이썬 크롤링(Crawling) 연습 - BeautifulSoup Documentation #2 (find_all, find, select 등)

by good4me 2020. 11. 8.

goodthings4me.tistory.com

 

■ 파이썬 크롤링 BeautifulSoup Documentation 내용 정리 #2

find_all(), find_all() 및 find()와 같은 메서드들, CSS selector 이용하는 select()와 select_one(), 파스 트리(Parse Tree) 내용 수정, get_text(), Encodings 등 내용 정리

### find_all()

## find_all()은 태그명, 속성, 문자열(텍스트) 또는 이들의 조합을 기준으로 사용할 수 있음
from bs4 import BeautifulSoup
import re

html_doc = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all('b'))  ## 문서 내 모든 b태그 찾기
# [<b>The Dormouse's story</b>]

## 정규표현식 활용
for tag in soup.find_all(re.compile('^b')):  ## b로 시작하는 모든 태그
    print(tag.name)
    
#body
#b
    
for tag in soup.find_all(re.compile('t')):  ## t가 들어간 모든 태그
    print(tag.name)
    
#html
#title
    
print(soup.find_all(['a', 'b']))  ## list로 전달, 모든 a태그, b태그 찾기
#[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, \
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, \
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


## find_all()에 속성이 있는지 체크하는 함수를 인자 전달하기
## class 속성은 있고, id 속성은 없는 경우

print(soup.p.has_attr('class'))
# True

print(soup.p.has_attr('id'))
# False

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

print(soup.find_all(has_class_but_no_id))  ## p 태그 찾아냄
#[<p class="title"><b>The Dormouse's story</b></p>, 
# <p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#and they lived at the bottom of a well.</p>, <p class="story">...</p>]


## 특정 속성을 함수에 전달 시 그 인자는 태그가 아닌 속성 값이 됨
def not_lacie(href):
    return href and re.compile('lacie').search(href)  ## re 객체로 href에서 'lacie' 찾기

print(soup.find_all(href = not_lacie))  ## 속성 href의 값으로 함수 전달
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]


def not_lacie(href):
    return href and not re.compile('lacie').search(href)

print(soup.find_all(href = not_lacie))  ## 속성 href의 값으로 함수 전달
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


## find_all() 상세히 살펴보기
## find_all(name, attrs, recursive, string, limit, ** kwargs)

print(soup.find_all('title'))  ## 모든 title 태그, string은 무시
# [<title>The Dormouse's story</title>]

print(soup.find_all('p', 'title'))  ## 모든 p태그 중 class가 title인 것
# [<p class="title"><b>The Dormouse's story</b></p>]

print(soup.find_all('a'))
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.find_all(id = 'link2'))  ## id가 link2
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

## 각 태그의 href 속성 값이 정규표현식과 일치하는 태그 
print(soup.find_all(href=re.compile("elsie")))  
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

print(soup.find_all(id=True))  ## id 속성이 있는 모든 태그 
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


print(soup.find_all(href = re.compile('elsie'), id = 'link1'))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]



## HTML5에서 키워드 인자 이름으로 사용할 수 없는 속성 : data-* attributes
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>', 'html.parser')
#print(data_soup.find_all(data-foo="value"))  ## 에러
# SyntaxError: keyword can't be an expression

## data-* 는  dict 형태로 속성 attrs의 값으로 전달하면 사용 가능
print(data_soup.find_all(attrs = {'data-foo':'value'}))
# [<div data-foo="value">foo!</div>]

## HTML의 name 요소에 대해서도 위와 동일한 방법으로 찾을 수 있음
name_soup = BeautifulSoup('<input name="email"/>', 'html.parser')
print(name_soup.find_all(name="email"))
# []
print(name_soup.find_all(attrs={"name": "email"}))
# [<input name="email"/>]


print(soup.find_all('p', attrs={'class': 'title'}))
# [<p class="title"><b>The Dormouse's story</b></p>]
print(soup.find_all('p', {'class': 'title'}))
# [<p class="title"><b>The Dormouse's story</b></p>]

print(soup.find_all('a', class_='sister'))  ## find_all('a', {'class':'sister'})
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.find_all(class_=re.compile("itl")))  ## class값 문자열 'itl' 포함 태그
# [<p class="title"><b>The Dormouse's story</b></p>]


## class값으로 함수 전달 (class가 None이 아니고 길이는 6)
def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

print(soup.find_all(class_=has_six_characters))
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


## 다중값에 대해
css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
print(css_soup.find_all('p', class_='strikeout'))
# [<p class="body strikeout"></p>]

print(css_soup.find_all('p', class_='body'))
# [<p class="body strikeout"></p>]

print(css_soup.find_all('p', class_="body strikeout"))
# [<p class="body strikeout"></p>]

print(css_soup.find_all('p', class_="strikeout body"))  ## 순서 틀림
# []


## CSS selector 사용 시 가능
print(css_soup.select('p.strikeout.body'))
# [<p class="body strikeout"></p>]
print(css_soup.select('p.body.strikeout'))
# [<p class="body strikeout"></p>]
print(css_soup.select('p.strikeout'))
# [<p class="body strikeout"></p>]
print(css_soup.select('p.body'))
# [<p class="body strikeout"></p>]



### 태그 대신에 string argument로 찾기 
## find_all(name, attrs, recursive, string, limit, ** kwargs)
print(soup.find_all(string='Elsie'))
# ['Elsie']

print(soup.find_all(string=['Tillie', 'Elsie', 'Lacie']))
# ['Elsie', 'Lacie', 'Tillie']

print(soup.find_all(string=re.compile('Dormouse')))
# ["The Dormouse's story", "The Dormouse's story"]

print(soup.find_all('a', string='Elsie'))## BeautifulSoup 4.4.0 이후
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

print(soup.find_all('a', text='Elsie'))  ## BeautifulSoup 4.4.0 이전에서 text로 사용
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]


### limit argument 사용하기 (SQL의 LIMIT 키워드처럼 작동)
print(soup.find_all('a', limit = 2))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

### recursive argument 사용하기
## find_all()이 하위 모든 태그를 검색하는 것을, 직계 자식만 검색하도록 할 때 recursive=False 사용
html_tag = '''
<html>
 <head>
  <title>The Dormouse's story</title>
 </head>
 <body>
  <div>
   <p>test</p>
  </div>
 </body>
</html>'''

soup = BeautifulSoup(html_tag, 'html.parser')
print(soup.html.find_all('title'))
# [<title>The Dormouse's story</title>]
print(soup.html.find_all('title', recursive = False))  ## html의 직속 자식이 아님
# []

print(soup.body.find_all('p'))
# [<p>test</p>]
print(soup.body.find_all('p', recursive = False))
# []

 

good4me.co.kr

 

### find_all() 및 find()와 같은 메서드들...

## find_parents(name, attrs, string, limit, **kwargs)
## find_parent(name, attrs, string, **kwargs)

from bs4 import BeautifulSoup
import re

html_doc = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

a_string = soup.find(string = 'Lacie')
print(a_string)
# Lacie

print(a_string.find_parents('a'))  ## string Lacie 부모  a 태그들, list
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

print(a_string.find_parent('p'))  ## string Lacie 부모 태그 p
#<p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#and they lived at the bottom of a well.</p>


## find_next_siblings() and find_next_sibling()

first_link = soup.a
print(first_link)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(first_link.find_next_siblings('a'))
#[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(first_link.find_next_sibling('a'))
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

first_story_paragraph = soup.find('p', 'story')
print(first_story_paragraph.find_next_sibling('p'))
# <p class="story">...</p>


## find_previous_siblings() and find_previous_sibling()
## find_all_next() and find_next()
## find_all_previous() and find_previous()

 

### CSS selectors

## select(), select_one() 메서드

from bs4 import BeautifulSoup
import re

html_doc = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.select('title'))  ## title 태그, 리스트 반환
# [<title>The Dormouse's story</title>]

print(soup.select('p:nth-of-type(3)'))  ## p태그 동일 레벨 3번째
# [<p class="story">...</p>]

print(soup.select('body a'))  ## body태그 하위 a태그들
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('html head title'))  # html 하위 head 하위 title 태그들
# [<title>The Dormouse's story</title>]

print(soup.select('head > title'))  # head태그 직계(바로 밑) title태그들
# [<title>The Dormouse's story</title>]

print(soup.select('p > a:nth-of-type(2)'))  ## p태그 직계 a태그 2번째
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

print(soup.select('p > #link1'))  ## p태그 직계 id 속성값 link1
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

print(soup.select('body > a'))  ## body태그 바로 밑 a태그 --> 없음, 빈 리스트 반환
# []


## 동위(형제) 선택자 ( '+' 또는 '~' )
## 선택자A + 선택자B : 선택자A 바로 뒤에 위치하는 선택자B 선택
## 선택자A ~ 선택자B : 선택자A 뒤에 위치하는 선택자B 선택

print(soup.select('#link1 + .sister'))  ## id link1 바로 뒤에 있는 class sister
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

print(soup.select('#link1 ~ .sister'))  ## id link1 뒤에 있는 class sister
#[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('.sister'))  ## class이 sister인 태그들
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

## 선택자[속성~=값] : 특정 값이 포함된 속성을 가진 요소를 찾아 스타일 적용
print(soup.select('[class ~= sister]'))
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


print(soup.select('#link1'))  ## id가 link1인 태그
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

print(soup.select("a#link2"))  ## a태그이고 id가 link2
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

print(soup.select('#link1, #link2'))  ## id가 link1이거나 link2인 태그들
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

print(soup.select('a[href]'))  ## a태그에 속성 href가 있는 태그들
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('a[href="http://example.com/elsie"]'))  ## a태그에 href 속성값이 있는 태
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(soup.select('a[href$="tillie"]'))  ## a태그 href 속성값이 tillie로 시작하는 태그
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('a[href*=".com/el"]'))  ## a태그 href 속성값에 .com/el가 포함된 태그
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]


### select_one() 알아보기 : 첫번째 매칭되는 태그 찾기
print(soup.select_one('.sister'))
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

 

### 파스 트리(Parse Tree) 내용 수정

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
b_tag = soup.b

b_tag.name = "blockquote"  ## 태그명 수정
b_tag['class'] = 'verybold'  ## class 속성 추가
b_tag['id'] = 1  ## id 속성 추가
print(b_tag)
# <blockquote class="verybold" id="1">Extremely bold</blockquote>>

del b_tag['class']
del b_tag['id']
print(b_tag)
# <blockquote>Extremely bold</blockquote>

print(b_tag.string)
b_tag.string = 'New Bold'
print(b_tag)
# <blockquote>New Bold</blockquote>


b_tag.append('...')  ## 내용 추가, NavigableString() 인자 전달해도 가능
print(b_tag)
# <blockquote>New Bold...</blockquote>


b_tag.extend([' and', ' Boldest'])  ## 리스트 추가는 extend()
print(b_tag)
# <blockquote>New Bold... and Boldest</blockquote>

print(b_tag.contents)
# ['New Bold', '...', ' and', ' Boldest']


### 새로운 태그 추가 : new_tag()
soup = BeautifulSoup('<b></b>', 'html.parser')
original_tag = soup.b
new_tag = soup.new_tag('a', href = 'http://example.com')
original_tag.append(new_tag)
print(original_tag)
# <b><a href="http://example.com"></a></b>


### 태그 내용(contents) 지우기 : clear()
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
soup.a.clear()
print(soup.a)
# <a href="http://example.com/"></a>


### extract() : 특정 태그나 string 추출하기
## 추출된 태그와 내용을 다시 추가해보기 - new_tag(), append() 
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')

print(soup.a)
# <a href="http://example.com/">I linked to <i>example.com</i></a>
print(soup.i.parent)
# <a href="http://example.com/">I linked to <i>example.com</i></a>

i_tag = soup.i.extract()

print(soup.a)
# <a href="http://example.com/">I linked to </a>

print(i_tag)
# <i>example.com</i>

print(i_tag.parent)
# None
new_tag = soup.new_tag('i', class_ = 'i_cls')
soup.a.append(new_tag)
print(soup.a)
# <a href="http://example.com/">I linked to <i class_="i_cls"></i></a>

soup.i.append('example.com')  ## string 추가 순서에 주의
print(soup.a)
# <a href="http://example.com/">I linked to <i class_="i_cls">example.com</i></a>

## decompose() : 태그와 string 완전히 제거하기
i_tag = soup.i.decompose()
print(soup.a)
# <a href="http://example.com/">I linked to </a>

print(i_tag)
# None


## insert() : 순서를 정해 태그에 string 추가하기
tag = soup.a
tag.insert(0, 'You, ')
print(soup.a)
# <a href="http://example.com/">You, I linked to </a>

tag.insert(2, '...')
print(soup.a)
# <a href="http://example.com/">You, I linked to ...</a>

i_tag = soup.new_tag('i')  ## i태그 생성
i_tag.string = 'example.com'  ## i태그 string 추가
soup.a.append(i_tag)  ## i태그를 a태그에 추가
print(soup.a)



### replace_with() : 페이지 요소를 대체시키기
new_tag = soup.new_tag('b')
new_tag.string = 'example.net'
soup.a.i.replace_with(new_tag)  ## i태그를 b태그로 대체 replace_with()
print(soup.a)
# <a href="http://example.com/">You, I linked to ...<b>example.net</b></a>


### wrap() : 지정 태그 요소를 랩핑하기
soup = BeautifulSoup('<p>I wish I was bold.</p>', 'html.parser')
soup.p.string.wrap(soup.new_tag('b'))  ## string을 b태그로 랩팽하기
print(soup)
# <p><b>I wish I was bold.</b></p>

soup.p.wrap(soup.new_tag('div'))  ## p태그를 div태그로 랩핑하기
print(soup)
# <div><p><b>I wish I was bold.</b></p></div>

 

### get_text() :  human-readable text

## get_text()는 문서 또는 태그 아래의 모든 텍스트를 단일 유니코드 문자열로 반환
markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup.get_text())
# \nI linked to example.com\n

print(soup.get_text('|'))  ## 추출 텍스트를 '|'로 분리
'\nI linked to |example.com|\n'

print(soup.get_text('|', strip = True))  ## 추출 텍스트를 '|'로 분리, 좌우 공백&개행 제거
# I linked to|example.com

print(soup.i.get_text())
# example.com

str_strip = [text for text in soup.stripped_strings]
print(str_strip)
# ['I linked to', 'example.com']

 

### Encodings 인코딩

  • Beautiful Soup는 ASCII 또는 UTF-8 인코딩으로 작성된 HTML이나 XML을 Unicode 변환하기 위해 Unicode, Dammit 라이브러리 사용
  • BeautifulSoup 객체 속성인 .original_encoding을 이용하여 인코딩 자동 감지함
  • 문서의 인코딩을 알면 BeautifulSoup 생성자에게 from_encoding 속성으로 전달 가능
  • 정확한 인코딩을 모르고 Unicode, Dammit 라이브러리가 잘못된 인코딩을 한다면, 그 인코딩을 exclude_encodings 속성으로 제외 가능
markup = "<b>\N{SNOWMAN}</b>"
soup = BeautifulSoup(markup, 'html.parser')
print(soup.b)
# <b>☃</b>
print(soup.encode("utf-8"))
# b'<b>\xe2\x98\x83</b>'
print(soup.decode("utf-8"))
#<b>
# ☃
#</b>

markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
soup = BeautifulSoup(markup, 'html.parser')
print(soup.h1)
# <h1>Sacré bleu!</h1>
print(soup.decode('utf-8'))
#<h1>
# Sacré bleu!
#</h1>
print(soup.original_encoding)
# None

 

[참고] Beautiful Soup 4.9.0 documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/

 

댓글