-
파이썬 크롤링(Crawling) 연습 - BeautifulSoup Documentation #2 (find_all, find, select 등)코딩 연습/코딩배우기 2020. 11. 8. 10:25
■ 파이썬 크롤링 BeautifulSoup Documentation 내용 정리 #2
find_all(), find_all() 및 find()와 같은 메서드들, CSS selector 이용하는 select()와 select_one(), 파스 트리(Parse Tree) 내용 수정, get_text(), Encodings 등 내용 정리
### find_all()
## find_all()은 태그명, 속성, 문자열(텍스트) 또는 이들의 조합을 기준으로 사용할 수 있음 from bs4 import BeautifulSoup import re html_doc = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' soup = BeautifulSoup(html_doc, 'html.parser') print(soup.find_all('b')) ## 문서 내 모든 b태그 찾기 # [<b>The Dormouse's story</b>] ## 정규표현식 활용 for tag in soup.find_all(re.compile('^b')): ## b로 시작하는 모든 태그 print(tag.name) #body #b for tag in soup.find_all(re.compile('t')): ## t가 들어간 모든 태그 print(tag.name) #html #title print(soup.find_all(['a', 'b'])) ## list로 전달, 모든 a태그, b태그 찾기 #[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, \ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, \ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] ## find_all()에 속성이 있는지 체크하는 함수를 인자 전달하기 ## class 속성은 있고, id 속성은 없는 경우 print(soup.p.has_attr('class')) # True print(soup.p.has_attr('id')) # False def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id') print(soup.find_all(has_class_but_no_id)) ## p 태그 찾아냄 #[<p class="title"><b>The Dormouse's story</b></p>, # <p class="story">Once upon a time there were three little sisters; and their names were #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, #<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; #and they lived at the bottom of a well.</p>, <p class="story">...</p>] ## 특정 속성을 함수에 전달 시 그 인자는 태그가 아닌 속성 값이 됨 def not_lacie(href): return href and re.compile('lacie').search(href) ## re 객체로 href에서 'lacie' 찾기 print(soup.find_all(href = not_lacie)) ## 속성 href의 값으로 함수 전달 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] def not_lacie(href): return href and not re.compile('lacie').search(href) print(soup.find_all(href = not_lacie)) ## 속성 href의 값으로 함수 전달 #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] ## find_all() 상세히 살펴보기 ## find_all(name, attrs, recursive, string, limit, ** kwargs) print(soup.find_all('title')) ## 모든 title 태그, string은 무시 # [<title>The Dormouse's story</title>] print(soup.find_all('p', 'title')) ## 모든 p태그 중 class가 title인 것 # [<p class="title"><b>The Dormouse's story</b></p>] print(soup.find_all('a')) #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] print(soup.find_all(id = 'link2')) ## id가 link2 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] ## 각 태그의 href 속성 값이 정규표현식과 일치하는 태그 print(soup.find_all(href=re.compile("elsie"))) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] print(soup.find_all(id=True)) ## id 속성이 있는 모든 태그 #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] print(soup.find_all(href = re.compile('elsie'), id = 'link1')) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] ## HTML5에서 키워드 인자 이름으로 사용할 수 없는 속성 : data-* attributes data_soup = BeautifulSoup('<div data-foo="value">foo!</div>', 'html.parser') #print(data_soup.find_all(data-foo="value")) ## 에러 # SyntaxError: keyword can't be an expression ## data-* 는 dict 형태로 속성 attrs의 값으로 전달하면 사용 가능 print(data_soup.find_all(attrs = {'data-foo':'value'})) # [<div data-foo="value">foo!</div>] ## HTML의 name 요소에 대해서도 위와 동일한 방법으로 찾을 수 있음 name_soup = BeautifulSoup('<input name="email"/>', 'html.parser') print(name_soup.find_all(name="email")) # [] print(name_soup.find_all(attrs={"name": "email"})) # [<input name="email"/>] print(soup.find_all('p', attrs={'class': 'title'})) # [<p class="title"><b>The Dormouse's story</b></p>] print(soup.find_all('p', {'class': 'title'})) # [<p class="title"><b>The Dormouse's story</b></p>] print(soup.find_all('a', class_='sister')) ## find_all('a', {'class':'sister'}) #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] print(soup.find_all(class_=re.compile("itl"))) ## class값 문자열 'itl' 포함 태그 # [<p class="title"><b>The Dormouse's story</b></p>] ## class값으로 함수 전달 (class가 None이 아니고 길이는 6) def has_six_characters(css_class): return css_class is not None and len(css_class) == 6 print(soup.find_all(class_=has_six_characters)) #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] ## 다중값에 대해 css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser') print(css_soup.find_all('p', class_='strikeout')) # [<p class="body strikeout"></p>] print(css_soup.find_all('p', class_='body')) # [<p class="body strikeout"></p>] print(css_soup.find_all('p', class_="body strikeout")) # [<p class="body strikeout"></p>] print(css_soup.find_all('p', class_="strikeout body")) ## 순서 틀림 # [] ## CSS selector 사용 시 가능 print(css_soup.select('p.strikeout.body')) # [<p class="body strikeout"></p>] print(css_soup.select('p.body.strikeout')) # [<p class="body strikeout"></p>] print(css_soup.select('p.strikeout')) # [<p class="body strikeout"></p>] print(css_soup.select('p.body')) # [<p class="body strikeout"></p>] ### 태그 대신에 string argument로 찾기 ## find_all(name, attrs, recursive, string, limit, ** kwargs) print(soup.find_all(string='Elsie')) # ['Elsie'] print(soup.find_all(string=['Tillie', 'Elsie', 'Lacie'])) # ['Elsie', 'Lacie', 'Tillie'] print(soup.find_all(string=re.compile('Dormouse'))) # ["The Dormouse's story", "The Dormouse's story"] print(soup.find_all('a', string='Elsie'))## BeautifulSoup 4.4.0 이후 # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>] print(soup.find_all('a', text='Elsie')) ## BeautifulSoup 4.4.0 이전에서 text로 사용 # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>] ### limit argument 사용하기 (SQL의 LIMIT 키워드처럼 작동) print(soup.find_all('a', limit = 2)) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] ### recursive argument 사용하기 ## find_all()이 하위 모든 태그를 검색하는 것을, 직계 자식만 검색하도록 할 때 recursive=False 사용 html_tag = ''' <html> <head> <title>The Dormouse's story</title> </head> <body> <div> <p>test</p> </div> </body> </html>''' soup = BeautifulSoup(html_tag, 'html.parser') print(soup.html.find_all('title')) # [<title>The Dormouse's story</title>] print(soup.html.find_all('title', recursive = False)) ## html의 직속 자식이 아님 # [] print(soup.body.find_all('p')) # [<p>test</p>] print(soup.body.find_all('p', recursive = False)) # []
### find_all() 및 find()와 같은 메서드들...
## find_parents(name, attrs, string, limit, **kwargs) ## find_parent(name, attrs, string, **kwargs) from bs4 import BeautifulSoup import re html_doc = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' soup = BeautifulSoup(html_doc, 'html.parser') a_string = soup.find(string = 'Lacie') print(a_string) # Lacie print(a_string.find_parents('a')) ## string Lacie 부모 a 태그들, list # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] print(a_string.find_parent('p')) ## string Lacie 부모 태그 p #<p class="story">Once upon a time there were three little sisters; and their names were #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, #<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; #and they lived at the bottom of a well.</p> ## find_next_siblings() and find_next_sibling() first_link = soup.a print(first_link) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> print(first_link.find_next_siblings('a')) #[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] print(first_link.find_next_sibling('a')) # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> first_story_paragraph = soup.find('p', 'story') print(first_story_paragraph.find_next_sibling('p')) # <p class="story">...</p> ## find_previous_siblings() and find_previous_sibling() ## find_all_next() and find_next() ## find_all_previous() and find_previous()
### CSS selectors
## select(), select_one() 메서드 from bs4 import BeautifulSoup import re html_doc = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' soup = BeautifulSoup(html_doc, 'html.parser') print(soup.select('title')) ## title 태그, 리스트 반환 # [<title>The Dormouse's story</title>] print(soup.select('p:nth-of-type(3)')) ## p태그 동일 레벨 3번째 # [<p class="story">...</p>] print(soup.select('body a')) ## body태그 하위 a태그들 #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] print(soup.select('html head title')) # html 하위 head 하위 title 태그들 # [<title>The Dormouse's story</title>] print(soup.select('head > title')) # head태그 직계(바로 밑) title태그들 # [<title>The Dormouse's story</title>] print(soup.select('p > a:nth-of-type(2)')) ## p태그 직계 a태그 2번째 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] print(soup.select('p > #link1')) ## p태그 직계 id 속성값 link1 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] print(soup.select('body > a')) ## body태그 바로 밑 a태그 --> 없음, 빈 리스트 반환 # [] ## 동위(형제) 선택자 ( '+' 또는 '~' ) ## 선택자A + 선택자B : 선택자A 바로 뒤에 위치하는 선택자B 선택 ## 선택자A ~ 선택자B : 선택자A 뒤에 위치하는 선택자B 선택 print(soup.select('#link1 + .sister')) ## id link1 바로 뒤에 있는 class sister # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] print(soup.select('#link1 ~ .sister')) ## id link1 뒤에 있는 class sister #[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] print(soup.select('.sister')) ## class이 sister인 태그들 #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] ## 선택자[속성~=값] : 특정 값이 포함된 속성을 가진 요소를 찾아 스타일 적용 print(soup.select('[class ~= sister]')) #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] print(soup.select('#link1')) ## id가 link1인 태그 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] print(soup.select("a#link2")) ## a태그이고 id가 link2 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] print(soup.select('#link1, #link2')) ## id가 link1이거나 link2인 태그들 #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] print(soup.select('a[href]')) ## a태그에 속성 href가 있는 태그들 #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] print(soup.select('a[href="http://example.com/elsie"]')) ## a태그에 href 속성값이 있는 태 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> print(soup.select('a[href$="tillie"]')) ## a태그 href 속성값이 tillie로 시작하는 태그 # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] print(soup.select('a[href*=".com/el"]')) ## a태그 href 속성값에 .com/el가 포함된 태그 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] ### select_one() 알아보기 : 첫번째 매칭되는 태그 찾기 print(soup.select_one('.sister')) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
### 파스 트리(Parse Tree) 내용 수정
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser') b_tag = soup.b b_tag.name = "blockquote" ## 태그명 수정 b_tag['class'] = 'verybold' ## class 속성 추가 b_tag['id'] = 1 ## id 속성 추가 print(b_tag) # <blockquote class="verybold" id="1">Extremely bold</blockquote>> del b_tag['class'] del b_tag['id'] print(b_tag) # <blockquote>Extremely bold</blockquote> print(b_tag.string) b_tag.string = 'New Bold' print(b_tag) # <blockquote>New Bold</blockquote> b_tag.append('...') ## 내용 추가, NavigableString() 인자 전달해도 가능 print(b_tag) # <blockquote>New Bold...</blockquote> b_tag.extend([' and', ' Boldest']) ## 리스트 추가는 extend() print(b_tag) # <blockquote>New Bold... and Boldest</blockquote> print(b_tag.contents) # ['New Bold', '...', ' and', ' Boldest'] ### 새로운 태그 추가 : new_tag() soup = BeautifulSoup('<b></b>', 'html.parser') original_tag = soup.b new_tag = soup.new_tag('a', href = 'http://example.com') original_tag.append(new_tag) print(original_tag) # <b><a href="http://example.com"></a></b> ### 태그 내용(contents) 지우기 : clear() markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, 'html.parser') soup.a.clear() print(soup.a) # <a href="http://example.com/"></a> ### extract() : 특정 태그나 string 추출하기 ## 추출된 태그와 내용을 다시 추가해보기 - new_tag(), append() markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup, 'html.parser') print(soup.a) # <a href="http://example.com/">I linked to <i>example.com</i></a> print(soup.i.parent) # <a href="http://example.com/">I linked to <i>example.com</i></a> i_tag = soup.i.extract() print(soup.a) # <a href="http://example.com/">I linked to </a> print(i_tag) # <i>example.com</i> print(i_tag.parent) # None new_tag = soup.new_tag('i', class_ = 'i_cls') soup.a.append(new_tag) print(soup.a) # <a href="http://example.com/">I linked to <i class_="i_cls"></i></a> soup.i.append('example.com') ## string 추가 순서에 주의 print(soup.a) # <a href="http://example.com/">I linked to <i class_="i_cls">example.com</i></a> ## decompose() : 태그와 string 완전히 제거하기 i_tag = soup.i.decompose() print(soup.a) # <a href="http://example.com/">I linked to </a> print(i_tag) # None ## insert() : 순서를 정해 태그에 string 추가하기 tag = soup.a tag.insert(0, 'You, ') print(soup.a) # <a href="http://example.com/">You, I linked to </a> tag.insert(2, '...') print(soup.a) # <a href="http://example.com/">You, I linked to ...</a> i_tag = soup.new_tag('i') ## i태그 생성 i_tag.string = 'example.com' ## i태그 string 추가 soup.a.append(i_tag) ## i태그를 a태그에 추가 print(soup.a) ### replace_with() : 페이지 요소를 대체시키기 new_tag = soup.new_tag('b') new_tag.string = 'example.net' soup.a.i.replace_with(new_tag) ## i태그를 b태그로 대체 replace_with() print(soup.a) # <a href="http://example.com/">You, I linked to ...<b>example.net</b></a> ### wrap() : 지정 태그 요소를 랩핑하기 soup = BeautifulSoup('<p>I wish I was bold.</p>', 'html.parser') soup.p.string.wrap(soup.new_tag('b')) ## string을 b태그로 랩팽하기 print(soup) # <p><b>I wish I was bold.</b></p> soup.p.wrap(soup.new_tag('div')) ## p태그를 div태그로 랩핑하기 print(soup) # <div><p><b>I wish I was bold.</b></p></div>
### get_text() : human-readable text
## get_text()는 문서 또는 태그 아래의 모든 텍스트를 단일 유니코드 문자열로 반환 markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>' soup = BeautifulSoup(markup, 'html.parser') print(soup.get_text()) # \nI linked to example.com\n print(soup.get_text('|')) ## 추출 텍스트를 '|'로 분리 '\nI linked to |example.com|\n' print(soup.get_text('|', strip = True)) ## 추출 텍스트를 '|'로 분리, 좌우 공백&개행 제거 # I linked to|example.com print(soup.i.get_text()) # example.com str_strip = [text for text in soup.stripped_strings] print(str_strip) # ['I linked to', 'example.com']
### Encodings 인코딩
- Beautiful Soup는 ASCII 또는 UTF-8 인코딩으로 작성된 HTML이나 XML을 Unicode 변환하기 위해 Unicode, Dammit 라이브러리 사용
- BeautifulSoup 객체 속성인 .original_encoding을 이용하여 인코딩 자동 감지함
- 문서의 인코딩을 알면 BeautifulSoup 생성자에게 from_encoding 속성으로 전달 가능
- 정확한 인코딩을 모르고 Unicode, Dammit 라이브러리가 잘못된 인코딩을 한다면, 그 인코딩을 exclude_encodings 속성으로 제외 가능
markup = "<b>\N{SNOWMAN}</b>" soup = BeautifulSoup(markup, 'html.parser') print(soup.b) # <b>☃</b> print(soup.encode("utf-8")) # b'<b>\xe2\x98\x83</b>' print(soup.decode("utf-8")) #<b> # ☃ #</b> markup = "<h1>Sacr\xc3\xa9 bleu!</h1>" soup = BeautifulSoup(markup, 'html.parser') print(soup.h1) # <h1>Sacré bleu!</h1> print(soup.decode('utf-8')) #<h1> # Sacré bleu! #</h1> print(soup.original_encoding) # None
[참고] Beautiful Soup 4.9.0 documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/
'코딩 연습 > 코딩배우기' 카테고리의 다른 글
파이썬 크롤링(Crawling) - 셀레니움(Selenium) 연습 #1 (0) 2020.11.09 파이썬으로 아스키(ASCII) Code 출력, 그리고 영문자에 대한 진수값들 (0) 2020.11.08 파이썬 크롤링(Crawling) 연습 - BeautifulSoup Documentation #1 (html 태그로 접근, 객체 4종류 등) (0) 2020.11.07 파이썬 크롤링(Crawling) 연습 - 네이버 영화 평점/리뷰, 영화코드 추출 (0) 2020.11.04 파이썬 크롤링(Crawling) 연습 - BeautifulSoup 활용 기초 (0) 2020.11.03