You cannot see this page without javascript.

Skip to content
조회 수 10899 추천 수 0 댓글 0


Prev이전 문서

Next다음 문서

크게 작게 위로 아래로 댓글로 가기 인쇄


Prev이전 문서

Next다음 문서

크게 작게 위로 아래로 댓글로 가기 인쇄



[파이썬] scrapy 로 웹 사이트 크롤링


Crawl a website with scrapy

Written by Balthazar


In this article, we are going to see how to scrape information from a website, in particular, from all pages with a common URL pattern. We will see how to do that with Scrapy, a very powerful, and yet simple, scraping and web-crawling framework.

For example, you might be interested in scraping information about each article of a blog, and store it information in a database. To achieve such a thing, we will see how to implement a simple spider using Scrapy, which will crawl the blog and store the extracted data into a MongoDB database.

We will consider that you have a working MongoDB server, and that you have installed the pymongo and scrapy python packages, both installable with pip.

If you have never toyed around with Scrapy, you should first read this short tutorial.

First step, identify the URL pattern(s)

In this example, we’ll see how to extract the following information from each blogpost :

  • title
  • author
  • tag
  • release date
  • url

We’re lucky, all posts have the same URL pattern: These links can be found in the different pages of the site homepage.

What we need is a spider which will follow all links following this pattern, scrape the required information from the target webpage, validate the data integrity, and populate a MongoDB collection.

Building the spider

We create a Scrapy project, following the instructions from their tutorial. We obtain the following project structure:

├── isbullshit
│   ├──
│   ├──
│   ├──
│   ├──
│   └── spiders
│       ├──
│       ├──
└── scrapy.cfg

We begin by defining, in, the item structure which will contain the extracted information:

from scrapy.item import Item, Field

class IsBullshitItem(Item):
    title = Field()
    author = Field()
    tag = Field()
    date = Field()
    link = Field()

Now, let’s implement our spider, in

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from isbullshit.items import IsBullshitItem

class IsBullshitSpider(CrawlSpider):
    name = 'isbullshit'
    start_urls = [''] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), 
    	# r'page/\d+' : regular expression for URLs
    	Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback='parse_blogpost')]
    	# r'\d{4}/\d{2}/\w+' : regular expression for URLs
    def parse_blogpost(self, response):

Our spider inherits from CrawlSpider, which “provides a convenient mechanism for following links by defining a set of rules”. More info here.

We then define two simple rules:

  • Follow links pointing to
  • Extract information from pages defined by a URL of pattern, using the callback method parse_blogpost.

Extracting the data

To extract the title, author, etc, from the HTML code, we’ll use the scrapy.selector.HtmlXPathSelector object, which uses the libxml2 HTML parser. If you’re not familiar with this object, you should read the XPathSelector documentation.

We’ll now define the extraction logic in the parse_blogpost method (I’ll only define it for the title and tag(s), it’s pretty much always the same logic):

def parse_blogpost(self, response):
    hxs = HtmlXPathSelector(response)
    item = IsBullshitItem()
    # Extract title
    item['title'] ='//header/h1/text()').extract() # XPath selector for title
    # Extract author
    item['tag'] ="//header/div[@class='post-data']/p/a/text()").extract() # Xpath selector for tag(s)
    return item

Note: To be sure of the XPath selectors you define, I’d advise you to use Firebug, Firefox Inspect, or equivalent, to inspect the HTML code of a page, and then test the selector in a Scrapy shell. That only works if the data position is coherent throughout all the pages you crawl.

Store the results in MongoDB

Each time that the parse_blogspot method returns an item, we want it to be sent to a pipeline which will validate the data, and store everything in our Mongo collection.

First, we need to add a couple of things to

ITEM_PIPELINES = ['isbullshit.pipelines.MongoDBPipeline',]

MONGODB_SERVER = "localhost"
MONGODB_DB = "isbullshit"

Now that we’ve defined our pipeline, our MongoDB database and collection, we’re just left with the pipeline implementation. We just want to be sure that we do not have any missing data (ex: a blogpost without a title, author, etc).

Here is our file :

import pymongo

from scrapy.exceptions import DropItem
from scrapy.conf import settings
from scrapy import log
class MongoDBPipeline(object):
    def __init__(self):
        connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]
    def process_item(self, item, spider):
    	valid = True
        for data in item:
          # here we only check if the data is not null
          # but we could do any crazy validation we want
       	  if not data:
            valid = False
            raise DropItem("Missing %s of blogpost from %s" %(data, item['url']))
        if valid:
          log.msg("Item wrote to MongoDB database %s/%s" %
                  (settings['MONGODB_DB'], settings['MONGODB_COLLECTION']),
                  level=log.DEBUG, spider=spider) 
        return item

Release the spider!

Now, all we have to do is change directory to the root of our project and execute

$ scrapy crawl isbullshit

The spider will then follow all links pointing to a blogpost, retrieve the post title, author name, date, etc, validate the extracted data, and store all that in a MongoDB collection if validation went well.

Pretty neat, hm?


This case is pretty simplistic: all URLs have a similar pattern and all links are hard written in the HTML code: there is no JS involved. In the case were the links you want to reach are generated by JS, you’d probably want to check out Selenium. You could complexify the spider by adding new rules, or more complicated regular expressions, but I just wanted to demo how Scrapy worked, not getting into crazy regex explanations.

Also, be aware that sometimes, there’s a thin line bewteen playing with web-scraping and getting into trouble.

Finally, when toying with web-crawling, keep in mind that you might just flood the server with requests, which can sometimes get you IP-blocked :)

Please, don’t be a d*ick.

See code on Github

The entire code of this project is hosted on Github. Help yourselves!

List of Articles
번호 제목 글쓴이 날짜 조회 수
310 영어 채팅 용어, 영어 축약어 정리 txt msg file 졸리운_곰 2015.01.18 6
309 음악 영어 용어 졸리운_곰 2015.01.18 1
308 클래식 입문에 필요한 음악용어 1000 졸리운_곰 2015.01.18 15
307 기계학습 (머신러닝:Machine Learning) 참고자료 링크 : 머신러닝 : 기계 학습 프로그래밍 자료 졸리운_곰 2014.11.29 2132
306 [앱으로 돈 버는 시대]잘 만든 앱 하나 억대연봉 안부럽다 file 졸리운_곰 2014.11.18 146
305 [프레미엄조선] 경영, 철학에 한 수 배우다 file 졸리운_곰 2014.11.17 121
304 [발명의 40가지 원리] Triz 40 트리즈 40 자료조사 file 졸리운_곰 2014.10.29 218
303 esxi 서버 ovf 배포시 "사용자에 의해 취소..." OVF deplyment - cancled by user 졸리운_곰 2014.10.22 243
302 우분투 14.04 Numix & MAC 테마 설치 file 졸리운_곰 2014.10.21 214
301 Linux (ubuntu) codeblocks wxwidget 개발환경 구축 졸리운_곰 2014.10.21 272
300 libreoffice 설치 (kali) 졸리운_곰 2014.10.21 246
299 Kali Linux 한글 설정 졸리운_곰 2014.10.21 282
298 delphi clone lazarus 델파이 클론 라자루스 file 졸리운_곰 2014.10.13 289
297 spidering hacks [web bot] [web crawling] basics with per file 졸리운_곰 2014.10.08 249
296 GNUstep 과 Dev-C++ 설치하기 (windows에서 Objective-C개발 환경 만들기) 졸리운_곰 2014.10.03 220
295 기업의 경영전략 - 강원대 경영전략 자료 file 졸리운_곰 2014.07.29 1077
294 아이디어발명가의 사명 file 졸리운_곰 2014.07.21 864
293 파워포인트 1page 기획서 file 졸리운_곰 2014.07.21 852
292 프로그래밍 수련법 요약 정리 file 졸리운_곰 2014.07.21 1053
291 * DBA (Database Administrator) file 졸리운_곰 2014.07.21 893
Board Pagination Prev 1 2 3 4 5 6 7 8 9 10 ... 16 Next
/ 16 2014

대표/정보보호담당자 : 김성준 010-4589-2193 경기도 용인시 수지구 풍덕천동 1지구

sketchbook5, 스케치북5

sketchbook5, 스케치북5

나눔글꼴 설치 안내

이 PC에는 나눔글꼴이 설치되어 있지 않습니다.

이 사이트를 나눔글꼴로 보기 위해서는
나눔글꼴을 설치해야 합니다.

설치 취소