loader image
Cryt is a Company that is inspired by how far IT technologies has brought us and strive to help others unlock their full potential.

Scrapy advanced

scrapy logo

Written by adam

March 16, 2020

Scrapy advanced

As most of the tutorials on Scrapy focuses on beginner code.

 So I decided to but a few tips and trick together for more advanced scrapy.

1: Always try to discover ajax JSON calls

Using the Network tab in chrome developers tab scroll through and ajax calls in the response tab and keep your eye out for json. finding the ajax calls allows for quicker response as you do not need to load all the HTML, CSS files etc.

2: Using Postman to investigate HTTP request

Using Postman we can determine what requests require Cookies and headers and also to write descriptions of request to build a better Idea of the structure of the website.
A example would be to deconstruct a website search bar as it normally uses cookies or url encode variables to search.

3: Mongo NoSQL database

Mongo is unstructure which is needed to in web scraping as we can’t always predicate what data will come out and also allows for use to add meta data.
Using Items and a Pipeline inserting data into Mongo DB can be done with a generic pipeline
Mongo Pipeline

                
                class MongoDB:
                    def __init__(self, mongo_uri, mongo_db, stats):
                    self.mongo_uri = mongo_uri
                    self.mongo_db = mongo_db
                    self.stats = stats

                    @classmethod
                    def from_crawler(cls, crawler):
                    ## pull in information from settings.py
                    return cls(
                    mongo_uri=crawler.settings.get('MONGO_URI'),
                    mongo_db=crawler.settings.get('MONGO_DATABASE'),
                    stats=crawler.stats
                    )

                    def open_spider(self, spider):
                    self.client = pymongo.MongoClient(self.mongo_uri)
                    self.db = self.client[self.mongo_db]

                    def close_spider(self, spider):
                    self.client.close()

                    def process_item(self, item, spider):
                    self.db['collectionName'].insert(dict(item))
                    return item
                 

The above code is all you need no insert statements or worrying if all fields meet the strict template introduced by SQL databases.
To keep note: Ensure that the variables are the same type as it will help with step 5 data Processing.

4: Web Tokens and how to handle them

CSRF: These are variable in the forms and scrapy fromResponse() will handle theses
x-csrf-token: Theses are in the header of the request and scrapy will not always handle theses
x-xsrf-token:The token are send in the cookies and again scrapy will handle theses most times.
both x-xsrf-token and x-csrf-token might need to be handle if splash or selenium is not being used.

5: Data processing

One of the reason one may want to scrap data is to analyse it for business analytics or insight into a problem whatever it may be. You have populated your data in mongoDB and now you want to visual the data.

Mongo DB has a great querying system and using matplotlib we can create bar charts, scatter graphs and many more to represent our data and gain insight.
In only a few lines of code.

You May Also Like…

How much does Web Scraping cost?

How much does Web Scraping cost?

Web scraping can unlock a whole world of data. And with all that data, also comes a lot of value. You can use this...

Is Web Scraping Legal?

Is Web Scraping Legal?

With you being able to extract data from any website, is web scraping legal? Many big companies and data scientists...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Subscribe To Our Newsletter

Join our mailing list to receive the latest news and updates from our team.

You have Successfully Subscribed!