As most of the tutorials on Scrapy focuses on beginner code.
So I decided to but a few tips and trick together for more advanced scrapy.
1: Always try to discover ajax JSON calls
Using the Network tab in chrome developers tab scroll through and ajax calls in the response tab and keep your eye out for json. finding the ajax calls allows for quicker response as you do not need to load all the HTML, CSS files etc.
2: Using Postman to investigate HTTP request
Using Postman we can determine what requests require Cookies and headers and also to write descriptions of request to build a better Idea of the structure of the website.
3: Mongo NoSQL database
Mongo is unstructure which is needed to in web scraping as we can’t always predicate what data will come out and also allows for use to add meta data.
Using Items and a Pipeline inserting data into Mongo DB can be done with a generic pipeline
class MongoDB: def __init__(self, mongo_uri, mongo_db, stats): self.mongo_uri = mongo_uri self.mongo_db = mongo_db self.stats = stats @classmethod def from_crawler(cls, crawler): ## pull in information from settings.py return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE'), stats=crawler.stats ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db['collectionName'].insert(dict(item)) return item
The above code is all you need no insert statements or worrying if all fields meet the strict template introduced by SQL databases.
To keep note: Ensure that the variables are the same type as it will help with step 5 data Processing.
4: Web Tokens and how to handle them
CSRF: These are variable in the forms and scrapy fromResponse() will handle theses
x-csrf-token: Theses are in the header of the request and scrapy will not always handle theses
x-xsrf-token:The token are send in the cookies and again scrapy will handle theses most times.
both x-xsrf-token and x-csrf-token might need to be handle if splash or selenium is not being used.
5: Data processing
One of the reason one may want to scrap data is to analyse it for business analytics or insight into a problem whatever it may be. You have populated your data in mongoDB and now you want to visual the data.
Mongo DB has a great querying system and using matplotlib we can create bar charts, scatter graphs and many more to represent our data and gain insight.
In only a few lines of code.