Why do I need these advice?
- There are some facts & limitations to Web-Scraping that you must be aware of
- An educated & informed job description will put your developer in comfort zone
- Your developer will not be able to fool you
- You will know in advance the limitations & ways to pass-thru
Web Automation Tips for buyers:
Following tips will help you identify possibilities & limitations of scraping a website.
- Web scrapers use target website’s source code (HTML) as hooks. Therefore even minor change in HTML might stop your scraper. Ask your professional to use less HTML hooks. This way, your web-scraper script or application will work (most of the time)
- Make sure professional adheres naming convention, indent code & write necessary comments within code. This practice will help in quickly grasping the internal working of a script to both same or new developer.
- Most websites restrict access, if same IP Address is being used to crawl their website. If you use sequential IPs, it might block whole IP-Block. Make use of Socks or HTTP proxies from different IP-blocks. Some websites might need you to use just residential IP addresses. Confirm with your proxy service provider if they residential IP addresses
- If your web-scraper takes care of target website resource, chances are your scraper could use same IP Address for longer period of time. Use of delays will play vital role here
- Some websites use fingerprinting techniques to identify same person, make sure your developer know how to handle this.
- Clock Skew is another technique to identify same visitors, though chances are you will never get to this point of scraping, but if you reach there, you will have to use different computers with different internet service providers to perform your scraping task
My friend Bill Hess of PixelPrivacy has shed more light on the subject, have a look at it and let me know if you like it.