Who is ScrapingHub and what they do?
ScrapingHub is the company behind Scrapy framework. The location of candidate determines who is going to be full time employee or contract basis, most of employees work remotely. Their job listing page covers all open positions.

Regardless of reservations with ScrapingHub, I do appreciate them for following;

Scrapy framework is amazing
ScrapingHub Cloud services are really good
Providing reason, why you’ve rejected the candidate
Prompt communications during whole process

Open Letter to ScrapingHub

I applied to your job ad for the Python Developer position. Within a week of applying online, I received an email that I’ve to submit a trial project. I completed the project at weekend and submitted on next working day. Within a week I received reply from my point of contact that “Trial does not meet the required criteria to progress to the next stage of our hiring process on this occasion.” One of my friend applied at their job ad and he was informed that his code was not PEP8 compliant.

Review from ScrapingHub of my Trial Project submission

When dealing with data that’s structured hierarchically it is best if the call structure follows, because that makes understanding and maintenance easier. Single methods/functions with complex conditionals are difficult to understand and debug. The number of items in the job’s output is below the known total, and field coverage is 100%, which is against requirements; taking more time to review the results would have helped with that.

My reservations

When dealing with data that’s structured hierarchically it is best if the call structure follows: Completely agree. However I tried to google atleast one script that will help me here, but was not able to find. I assumed it will be forgiven due to extensive comments 🙂 But I take complete responsibility here
Single methods/functions with complex conditionals are difficult to understand and debug.: Agreed, that is why I mentioned this already in the assumptions file and as inline-comments in the spider itself. I did distributed logic in the pipelines classes for most of the data filtering. Did you guys checked it? While I was googling, I find a git hub bug page over Scrapy project where it was recommended to not send response object in Pipeline class I note it down in assumptions file too that is the reason image was filtered in spider class
The number of items in the job’s output is below the known total: That could be only known to you guys becuase you’ve the database and you guys have setup the website, it was not mentioned within html anywhere. First thing, in order to know the total-number of records I’ve to first run the spider to collect whole data. Secondly, the sample website had individual items within multiple categories. Third, perhaps you guys forgot, it was a test project that usually completed in 9-10 hours according to you guys and you guys already mentioned if it takes more than 16 hours, the candidate should stop and submit whatever he has. I completed the trial project within 10 hours, running multiple times (each run could take upto 10 minutes) locally and on ScrapingHub Cloud (which has limited number of Free credits)
field coverage is 100%, which is against requirements: First of all, what is field coverage? and No, it is not mentioned in the requirements as-is, do you guys want me to put that requirements public?
taking more time to review the results would have helped with that.: Alright, then say it, modify your requirements so that it clearly mentions that for a non-paid (you guys already mentioned this) put as many as hours possible to find complete results. For your information, each execution of result takes upto 10 minutes. I ran the spider multiple times with -l ERROR to see if result is complete, I submitted the project only when it has no error

Trial Acceptance Suggestions
If you guys are sincerely looking for a resource, here are my suggestions;

Add Best Practices section under Scrapy.org’s documentation section
Its good that you have requirements to submit assumptions file, do read it as well
Do not say that on average it takes 9-10 hours to complete the project, specially to a person who is going to write Scrapy spider code for the first
The maximum 16 (or 20) hours limit do not work, clearly mention that you’re expecting a true Scrapy spider comparable to one written by your full time developer who have spent years with you guys. This will help those who are hustler and could write finest piece of code
After submission of trial project, if you guys think candidate can improve or willing to, give a choice to either spend more time to improve the code quality of trial project (non paid) or just let it go

General Suggestions

If your hiring process has to reject someone based upon their age or geographical location, atleast do not waste their time, do not reply at all like rest or say it loud in email
Clearly mention in which Geographical Locations you are hiring for a full time or on-going contract job. Reference ScrapingHub reviews at Glassdoor

Category: Uncategorized

Open Letter to ScrapingHub for Evaluating Trial Project

Open Letter to ScrapingHub