Open Letter to ScrapingHub for Evaluating Trial Project

Who is ScrapingHub and what they do?
ScrapingHub is the company behind Scrapy framework. The location of candidate determines who is going to be full time employee or contract basis, most of employees work remotely. Their job listing page covers all open positions.

Regardless of reservations with ScrapingHub, I do appreciate them for following;

  • Scrapy framework is amazing
  • ScrapingHub Cloud services are really good
  • Providing reason, why you’ve rejected the candidate
  • Prompt communications during whole process

Open Letter to ScrapingHub

I applied to your job ad for the Python Developer position. Within a week of applying online, I received an email that I’ve to submit a trial project. I completed the project at weekend and submitted on next working day. Within a week I received reply from my point of contact that “Trial does not meet the required criteria to progress to the next stage of our hiring process on this occasion.” One of my friend applied at their job ad and he was informed that his code was not PEP8 compliant.

Review from ScrapingHub of my Trial Project submission

When dealing with data that’s structured hierarchically it is best if the call structure follows, because that makes understanding and maintenance easier. Single methods/functions with complex conditionals are difficult to understand and debug. The number of items in the job’s output is below the known total, and field coverage is 100%, which is against requirements; taking more time to review the results would have helped with that.

My reservations

  • When dealing with data that’s structured hierarchically it is best if the call structure follows: Completely agree. However I tried to google atleast one script that will help me here, but was not able to find. I assumed it will be forgiven due to extensive comments 🙂 But I take complete responsibility here
  • Single methods/functions with complex conditionals are difficult to understand and debug.: Agreed, that is why I mentioned this already in the assumptions file and as inline-comments in the spider itself. I did distributed logic in the pipelines classes for most of the data filtering. Did you guys checked it? While I was googling, I find a git hub bug page over Scrapy project where it was recommended to not send response object in Pipeline class I note it down in assumptions file too that is the reason image was filtered in spider class
  • The number of items in the job’s output is below the known total: That could be only known to you guys becuase you’ve the database and you guys have setup the website, it was not mentioned within html anywhere. First thing, in order to know the total-number of records I’ve to first run the spider to collect whole data. Secondly, the sample website had individual items within multiple categories. Third, perhaps you guys forgot, it was a test project that usually completed in 9-10 hours according to you guys and you guys already mentioned if it takes more than 16 hours, the candidate should stop and submit whatever he has. I completed the trial project within 10 hours, running multiple times (each run could take upto 10 minutes) locally and on ScrapingHub Cloud (which has limited number of Free credits)
  • field coverage is 100%, which is against requirements: First of all, what is field coverage? and No, it is not mentioned in the requirements as-is, do you guys want me to put that requirements public?
  • taking more time to review the results would have helped with that.: Alright, then say it, modify your requirements so that it clearly mentions that for a non-paid (you guys already mentioned this) put as many as hours possible to find complete results. For your information, each execution of result takes upto 10 minutes. I ran the spider multiple times with -l ERROR to see if result is complete, I submitted the project only when it has no error

Trial Acceptance Suggestions
If you guys are sincerely looking for a resource, here are my suggestions;

  • Add Best Practices section under Scrapy.org’s documentation section
  • Its good that you have requirements to submit assumptions file, do read it as well
  • Do not say that on average it takes 9-10 hours to complete the project, specially to a person who is going to write Scrapy spider code for the first
  • The maximum 16 (or 20) hours limit do not work, clearly mention that you’re expecting a true Scrapy spider comparable to one written by your full time developer who have spent years with you guys. This will help those who are hustler and could write finest piece of code
  • After submission of trial project, if you guys think candidate can improve or willing to, give a choice to either spend more time to improve the code quality of trial project (non paid) or just let it go

General Suggestions

  • If your hiring process has to reject someone based upon their age or geographical location, atleast do not waste their time, do not reply at all like rest or say it loud in email
  • Clearly mention in which Geographical Locations you are hiring for a full time or on-going contract job. Reference ScrapingHub reviews at Glassdoor

Skype Broadcast Message Tool – Replacement of Babel Fish

SkyMass is a message broadcast tool that sends message to your Skype contact list. Since after the Babel Fish stopped working and specially after the launch of Skype ver8, we’ve been strugling to find a tool. When we were not able to find one, we built one for ourself.

Features

  • Pause or Resume broadcast
  • JSON Formatted config file will store your Skype credentials, message and other settings
  • Test before you start your broadcast campaign
  • If your broadcast fails during run, upon restarting it will continue after the contact, last message has been sent to

Pre-Requisites

  • Export your Skype contact list from Skype’s web interface
  • Setup your config.json file
  • Run SkyMass as Administrator

Setup

  • Unzip skymass.zip anywhere on your computer
  • Make sure config.json & contacts.csv files, exist within same directory where skymass.exe resides
  • If you want, set “testing”: “true” in config.json file

config.json Explained

{
	"username": "YOUR_SKYPE_USERNAME",
	"password": "YOUR_SKYPE_PASSWORD",
	"contact_csv_file": "contacts.csv",
	"message": "YOUR_BROADCAST_MESSAGE",
	"refresh_report_file": false,
	"testing": true,
	"wait_for_skype_open": 290,
	"wait_for_login":20
}
  1. Enter your Skype Username, make sure its enclosed with double quotes (“) without any space
  2. Enter your Skype Password, make sure its enclosed with double quotes (“) without any space
  3. contact_csv_file: Make sure your Exported contact (csv) file name match here
  4. message: Message you want to broadcast
  5. refresh_report_file: If you want to empty existing-contact-message file then set it as true. It will cause sending message again and again, in case of restarting SkyMass software. Its better to set it as false, this way you would not be spaming your contact with same message agian & again
  6. testing: If you want to first test the software, you should set it as true, this way it will run software but will never send message to your contacts
  7. wait_for_skype_open: If Skymass is unable to start Skype, check logs and increase this timer if required
  8. wait_for_login: If your skype hangs after login due to very huge contact list, increase it accordingly

Note:

  • This program search for Skype, make sure no other window on the computer is named Skype.
  • This program works with active Skype window, so you should run this program on a machine which can be dedicated for this purpose
  • During run do not try to use the computer, otherwise this software may break and might click controls other than Skype
  • Use this software on your own responsibility, we could never be held for any loss of data or software/hardware failures

Update

version 1.1

  • Delays while opening Skype & after login is now configurable

version 1.0

  • Initial Release

Automate Skype Login using AutoIt

Since Microsoft laucnhed Skype ver8.23.0.10, it was not appreciated by the community, primarily due to its user-interface and do not respect what Skype user want. Some of these people criticize Microsoft when they acquired GitHub, because they suspect Microsoft will do the same with GitHub as what they did to Skype.

Major Changes in Skype 8

  • From AutoIt automation point of view, previously we were logging into Skype using _IEAttach() function while now we login using _UIA_getFirstObjectOfElement found within a user-defined AutoIt Library IUIAutomation MS framework
  • Another significant difference is that now there is no way to export your contacts, nor they are stored in c:\Users\WIN_USERNAME\AppData\Roaming\Skype\SKYPE_USERNAME\main.db any longer. However you can export your contacts only if you login to their web-interface.
  • The third change is that now Skype program contian within Microoft directory instead of Skype’s own directory, like C:\Program Files (x86)\Microsoft\Skype for Desktop\Skype.exe

Pre-requisites to automate Skype Login

Skype Login Automation AutoIt Script

#include "UIAWrappers.au3"
 
; #INDEX# =======================================================================================================================
; Title .........: Skype Login Automation
; AutoIt Version : 1.0
; Language ......: English
; Description ...: Login to Skype ver 8.23.0.10
; Author ........: Ahmed Shaikh Memon 
; Requirements...: AutoIt v3.3.12, Developed/Tested on Windows 7 Ultimate SP 1
; ===============================================================================================================================
 
; #CONSTANTS# ===================================================================================================================
Global Const $cDocument ="controltype:=Document"
; ===============================================================================================================================
 
; #LOCAL VARIABLES# ============================================================================================================
Local $Username	= "YOUR_SKYPE_USERNAME"
Local $Password	= "YOUR_SKYPE_PASSWORD"
; ===============================================================================================================================
 
;
; Start Skype
;
Local $ProgramFileDir
Switch @OSArch
	Case "X32"
		$ProgramFileDir = "Program Files"
	Case "X64"
		$ProgramFileDir = "Program Files (x86)"
EndSwitch
$ProgramFileDir = @HomeDrive & "\" & $ProgramFileDir
Run($ProgramFileDir & "\Microsoft\Skype for Desktop\Skype.exe")
Sleep(5000)
 
 
Local $bIsLoggedIn = False
 
; Is Skype running?
If Not WinExists("[Class:Chrome_WidgetWin_1]") Then
	ConsoleWrite("Unable to find Skype")
	Exit
EndIf
 
;
; Get Skype window object
;
Local $oChrome = _UIA_getFirstObjectOfElement($UIA_oDesktop,"class:=Chrome_WidgetWin_1", $treescope_children)
$oChrome.setfocus()
Sleep(1000)
 
;
; Click Proceed to login form
;
Local $oDocument = _UIA_getFirstObjectOfElement($oChrome, "controltype:=" & $UIA_DocumentControlTypeId, $treescope_subtree)
 
Local $oAnotherUser = _UIA_getObjectByFindAll($oDocument, "name:=Use another", $treescope_subtree)
If IsObj($oAnotherUser) Then
	_UIA_action($oAnotherUser,"leftclick")
	$bIsLoggedIn = True
Else
	Local $oSignInWithMS = _UIA_getObjectByFindAll($oDocument, "name:=Sign", $treescope_subtree)
	If IsObj($oSignInWithMS) Then
		_UIA_action($oSignInWithMS,"leftclick")
		$bIsLoggedIn = True
	EndIf
EndIf
 
If Not $bIsLoggedIn Then
	ConsoleWrite("Unable to login")
	Exit
Else
	Sleep(10000)
EndIf
 
;
; Write Username
;
Local $oDocument = _UIA_action($cDocument, "object")
Local $oElement = _UIA_getObjectByFindAll($oDocument, "controltype:=UIA_EditControlTypeId", $treescope_subtree)
 
If Not IsObj($oElement) Then
	ConsoleWrite("Unable to get Username element")
	Exit
EndIf
 
_UIA_action($oElement, "leftclick")
Send("^a")
Send($Username & "{ENTER}")
 
Sleep(5500)
 
;
; Write Password
;
Local $oDocument = _UIA_action($cDocument, "object")
Local $oElement = _UIA_getObjectByFindAll($oDocument, "controltype:=UIA_EditControlTypeId", $treescope_subtree)
 
If Not IsObj($oElement) Then
	ConsoleWrite("Unable to get Username element")
	Exit
EndIf
 
_UIA_action($oElement, "leftclick")
Send("^a")
Send($Password & "{ENTER}")
Sleep(5500)

Checkout all scripts and compile into your AutoIt.

Debuging
We can use _UIA_DumpThemAll($oChrome, $treescope_subtree) to find the name or element type id associated with particular control over Skype screen. Download examples zip on IUIAutomation MS framework.

Explanation

  • controltype:=Document in Global Const $cDocument will be used to get access to document object of Skype’s screens change during login process.
  • $ProgramFileDir points to right 32-bit Program Files on this script host computer.
  • _UIA_getFirstObjectOfElement will give access to Skype screen and will try to find it using UI_Automation class name Chrome_WidgetWin_1. We will find this name using Inspect tool
  • $oDocument will be used to find particular elements on Skype app. We can get list of all available elements UI Automation control-type, name, index etc using _UIA_DumpThemAll($oChrome, $treescope_subtree) or _UIA_DumpThemAll($oDocument, $treescope_subtree)
  • _UIA_getObjectByFindAll($oDocument, "name:=Use another", $treescope_subtree) will return element object (if exist) that has text “Use another account”. This will be displayed on Skype login screen if and only if you’ve previously logged in to Skype and logout. Otherwise it will display Sign in
    Skype 8 Automation - Use Another Account
    Skype 8 Automation – Use Another Account
    Skype 8 Automation - Sign in with Microsoft
    Skype 8 Automation – Sign in with Microsoft
  • _UIA_action($oAnotherUser,"leftclick") will trigger leftclick event on Skype screen, this particular line will click “Use another account” link
  • _UIA_getObjectByFindAll($oDocument, "name:=Sign", $treescope_subtree) is added as fail-safe. This piece of code will work when there is no “Use another account” link is visible on Skype login screen.
  • _UIA_getObjectByFindAll($oDocument, "controltype:=UIA_EditControlTypeId", $treescope_subtree) will return Username text-field element.
  • _UIA_action($oElement, "leftclick")
    Send("^a")
    Send($Username & "{ENTER}")

    In a perfect world, we could have used just single line;

    _UIA_action($oElement,"setvalue using keys", $Username & "{ENTER}")

    But setvalue using keys work in following sequence;

    1. put element into focus
    2. select all text within element
    3. send text to control using Send()

    setvalue using keys is not working as expected on Skype screen. Because as soon as Send() is called, it looses the focus, thus no text is written to the element in question

    While my 3 lines code above, always work, as I am triggering left-click on element instead of focus

    Personally I dont like using Send(), as it require window to be active, which might disturb control being sent to incorrect window, however it seems like the need until we get better solution.

Best practises implemented

  • Well commented code
  • Use of @OSArch & @HomeDrive to build system architecture specific directory
  • Saved two interactions and instead passed [ENTER] keystroke while writing username & password
  • Used generic condition controltype:=UIA_EditControlTypeId for _UIA_getObjectByFindAll to find username & password elements. Deliberately did not used the name or title to find them

Windows Desktop Applications Automation using AutoIt

Write AutoIt Script Like a ProAutoIt is a language & framework to automate (mainly) Windows GUI application interactions. Though it has its own language but there are wrappers available for Python, C# etc too.

This write up target developers who want to automate Windows Desktop Applications but either do not know the right tool or want to enhance their skills using AutoIt.

Just like with any thing, there is either informed or amateur approach towards solving a problem. Developers who have never written Windows GUI applications will write AutoIt code that will depend too much on interactions with active GUI. While we can get maximum out of AutoIt when we rely less over such interactions. This amateur approach creates real issue when we have to interact with multiple desktop applications within same AutoIt script.

One solution to address these issues is to use _WinAPI_PostMessage function, where applicable. We will primarily use this function to interact with menus, otherwise our interactions may occur on wrong window.

Intent:

Want to automate Notepad;

  • change display font
  • write some text and
  • save file

Downloads:

GUI Inspection:

In order to control GUI elements we will need the Windows Name, Controls ClassNameNN & Automation Ids of Notepad application. If you dont know how to find them using AutoIt Window Info and Inspect tools, watch this video (enable captions, if you want).

AutoIt Script

#include "WinAPI.au3"
#include "WindowsConstants.au3"
 
 
; #INDEX# =======================================================================================================================
; Title .........: Windows Desktop Applications Automation using AutoIt
; AutoIt Version : 3.3.14.2
; Description ...: Automate Notepad for changing display font, writing some text and save file
; Author(s) .....: Ahmed Shaikh Memon (ahmedshaikhm)
; ===============================================================================================================================
 
 
; #VARIABLES# ===================================================================================================================
Local $hWnd, $hWndFont, $hWndSaveAs
Global $sFontName = "Arial"
Global $sFileName = "tutorial_1"
 
; #CONSTANTS# ===================================================================================================================
Global Const $__IDM_FONT = 33
Global Const $__IDM_SAVE = 3
Global Const $__IDC_FONT_NAME = 1001
 
; Open Notepad
Run("notepad.exe")
Sleep(1000)
 
; Get Notepad's handle
$hWnd = WinGetHandle("[Class:Notepad]")
 
;
; Display Font Change
;
 
; Click Format > Font menu
_WinAPI_PostMessage($hWnd, $WM_COMMAND, $__IDM_FONT, 0)
Sleep(1000)
 
 
; Handle Font window
$hFontWin = WinGetHandle("Font")
Sleep(1000)
 
; Select display font
ControlSend($hFontWin,"", $__IDC_FONT_NAME, $sFontName)
Sleep(2000)
 
; Save display font
ControlClick($hFontWin, "", "Button5")
Sleep(1000)
 
;
; Write text in Notepad
;
ControlSend($hWnd, "", "Edit1","This is informed approach towards solving a problem")
 
;
; Save file
;
 
; Click File > Save menu
_WinAPI_PostMessage($hWnd, $WM_COMMAND, $__IDM_SAVE, 0)
Sleep(1000)
 
; Enter file name in the file open dialog box
$hWndSaveAs = WinGetHandle("Save As")
ControlSend($hWndSaveAs, "", "Edit1", $sFileName)
Sleep(1000)
 
; Press the Save button on Save As prompt
ControlClick($hWndSaveAs, "", "Button1")
Sleep(1000)

This code is available as gist here

Explanation:

WinAPI.au3 & WindowsConstants.au3 contain definition or constants that will help to communicate particular command to window without visible interaction.

Global variables $hWnd, $hWndFont & $hWndSaveAs will hold handles of Notepad Application, Font window & Save As windows, respectively. While s in $sFontName & $sFileName indicates that they are string. The advantage of following Naming Convention is that it helps create team factor in individuals thus a guess work can be utilized when multiple people are working on same code. Another benefit is that even if you open code after months, you can instantly recognize why particular variable was used.

$__IDM_FONT & $__IDM_SAVE are Menu IDs for sub-menus Font and Save respectively. We’ve identified them using Inspect tool. $__IDC_FONT_NAME is the Control ID for text field for font name.

Run("notepad.exe") is use to open the Notepad application. We can further improve it by using Run("notepad.exe", "", @SW_MAXIMIZE), this will make sure when Notepad is opened up, its window is minimize by default, specially useful when active window will disturb other application or if the computer (where this AutoIt script is being executed) distracts the person using it.

WinGetHandle() is used so that we can send control actions or PostMessage() to just related Window.

_WinAPI_PostMessage() is a wraper to Win 32 API function PostMessage(), this will click sub-menu item without need to have Notepad activated and visible. The function definition is within WinAPI.au3, that is why it is included at the top. WindowsConstants.au3 contain the associated integer for $WM_COMMAND constant

We’ve used ControlSend() function, with particular window handle, so that control command is executed within specific window, otherwise Send() function just send control command, even if (for any reason) your target window is not visible (minimized or something), the control might execute on wrong target if you use Send().

Sleep() functions are used multiple times, because it take few milliseconds to load or close a window, for example Font & Save As windows open after some milliseconds, same while Font window is getting closed. If we dont use them the automation might not work properly.

Known Limitations:

The code is as brief as possible, there is a room to enhance and make it more professional

  • We can open Notepad minimized
  • The Font & Save As windows will be maximized by-default even if Notepad is opened minimized, therefore we can use WinSetState() function to explicitly minimize these windows
  • If the file already exist, we can add few lines of code to test if its asking for overwriting and then peform the desired action
  • We can close Notepad after the automation using WinClose()
  • If your script is going to be deployed on variety of Windows oeprating system versions and different hardware specification machines, then add checks to see if intended control has successfuly executed. For instance, if I’ve to check if Notepad is really opened after Run("Notepad.exe") I would add something like;
    If Not WinExists($hWnd) Then
    	ConsoleWrite("Notepad is not opened" & @CRLF)
    EndIf

Code Execution

Following animated gif shows what happen when you execute this code

Further Reading

For those who are interested in learning more should visit AutoIt’s official wiki & forum or search over StackOverflow, however make sure that you are coming up with intelligent (specific) question

Checkout Automate Skype Login using AutoIt post for a more practical example.

Credits

  1. Matt has pointed towards use of PostMessage() Menu IDs
  2. AutoIt Forums got bunch of helpful guys who respond promptly

6 Tips Before You Hire Web Scraping Services

6 Tips Before You Hire Web Scraping Services

Why do I need these advice?

  • There are some facts & limitations to Web-Scraping that you must be aware of
  • An educated & informed job description will put your developer in comfort zone
  • Your developer will not be able to fool you
  • You will know in advance the limitations & ways to pass-thru

Web Automation Tips for buyers:

Following tips will help you identify possibilities & limitations of scraping a website.

  1. Web scrapers use target website’s source code (HTML) as hooks. Therefore even minor change in HTML might stop your scraper. Ask your professional to use less HTML hooks. This way, your web-scraper script or application will work (most of the time)
  2. Make sure professional adheres naming convention, indent code & write necessary comments within code. This practice will help in quickly grasping the internal working of a script to both same or new developer.
  3. Most websites restrict access, if same IP Address is being used to crawl their website. If you use sequential IPs, it might block whole IP-Block. Make use of Socks or HTTP proxies from different IP-blocks. Some websites might need you to use just residential IP addresses. Confirm with your proxy service provider if they residential IP addresses
  4. If your web-scraper takes care of target website resource, chances are your scraper could use same IP Address for longer period of time. Use of delays will play vital role here
  5. Some websites use fingerprinting techniques to identify same person, make sure your developer know how to handle this.
  6. Clock Skew is another technique to identify same visitors, though chances are you will never get to this point of scraping, but if you reach there, you will have to use different computers with different internet service providers to perform your scraping task

My friend Bill Hess of PixelPrivacy has shed more light on the subject, have a look at it and let me know if you like it.