The task was to create a highly scalable and cost-effective google search crawler. Challenges were:
1. Scalability — Perform maximum google searches in minimum time.
2. Google temporarily blocks an IP address.
3. Minimal Cost.
I will talk about step by step process of how my solution evolved to create a scalable and variable IP address crawler with infrastructure costs as low as $0.021 for 5000 searches.
Initially, it seemed straightforward. I quickly wrote a ruby script to perform a google search for different queries sequentially and used Nokogiri to parse the HTML response. This worked well until the time the number of searches was less than 500 (approx).
Once the number of Google search queries increased, the problem was that searching on google sequentially was not scalable. This problem of sequential search can be optimized by running them parallelly using a queue processing software such as sidekiq. But there was another major problem that I was facing here.
Google was blocking my IP address after approx. 500 queries.
It was impossible to scale the application with the approach I was following.
his was a Rails application and AWS was being used.
To tackle the challenge of getting blocked by Google, I used AWS Elastic IPs. I started running the google searches in parallel sidekiq jobs in a single AWS instance and as soon as Google started blocking the instance IP address. I would
1. Allocate a new elastic IP in my AWS account.
2. Disassociate the current elastic IP from the instance.
3. Associate newly allocated IP address with the instance.
4. Deallocate previous elastic IP address.
While this solved the problem of getting blocked by Google. The problem that persisted after this was that, all the sidekiq jobs would stop and wait for the new elastic IP address to be allocated and associated. And there is still a limit on the number of sidekiq jobs that can be run on an instance based on its infrastructure.
The most cost-effective, scalable solution that I found for the above-mentioned problems is mentioned below. Let’s say there are a certain number of Google searches to be done. I followed the following steps:
1. Divide all the search queries in a group of 500 each. (Because google blocks after approx 500 queries).
2. For each group of search queries, create a rake task to run google search for them.
3. Dockerise the application and push it on docker hub.
4. From an AWS instance, start creating micro instances that will be used for google search of 500 search queries. So, I would spawn a t2.micro instance for each group of search queries and pass the queries to it in a user-data script that runs immediately after an instance is launched.
5. Each AWS instance was spawned using Hashicorp Terraform using a prebuilt Amazon machine image (AMI) which I created using Hashicorp Packer.
6. A user-data script is a shell script that you can create to run tasks immediately after an AWS instance is launched. In my user-data script, I created a docker-compose file. And ran docker build using it.
7. After the docker container of my application was up and running, the next task in the user data script was to run that rake task for google search of all the search queries passed to the user-data script.
8. After the google search for all the search queries was complete, I called an API in main instance to destroy current instance.
This is how I would spawn an AWS instance for each group of search queries. Spawning of each instance happened parallelly in a sidekiq job and then call the main instance to destroy itself using terraform-destroy.
Each t2.micro instance ran for about 10 min for a google search of 500 queries. The cost of a t2.micro instance is $0.013. That makes the cost of running 500 google searches $0.0021 per instance.
So, if there are 5000 google searches to be done, then there will be 10 instances spawned and the cost for these google searches in total will be $0.021.
RemotePanda is a personalized platform for companies to hire remote talent and get the quality work delivered from the city Pune. The resources in our talent pool are our close network connections. While connecting them with you, we make sure to manage the quality, growth, legalities, and the delivery of their work. The idea is to make remote work successful for you. Get in touch with us to learn why RemotePanda is the best fit solution for your business requirements.