Getting Company URL from a list of Company Names using python

I have a list of company names in my excel sheet, and I want to get the corresponding website URL. After trying hard using WEBSERVICES and FILTERXML using excel VBA, I didn't get much of a result. I then switched to Python, for better results. 

Turns out there is a library called googlesearch which helps achieve the same. I am providing the steps below:

Save the list of companies in a text file.

Here is the code that achieves the task

from googlesearch import search

file = open("your_file_name.txt","r")
links = file.read().split('\n')
#print(links)
with open("yourfilenamewithlinks.txt","w") as f:
    for query in links:
        print(query)
        for results in search(query,tld="co.in", num=1, start = 1, stop=1, pause=10 ):
                f.write(query)
                f.write(",")
                f.write(results)
                f.write('\n')

In the terminal, install googlesearch-python using pip install googlesearch
from googlesearch import search

Open the file in read mode.
Read the file stream, and split the stream by newline "\n". This will provide a list of names.
For each name in the list, perform a google search. 

The search returns a list of urls it finds.
An explanation of the parameters is as follows:
1. query is the text that you want to search in google.
2. tld stands for 'top level domain' and allows you to set restrictions to search domains.
.net (an alternative to .com)
.org (typically for but not restricted to nonprofit organizations)
.gov (for government sites)
.edu (for educational institutions)
.mil (for military use)
.uk, .us, .au and others (country-specific domains)
3. num=n : return n number of results.  
4. start=p :  start displaying from p-th result
5. stop=m : stop displaying at m-th result. 
6. pause=n : pause search for n seconds. ** See note below.

For every result returned, write the query and the result, separated by a comma (",") and end in a newline "\n" in a new text file. This will provide a list of company names and their searched website URL in the text file. 

** Note

Go slow between searches. Keep restriction to 100 queries per day to honor google terms of service. Also, if you go too fast and at a high concurrency, google may block your ip from further searches. 

For volume retrievals, it will be advisable to look at google's search API, in order to abide by its terms of service (tos). After the first  free 100 queries in a day, it will charge $5 for 1000 queries and require signup to the google cloud with a restriction of 10K requests in a day. Here is the link for the same.   https://developers.google.com/custom-search/v1/overview


Comments

Popular posts from this blog

Reading a pre-populated sqlite database using react native with expo..

React Hooks - useState Vs useRef (An explanation)