How I learned python and developed a basic functional search engine in 15 days

You will never fulfill your destiny , until you let go of the illusion of control.

–  Master Oogway

Okay , so today I entered the 15th day into learning the fundamentals of computation using python and developing a basic functional search engine along the way. Well by functional , in any way I am not comparing what I have built ,to the likes of Googles and Bings of the world , but even Google search engine is based upon the fundamental concepts which I have used to build my search engine , but in a much much much larger scale and with many more servers just dedicated for the job of crawling and indexing of data.

Contribution graph from my github profile
Contribution graph from my github profile

Along the way I have covered topics like procedures , mutable data types like lists , string handling , string operations (like find() , len() , <string>[::-1]) , link extraction , Backus-Naur Form (BNF) grammar , control statements , loops , multiple assignment , nested lists , list operations (like append() , max() , min() , index() , <value> in/not in <list> , pop()) and many more ideas ranging from ethical programming to problem solving procedure.

And after learning soo much , I realized that a basic search engine was more easier to build , than what I had previously imagined , because you only need a few concepts to build a web crawler (which is the most important part of the search engine) , which is a program which basically collects all the hyperlinks from the web and the only other thing you need , to build a basic search engine is an index list , which is a list of keywords and URLs.

(For people unfamiliar with python , list is a concept similar to arrays in C/C++/Java but it is also miles apart in other functionalities . For example – Arrays are always made up of homogeneous data types , an array of integer cannot contain characters or floating points numbers , but a list like this – [1, 1.2 , ‘a’ , [ 3 , 4 ] ] , is possible in python which makes lists a bazillion times more powerful than array in terms of usability).

So , the very first step to build a web crawler is based upon knowing about anchor tag which is used to create a hyperlink in a html document and it’s syntax is –

<a href=”URL”> link text </a>

The only part we should be interested in is –

<a href=”

Because we have to extract to URL in between the two quotes in href attribute of <a>. So , how to extract it ? Simple , just use find() function , which searches for a target string in a given search string and the syntax of find is –

<search_string>.find(<target_string>)

The find() procedure also comes with one more parameter which could allow to search the target string in the search string after the nth position

<search_string>.find(<target_string>,n)

So now let’s assume that any given web page’s source code is stored in form of a string variable (e.g. page). Then we could use the following steps to extract the first hyperlink

page = ‘<web page source code>’

start_position = page.find ( ‘<a href =’ )

start_quote = page.find ( ‘ ” ‘ , start_position )     #Will search for 1st quotation mark

start_quote = page.find ( ‘ ” ‘ , start_quote+1 )     #Will search for last quotation mark

url = page[ start_quote+1 : end_quote ]               #Will store URL

And bang , we have extracted our first hyperlink from scratch by just playing around with strings.So, What’s the next step ? Extract all hyperlinks from a given page by just using while loops. But the story of web crawler doesn’t ends here, as you have to extract all hyperlinks from all the web pages your web crawler comes across and this requires a little bit more programming and 2 more variables – tocrawl (which is a list which would contain all the links to be crawled by your web crawler) and crawled (which is a list of all the hyperlinks your web crawler has crawled through). So let’s make the most basic version of web crawler. Seed page is the first page from which the crawling begins. It’s denoted as seed in the following program (Sorry for the indentation error) –

def web_crawler( seed ):

tocrawl = [ seed ]                                                  #As no pages have been crawled yet

crawled=[]                                                             #List of crawled pages

while tocrawl :                                                      #Will run till tocrawl isn’t empty

page=tocrawl.pop()                                        #Extracts last element of tocrawl

content=get_page( page )                              #Source code of page in content

union( tocrawl , get_all_links ( content ) ) #Will do set union of the two lists

crawled.append( page )

return crawled

So, basically what happens in the above program is that ,  the web_crawler procedure or function accepts seed as it’s input , where seed is the seed URL and then stores it in tocrawl , because seed page is to be crawled by the program. Then the last element of tocrawl is popped and stored in a variable ‘page’ and in the first case the last element of tocrawl is seed URL itself and then the next step is to store all the contents of the source code of the web page of the popped element in a variable ‘content’ and after that we could do the set union of the sets of tocrawl and all the extracted hyperlinks from the popped URL (Here union isn’t a predefined function , I had to create it for this purpose alone , but it’s functionality is same as that of set union ). Then the popped page can be appended into crawled list. This process goes on till we have exhausted all the links in tocrawl list , but this process can also be restricted by using 2 parameters – max_depth and max_pages .

The last part of making a basic functional search engine is adding an index list which will store all the contents of the web page which would include the keyword which could be searched from the index.

Here’s the full program for doing so.

 

3 thoughts on “How I learned python and developed a basic functional search engine in 15 days

    1. Thank you very much. It was difficult journey and I haven’t even reached till the end but it’s fun because I learned something I didn’t knew anything about and learned to build something useful.

      Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s