You will never fulfill your destiny , until you let go of the illusion of control.
– Master Oogway
Okay , so today I entered the 15th day into learning the fundamentals of computation using python and developing a basic functional search engine along the way. Well by functional , in any way I am not comparing what I have built ,to the likes of Googles and Bings of the world , but even Google search engine is based upon the fundamental concepts which I have used to build my search engine , but in a much much much larger scale and with many more servers just dedicated for the job of crawling and indexing of data.
Along the way I have covered topics like procedures , mutable data types like lists , string handling , string operations (like find() , len() , <string>[::-1]) , link extraction , Backus-Naur Form (BNF) grammar , control statements , loops , multiple assignment , nested lists , list operations (like append() , max() , min() , index() , <value> in/not in <list> , pop()) and many more ideas ranging from ethical programming to problem solving procedure.
And after learning soo much , I realized that a basic search engine was more easier to build , than what I had previously imagined , because you only need a few concepts to build a web crawler (which is the most important part of the search engine) , which is a program which basically collects all the hyperlinks from the web and the only other thing you need , to build a basic search engine is an index list , which is a list of keywords and URLs.
(For people unfamiliar with python , list is a concept similar to arrays in C/C++/Java but it is also miles apart in other functionalities . For example – Arrays are always made up of homogeneous data types , an array of integer cannot contain characters or floating points numbers , but a list like this – [1, 1.2 , ‘a’ , [ 3 , 4 ] ] , is possible in python which makes lists a bazillion times more powerful than array in terms of usability).
So , the very first step to build a web crawler is based upon knowing about anchor tag which is used to create a hyperlink in a html document and it’s syntax is –
<a href=”URL”> link text </a>
The only part we should be interested in is –
Because we have to extract to URL in between the two quotes in href attribute of <a>. So , how to extract it ? Simple , just use find() function , which searches for a target string in a given search string and the syntax of find is –
The find() procedure also comes with one more parameter which could allow to search the target string in the search string after the nth position –
So now let’s assume that any given web page’s source code is stored in form of a string variable (e.g. page). Then we could use the following steps to extract the first hyperlink –
page = ‘<web page source code>’
start_position = page.find ( ‘<a href =’ )
start_quote = page.find ( ‘ ” ‘ , start_position ) #Will search for 1st quotation mark
start_quote = page.find ( ‘ ” ‘ , start_quote+1 ) #Will search for last quotation mark
url = page[ start_quote+1 : end_quote ] #Will store URL
And bang , we have extracted our first hyperlink from scratch by just playing around with strings.So, What’s the next step ? Extract all hyperlinks from a given page by just using while loops. But the story of web crawler doesn’t ends here, as you have to extract all hyperlinks from all the web pages your web crawler comes across and this requires a little bit more programming and 2 more variables – tocrawl (which is a list which would contain all the links to be crawled by your web crawler) and crawled (which is a list of all the hyperlinks your web crawler has crawled through). So let’s make the most basic version of web crawler. Seed page is the first page from which the crawling begins. It’s denoted as seed in the following program (Sorry for the indentation error) –
def web_crawler( seed ):
tocrawl = [ seed ] #As no pages have been crawled yet
crawled= #List of crawled pages
while tocrawl : #Will run till tocrawl isn’t empty
page=tocrawl.pop() #Extracts last element of tocrawl
content=get_page( page ) #Source code of page in content
union( tocrawl , get_all_links ( content ) ) #Will do set union of the two lists
crawled.append( page )
So, basically what happens in the above program is that , the web_crawler procedure or function accepts seed as it’s input , where seed is the seed URL and then stores it in tocrawl , because seed page is to be crawled by the program. Then the last element of tocrawl is popped and stored in a variable ‘page’ and in the first case the last element of tocrawl is seed URL itself and then the next step is to store all the contents of the source code of the web page of the popped element in a variable ‘content’ and after that we could do the set union of the sets of tocrawl and all the extracted hyperlinks from the popped URL (Here union isn’t a predefined function , I had to create it for this purpose alone , but it’s functionality is same as that of set union ). Then the popped page can be appended into crawled list. This process goes on till we have exhausted all the links in tocrawl list , but this process can also be restricted by using 2 parameters – max_depth and max_pages .
The last part of making a basic functional search engine is adding an index list which will store all the contents of the web page which would include the keyword which could be searched from the index.
Here’s the full program for doing so.