The distinct parts of the engine
The Crawler is divided into several distinct parts.
The Job Manager builds a list with the next site to visit, known robot rules and pages to grab.
The Bot receives the "job list" and tries to grab a "robot.txt". This provides two important peices of information about the site being visited: (1) is the site online and available, (2) there could be crawl restrictions imposed by the "robot.txt". If the site is available, the Bot will then begin to grab the known pages and give them to the Extractor.
The Extractor will perform two major tasks : (1) build a list of links found in the page. (2) extract the text.
...
Using the "job list", the status of the current site is updated. New links are stored. Text extracted from the page is stored.
...
...
...
Please contact us if you would like to talk about the possibilities of this development.