Distributed Web Crawler
In this section, we aim to create a simple HTML crawler that scrapes the content of the first HTML page returned from the given urls, the example source code is available at here.
Core
Suppose you have a url address spreadsheet in CSV format with 2 fields as below:
number,url
[...],[...]
To scrape off the content from various given sources and collect all the HTML text in a single text file, let's start with a simple handler:
// The following example shows how to collect HTML contents from the given urls
func ExampleJobHandler(w http.ResponseWriter, r *http.Request, bg *task.Background) {
var (
job = task.MakeJob()
path = "./data.csv"
raw = []struct {
URL string
}{}
source = task.Collection{}
)
ioHelper.FromPath(path).NewCSVOperator().Fill(&raw)
for _, r := range raw {
source.Append(r.URL)
}
job.Tasks(
&task.Task{task.SHORT,
task.BASE, "exampleFunc",
source,
task.Collection{},
task.NewTaskContext(struct{}{}), 0},
)
job.Stacks("core.ExampleTask.Mapper", "core.ExampleTask.Reducer")
bg.Mount(job)
}
And then, we create a function that sends request to the passed-in urls, and store the responses in its result space:
func ExampleFunc(source *task.Collection,
result *task.Collection,
context *task.TaskContext) bool {
var text = task.Collection{}
for _, n := range *source {
var (
bytes []byte
resp *http.Response
err error
)
resp, err = http.Get(n.(string))
if err != nil {
break
}
bytes, err = ioutil.ReadAll(resp.Body)
if err != nil {
break
}
text = append(text, task.Countable(string(bytes)))
}
*result = append(*result, text...)
return true
}
Similarly, add a simple mapper that segregates the tasks into 3 subsets:
type SimpleMapper int
func (m *SimpleMapper) Map(inmaps map[int]*task.Task) (map[int]*task.Task, error) {
// slice the data source of the map into 3 separate segments
return taskHelper.Slice(inmaps, 3), nil
}
And output the collected text into a specified file path:
type SimpleReducer int
func (r *SimpleReducer) Reduce(maps map[int]*task.Task) (map[int]*task.Task, error) {
var (
sum int
text string = ""
)
for _, s := range maps {
sum += len((*s).Result)
for _, r := range (*s).Result {
text += r.(string)
}
}
file, _ := os.Create("./websites.txt")
io.WriteString(file, text)
fmt.Printf("The sites visited: %v \n", sum)
return maps, nil
}
Now register the components and run:
go run main.go -mode=clbt