Distributed Web Crawler

In this section, we aim to create a simple HTML crawler that scrapes the content of the first HTML page returned from the given urls, the example source code is available at here.

Core

Suppose you have a url address spreadsheet in CSV format with 2 fields as below:

number,url
[...],[...]

To scrape off the content from various given sources and collect all the HTML text in a single text file, let's start with a simple handler:


// The following example shows how to collect HTML contents from the given urls
func ExampleJobHandler(w http.ResponseWriter, r *http.Request, bg *task.Background) {
    var (
        job  = task.MakeJob()
        path = "./data.csv"
        raw  = []struct {
            URL string
        }{}
        source = task.Collection{}
    )

    ioHelper.FromPath(path).NewCSVOperator().Fill(&raw)

    for _, r := range raw {
        source.Append(r.URL)
    }

    job.Tasks(
        &task.Task{task.SHORT,
        task.BASE, "exampleFunc",
        source,
        task.Collection{},
        task.NewTaskContext(struct{}{}), 0},
    )
    job.Stacks("core.ExampleTask.Mapper", "core.ExampleTask.Reducer")

    bg.Mount(job)
}

And then, we create a function that sends request to the passed-in urls, and store the responses in its result space:

func ExampleFunc(source *task.Collection,
    result *task.Collection,
    context *task.TaskContext) bool {
    var text = task.Collection{}

    for _, n := range *source {
        var (
            bytes []byte
            resp  *http.Response
            err   error
        )
        resp, err = http.Get(n.(string))
        if err != nil {
            break
        }
        bytes, err = ioutil.ReadAll(resp.Body)
        if err != nil {
            break
        }
        text = append(text, task.Countable(string(bytes)))
    }

    *result = append(*result, text...)
    return true
}

Similarly, add a simple mapper that segregates the tasks into 3 subsets:

type SimpleMapper int

func (m *SimpleMapper) Map(inmaps map[int]*task.Task) (map[int]*task.Task, error) {
    // slice the data source of the map into 3 separate segments
    return taskHelper.Slice(inmaps, 3), nil
}

And output the collected text into a specified file path:

type SimpleReducer int

func (r *SimpleReducer) Reduce(maps map[int]*task.Task) (map[int]*task.Task, error) {
    var (
        sum  int
        text string = ""
    )
    for _, s := range maps {
        sum += len((*s).Result)
        for _, r := range (*s).Result {
            text += r.(string)
        }
    }
    file, _ := os.Create("./websites.txt")
    io.WriteString(file, text)
    fmt.Printf("The sites visited: %v \n", sum)
    return maps, nil
}

Now register the components and run:

go run main.go -mode=clbt



results matching ""

    No results matching ""