"HTML-structured-engine" Usage

Sponsored Links

 

Usage of "html-structured-engine"


Introduction

We introduce common usage here. For complex situation, you need to understand some abstract concepts in the source code by yourself. You can contact webmaster for help if you need.


Usage

There are 2 packages: browse and request. The package 'browse' is to parse HTML using regex. The package 'request' is called by 'browse', to fetch HTML with a URL.

Package Interface and class Description
request RequestInterface
 └ SimpleRequester
Fetch HTML from a specific URL.
RequestContext The result of fetching HTML.
browse BrowseInterface
 └ Html2StructBrowser
To parse HTML and get records using regex.
BrowseContext The result of parse.
BrowseConfig Configuration of how to get HTML, how to parse.
BrowseListener A callback interface to process the result.

Steps:

  1. Construct a BrowseConfig instance, set a regex with groups and group to name map.
  2. Implement and construct a Callback BrowseListener instance to gather results.
  3. Construct a Html2StructBrowser instance, and call browse() to begin parse.

Simple Demo

Just a simple demo to show the usage:

// Declare a BroseConfig
BrowseConfig config = new BrowseConfig();
config.setUrl("http://www.baidu.com/");
config.setPattern("<script[^>]*>(.*?)</script>");
config.setGfmap("1=script");

// Implement a Listener
BrowseListener listener = new BrowseListener() {
    public void save(BrowseContext context) {
        System.out.println(context.getFields().get("script"));
    }
   
    public boolean beforeOpenURL(String url) {
        System.out.println("Going to open: " + url);
        return true;
    }
};

// Construct a Browser and call browse()
new Html2StructBrowser(config).browse(
        new BrowseContext(new SimpleRequester()), listener);

For other configurations of BrowseConfig, please see comments in the source code.


Highly recommended tool

It is usually a bit complex to compose a long regex. Easy to make mistakes and not efficient. It is highly recommended to use "Wildcard to regex tool" to make the regex automatically. And the wildcard itself can be simple edited from the HTML source code. For example:

  1. Copy a piece of code of HTML:
    <td>First Name: Tom</td>
    <td>Tel: 12345678</td>
    
  2. Change the dynamic part in the HTML to wildcard. According to syntax of "Wildcard to regex tool":
    <td>First Name: *{name:Name}</td>
    <td>Tel: *{name:Tel}</td>
    
  3. Get the regex:
    <td>First Name: ((?:(?!</td>\s+<td>Tel:)(?:.|\n))+)</td>\s+<td>Tel:((?:(?!</td>)(?:.|\n))+)</td>

    And the group number to group name map:

    1=Name|2=Tel