"HTML-structured-engine" Usage
Sponsored Links
Usage of "html-structured-engine"
Introduction
We introduce common usage here. For complex situation, you need to understand some abstract concepts in the source code by yourself. You can contact webmaster for help if you need.
Usage
There are 2 packages: browse and request. The package 'browse' is to parse HTML using regex. The package 'request' is called by 'browse', to fetch HTML with a URL.
Package | Interface and class | Description |
---|---|---|
request |
RequestInterface └ SimpleRequester |
Fetch HTML from a specific URL. |
RequestContext | The result of fetching HTML. | |
browse |
BrowseInterface └ Html2StructBrowser |
To parse HTML and get records using regex. |
BrowseContext | The result of parse. | |
BrowseConfig | Configuration of how to get HTML, how to parse. | |
BrowseListener | A callback interface to process the result. |
Steps:
- Construct a BrowseConfig instance, set a regex with groups and group to name map.
- Implement and construct a Callback BrowseListener instance to gather results.
- Construct a Html2StructBrowser instance, and call browse() to begin parse.
Simple Demo
Just a simple demo to show the usage:
// Declare a BroseConfig BrowseConfig config = new BrowseConfig(); config.setUrl("http://www.baidu.com/"); config.setPattern("<script[^>]*>(.*?)</script>"); config.setGfmap("1=script"); // Implement a Listener BrowseListener listener = new BrowseListener() { public void save(BrowseContext context) { System.out.println(context.getFields().get("script")); } public boolean beforeOpenURL(String url) { System.out.println("Going to open: " + url); return true; } }; // Construct a Browser and call browse() new Html2StructBrowser(config).browse( new BrowseContext(new SimpleRequester()), listener);
For other configurations of BrowseConfig, please see comments in the source code.
Highly recommended tool
It is usually a bit complex to compose a long regex. Easy to make mistakes and not efficient. It is highly recommended to use "Wildcard to regex tool" to make the regex automatically. And the wildcard itself can be simple edited from the HTML source code. For example:
-
Copy a piece of code of HTML:
<td>First Name: Tom</td> <td>Tel: 12345678</td>
-
Change the dynamic part in the HTML to wildcard. According to syntax of "Wildcard to regex tool":
<td>First Name: *{name:Name}</td> <td>Tel: *{name:Tel}</td>
-
Get the regex:
<td>First Name: ((?:(?!</td>\s+<td>Tel:)(?:.|\n))+)</td>\s+<td>Tel:((?:(?!</td>)(?:.|\n))+)</td>
And the group number to group name map:
1=Name|2=Tel