Get 5 day's weather forecast of New York
Sponsored Links
![]() |
Scrape data from web page using "HTML-structured-engine"
Introduction
We introduce how to scrape data from web page with a real example. For exmaple, the weather of New York, http://www.weather.com/weather/5-day/USNY0996. We get regex with the help of "Wildcard to regex tool".
Steps
- Open Browser to browse http://www.weather.com/weather/5-day/USNY0996 , find what we need to get.
-
View source code, get the HTML Srouce code of one day:
<
div
class
=
"wx-daypart"
>
<
h3
>Today
<
span
class
=
"wx-label"
>May 28</
span
>
</
h3
>
<
div
class
=
"wx-conditions"
>
<
img
src
=
"http://s.imwx.com/v.20120328.084156/img/wxicon/100/11.png"
height
=
"70"
width
=
"70"
alt
=
"Showers"
class
=
"wx-weather-icon"
>
<
p
class
=
"wx-temp"
> 66<
sup
>°F</
sup
><
span
class
=
"wx-label"
></
span
></
p
>
<
p
class
=
"wx-temp-alt"
> 60<
sup
>°F</
sup
><
span
class
=
"wx-label"
></
span
></
p
>
<
p
class
=
"wx-phrase"
>Showers</
p
>
</
div
>
<
div
class
=
"wx-details wx-event-details-link"
>
<
dl
>
<
dt
>Chance of rain:</
dt
>
<
dd
>60%</
dd
>
</
dl
>
<
dl
>
<
dt
>Wind:</
dt
>
<
dd
>
ESE at 8 mph
</
dd
>
</
dl
>
<
div
class
=
"wx-more"
><
a
href
=
"/weather/today/USNY0996"
from
=
"5day_24Hour_details_1"
>Details</
a
></
div
>
</
div
>
<
div
class
=
"wx-planmyday1 wx-plan-day wx-expand wx-clear"
></
div
>
</
div
>
-
Change the dynamic part of HTML to wildcard: ( For those content we want to get, use *{name:XXX}, others use * )
<
div
class
=
"wx-daypart"
>
<
h3
>*
<
span
class
=
"wx-label"
>*{name:date}</
span
>
</
h3
>
<
div
class
=
"wx-conditions"
>
<
img
src
=
"*{name:weather-icon}"
*>
<
p
class
=
"wx-temp"
> *{name:temp}<
sup
>*</
p
>
<
p
class
=
"wx-temp-alt"
> *{name:temp-alt}<
sup
>*</
p
>
<
p
class
=
"wx-phrase"
>*{name:phrase}</
p
>
</
div
>
<
div
class
=
"wx-details wx-event-details-link"
>
<
dl
>
<
dt
>Chance of rain:</
dt
>
<
dd
>*{name:rain}</
dd
>
</
dl
>
<
dl
>
<
dt
>Wind:</
dt
>
<
dd
>
*{name:wind}
</
dd
>
</
dl
>
<
div
class
=
"wx-more"
>*</
div
>
</
div
>
<
div
*></
div
>
</
div
>
-
Copy the pattern to "Wildcard to regex tool", we get the following regex and map:
<div\s+class\="wx\-daypart">\s+<h3>(?:(?!\s+<span\s+class\="wx\-label">)(?:.|\n))+\s+<span\s+class\="wx\-label">((?:(?!</span>\s+</h3>\s+<div\s+class\="wx\-conditions">\s+<img\s+src\=")(?:.|\n))+)</span>\s+</h3>\s+<div\s+class\="wx\-conditions">\s+<img\s+src\="((?:(?!")(?:.|\n))+)"(?:(?!>\s+<p\s+class\="wx\-temp">\s+)(?:.|\n))+>\s+<p\s+class\="wx\-temp">\s+((?:(?!<sup>)(?:.|\n))+)<sup>(?:(?!</p>\s+<p\s+class\="wx\-temp\-alt">\s+)(?:.|\n))+</p>\s+<p\s+class\="wx\-temp\-alt">\s+((?:(?!<sup>)(?:.|\n))+)<sup>(?:(?!</p>\s+<p\s+class\="wx\-phrase">)(?:.|\n))+</p>\s+<p\s+class\="wx\-phrase">((?:(?!</p>\s+</div>\s+<div\s+class\="wx\-details\s+wx\-event\-details\-link">\s+<dl>\s+<dt>Chance\s+of\s+rain\:</dt>\s+<dd>)(?:.|\n))+)</p>\s+</div>\s+<div\s+class\="wx\-details\s+wx\-event\-details\-link">\s+<dl>\s+<dt>Chance\s+of\s+rain\:</dt>\s+<dd>((?:(?!</dd>\s+</dl>\s+<dl>\s+<dt>Wind\:</dt>\s+<dd>\s+)(?:.|\n))+)</dd>\s+</dl>\s+<dl>\s+<dt>Wind\:</dt>\s+<dd>\s+((?:(?!\s+</dd>\s+</dl>\s+<div\s+class\="wx\-more">)(?:.|\n))+)\s+</dd>\s+</dl>\s+<div\s+class\="wx\-more">(?:(?!</div>\s+</div>\s+<div\s+)(?:.|\n))+</div>\s+</div>\s+<div\s+(?:(?!></div>\s+</div>)(?:.|\n))+></div>\s+</div>
Group number to name:
1=date|2=weather-icon|3=temp|4=temp-alt|5=phrase|6=rain|7=wind - Copy the regex to Regex Match Tracer (It's free), to generate code. Download and install Regex Match Tracer, use menu "Tools -> Export -> Java".
-
Finally, the whole program is:
@Test
public
void
testBrowse() {
BrowseConfig config =
new
BrowseConfig();
config.setPattern(
"<div\\s+class\\=\"wx\\-daypart\">\\s+<h3>(?:(?!\\s+"
+
"<span\\s+class\\=\"wx\\-label\">)(?:.|\\n))+\\s+<span\\s+class"
+
"\\=\"wx\\-label\">((?:(?!</span>\\s+</h3>\\s+<div\\s+class\\="
+
"\"wx\\-conditions\">\\s+<img\\s+src\\=\")(?:.|\\n))+)</span>\\s+"
+
"</h3>\\s+<div\\s+class\\=\"wx\\-conditions\">\\s+<img\\s+src\\="
+
"\"((?:(?!\")(?:.|\\n))+)\"(?:(?!>\\s+<p\\s+class\\=\"wx\\-temp\">"
+
"\\s+)(?:.|\\n))+>\\s+<p\\s+class\\=\"wx\\-temp\">\\s+((?:(?!<sup>)"
+
"(?:.|\\n))+)<sup>(?:(?!</p>\\s+<p\\s+class\\=\"wx\\-temp\\-alt\">\\s+)"
+
"(?:.|\\n))+</p>\\s+<p\\s+class\\=\"wx\\-temp\\-alt\">\\s+((?:(?!<sup>)"
+
"(?:.|\\n))+)<sup>(?:(?!</p>\\s+<p\\s+class\\=\"wx\\-phrase\">)"
+
"(?:.|\\n))+</p>\\s+<p\\s+class\\=\"wx\\-phrase\">((?:(?!</p>\\s+"
+
"</div>\\s+<div\\s+class\\=\"wx\\-details\\s+wx\\-event\\-details"
+
"\\-link\">\\s+<dl>\\s+<dt>Chance\\s+of\\s+rain\\:</dt>\\s+<dd>)"
+
"(?:.|\\n))+)</p>\\s+</div>\\s+<div\\s+class\\=\"wx\\-details\\s+wx"
+
"\\-event\\-details\\-link\">\\s+<dl>\\s+<dt>Chance\\s+of\\s+rain\\:"
+
"</dt>\\s+<dd>((?:(?!</dd>\\s+</dl>\\s+<dl>\\s+<dt>Wind\\:</dt>\\s+<dd>"
+
"\\s+)(?:.|\\n))+)</dd>\\s+</dl>\\s+<dl>\\s+<dt>Wind\\:</dt>\\s+<dd>"
+
"\\s+((?:(?!\\s+</dd>\\s+</dl>\\s+<div\\s+class\\=\"wx\\-more\">)"
+
"(?:.|\\n))+)\\s+</dd>\\s+</dl>\\s+<div\\s+class\\=\"wx\\-more\">"
+
"(?:(?!</div>\\s+</div>\\s+<div\\s+)(?:.|\\n))+</div>\\s+</div>\\s+"
+
"<div\\s+(?:(?!></div>\\s+</div>)(?:.|\\n))+></div>\\s+</div>"
);
config.setGfmap(
"1=date|2=weather-icon|3=temp|4=temp-alt|5=phrase|6=rain|7=wind"
);
config.setLanguage(
"en"
);
// optional
BrowseInterface browse =
new
Html2StructBrowser(config);
browse.browse(
new
BrowseContext(
new
SimpleRequester()),
new
BrowseListener() {
public
void
save(BrowseContext context) {
System.out.println(
"------"
+ printRecord(context.getFields()));
}
public
boolean
beforeOpenURL(String url) {
System.out.println(
"going to open: "
+ url);
return
true
;
}
});
}
static
String printRecord(Map<String, Object> rec)
{
StringBuffer sb =
new
StringBuffer();
for
(Map.Entry<String, Object> e : rec.entrySet()) {
if
(sb.length() >
0
) sb.append(
", "
);
sb.append(e.getKey()).append(
": "
).append(e.getValue());
}
return
"{ "
+ sb +
" }"
;
}
So, we can scrape the data without writing regex by ourselves.