Get 5 day's weather forecast of New York
Sponsored Links
Scrape data from web page using "HTML-structured-engine"
Introduction
We introduce how to scrape data from web page with a real example. For exmaple, the weather of New York, http://www.weather.com/weather/5-day/USNY0996. We get regex with the help of "Wildcard to regex tool".
Steps
- Open Browser to browse http://www.weather.com/weather/5-day/USNY0996 , find what we need to get.
-
View source code, get the HTML Srouce code of one day:
<div class="wx-daypart"> <h3>Today <span class="wx-label">May 28</span> </h3> <div class="wx-conditions"> <img src="http://s.imwx.com/v.20120328.084156/img/wxicon/100/11.png" height="70" width="70" alt="Showers" class="wx-weather-icon"> <p class="wx-temp"> 66<sup>°F</sup><span class="wx-label"></span></p> <p class="wx-temp-alt"> 60<sup>°F</sup><span class="wx-label"></span></p> <p class="wx-phrase">Showers</p> </div> <div class="wx-details wx-event-details-link"> <dl> <dt>Chance of rain:</dt> <dd>60%</dd> </dl> <dl> <dt>Wind:</dt> <dd> ESE at 8 mph </dd> </dl> <div class="wx-more"><a href="/weather/today/USNY0996" from="5day_24Hour_details_1">Details</a></div> </div> <div class="wx-planmyday1 wx-plan-day wx-expand wx-clear"></div> </div>
-
Change the dynamic part of HTML to wildcard: ( For those content we want to get, use *{name:XXX}, others use * )
<div class="wx-daypart"> <h3>* <span class="wx-label">*{name:date}</span> </h3> <div class="wx-conditions"> <img src="*{name:weather-icon}"*> <p class="wx-temp"> *{name:temp}<sup>*</p> <p class="wx-temp-alt"> *{name:temp-alt}<sup>*</p> <p class="wx-phrase">*{name:phrase}</p> </div> <div class="wx-details wx-event-details-link"> <dl> <dt>Chance of rain:</dt> <dd>*{name:rain}</dd> </dl> <dl> <dt>Wind:</dt> <dd> *{name:wind} </dd> </dl> <div class="wx-more">*</div> </div> <div *></div> </div>
-
Copy the pattern to "Wildcard to regex tool", we get the following regex and map:
<div\s+class\="wx\-daypart">\s+<h3>(?:(?!\s+<span\s+class\="wx\-label">)(?:.|\n))+\s+<span\s+class\="wx\-label">((?:(?!</span>\s+</h3>\s+<div\s+class\="wx\-conditions">\s+<img\s+src\=")(?:.|\n))+)</span>\s+</h3>\s+<div\s+class\="wx\-conditions">\s+<img\s+src\="((?:(?!")(?:.|\n))+)"(?:(?!>\s+<p\s+class\="wx\-temp">\s+)(?:.|\n))+>\s+<p\s+class\="wx\-temp">\s+((?:(?!<sup>)(?:.|\n))+)<sup>(?:(?!</p>\s+<p\s+class\="wx\-temp\-alt">\s+)(?:.|\n))+</p>\s+<p\s+class\="wx\-temp\-alt">\s+((?:(?!<sup>)(?:.|\n))+)<sup>(?:(?!</p>\s+<p\s+class\="wx\-phrase">)(?:.|\n))+</p>\s+<p\s+class\="wx\-phrase">((?:(?!</p>\s+</div>\s+<div\s+class\="wx\-details\s+wx\-event\-details\-link">\s+<dl>\s+<dt>Chance\s+of\s+rain\:</dt>\s+<dd>)(?:.|\n))+)</p>\s+</div>\s+<div\s+class\="wx\-details\s+wx\-event\-details\-link">\s+<dl>\s+<dt>Chance\s+of\s+rain\:</dt>\s+<dd>((?:(?!</dd>\s+</dl>\s+<dl>\s+<dt>Wind\:</dt>\s+<dd>\s+)(?:.|\n))+)</dd>\s+</dl>\s+<dl>\s+<dt>Wind\:</dt>\s+<dd>\s+((?:(?!\s+</dd>\s+</dl>\s+<div\s+class\="wx\-more">)(?:.|\n))+)\s+</dd>\s+</dl>\s+<div\s+class\="wx\-more">(?:(?!</div>\s+</div>\s+<div\s+)(?:.|\n))+</div>\s+</div>\s+<div\s+(?:(?!></div>\s+</div>)(?:.|\n))+></div>\s+</div>
Group number to name:
1=date|2=weather-icon|3=temp|4=temp-alt|5=phrase|6=rain|7=wind - Copy the regex to Regex Match Tracer (It's free), to generate code. Download and install Regex Match Tracer, use menu "Tools -> Export -> Java".
-
Finally, the whole program is:
@Test public void testBrowse() { BrowseConfig config = new BrowseConfig(); config.setUrl("http://www.weather.com/weather/5-day/USNY0996"); config.setPattern( "<div\\s+class\\=\"wx\\-daypart\">\\s+<h3>(?:(?!\\s+" + "<span\\s+class\\=\"wx\\-label\">)(?:.|\\n))+\\s+<span\\s+class" + "\\=\"wx\\-label\">((?:(?!</span>\\s+</h3>\\s+<div\\s+class\\=" + "\"wx\\-conditions\">\\s+<img\\s+src\\=\")(?:.|\\n))+)</span>\\s+" + "</h3>\\s+<div\\s+class\\=\"wx\\-conditions\">\\s+<img\\s+src\\=" + "\"((?:(?!\")(?:.|\\n))+)\"(?:(?!>\\s+<p\\s+class\\=\"wx\\-temp\">" + "\\s+)(?:.|\\n))+>\\s+<p\\s+class\\=\"wx\\-temp\">\\s+((?:(?!<sup>)" + "(?:.|\\n))+)<sup>(?:(?!</p>\\s+<p\\s+class\\=\"wx\\-temp\\-alt\">\\s+)" + "(?:.|\\n))+</p>\\s+<p\\s+class\\=\"wx\\-temp\\-alt\">\\s+((?:(?!<sup>)" + "(?:.|\\n))+)<sup>(?:(?!</p>\\s+<p\\s+class\\=\"wx\\-phrase\">)" + "(?:.|\\n))+</p>\\s+<p\\s+class\\=\"wx\\-phrase\">((?:(?!</p>\\s+" + "</div>\\s+<div\\s+class\\=\"wx\\-details\\s+wx\\-event\\-details" + "\\-link\">\\s+<dl>\\s+<dt>Chance\\s+of\\s+rain\\:</dt>\\s+<dd>)" + "(?:.|\\n))+)</p>\\s+</div>\\s+<div\\s+class\\=\"wx\\-details\\s+wx" + "\\-event\\-details\\-link\">\\s+<dl>\\s+<dt>Chance\\s+of\\s+rain\\:" + "</dt>\\s+<dd>((?:(?!</dd>\\s+</dl>\\s+<dl>\\s+<dt>Wind\\:</dt>\\s+<dd>" + "\\s+)(?:.|\\n))+)</dd>\\s+</dl>\\s+<dl>\\s+<dt>Wind\\:</dt>\\s+<dd>" + "\\s+((?:(?!\\s+</dd>\\s+</dl>\\s+<div\\s+class\\=\"wx\\-more\">)" + "(?:.|\\n))+)\\s+</dd>\\s+</dl>\\s+<div\\s+class\\=\"wx\\-more\">" + "(?:(?!</div>\\s+</div>\\s+<div\\s+)(?:.|\\n))+</div>\\s+</div>\\s+" + "<div\\s+(?:(?!></div>\\s+</div>)(?:.|\\n))+></div>\\s+</div>"); config.setGfmap("1=date|2=weather-icon|3=temp|4=temp-alt|5=phrase|6=rain|7=wind"); config.setLanguage("en"); // optional BrowseInterface browse = new Html2StructBrowser(config); browse.browse(new BrowseContext(new SimpleRequester()), new BrowseListener() { public void save(BrowseContext context) { System.out.println("------" + printRecord(context.getFields())); } public boolean beforeOpenURL(String url) { System.out.println("going to open: " + url); return true; } }); } static String printRecord(Map<String, Object> rec) { StringBuffer sb = new StringBuffer(); for(Map.Entry<String, Object> e : rec.entrySet()) { if(sb.length() > 0) sb.append(", "); sb.append(e.getKey()).append(": ").append(e.getValue()); } return "{ " + sb + " }"; }
So, we can scrape the data without writing regex by ourselves.