When Screen Scraping became API calling – Gathering Oracle OpenWorld Session Catalog with Node by Lucas Jellema

image

A dataset with all sessions of the upcoming Oracle OpenWorld 2017 conference is nice to have – for experiments and demonstrations with many technologies. The session catalog is exposed at a website here.

With searching, filtering and scrolling, all available sessions can be inspected. If data is available in a browser, it can be retrieved programmatically and persisted locally in for example a JSON document. A typical approach for this is web scraping: having a server side program act like a browser, retrieve the HTML from the web site and query the data from the response. This process is described for example in this article – https://codeburst.io/an-introduction-to-web-scraping-with-node-js-1045b55c63f7 – for Node and the Cheerio library.

However, server side screen scraping of HTML will only be successful when the HTML is static. Dynamic HTML is constructed in the browser by executing JavaScript code that manipulates the browser DOM. If that is the mechanism behind a web site, server side scraping is at the very least considerably more complex (as it requires the server to emulate a modern web browser to a large degree). Selenium has been used in such cases – to provide a server side, programmatically accessible browser engine. Alternatively, screen scraping can also be performed inside the browser itself – as is supported for example by the Getsy library.

As you will find in this article – when server side scraping fails, client side scraping may be a much to complex solution. It is very well possible that the rich client web application is using a REST API that provides the data as a JSON document. An API that our server side program can also easily leverage. That turned out the case for the OOW 2017 website – so instead of complex HTML parsing and server side or even client side scraping, the challenge at hand resolves to nothing more than a little bit of REST calling. Read the complete article here.

About Jürgen Kress
As a middleware expert Jürgen works at Oracle EMEA Alliances and Channels, responsible for Oracle’s EMEA Fusion Middleware partner business. He is the founder of the Oracle SOA & BPM and the WebLogic Partner Communities and the global Oracle Partner Advisory Councils. With more than 5000 members from all over the world the Middleware Partner Community is the most successful and active community at Oracle. Jürgen manages the community with monthly newsletters, webcasts and conferences. He hosts his annual Fusion Middleware Partner Community Forums and the Fusion Middleware Summer Camps, where more than 200 partners get product updates, roadmap insights and hands-on trainings. Supplemented by many web 2.0 tools like twitter, discussion forums, online communities, blogs and wikis. For the SOA & Cloud Symposium by Thomas Erl, Jürgen is a member of the steering board. He is also a frequent speaker at conferences like the SOA & BPM Integration Days, JAX, UKOUG, OUGN, or OOP.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: