Architectural Patterns: The data hub

The data hub is a centralized location for retrieving, receiving, collating and distributing data. It gathers information from diverse sources, environments and locations, manipulates it, and optionally distributes it to another set of locations.

Scenario 1. Tidying up data replication between non-heterogeneous systems
Before (figure 1)

After (figure 2)

Adding one additional system to a data hub requires one more data feed into the hub, and one distributed data out. Adding another system without a data hub requires more additional point to point connections. Not only is the quantity of transfers greater, each interface is likely to be a bespoke piece of software, requiring more design, development and test effort.

Scenario 2. Global reporting from a distributed system
Before (figure 3)

After (figure 4)

To add global reporting to an additional system in figure 3 would require data going from all systems to the new reporting system as well as to the existing one. In figure 4, only one more data feed is required from the data hub to the additional reporting system.

Related Patterns

Publish and subscribe

Data hubs are filled by collecting, or accepting data. A hub can be built around the concept of systems publishing data to the data hub, and subscribing to other sets of data created by the data hub and delivered to them.

Master-slave

Related to publish and subscribe. If publish and subscribe is used, then the data hub is the slave. If the data hub requests and distributes the data, then it is the master.

Bridge

A bridge is often used to gather information from an external system without concern for the environment at the far end of the bridge.

Hub and spoke

The hub and spoke pattern is an essential ingredient of the data hub.

Point to point

A single point to point connection is more efficient to develop, deploy and operate, and gives a better real-time response than a data hub. Only when the number of point to point connections begins to increase rapidly is a data hub of any real benefit.

Business objects

A hub can gather and disperse information as chunks of hierarchical or relational data, or as business objects. A business object contained in a predefined XML schema can be very useful.

Alternative patterns

Data replication

Replicated data has been used by many organizations to synchronize and update data in databases since their early days.

Search engine

A search engine is similar to a data hub in that it goes out and collects vast quantities of information. It then builds indexes into that information. In some cases, a representation of the original information can be rebuilt from the index in a simplified (usually HTML) format.

Databases and search engines can both hold text, and both build indexes into relational data. It is not uncommon for them to coexist. The search engine will find stereo playback with greater success when you type in binaural output. The database will do a better job of finding all unpaid invoices for customer X.

Both databases and search engines use caches to hold recently requested information. A repeated request can then be delivered from cache rather that by requerying the source.

Data warehouse

A data warehouse is a specialized and extended data hub. It is specialized in that data is only drawn in, and extended in that the collected data is further refined and expanded into sets of information and data cubes relevant to many viewers.

Related antipatterns

Spaghetti junction

A spaghetti junction is formed when a number of systems are all linked to each other (see figure 1) in a point to point configuration.

Example 1

A business uses Peoplesoft HR for storing personnel data and business roles. System access roles are held in Active Directory or by groups of members identified in Exchange’s Global Address List (GAL). File stores are used along with Exchange public folders to hold documents, CVs and staff pictures on an office by office basis. Sun Financials is used for accounting, an ERP system with SQL storage is used to run the day-to-day business, and XML files hold data on all of the employees’ laptops for offline processing. The Autonomy search engine is used to index file servers, Exchange public folders and selected SQL data.

Peoplesoft feeds the GAL (Global Address List) in Exchange, and Exchange feeds email addresses and machine names into the SQL database. The SQL data feeds into Peoplesoft, Sun Financials and Exchange Public Folders, and is merged with information from Active Directory to create and update the offline files (figure 5).

Figure 6 shows how a data hub is used to gather data from Peoplesoft, Exchange and SQL Server. It then updates the source systems, and feeds into Sun Financials and the offline data files.

In such a system, it is likely that all feeds are batch processes, run once a day or once a week, depending upon how up-to-date the data needs to be in the distributed systems.

Before (figure 5)

After (figure 6)

Although the data hub means the same number of lines of communication in this case, extracting any one system for replacement with another will be an easier task. Consolidating source data or adding another system to the mix will also prove easier.

Example 2

A business decision application needs to gather information from non-heterogeneous systems, present them to the user, then record the result of the decision and deliver it back to some of the source systems.


Figure 7. Using a data hub to gather dispersed data as an aid to decision making

When to use a data hub

Refactoring

Scenario: Redeveloping an existing SQL based distributed system

Data is collected from remote databases by a nightly batch process. It is then refactored into a new database design, and redistributed to a number of global locations where reports pull data directly from the new database structure.

Two way bridges are then built between the new and old data environments, allowing development of new client applications to gradually replace older applications.

Globalization of data

At present there are many distributed systems in use throughout worldwide organisations. Many businesses are consolidating this distributed data into global systems. A data hub can help to phase data gathering from many locations and many environments.

Other uses

Methods

A data hub can have two methods for getting data to the viewer. It can hold data itself (collect) or point directly to data sources (refer). When referring, a call is made to the data hub, which opens a connection to the data source, retrieves the data and forwards it to the requester. Whether the data is held locally or looked up remotely is not an issue for the requester, but one that is decided upon pragmatically by the designer of the data hub.

Example data request showing how it is not necessary to know where the data is:

Person person = (Person)DataHub.getPerson(“Nigel Leeming”);
BusinessObject getPerson(string name) 
{
     try 
     {
          DataConnection conn = DataConnection();
          if (Person.isObjectLocal())
          {
               conn.openLocalConnection();
          }
          else 
          {
               conn.openRemoteConnection();
          }
          BusinessObject result = return conn.GetObject(“Person”, name);
          conn.close;
          return result;
     }
     catch 
     {
          return null;
     }
}
Data collection methods

Database

Data from databases can be collected by data replication, ODBC, JDBC or ADO.

APIs

Some systems, depending on the foresight of their designers will provide a rich set of APIs through which information can be extracted. If the API is fronted by a web service then information can be extracted in a uniform way with the minimum of effort.

Binary or text files or streams

Ascii or binary files and streams can be transferred across most systems using file copy, ftp, email attachments, email bodies, or streamed through a telnet port or other serial or parallel interface.

There are many low-tech options for extracting information from printer or serial ports, which come on even the oldest and most decrepit systems. Failing that, it is still possible in some cases to wire up old microprocessors’ data buses to newer parallel interfaces, and simply read off the incoming data. I have used this technique more than once, although not for a long time and certainly not on a mainframe.

Screen scraping

Most older mainframe systems have VT connections, which are RS-232 serial interfaces. Using Telnet to emulate a manual worker, information can be entered as if from a keyboard, then read from output returned to the virtual screen. Many screen scraping products are available.

Reading input from paper

Character and handwriting recognition systems can process thousands of pages per hour to scan details from delivery notes, purchase orders, invoices etc. Some companies specialize in offering this as a service to their customers.

Manual input

Finally, manual imports from spreadsheets should not be overlooked, nor should manual input via a keyboard or speech recognition device, though these are the last and worst solutions.

Data redistribution methods

The interaction between a data hub and its source is not always a one way transfer. The data hub may also gather and present information upon which decisions are made. If the decision is to be recorded alongside the source data, it must be fed back into those systems.

Similarly, the data hub may be used to consolidate data, which is them fed back in a summarized form to the source systems, most often for reporting.

The methods for getting information out of the data hub and back to the source are largely the same as those for extracting information. However, complications can arise in transaction processing environments, or where updates to the same pieces of data can come from multiple sources.

Summary

The data hub is a simple concept. The difficulty lies in selling it to a sponsor, and implementing it with sufficient foresight. Point to point solutions are easier to develop and implement, and returns from a data hub are only evident when reuse of connections, bridges and gathered data come into play.