17. January 2017
Big Data Reference architectures are a key research topic in business information systems. They try to simplify software development by reusing architectural and software components. But reusability leads also to a trade-off in making reference architectures on a higher level to reuse it in many domains and applications. Or to concentrate them on a subject and hence easier to reuse. In this blog post we discuss, whether big data references architectures are really needed. Our hypothesis is that current big data reference architectures are not sufficient to provide real benefit for implementing big data projects.
In scientific literature there are a lot of different reference architectures for specific domains or problems as well as approaches for definition of the term reference architecture. According to (Dern 2006) reference architectures unify groups of existing information systems and differ to common IT architectures in the abstraction level and usage in an organization. A more technical perspective view on the term uses (Trefke 2012), as he states that software components and their interactions are the core of a reference architecture to map the functionalities of reference models. Special characteristics given by the problem or domain use case are abstracted and so generalized. In a similar way the widely used Rational Unified Process (RUP) for software development describes the term as a “predefined architectural pattern, or set of patterns” that are already instantiated and proven in business or technical settings (Reed 2002).
These definition approaches focus on software perspective of a reference architecture. In a broader view the “whole” system including processes, methods, etc. have to integrated in such an architecture. The definition of (Cloutier et al. 2010) includes these aspects and further argues that a reference architecture “capture the essence of existing architectures, and the vision of future needs and evolution to provide guidance to assist in developing new system architectures.” In addition, this work proposes two key principles for the design of reference architectures (Cloutier 2010):

The common denominator in the definitions in the definitions are the terms abstraction, proven, universality and domain. Abstraction means that specific aspects, e.g. less important constraints, of the formulated problem or use case are not incorporated in the solution architecture. Most of the definitions speak about proven concepts which are integrated. So only artefacts that solved a sub problem, were already implemented and especially evaluated in practical settings can be used. Proven concepts can also mean that new concepts are developed that are based on experiences in similar problem cases. The other two terms – universality and domain – are in opposition to each other. Every reference architecture is located in a domain, no matter if its focused on a problem or a business domain. But on the other hand one overall goal of reference architecture is universality, which means generalization of specific problems.
This conflict is still an open problem in research, because of the trade-off between the universality of a specific problem and the usability of a reference architecture. This manifesst itself in the ease of use of a reference architecture for an instantiation in a domain-specific problem. If the architecture is too universal, the reusability in other settings is quite good, but the instantiation in a specific use case needs lots of adaptions. But if the described reference architecture is too domain-driven and problem-specific the universality lacks and the scientific nature and research contribution is disputable. The benefit of such architecture models is that they are highly relevant in practice. Furthermore, for instantiation less adaption is needed.
In recent years’ software applications, organizations, business processes and hence the information systems being developed and used are getting more complex, e.g. in their scope and size. Also the environment of the information systems landscape in organisations become more dynamic. “Time-to-market” and “interoperability” are terms that describe key success factors not just for business companies; in healthcare, government, etc. these needs are highly relevant too (Cloutier et al. 2010). The goal of reference architectures is to decouple the effect that with increasing complexity of information systems the complexity of the development process increases.
First of all, reference architectures provide common taxonomies, grammars and architectural visions of the problem domain. These aspects help to simplify communication in complex project teams and in requirements engineering. Further, the documentation of large information systems on an appropriate abstraction level leads to better interoperability between different systems. It further helps to reduce efforts to adapt instantiations due to dynamics and changing environments. The most important goal for many companies for using reference architectures is the more efficient development process of software or information systems. The reusability of existing concepts (and in the best case implemented software modules) leads to significant decrease in development efforts. By the use and application of known and already proven patterns for the implementation, reference architectures mitigate risks that occur by implementing new systems from scratch.
Against this background various reference architectures were postulated in recent years. Most notable approaches are:
These are published as de facto standards for their particular problem domain, constructed by national standardization organisations (NIST) or international industry consortiums. Other well-known approaches are vendor-driven and more domain- respectively technology-specific than the aforementioned. Examples are Business intelligence reference architectures by:
Their intention concerning reusability is focused on deploying their products at different customers and to promote these products. Different perspectives take domain-specific reference architectures, like the:
These are examples for business-oriented reference architectures that address besides the technical aspects of the IT systems also functional aspects of the domain and organizational aspects. Due to this more detailed view, the instantiation requires less effort, but the reusability is more restricted then the SOA or Cloud Computing reference architectures.
The NIST provides another big data reference architecture (National Institute of Standards and Technology 2015). It also describes technical building blocks of big data systems, e.g. infrastructure components like storage, computing or network resources, but also surrounding components for the management of the resources, processing models or data organization. Despite it provides also functional architecture components e.g. for data collection, preparation, analytics or visualization, the abstraction level of this components – and the customer context – is quite high, while this means that it is highly reusable in many use cases the instantiation for a specific domain needs more effort. So the first key principle of Cloutier (2010) is broken, since a clear strategy description is missing. Further it only consists of a technical architecture, while customer context and business architecture are necessary RA building blocks (Cloutier 2010).
Nirmala and Raj (2016) describe different components of big data architectures, which are commonly used in organizations for big data analytics projects. They further outline several scenarios like text, stream or predictive analytics. But how these use cases relate to the technical components is missing. The approach further lacks in describing the functions of and interactions between the technical components as well as how the architecture can be used in different settings and problems.
Pääkkönen and Pakkala (2015) also present an approach for a big data reference architecture. It is derived from existing data analytics infrastructures of Twitter, Facebook and LinkedIn. It contains eight modules starting with data sources and storage up to the visualization and access to information. Further it provides mappings between this modules and existing big data technologies.
A more specific reference architecture for big data systems in the national security domain is proposed by Klein et al. (2016). In addition to the technical view, it contains modules of the NIST approach (2015), it integrates domain knowledge and use cases to the reference architecture and tries to include aspects of reference models. Hence it provides a technical and a customer context, they lack in details and doesn’t contain methods to close this gap. Further domain-specific aspects of the use case have no consequences for architecture components and functions or are not described in detail.
All of the presented approaches have modules for the collection, storage, integration & preparation, analysis and visualization of data, as well as the under- respectively overlying stakeholder, data provider and data consumer. There is little variation in this components and in the abstraction level of these components. The NIST approach is getting deep into technical details and describes sub-modules, such as:
On the downside, it lacks in domain and application details. Therefore, this approach can be seen as a meta reference architecture to derive domain-specific reference models for big data projects.
Compared with domain-driven reference models like SCOR or ARIS the introduced attempts provide a more technical-oriented RAs and are far from describing detailed processes and best practices for solving specific business problems. Standardized approaches like (Open Management Group 2011) or (NIST 2011) have more fine-grained descriptions of the architecture modules and include in certain scope guidance for the instantiation. By comparing the big data reference architectures with existing business intelligence reference architectures it is remarkable that they have many components in common. Examples are modules for data sources, data integration and preparation as well as for data analytics and visualization. Apparent differences between both areas concern the focus in business intelligence on several aspects of metadata and information management, instead big data reference architectures include data processing aspects like processing models, infrastructure and resource management.
So why do we need explicit reference architectures for big data? Here are some arguments against!
One argument on behalf a big data reference architecture is the non-consideration of real-time data like sensor data streams in business intelligence reference architectures. Despite, some BI RAs contain stream-based applications like business or complex event processing.
| Big Data Origin | Big Data target | 
|---|---|
| Science | Scientific discovery | 
| Telecom | New technologies | 
| Industrie | Manufactoring, process control, transport | 
| Business | Personal services, campaigns | 
| Living environment Cities | Living environment support | 
| Social media and Networks | Human Behaviour | 
| Healthcare | Healthcare support | 
To come back to our hypothesis of the beginning, we explained why we think that the mentioned approaches are not sufficient for supporting big data projects. What is the conclusion and what should be future research outcomes to solve this problem? Since Big Data reference architectures barely distinguish from existing business intelligence reference architectures it might be reasonable to combine both concepts. The approaches can be merged and could lead to a Business Analytics reference architectures (BIRA).
Based on this “Big data meta architecture model” domain-specific reference models could be derived and instantiated. Table 1 shows possible domains and use cases for such reference architectures. Forrester research already tried to integrate Big Data in their BIRA for business applications but did not argue why (Evelson & Yuhanna 2012). Another research direction would be to design specific BDRAs for each domain and use cases without merging BI aspects. This would result in a higher number but lean architectures, which probably are easier to instantiate.
