The rise of big data presents both big opportunities and big challenges to domains ranging from enterprises to sciences. The opportunities include better-informed business decisions, more efficient supply-chain management and resource allocation, more effective targeting of products and advertisements, better ways to “organize the world’s information,” and faster turnaround of scientific discoveries, among others.
The challenges are also tremendous. For one, more data comes in diverse forms: such as text, audio, video, OCR, and sensor data. While existing data management systems predominantly assume that data has rigid, precise semantics, increasingly more data (albeit valuable) contains imprecision or inconsistency. For another, the proliferation of ever-evolving algorithms to gain insights from data (in the name of machine learning, data mining, statistical analysis, and so on) can often be daunting to a developer with a particular dataset and specific goals: the developer not only has to keep up with the state of the art, but also must expend significant development effort in experimenting with different algorithms.
Many state-of-the-art approaches to both of these challenges are largely statistical and combine rich databases with software driven by statistical analysis and machine learning. Examples include Google’s Knowledge Graph, Apple’s Siri, IBM’s “Jeopardy!”-winning Watson system, and the recommendation systems of Amazon and Netflix. The success of these big-data analytics-driven systems, also known as trained systems, has captured the public imagination, and there is excitement in bringing such capabilities to other verticals such as enterprises, health care, sciences, and government. The complexity of such systems, however, means that building them is very challenging, even for Ph.D.-level computer scientists. If such systems are to have truly broad impact, building and maintaining them needs to become substantially easier, so that they can be turned into commodities that can be easily applied to different domains. Most of the research emphasis so far has been on individual algorithms for specific machine-learning tasks.
In contrast, the Hazy project (http://hazy.cs.wisc.edu) takes a systems approach with the hypothesis: The next breakthrough in data analysis may not be in individual algorithms, but in the ability to rapidly combine, deploy, and maintain existing algorithms. Toward that goal, Hazy’s research has focused on identifying and validating two broad categories of “common patterns” (also known as abstractions) in building trained systems: programming abstractions and infrastructure abstractions. Identifying, optimizing, and supporting such abstractions as primitives could make trained systems substantially easier to build. This can bring us a step closer to unleashing the full potential of big-data analytics in various domains.
Programming abstractions. To ensure that a trained-system platform is accessible to many developers, the programming interface must be small and composable to enhance productivity and enable developers to try many algorithms; the ability to integrate diverse data resources and formats requires the data model of the programming interface to be versatile. A combination of the relational data model and a probabilistic rule-based language such as Markov logic satisfies these criteria. Using this combination, we have developed several knowledge-based construction systems (namely, DeepDive, GeoDeepDive, and AncientText). Furthermore, our (open source) software stack has been downloaded thousands of times and used by different communities such as natural language processing, chemistry, and biostatistics.
Infrastructure abstractions. To build a trained-system platform that can accommodate many different algorithms and that scales to large volumes of data, it is crucial to find the invariants in applying individual algorithms, to have a clean interface between algorithms and systems, and to have a scalable data-management and memory-management subsystem. Using these principles, we developed a prototype system called Bismarck, which leverages the observation that many statistical-analysis algorithms behave as a user-defined aggregate in an RDBMS. The Bismarck approach to data analysis is resonated by commercial systems providers such as Oracle and EMC Greenplum. In addition, such infrastructure-level abstractions allow us to explore generic techniques for improving the scalability and efficiency of many algorithms.
Extraido de Arun Kumar, Feng Niu, Christopher Ré, Communications of the ACM, Vol. 56 No. 3, Pages 40-49
Artículo completo en http://cacm.acm.org/magazines/2013/3/161202-hazy-making-it-easier-to-build-and-maintain-big-data-analytics/fulltext#F1