Entanglement: Embarrassingly-scalable graphs

by Keith Flanagan

16:00 (40 min) in CT 7.01

Graph data structures are often used for data integration tasks due to their flexibility and schemaless operation. Multiple datasets can be joined together by linking data items via graph edges. However, many existing data integration tools are tightly-coupled to a visualisation tool kit and therefore do not scale beyond a single machine. Such tools also often lack the ability to query data items in an ad-hoc fashion; nodes cannot be indexed arbitrarily. When building huge graphs, failures will occur. Whether the cause is hardware, software, or data format-related, we need the ability to 'unintegrate' as well as merge.

Entanglement is a graph storage framework that builds on existing document data stores. It focuses on features and design patterns that are necessary for large-scale, graph-based data integration tasks: