Entanglement: Embarrassingly-scalable graphs
by Keith Flanagan
16:00 (40 min) in CT 7.01
Graph data structures are often used for data integration tasks due to their flexibility and schemaless operation. Multiple datasets can be joined together by linking data items via graph edges. However, many existing data integration tools are tightly-coupled to a visualisation tool kit and therefore do not scale beyond a single machine. Such tools also often lack the ability to query data items in an ad-hoc fashion; nodes cannot be indexed arbitrarily. When building huge graphs, failures will occur. Whether the cause is hardware, software, or data format-related, we need the ability to 'unintegrate' as well as merge.
Entanglement is a graph storage framework that builds on existing document data stores. It focuses on features and design patterns that are necessary for large-scale, graph-based data integration tasks:
- scalability - billions of indexed graph entities partitioned over a cluster of machines,
- provenance and version control - every change to every node/edge is incremental and tracked over time; errors and failures can be backed-out without impacting the rest of the graph,
- integration - datasets from different sources are placed in their own independent graphs; multiple user-defined views can then be composed by selecting which data sources to integrate,
- collaborative environment - multiple users or automated agents can share the same graph-browsing session through the use of distributed graph cursors and via command-driven Internet Relay Chat (IRC) sessions.