visit
There are many ideas and considerations behind graph databases. This includes their use cases, advantages, and the trends behind this database model. There are also several real-world examples to dissect.
Sometimes requirements for one analysis can be extremely tricky. For example, you might have a query for Find all posts from users and count the number of posts up to 3 levels down[1]. Such an analytic problem is not purposely designed to create obstructions for data engineers, Indeed, these things are quite common in the current era of distributed computing.
Data model schemes and traditional queries
The traditional way to solve this query need is to build a relational data model for every user and store those relations in several tables. This is commonly done in a Relational database such as MySQL.This is a basic relational schema.
If we want to implement a query as noted, it is inevitable that there will be a huge amount of JOIN and the length of query can be extremely long. [1]SELECT T.directReportees AS directReportees, sum(T.count) AS count
FROM (
SELECT manager.pid AS directReportees, 0 AS count
FROM person_reportee manager
WHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName")
UNION
SELECT manager.pid AS directReportees, count(manager.directly_manages) AS count
FROM person_reportee manager
WHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName")
GROUP BY directReportees
UNION
SELECT manager.pid AS directReportees, count(reportee.directly_manages) AS count
FROM person_reportee manager
JOIN person_reportee reportee
ON manager.directly_manages = reportee.pid
WHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName")
GROUP BY directReportees
UNION
SELECT manager.pid AS directReportees, count(L2Reportees.directly_manages) AS count
FROM person_reportee manager
JOIN person_reportee L1Reportees
ON manager.directly_manages = L1Reportees.pid
JOIN person_reportee L2Reportees
ON L1Reportees.directly_manages = L2Reportees.pid
WHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName")
GROUP BY directReportees
) AS T
GROUP BY directReportees)
UNION
(SELECT T.directReportees AS directReportees, sum(T.count) AS count
FROM (
SELECT manager.directly_manages AS directReportees, 0 AS count
FROM person_reportee manager
WHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName")
UNION
SELECT reportee.pid AS directReportees, count(reportee.directly_manages) AS count
FROM person_reportee manager
JOIN person_reportee reportee
ON manager.directly_manages = reportee.pid
WHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName")
GROUP BY directReportees
UNION
SELECT depth1Reportees.pid AS directReportees,
count(depth2Reportees.directly_manages) AS count
FROM person_reportee manager
JOIN person_reportee L1Reportees
ON manager.directly_manages = L1Reportees.pid
JOIN person_reportee L2Reportees
ON L1Reportees.directly_manages = L2Reportees.pid
WHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName")
GROUP BY directReportees
) AS T
GROUP BY directReportees)
UNION
(SELECT T.directReportees AS directReportees, sum(T.count) AS count
FROM(
SELECT reportee.directly_manages AS directReportees, 0 AS count
FROM person_reportee manager
JOIN person_reportee reportee
ON manager.directly_manages = reportee.pid
WHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName")
GROUP BY directReportees
UNION
SELECT L2Reportees.pid AS directReportees, count(L2Reportees.directly_manages) AS
count
FROM person_reportee manager
JOIN person_reportee L1Reportees
ON manager.directly_manages = L1Reportees.pid
JOIN person_reportee L2Reportees
ON L1Reportees.directly_manages = L2Reportees.pid
WHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName")
GROUP BY directReportees
) AS T
GROUP BY directReportees)
UNION
(SELECT L2Reportees.directly_manages AS directReportees, 0 AS count
FROM person_reportee manager
JOIN person_reportee L1Reportees
ON manager.directly_manages = L1Reportees.pid
JOIN person_reportee L2Reportees
ON L1Reportees.directly_manages = L2Reportees.pid
WHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName")
)
Performance issues in traditional RDBMS
The most essential issue relating to such a query comes from the size of the problem. In our era there are scaling issues related to the nature of huge social network associations.The relation between users to users, users to items or items to items is exponential. Here are some basic facts for just a few user relationships. On Twitter, with an A following B relationship there are 500 million users. On Amazon, with an A buying B relationship, there are 120 million users. For AT&T’s cell phone network there are 100 million who-call-whom relationship users. [2]To resolve such queries, one can use a graph database instead of a RDBMS. The open source graph benchmarking has billions of nodes and edges listed below with O(TB) or even O(PB) amount of data: [3]Performance fine-tuning with indexing or caching and why it is not a best practice.
Indexing is used to help a SQL engine find and acquire data efficiently. Using indexing such as B-tree indexing, or hash indexing is one way to solve performance costs in a table join operation. For example, we can create a unique ID for every person and B-tree, which is a balanced tree, will sort all items by their unique ID. This is suitable for a range query and the performance cost to find a certain item can be limited to O (logN) where N is the number of indexed items.But indexing is not a panacea for all situations. If items update frequently or have many repeated items, the indexing will attempt to perform a huge number of space overhead due to indexing.Plus, disk IO in a large HDD disk will be critical for JOIN operations. This is even if one seek operation only costs a few milliseconds, and one B-tree indexing just needs 4 seek operations to traverse the whole indexing table. Why? Well, once the JOIN operations become larger, seek operations need to happen hundreds of times.Caching is designed to use the essence of the spatial locality of a dataset for a read intensive analytic scenario. There is a widely used architecture named lookaside cache architecture. Here is a simple demo with memcached and MySQL which was used in Facebook before the move from an RDBMS to a graph database: [5]Lacking relationships is the key failure of other kinds of database modeling [4]
Relational database works well if data is in codify paper forms and tabular structures. However, if we want to model the ad hoc relationships that will crop up in the real world, the RDBMS is poor at handling this.As mentioned, the relational data model becomes a burden with a very large number of join tables, with sparsely populated rows where there are billions of different tables across databases. With the growth of such a dataset and the increased number of JOIN operations needed, the performance of existing RDBMS are no longer a fit for these business requirements.Dealing with recursive queries such as the manager’s and employee’s report number will perform badly due to the computational and space complexity of recursively joining tables.Nodes, associations and graph modeling
So, it is true that traditional databases have implicit graph connections between different schemas. But as we analyze the semantic dependencies, such as A controls B, A buys B, as the data model defined, we need to check on the data table. We will see we are blind to these connections. If we want to make the connections visible before a check on the node, we need to define objects and their connections separately.The key difference between graph databases and other databases is to represent nodes and paths separately. For example, we can represent people, or managers, as separate nodes. Associations capture the users’ friendships, belongings, authorship of the checkin and comments and the binding between the checkin and its location and comments.We can add new nodes and new associations in a flexible way without compromising the existing network or migrating data (original data remains stateless). Based on this modeling technique, we can model the original social network problem in this way:-- Insert People
INSERT VERTEX person(ID, name) VALUES 1:(2020031601, ‘Jeff’);
INSERT VERTEX person(ID, name) VALUES 2:(2020031602, ‘A’);
INSERT VERTEX person(ID, name) VALUES 3:(2020031603, ‘B’);
INSERT VERTEX person(ID, name) VALUES 4:(2020031604, ‘C’);
-- Insert edge
INSERT EDGE manage (level_s, level_end) VALUES 1 -> 2: ('0', '1')
INSERT EDGE manage (level_s, level_end) VALUES 1 -> 3: ('0', '1')
INSERT EDGE manage (level_s, level_end) VALUES 1 -> 4: ('0', '1')
GO FROM 1 OVER manage YIELD manage.level_s as start_level, manage._dst AS personid
| GO FROM $personid OVER manage where manage.level_s < start_level + 3
YIELD SUM($$.person.id) AS TOTAL, $$.person.name AS list
MATCH (boss)-[:MANAGES*0..3]->(sub),
(sub)-[:MANAGES*1..3]->(personid)
WHERE boss.name = “Jeff”
RETURN sub.name AS list, count(personid) AS Total
Neo4j [4]
Neo4j is one of the best-known graph database, and widely adopted by well-known companies, such as by eBay, Microsoft and so on.Native graph processing
A graph database has native processing capabilities if it exhibits a property called . A database engine that utilizes index-free adjacency is one in which each node maintains direct reference to its adjacent nodes and each node will therefore act as a micro-index for other nearby nodes. This is much less taxing than using global indexes. It means that query times are independent of the total size of the graph and are instead simply proportional to the amount of the graph searched.In simpler terms, index-lookups could be O(logN) in algorithmic complexity versus O(1) for lookup immediately from key-value relationship and traverse m steps. If we use an index approach, it costs O(mlogn) which only costs O(m) with an index-free adjacency solution.As mentioned, relational databases perform poorly when we move away from modestly sized JOIN operations which mainly results from the number of index-lookups involved. In contrast to relational databases, where join-intensive query performance deteriorates as the dataset gets bigger, with a graph database performance tends to remain relatively constant, even as the dataset grows. This is because queries are localized to a portion of the graph.In-memory cache
For Neo4j 2.2, it uses an . The page cache is an LRU-K page-affined cache, meaning the cache divides each store into discrete regions, and then holds a fixed number of regions per store file.Pages are evicted from the cache based on a least frequently used (LFU) cache policy, nuanced by page popularity. That is, unpopular pages will be evicted from the cache in preference to popular pages, even if the latter have not been touched recently. This policy ensures a statistically optimal use of caching resources.Janus Graph [7]
itself does not focus on storage and analytics functionalities. The main purpose of this frame is to serve as a graph database engine which focuses on compact graph serialization, graph data modeling and efficient query execution.Supporting robust modular interfaces for data persistence, data indexing and client access is the main purpose of this frame. It can be a useful solution for companies that just want to use graph modeling as their architecture solutions.JanusGraph can use Cassandra, HBase and Berkeley DB as its storage adapter and use Elasticsearch, Solr or lucene for data indexing.Broadly speaking, applications can interact with JanusGraph in two ways:Nebula Graph [8]
Here is listed several bird’s eye views on the system design of , which is specially optimized for graph modeled data.Key value store graph processing
Nebula Graph adopted
vertexID + TagID
as keys for local storage and it stores out-key and in-key separately in different locations. This supports O(k) look up and partitioning ensures high availability in a large distributed cluster.In contrast to other storage designs, Nebula Graph natively supports distributed partitioning or sharding. This greatly increases the processing speed and fault tolerant capabilities.Shared-nothing distributed storage layer
To help migrating data and provide a modular level storage engine it uses its own kv store library. Thus it supports storage service for a third party kv store such as HBase and so on.Nebula Graph manages the distributed kv store in a unified scheduling way with meta service. All the partition distribution data and current machine status can be found in the meta service. Users can add or remove machines from a console and execute a balance plan. Plus, Multi-cluster Raft group with atomic CAS, and read-modify-write operations are fully used by raft serialization which enables immediate consistency.Stateless query layer
In the query layer, nGQL will be paired to Abstract Syntax Tree and be converted to LLVM IR. The IR code can be passed to an execution planner for edge-level parallel execution and Just In Time execution. The complied query will be stored and if a user adopted the same query again, it will reuse the cached commands with no need for further parsing.The Graph Database is the Proven Trend
Graph databases have already attracted the attention of many analysts and consulting companies:Graph analysis is possibly the single most effective competitive differentiator for organizations pursuing data-driven operations and decisions after the design of data capture. — Gartner
“Graph analysis is the true killer app for Big Data.” — ForresterThe current trend of graph database is the highest based on : [9]
Netflix Cloud Database engineering [10]:
Adobe [11]
There is a prejudice with new technologies that they may not suit large companies that may have many legacy systems. That it does not make sense to take the risk to change from an old stable system to a new and possibly unstable technology.But Adobe did this with some transfer from an old NoSQL database to Neo4j.The overhauling system’s name is Behance, which is a leading social media platform owned by Adobe and launched in 2005. It has more than 10 million members.Here people can share their creative works with millions of daily visitors.Previously published at