Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
NoSQL is a Premature Optimization (smoothspan.wordpress.com)
71 points by timf on July 23, 2011 | hide | past | favorite | 58 comments


This is a great post, however I think there is one caveat... NoSQL is a premature optimization for relational data. NoSQL isn't necessarily an optimization if your dataset is one that isn't relational, where you gain flexibility and get the scaling for free.

However, the truth is that most business datasets are relational, and because of that relational databases do an excellent job of storing and accessing that data in an efficient way. I completely agree that people who try to force what would be a simple relational database into a complicated NoSQL system to gain scaling in the future are making a huge mistake.


NoSQL is a premature optimization for relational data

I don't know about that either. Graph DBs have been around a while and are ideal for relational data (you don't have to mess with tables and joins because everything is explicitly joined), but Web developers are just starting to realize their power.

If I am developing a website that uses the Facebook Graph API (http://developers.facebook.com/docs/reference/api/) or the Twitter API to store my users' friends and followers (and many modern auth systems are doing this), a graph database is ideal because you can use a graph query language like Gremlin (https://github.com/tinkerpop/gremlin/wiki) to do all sorts of cool stuff, like find a user's friends of friends:

  g.v(user_id).outE('friend').inV.outE('friend').inV
    
This says for the graph database "g":

1. Start at the vertex for the user_id

2. Follow the outgoing edges labeled "friend" to the incoming vertices (these incoming vertices are the user's friends)

3. From the user's friends, follow the outgoing edges to the incoming vertices one more time to get their friends.

This is so much simpler than messing with joins in a relational database, and it's much more performant on large data sets.


Funny thing though, a well designed relational database will execute your graph queries faster. It won't do generic graph queries faster, but if your sql data model and code are optimized for the type of graph data you are storing, sql will almost always be significantly faster, orders of magnitude in some cases. Obviously not a data model argument, but a real world argument nevertheless. If you then need to scale further you will probably end up putting up the same caching layer between your database and the front end, but at least all your data will be in once place (the sql database).


a well designed relational database will execute your graph queries faster

Why do you say that? -- see "MySQL vs. Neo4j on a Large-Scale Graph Traversal" (http://markorodriguez.com/2011/02/18/mysql-vs-neo4j-on-a-lar...).

You can represent a graph in almost any data structure, including a relational database. But the difference between a graph database and everything else is that in a real graph database (like Neo4j), each node has an internal/local index for its adjacent nodes so it doesn't have do an external look up for each traversal step.

Watch this video on "The Graph Traversal Programming Pattern" to see what I'm talking about (http://vimeo.com/13213184).


From the article:

"However, no attempts have been made to optimize the Java VM, the SQL queries, etc"

Emphasis being on optimizing the sql. We have run tests comparing neo4j and postgres, and postgres comes out with greater throughput for our data set, where our database implementation was done people who know postgres extremely well. Where you will see especially great differences is aggregate queries, such as if you want to count the number of a certain type of connections coming into a set of nodes, and then sort these nodes by that number. A sql database is much better at stuff like that.


What were you using to query the graph?

Gremlin has significantly improved what you can do with graph aggregating and sorting:

  // count incoming friends for each node and sort by most friends
  m = [:] 
  g.V.inE('friend').outV.groupCount(m)   
  m.sort{}


This is so much simpler than messing with joins in a relational database

I disagree. I have no idea what you just said and I'm sure I could figure out the join faster than trying to figure it out.


In graph speak, vertices are nodes (or objects) in the graph and edges are the links between nodes -- it's really quite easy once you become familiar with it.

This script will return a list of all the user's friends:

   g.v(user_id).outE('friend').inV
All you're doing is starting with the user node v(user_id) and following the outgoing links that are labeled "friend" to the adjacent nodes (the "incoming vertices").

In a graph, each "edge" (each link) has an outgoing vertex and an incoming vertex. The outgoing vertex is where the edge starts from, and the incoming vertex is what the edge points to:

  outgoing_vertex -- edge("friend") --> incoming_vertex
Here's a more concrete example...

  james -- "friend" --> julie
In this example, "james" is the outgoing vertex, "julie" is the incoming vertex, and they are linked together by an edge labeled "friend".

Again "vertex" means the same thing as node so every user in the graph is a node (a vertex).


Thanks for the explanation.

it's really quite easy once you become familiar with it

This was really my point. Things are always simpler when you're familiar with it and I happen to be very familiar with SQL. I just don't think ease should be a highlighted feature of NoSQL.

Don't get me wrong though. I use a few NoSQL systems in my company. But I didn't pick them because they were easy to understand.


This is not a very useful statement, because there is no clear-cut border between relational and non relational data. You can map everything to a relational schema and to most nosql models.


People seem to forget the MS in RDMS.

It's true that anytime you write a function that maps some set to the set of logical values (true, false, null, maybe, ...) you're dealing with relational data. So while there is no clear cut border, at some point it becomes obvious that you would benefit from a Relational Data Management System. The danger is that your particular data set might out grow a particular management system.

When you switch to a "NoSql" solution all you're doing is deciding that you need a more purpose built managment system for your relational data.


It's not just the data, it's also what you do with it. I think a typical social network is a reasonably good fit to the relational model. But this is mostly because any given user cares only about his friends, and at most about the friends-of-friends. If you try to do something different, like computing eigenvector centrality of clusters of people, suddenly an RDBMS is inadequate and you want a graph database or something simpler like mapreduce.


> where you gain flexibility and get the scaling for free.

Is it really for free? While I tend to agree with your point, one of the wrinkles here is that the older SQL systems have been beaten to death in production. We've had all sorts of 'little problems' with some of these newer systems. And this is not just about runtime issues, the 'hardened product' benefits reveal themselves in more secondary things like education, troubleshooting, monitoring, add-on products, and backup.


>older SQL systems have been beaten to death in production

But they haven't. SQL Multi-master replication systems with no single points of failure really aren't that common, and that's what everyone really wants. Everything else imposes huge production costs.


I did not say what is offered is equivalent and no that is not "what everyone really wants".


What dataset isn't relational? Seriously. Even a basic chat software, you think Redis would be great for that. But what about when you want to relate users to messages?

Actually, I've never worked on any project where our data set only needed simple key value mapping and no relationships at all.


Full Disclosure: I work for Clustrix, though I don't speak for them in any official or unofficial way. These are my personal opinions.

I think the author of the blog completely missed the point of what Michael Stonebreaker (VoltDB), Jim Starkey (NimbusDB), and Paul Mikesell (Clustrix) said. All three companies make scalable, relational, ACID compliant, SQL databases. None of them are advocating using NoSQL.

Not to put words in their mouths, but they are saying you need what MySQL gives you. You should not have to move the relational and consistency logic to your application, that's the job of the database. The problem with MySQL is not that it is SQL, it is that MySQL doesn't scale. In the case of Clustrix their database is MySQL compatible, it uses the same library interfaces (e.g., MySQLdb in Python), the same command-line tools, and it can participate in a MySQL replication setup as master or slave. The difference is it actually scales, so you don't have to shard.

With MySQL, once you reach the performance limit of using massive hardware, lots of read slaves, and memcache, the only thing left is sharding. With sharding you are breaking up your database into lots of databases running on different machines. With separate databases you lose a lot of the relational ability, or you move it into the application. What the "NewSQL" databases are offering is the ability to keep a single database, and when you want to scale performance or capacity you just add another node to the DB cluster.

To those that say NoSQL is easier and more flexible than SQL, that entirely depends on your data. You should pick the database that fits your data, which might be multiple databases for different datasets within your organization. Facebook uses a lot of NoSQL, but they also use a lot of MySQL. They have obviously decided that some of their data works best in NoSQL, and some of it works best in a RDBMS no matter how hard it is to scale.

So build your app using whatever works best for you. If that's CouchDB and Redis, cool. If it's MySQL, go for it. Once you've proven that there is demand for your app, and it takes off, then deal with scaling.


I think MySQL NDB is worth mentioninig there.


Excellent post. This seriously needs to be reiterated to everyone who needs to choose a database.

Sharded MySQL means you're tying your architecture together with boogers and spit. But the problem isn't the relational model itself, and there's no good reason that model shouldn't be able to scale.


Wrote this comment on the OP (waiting for moderation):

NoSQL is not just about scale (although it can be). In my opinion the main benefit of giving up relational constraints is flexibility. Find me a relational system that can credibly do offline replication to mobile devices. Each of the options I’ve seen are full of gotchas. But if you replace the relational model with an MVCC document model, you can suddenly solve an entirely new class of problems.

Read about how Apache CouchDB is being used in rural Africa to bring collaborative data and web technologies to health clinics that don’t have reliable internet access: http://radar.oreilly.com/2011/03/couchdb-zambia-healthcare.h...

These same patterns are applicable to mobile connections in the 1st world.


I'm unsure what you're asking for, but isn't SQLlite suitable as it's used by many applications even on mobile?


He's not asking for anything. He's pointing out how a different data model allows for solving a different set of problems. The MVCC page on Wikipedia might give you a better background so that you're able to contrast the two models and determine for yourself where they are useful: http://en.wikipedia.org/wiki/Multiversion_concurrency_contro...


SQLite runs great on mobile, but it is a data island. The document model provides a foundation for synchronization and conflict detection that the relational model cannot. Specially it becomes very challenging in the relational model to sync a subset of data to each device, while allowing updates to data on any node, regardless of that node's network connection status. That is, with MVCC documents, you can push the partition tolerance of a system much further than with the relational, or even key/value models.


This seems like a complaint looking for a problem. NoSQL is a very broad term -- are you talking about Cassandra, MongoDB, CouchDB, Redis, Memcached, Tokyo Cabinet, neo4j, etc.?

Scalability is one piece of the puzzle; picking the best tool for the job is a much bigger piece. Work with what you are comfortable with, but realize that an RDBMS may not be the best way to deal with a graph. Logging is probably not best done through an RDBMS. We have tools that we can use to solve problems in a more efficient way, so we should use them.

RDBMS are not a panacea and neither are NoSQL database solutions. Use the right tool for the job.


One could argue that SQL is a premature optimization as well if all you need is a key/value store or a document store. Using SQL just in case you may need ad-hoc queries could be as harmful as using NoSQL just because you may need to scale.

Each application is different and you need to pick the appropriate tools.


And that would be a perfectly fine argument! Though I am not aware of anything significant that can be built without a relational db. Even the simplest college assignments about building an address book involve querying and referencing primary keys across tables.


You can do queries and reference other documents many NoSQL databases; look at the documentation for GAE's datastore: http://code.google.com/appengine/docs/java/datastore/queries...

Or for MongoDB: http://www.mongodb.org/display/DOCS/Advanced+Queries


That is relational though. In those cases, Redis and MongoDB must create some kind of key reference table like MySQL would.


It depends on what do you mean by relational. I mean, even the simplest key-value store has to have keys, so if you store the key of another record/document, you have a "relation."

But you can't set constraints that enforce such relations, which is the whole point of the relational model.


An address book is a fine example for a document database or key value store. You'd just need two buckets or collections in the key value store.

    user_to_contact_sets -> keyed on user's key
    contacts -> keyed on a contact id
For searching I'd use a real search engine and index the contact data as it is stored and deleted.

The beauty of using a schema-free KVS you can have a much more flexible contact document that can have embedded one-to-many objects that would require a number of extra tables in SQL for data that will only ever be associated to a single contact record.

Here's an example of a hypothetical KVS storing contact data:

    // API:
    db.store = function(bucket, key, data) { ... }
    db.fetch = function(bucket, key) { ... }
    search_engine.index = function(key, document) { ... }

    // Store a Contact record:
    var contact = {"id": "eric-moritz", // lexicographically ordered key
		   "name": "Eric Moritz",
		   "emails": [
		       {"type": "personal", "value": "eric@example.com"},
		       {"type": "work", "value": "workin@example.com"}],
		   "addresses": [
		       {"type": "home":
			"address1": "111 A ST",
			"city": "somewheresville",
			"state": "somestate",
			"postal-code": "90210"},
		       {"type": "work":
			"address1": "111 B ST",
			"city": "somewhereelseville",
			"state": "somestate",
			"postal-code": "90210"}],
		   "phone-numbers": [
		       {"type": "mobile", "value": "555-1212"},
		       {"type": "work", "value": "555-2323"}]
		  }
    // Fetch the user's contact set
    var contact_set = db.fetch("contact_sets", user_id);

    // If this contact is not already indexed, add it and store the set
    if(contact_set.indexOf(contact['id']) < 0) {
	contact_set.push(contact['id']);
	db.store("contact_sets", user_id, contact_set.sort());
    }

    // Store the contact into the contacts bucket
    var contact_key = user_id + ":" + contact['id'];
    db.store("contacts", contact_key, contact);

    // Index the document into the search engine
    search_engine.index(contact_key, contact)


That would be a fine argument, except that you almost always need more than a key/value store.

Even a basic chat software, you'll want to relate users to messages. Even session management.


I call bullshit.

Basically, Nosql is not just about scale but flexibility. Give me a schemaless engine anyday


Wrong from point 1. I use MongoDB and Redis because first is easier to prototype with ad hoc and second offers unique features one has to remap manually if SQL is used. In short, NoSQL requires less investment than SQL for my use cases.


Yea to a database newb MongoDB and even Redis look simpler to use than MySQL. I was judging from the basic setup, syntax, boilerplate, tutorials, etc. Mongo's syntax requires the smallest learning curve, it's the dot notation you're used to. But that's was a total n00b's perspective and I imagine people that don't dive too deep in MySQL feel similarly. Generally programmers are superficial and look for the easiest syntax.

For prototyping purposes I'd like to see a comparison of building a twitter clone in MySQL, MongoDB and there's already one in Redis http://redis.io/topics/twitter-clone MySQL will require the most boilerplate here, setting up schemas etc.


Definitely agree. Getting Rails up and running on MongoDB is a lot faster than with ActiveRecord on top of some SQL. Especially when your data format is changing quickly, it's nice to just make a few changes in the model rather than having to track migrations.


You realize that this has nothing to do with sql/nosql and everything to do with your chosen framework, right? If, you used, say, a framework that autogenerated itself from your schema at build time, then changing the database would be as simple as changing the database, which is trivial.

I think a lot of our perceptions about technologies are influenced by the things that sit between us and those technologies. A lot of the talk about this particular issue has nothing to do with Relational databases or key/value stores.


Either I didn't get your point or you don't know what what Rails migrations are. Migrations are just a way to "change the database" in a repeatable way.


Right, but the grandparent's complaint was that you had to change your model, then write migrations to match.

In other frameworks, you use your tool of choice to change the database schema, then build the project and you're done. All the "model" stuff regenerates automatically. Thus, schema/model changes have no pain associated with them whatsoever, so there's no particular advantage to using nosql vs sql.


With Rails' ActiveRecord the model introspects the database schema at run time and figures out field changes automatically. What it doesn't figure out is relations and you still have to specify them in both the database schema and the model. Since you're mentioning project build, I suppose you're referring to Java or C# frameworks. Are those frameworks able to generate code for relations as well? (I suppose this requires following naming conventions or setting referential constraints)


Yeah, conventions cover pretty much every case you tend to come across. Got a column called CompanyID that has a foreign key constraint over to the Company table? Perhaps this is a relation... Nice of Rails to make it acceptable to simply ask you to name your stuff in a consistent way so that things just work.

The added bonus of a build step is that you don't need to keep regenerating the same classes and SQL code every page load, but it's not really any different from the programmer's standpoint.


For a domain model SQL DDL is not as declarative as ActiveRecord.


Obviously, it also depends on exactly what kind of application you are building.

If you primarily want to persist objects and indices that don't have good or easy analogue in the SQL world, a NoSQL solution is good solution because you don't have to bother with the unneeded SQL overhead.

If you've doing something that maps reasonably well into the relational model, well, that's what you should use. Rational technology has been around a long time and has been scaled to huge levels. And it seems fairly clear a social network maps pretty well to a relational model.


I've been trying to tell my colleagues and higher-ups this for the last few weeks as I've planned out a new platform that "needs to scale." They've heard that some of our business partners are using MongoDB and want to know why I'm using Postgres.

The truth is that our traffic could wildly exceed our most hopeful projections and we still wouldn't nearly be in the position of orgs like Twitter, Facebook, or Netflix, stretching the limits of what an RDBMS can do without becoming a nightmare to maintain.


Sounds like my experience applying to companies who list every NoSQL option under the roof in the job description and yet looking at what they do, I can't figure out why they'd need all that.


Funny because MongoDB was not built for scalability


In theory you should use the right tool for the job and there shouldn't even be a discussion.

However in reality not all DB's are created equally. It comes down to this: do you want to spend a known amount of time hammering in a screw, or do you want to spend an unknown amount of time optimizing your state of the art, but unproven screwdriver.

This is important for people starting a business. You don't start a business to test out new technology, that's what weekend projects are for. They need to be focusing on the product and not spending time 'just getting it to work'.

Mysql/MC aren't sexy, but it takes a trivial amount of time to set up and tune them.

There are lots of caveats to this; the entire argument can be nit-picked to death. But I think it holds for the generic webapp template that just needs data to persist and then be served back to users.


A lot of the confusion and furor over NoSQL vs relational vs NewSQL is caused by the idea that these technologies demand any sort of mutual exclusion. At my shop we use Redis and MySQL and SQLite extensively and - to the best of my knowledge - they have never gotten into scuffles with each other, holy wars of relationality. In fact, this seems to be a proclivity unique to their users.

That being said, I think it is - wait for it - premature to call NoSQL an unwarranted early "optimization." For our real-time analytics app, we use Redis because we are just storing key (page name) value (hits). With millions of writes a day for relatively simple data, it is nice - and in no way premature - to run Redis on a medium-sized instance. Data retrieval is fast and after a year and a half we haven't had any problems.

Other apps, use MySQL exclusively. For our many-many forests of tag, entry, source, author, etc etc relational data, we rely on the easy oversight of MySQL and the rich ORMs that have gracefully evolved to cut through the jungle.

What's more, there are many applications that use both Redis and MySQL. My Silver gem https://github.com/tpm/silver wraps MySQL requests in a Redis caching layer to speed up queries. This is not terribly novel or that different than, say, memcached but it does what we need it to.

The point is that just like apps are as multiform as the imagination is competent, so should be the tools and strategies we use to create them and solve their problems. Yes, people will use NoSQL incorrectly. People use MySQL incorrectly and inefficiently more than not. Hell, you can use YAML in a bad way. Those aren't the cases we should focus on. We should focus on all the systems that do work.

I'd like to see, for every mind-thinky post about whether or not MySQL is a death trap or NoSQL is a nu-wave panacea, a post about how some company is making the technologies that they have chosen work for them. At the end of the day, there is no uniform march toward optimal global technology efficiency. Just the small victories of a startup that needs to record cat GPS movements, a newspaper that needs to analyze a million pages of leaks, a broker who needs to pour through SEC dumps. NoSQL, NewSQL, SQL, HithertoUnknownUnPostSQL: I say let em all in. Someone will find them useful.


What about lock-in though? What if you decide you do want some relationship or way of filtering the keys you're storing with Redis. After you already have it in Redis as keys/values.

What then?


Hopefully you abstracted as much of the database code as possible, so you can switch to (for example) MySQL without serious pain. This only gets tricky if you're using some of the more fancy features of Redis, like sorted sets, which are a bit more clunky to do in SQL.


Can someone explain to me what NoSQL equivalents are for concepts like primary key and general data consistency?

My other question is that I've read several places that NoSQL makes most sense for full-text documents. And yet I see company after company with little apparent full-text data boasting about their NoSQL use. I am especially aware of this as someone in the job market seeing terms like Hadoop and MongoDB just thrown around in the job description. How? Why?


There isn't a single model, but an eventually consistency is the most common: http://www.allthingsdistributed.com/2008/12/eventually_consi...


Isn't a key-value store (NoSQL) simpler to use than a relational database? (I haven't used one so I don't know for sure).

Hierarchical and network stores were popular before RDB was invented; although RDB was slower, the big advantage was that you could transform the data into whatever form you need (instead of having to use the specific hierarchy used to store it). At least, that was the key benefit according to my reading of Codd's 1970 paper.


Yeah, I can see this argument going either way. If you're avoiding SQL for NoSQL for scalability reasons, that might be premature optimization, especially if your data is pretty easily relational (and e.g. you have something approximating a schema).

But if you just need to persist a data model, don't care too much about the schema, and come up with the solution of, "I'll just serialize my hash table to a CSV flat file", that's just taking the ultra-simple solution, not really an optimization. If anything, pulling out the big iron of SQL when all you need to do is serialize a hashtable would be the premature optimization (or at least, premature architecting).


Yes, it's a much simpler solution. But it's an "optimization" in the sense that one has to work extra hard to avoid the ubiquity & tooling of SQL. Expanding effort to achieve better performance (whether true or perceived) is what is most associated with software optimization.


Awful post. His example for why NOSQL is "difficult" to start with, is a company with a huge RDBMS trying to transition. Had they started with NOSQL, it would have been easy to architect around that from the beginning, rather than cramming it in later. He then goes on to argue that it's much easier to transition to NOSQL later, which is the exact opposite of his example.

NOSQL isn't one catchall group. There are different tools there for different jobs. His arguments rest on the assumption that NOSQL is hard(er). If it's not hard, and it solves a problem, then it's not premature optimization, it's just optimization.

There are certainly situations where using a NOSQL database to start with gives you a competitive advantage and makes things easier down the road. This same argument is made every time a new technology comes onto the scene. People who made a living using something else will argue that learning the new thing is too hard, and isn't worth the time. If he doesn't want to bother with learning about new tools, that's fine, but it's silly for him to argue that learning them is pointless or a waste of time.


NoSQL is not only about scale, but also about flexibility?

the flexibility and schemalessness of NoSQL is actually a consequence of the scaling possibilities - it's hard to manage horizontal partitioning and maintain consistency between multiple tables with foreign keys and rollback transactions in a distributed environment; and easy when you have only one table with two fields: key and value.

when, like in mongodb, the value is actually a tree-like document, and you need to have an index on some of the element, you gradually return to schema, because this element should be in the same place/surrounding in all indexed values.

i am saying that flexibility is not why you should choose NoSQL, but actually the scalability that you might need later.

(and non-relational data is kept ok in blob fields or in files and indexed by lucene - if you want (i don't))


I found this article fairly uninformative.

His first point is 'all NoSQL is less mature'. Second point is 'there is no advantage to NoSQL except scale' Third point is 'you'll have time to scale later, which includes moving to NoSQL'

None of these arguments is novel. Developers pick NoSQL for other reasons as well.

I didn't find a pioneer tax to setting it up. It worked extremely well for non-uniform data. I found its design easier to handle for failover.

This entire debate is old. NoSQL is a tool. His arguments add nothing to the debate.


Fantastic post. Tells the truth.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: