Using Postgresql for Nosql things

(Andy Wootton) #1

My programming background was in file systems, not databases. I’ve done a bit of relational theory and SQL practice but I’m far from being a database expert. When I decided to start learning to write in a modern coding language I faced a huge set of technical hills that I needed to climb. I picked Python to avoid needing ‘to web’ straight away and when I discovered NoSQL databases, they sounded like a great way to dismiss the mountain of mapping objects onto the relational model. This morning I started to have doubts.

I see a database being in at least 2 levels, the database engine and the query language. This seems like having different shells atop a Unix/Linux kernel. Why hadn’t the NoSQL crowd built a new NoSQL front-end onto an existing DB rather than starting from scratch? Was there a fundamental mismatch at all levels? Did data have to be split between 2 different DBs? A wise man once told me that within an organisation, the word “database” had no plural. My personal experience is definitely that everything is connected, because human organisations are networks not trees.

I just found this:

Does anyone who is using NoSQL tech think this article is wrong? One of the comments says that Postgresql won’t scale to 30,000 machines. I’m not sure that I need to do that right now.

I’ll chuck this in for good measure

(Stuart Langridge) #2

'Cos the prevailing wisdom was that that was unnecessarily hard, and they have a point. Being a DBA for Postgres is hard; much more so than just trivially spinning up a MongoDB or CouchDB. And CouchDB can be sharded as many ways as you like just with an HTTP load balancer. On the other hand, there are good reasons to use Postgres; it’s got nearly two decades of reliable support behind it, and since what most people do with JSON document stores is “store a document; retrieve it by ID; retrieve documents which match this JSON query”, Postgres now implements that pretty well in the most recent versions. So I’m leaning towards using Postgres JSON in projects now.

(Richard Cunningham) #3

I think NoSQL is really NoMySQL, since many of the problems are particular to MySQL which was widely used by startups.

Initially sites like Facebook and Twitter had problems scaling writes on MySQL systems, which were not really scalable beyond a single machine (I heard around 2008/2009 Twitter had a single, replicated, database server with 64GB of RAM). The MySQL design seems heavily optimised for a mostly read database (e.g. a blog or CMS), rarely updated, which suited the web in the early days before everything got “social”. Social sites ended up not being able to use relational features, for example FriendFeed (acquired by Facebook) talked about this in the early days.

Smaller scale sites have looked to document databases, because it’s easier to work with and makes more sense than using a ORM. MongoDB seems to be most popular, which actually has many of the features of relational database, but it’s had number of issues with integrity and scaling. It’s for this reason that people have looked back SQL, but this time Postgres and the relatively new JSONP feature (not available when MongoDB started).

I’ve used MongoDB in the past, but I’ll probably use Postgres as document store for future projects. Personally, I think SQL databases are still a liability since queries you think might be O(1) end up being O(n) or O(n2) which you only find out when it’s difficult to change.

(Andy Wootton) #4

Thanks guys. I think I remember a bit in the middle too, where object databases started to make relational DBs add object features but they didn’t catch on for mainstream use. Interesting then that the second link says NoSQL databases still have an impedance mismatch with objects. Did object DBs arrive too early?

In a world where algorithms + data structures = programmes, hierarchies, networks then relations were used to model data structures.
Now objects are networks of close-coupled procedures + data, it feels like databases should be networks of objects too but we’ve gone to XML and JSON trees. Are we lost? Can we normalise objects? I have no idea what that question means but maybe someone will have.

(Richard Cunningham) #5

The relational model still has some merit if you want fully normalised data, however most NoSQL databases have traded this off for performance or scalability.

I remember looking at XML databases around 2002, which never seem to catch on. Also, there is LDAP and SNMP which are both also tree based. Relational is good for some things and tree based (JSON/XML/LDAP/SNMP) is suited to other things.

I’ve found one document per user works quite well (lists of 1000 or less, say), where I don’t need to query multiple documents - I think this is a better approach that having 1000 rows per user in a table and always querying with user=649 (or whatever)

(Jon) #6

I have to say I wonder whether some of the buzz around NoSQL is misplaced - though I say that as someone who has only dipped a little toe into the water, and so perhaps there is a “eureka” moment I need to acquire!

The relational model gives us safety - null constraints, foreign key constraints, check constraints, types, and the flexibility of triggers, schedulers, stored procedures, full text search - lots of things. The idea, at least originally, was that it was hard to put bad data in your database, and if you got the normalisation right, one item of data could not ‘disagree’ with another.

However all of the checks came with a speed penalty, but these days - as long as you have a good quality set of automated tests - perhaps you can use a NoSQL solution. The trick here, as I see it, is that you’re running your checks at build time rather than run time, and so it is faster.

Nevertheless, most people don’t actually need the speed increase - it is usually cheaper to set up a leader-follower pair using the existing relational database, or upgrade your database VPS, rather than re-engineer in NoSQL. I wonder too whether the type safety you get in, say, Postgres, means that if you’re poking in the wrong information, you get an error quickly, rather than wondering why it doesn’t work.

I’m not writing off the technology - I mean to give MongoDB a go myself - but I think the surge of popularity means less experienced programmers are losing the valuable benefits of a existing “boring” solutions.

(Richard Cunningham) #7

Perhaps, SSDs are really the only significant thing that has changed. VPSes have very small amounts of memory and upgrading is much more incrementally expensive than on physical servers (8GB RAM for a year on Digital Ocean costs $960 where as a 8GB DIMM it’s self costs ~$60).

With conventional HDD based VPS (a while ago) I had a lot of problems with some tables of a few million rows, where each user had a few thousand rows, MySQL didn’t handle the updates to that well. Adding indexes often made it worse because it’s the updating of indexes that takes the time. Even simple schema updates, such as adding a column, would lock the table for hours. MySQL would sometimes decide query that was usually fast, needed to be done a slow way by creating a temporary table. There are solutions to these problems, but they are not easy and you end up learning a lot about MySQL to work it out, which is very time consuming.

Having something that is at it’s base a key/value store has fewer surprises as the data size grows, because it shouldn’t actually slow down at all and it’s simple to understand.

(Stuart Langridge) #8

It does, but it also insists that your data has to be relational. That is: you’ve gotta put everything into third normal form, and woe betide you if there are optional fields or optional ownerships or circular data.

Imagine, for example, that I’m making a DB of the people on this forum and the electronic equipment they own. So everyone might own one or more phones, one or more TV set top boxes, one or more computers. Modelling that in JSON is easy:

{name:"Stuart", phones:[
  {brand:"Google",model:"Nexus 4",os:"Android"},
], tvs:[{model:"PC",software:"Kodi"}, {model:"Chromecast"}], 
computers: [
  {formfactor:"desktop", model:"custom", os:"Ubuntu"},
  {formfactor:"Dell", model:"M1330", os:"Ubuntu", cost: 1000.00}

Now model that in SQL. Note that custom desktops don’t have a cost; PCs used as TVs have a software attribute. You end up with one column per attribute you want to track, half of which are null, and every time you want to track a new attribute you have to ALTER TABLE. Tracking a new type of equipment (say, smartwatches) requires adding a whole new table. This sort of fluid data is a major league pain in the arse in SQL, and what people end up doing is creating a table called things with columns property and value and storing a Thing as a whole bunch of rows… at which point you have basically made yourself a slow crap NoSQL database because you’ve lost all the benefits of nulls, foreign keys, types, etc.

SQL is very, very good at modelling normalised data. If your data’s not all that normalised, then the structured part of Structured Query Language starts to hinder more than it helps.

(Omar Qureshi) #9

I’d consider myself to be a very experienced Postgres user and whilst the JSONB data type and plv8 are pretty awesome, the one thing that is a huge pain in the ass and MongoDB gets right is durability (there is plenty it gets wrong though)

There are various tools that EnterpriseDB and 2ndQuadrant offer you to make it a lot less painful, but you need to be aware this isn’t really baked into PG proper.

If you are going to go down this route (and you should!), I’d suggest using RDS to get started.

If you need help, just ping me and I’ll see what I can do

(Andy Wootton) #10

@rythie Did you mean “fully normalised”? If not why would you denormalise? I’ve known people claim to do it for efficiency but because of my lack of practical experience (other than system managing Oracle, which is EASY, except when Oracle lie. As @sil says, it’s DBA that’s the black art.) I was never sure how much denormalisation helped as they were people whose technical judgement I didn’t entirely trust. They may have been making excuses for poor design.

If I remember correctly, the point of the relational model is to allow flexibility of queries, based on the belief that it is always obvious what were entities and attributes. I always had doubts about that which were later confirmed by Michael (JSD) Jackson.

I’ve always found normalisation issues really difficult because I seem to automatically think about entitiies and attributes in a normalised form, when I’m able to identify the entities and I didn’t understand why other people didn’t have my difficulties but did have trouble with normalisation. Maybe I was taught that without realising that I’d learned, or perhaps my brain is wired oddly. I think now that I was intuitively identifying the difference between real world objects and those invented to solve a particular problem.

Sorry, I don’t understand your “O(1) end up being O(n) or O(n2)” jargon.

I keep saying, half in jest, that I’ve lost faith in hierarchies. I’m convinced that human societies are network shaped, with hierarchy imposed on top to limit complexity when comms was to ‘give a tanner to a village lad to deliver a letter’.

(Andy Wootton) #11

Yeah but: Is that difficulty in changing the tables a problem with the relational model or the SQL language which is based BASED ON COBOL and horrible?



just showing off that you were free to do horrible things?

That’s an excellent example of why we need an object model database though as I’d argue that computers, smart TVs and phones inherit from ‘computable device’.

I’m interested in this because I want to play with ‘concept maps’, having found mind-mapping too restrictive to represent the ideas for my writing. I’ll get back to you on why that turned out to be a bad idea.

(Richard Cunningham) #12

Yes I did mean “fully normalised” (fixed it now). To me the problem with normalised data can be slow to query. I used to have a thing where I wanted to query a table which was stored ordered by time (Innodb stores in index order), so it should be fast because I wanted it in time order. However, I’d have to join it to other tables to query by userid, which then re-ordered the data, often MySQL would make a temporary table and sort that - which was all pretty slow. The thing is the userid never changed, I found I could just store it in the row and make the query really quick. Also I could then use userid in the primary key, which made for better sorting on disk.

Another use-case is things like favourites/like counts - it doesn’t make sense to do a count on these everytime you do a SELECT, you can keep a total in the row and just be careful to update it anytime something changes.

Interested quote:
"Normalized tables are suitable for general-purpose querying. This means
any queries against these tables, including future queries whose details
cannot be anticipated, are supported. In contrast, tables that are not
normalized lend themselves to some types of queries, but not others."

Whilst I think normalisation is useful if you want users to write any query, most web apps don’t let people do that, they have fixed queries built in and the denormalisation is optimised for that use (and is therefore slower for other types of queries).

By O(1) and O(n), I mean Big O notation, and specifically how some queries get slower as databases get bigger and some do not, a request for a single row, with no joins, by primary key, for example, should always take the same time i.e. O(1). Where as totaling bank account table (or favourites/likes), will increase linearly with the number of transactions i.e. O(n)

(Andy Wootton) #13

RE: Big O. I’m familiar with the concept but not the notation. I’m very aware that my maths is/are(n) rusty but I don’t think I’ve ever heard of it before.

(Stuart Langridge) #14

Not in the slightest. Computers have a form factor and smartwatches don’t. Requiring a form factor to be stored for smartwatches is a sign that the SQL worms have eaten into your head :wink:

But you’re not proposing that, you’re proposing…


First, because that might apply to the above example but doesn’t apply to all examples – cars aren’t a computable device, f’rinstance. And even if you say “well, they’re all Devices, so there’s an object hierarchy there”, there’s little to nothing meaningfully shared between a car and a smartwatch other than maybe having a model and manufacturer. And even if you say that that’s OK and is what the hierarchy is for… object databases are a NIGHTMARE. We spent some time experimenting with them when they were the Next Big Thing in the legal industry, and they make me want to harm myself. There’s no sensible cross-framework object representation these days, for a start – back then, there was the dream of such a thing, from which came the pathway to hell that was CORBA, but now we’ve acknowledged that lightweight is better. At which point you say, OK, what’s my object storage format if it’s gonna be lightweight? And the answer is… JSON. So, you can pretend to yourself that you’re using an object database and just use MongoDB or Postgres :wink:

(Andy Wootton) #15

I can’t concentrate on any of this until I understand what ‘Form Factor Dell’ means. Do Dell servers have a proprietary server cab format? I thought that crap was over when IBM amazed everyone by licensing DEC’s format for their racks. So useful when you can pinch a spare disk from any cab in a datacentre in an emergency.

(Jon) #16

Thanks everyone for the discussion.

So, why have we just seen the explosion of NoSQL popularity in the last five years or so? @rythie offers an interesting link about how FriendFeed uses NoSQL techniques atop a MySQL database - is it just that technology has reached the power and capability that it can “do” big data?

Certainly, the physical world we wish to model has not just changed shape! (I’d argue it hasn’t changed size either - the desire to model extremely large datasets has always been there). Or perhaps it is related to my earlier point, that schemaless designs have been considered too risky until testing automation reached a good level of maturity?

(Richard Cunningham) #17

There is lot more data being generated than ever before. Back in the early 2000s when web databases were mostly CMSes or a list of products for sale, the databases were pretty small. Now sites like, Twitter, Facebook, Instagram and WhatsApp have tons of user generated data to deal with and they’ve ended up creating dedicated systems to deal with it. Also, we now have massive amounts of automatically created data such as geo location (e.g. moves & runkeeper) and web traffic (e.g. google analytics). Cloud storage keeps getting much cheaper, so it is possible to keep the data we would previously had to throw away.

In some cases people are using tools designed for massive scale systems, when really they don’t need them, but then often there is no financial downside to these technologies anyway. I think document style databases are more suited to continuous development than MySQL, due to the schema change issues I mentioned earlier.

(Stuart Langridge) #18

Ha! I get it, now. My mistake :slight_smile: formfactor:laptop, of course.

(Marc Cooper) #19

Just a note on your schema change issue with MySQL, of which I’ve suffered too in the past. This pretty much goes away in postgresql, where schema changes can be often be performed concurrently without table locks. (e.g. CREATE INDEX CONCURRENTLY). That’s not to say that it solves every problem, but the common cases are far less bothersome, ime.

(Marc Cooper) #20

Memories of MySQL woes are flooding back to me :-/ Ordering in MySQL queries can be a problem because indexes can’t be ordered (unless they’ve fixed that) while they can in postgresql. So, your problem probably wasn’t one cause by normalisation.

Regarding your likes/favourites counts problem: I tackled this recently by using redis. On first access (missing key on redis), the count is retrieved via SELECT and stored in redis. +1s are then applied to redis (in addition to db updates). (In my case, I set expires on the redis counts to ensure they were refreshed periodically in case of things got out of sync.)

I really like redis for these kind of “active cache”-type functions, especially the fact that they can be expired without any need to write additional clean-up code.