One would think that for a startup of sorts, where things changes fast and are unpredictable, NoSQL is the correct answer. And when things are stable and the shape of entities are known, going for SQL becomes a natural path.
There is also cases for having both, and there is cases for graph-oriented databases or even columnar-oriented ones such as duckdb.
Seems to me, with my very limited experience of course, everything leads to same boring fundamental issue: Rarely the issue lays on infrastructure, and is mostly bad design decisions and poor domain knowledge. Realistic, how many times the bottleneck is indeed the type of database versus the quality of the code and the imlementation of the system design?
NoSQL gains you no speed at all in redesigning your system. Instead, you trade a few hard to do tasks in data migration into an unsurmountable mess of data inconsistency bugs that you'll never actually get into the end of.
> is mostly bad design decisions and poor domain knowledge
Yes, using NoSQL to avoid data migrations is a bad design decision. Usually created by poor general knowledge.
Stop and go ask more questions until you have a better understanding of the problem.
To me write scaling is the main current advantage of KV and document DBs. They can generally do schema evolution fairly easily, but nowadays so can many SQL DBs, with semi-structured column types. Also, you need to keep in mind that KV and document DBs are (mostly) non-relational. The more relational your data, the less likely you are to actually benefit from using those DBs over a relational, SQL DB.
If you have either of those problems, you will know it very clearly.
Also, ironically, Postgres became one of the most scalable NoSQL bases out there, and one of the most flexible to use unstructured data too.
It may not have a very rigid schema, you may later add several other optional fields.
You need very large scale (as in no of concurrent accesses), you want to shard the data by e.g. location. But also, the data is not "critical", your highschool not being visible temporarily for certain users is not an issue.
You mostly use the whole dataset "at the same time", you don't do a lot of WHERE, JOIN on some nested value.
In every other case I would rather reach for postgres with a JSONB column.
A lot of the bespoke no-sql data stores really started to come to the forefront around 2010 or so. At that time, having 8 cpu cores and 10k rpm SAS spinning drives was a high end server. Today, we have well over 100 cores, with TBs of RAM and PCIe Gen 4/5 NVME storage (u.x) that is thousands of times faster and has a total cost lower than the servers from 2010 or so that your average laptop can outclass today.
You can vertically scale a traditional RDBMS like PostgreSQL to an extreme degree... Not to mention utilizing features like JSONB where you can have denormalized tables within a structured world. This makes it even harder to really justify using NoSQL/NewSQL databases. The main bottlenecks are easier to overcome if you relax normalization where necessary.
There's also the consideration of specialized databases or alternative databases where data is echo'd to for the purposes of logging, metrics or reporting. Not to mention, certain layers of appropriate caching, which can still be less complex than some multi-database approaches.
That said, I've leaned into avoiding breaking up a lot of microservices unless/until you need them... I'm also not opposed to combining CQRS style workflows if/when you do need micro services. Usually if you need them, you're either breaking off certain compute/logic workflows first where the async/queued nature lends itself to your needs. My limited experience with a heavy micro-service application combined with GraphQL was somewhat painful in that the infrastructure and orchestration weren't appropriately backed by dedicated teams leading to excess complexity and job duties for a project that would have scaled just fine in a more monolithic approach.
YMMV depending on your specific needs, of course. You can also have microservices call natural services that have better connection sharing heuristics depending again on your infrastructure and needs... I've got worker pools that mostly operate of a queue, perform heavy compute loads then interact with the same API service(s) as everything else.
You could also consider renting an Oracle DB. Yep! Consider some unintuitive facts:
• It can be cheaper to use Oracle than MongoDB. There are companies that have migrated away from Mongo to Oracle to save money. This idea violates some of HN's most sacred memes, but there you go. Cloud databases are things you always pay for, even if they're based on open source code.
• Oracle supports NoSQL features including the MongoDB protocol. You can use the Mongo GUI tools to view and edit your data. Starting with NoSQL is very easy as a consequence.
• But... it also has "JSON duality views". You start with a collection of JSON documents and the database not only works out your JSON schemas through data entropy analysis, but can also refactor your documents into relational tables behind the scenes whilst preserving the JSON/REST oriented view e.g. with optimistic locking using etags. Queries on JSON DVs become SQL queries that join tables behind the scenes so you get the benefits of both NoSQL and SQL worlds (i.e. updating a sub-object in one place updates it in all places cheaply).
• If your startup has viral growth you won't have db scaling issues because Oracle DBs scale horizontally, and have a bunch of other neat performance tricks like automatically adding indexes you forgot you needed, you can materialize views, there are high performance transactional message queues etc.
So you get a nice smooth scale-up and transition from ad hoc "stuff some json into the db and hope for the best" to well typed data with schemas and properly normalized forms that benefit from all the features of SQL.
I love Postgres, and am not a huge fan of Oracle as a corporation, but I can’t deny that their RDBMS has some truly astounding capabilities.
Even then, PostgreSQL and even MS-SQL are often decent alternatives for most use cases.
So yeah running a relational DB used to be quite high effort but it got a lot better over time.
TBF, I haven't had to use Oracle in about a decade at this point... so I'm not sure how well it competes... My experiences with the corporate entity itself leave a lot to be desired, let alone just getting setup/started with local connectivity has always been what I considered extremely painful vs common alternatives. MS-SQL was always really nice to get setup, but more recently has had a lot of difficulties, in particular with docker/dev instances and more under arm (mac) than alternatives.
I'm a pretty big fan of PG, which is, again, very widely available and supported.
I think PG doesn't have most of the features I named, I'm pretty sure it doesn't have integrated queues for example (SELECT FOR UPDATE SKIP LOCKED isn't an MQ system), but also, bear in mind the "postgres" cloud vendors sell is often not actually Postgres. They've forked it and are exploiting the weak trademark protection, so people can end up more locked in than they think. In the past one cloud even shipped a transaction isolation bug in something they were calling managed Postgres, that didn't exist upstream! So then you're stuck with both a single DB and a single cloud.
Local dev is the same as other DBs:
docker run -d --name <oracle-db> container-registry.oracle.com/database/free:latest
See https://container-registry.oracle.comWorks on Intel and ARM. I develop on an ARM Mac without issue. It starts up in a few seconds.
Cost isn't necessarily much lower. At one point I specced out a DB equivalent to what a managed Postgres would cost for OpenAI's reported workload:
> I knocked up an estimate using Azure's pricing calculator and the numbers they provide, assuming 5TB of data (under-estimate) and HA option. Even with a 1 year reservation @40% discount they'd be paying (list price) around $350k/month. For that amount you can rent a dedicated Oracle/ExaData cluster with 192 cores! That's got all kinds of fancy hardware optimizations like a dedicated intra-cluster replication network, RDMA between nodes, predicate pushdown etc. It's going to perform better, and have way more features that would relieve their operational headache.
Yes I meant it in this sense: "If you knock something up, you make it or build it very quickly, using whatever materials are available."
https://www.collinsdictionary.com/dictionary/english/knock-u...
So.
Yeah no sane person would be that stupid
Never!
NoSQL is the "correct" answer if your queries are KV oriented, while predictable performance and high availability are priority (true for most "control planes"). Don't think any well-designed system will usually need to "graduate" from NoSQL to SQL.
No, no it isn't. It never is. Just as building your house on a rubber foundation isn't the correct answer either. This is just cope. Unless your use cases don't care about losing data or data corruption at all, NoSQL isn't the correct answer.
This is by design, the idea is that scaling your application layer is easy but scaling your storage/db layer is not
Hence make the storage dumb and have the application do the joins and now your app scales right up
(But tbh I agree a lot of applications don’t reach the scale required to benefit from this)
check_empty_vhosts () {
# Check which vhost adapter doesn't have any VTD mapped
start_sqlite
tosql "SELECT l.vios_name,l.vadapter_name FROM vios_vadapter AS l
LEFT OUTER JOIN vios_wwn_disk_vadapter_vtd AS r
USING (vadapter_name,vios_name)
WHERE r.vadapter_name IS NULL AND
r.vios_name IS NULL AND
l.vadapter_name LIKE 'vhost%';"
endsql
getsql
stop_sqlite
}
check_empty_vhosts_sh () {
# same as above, but on the shell
join -v 1 -t , -1 1 -2 1 \
<(while IFS=, read vio host slot; do
if [[ $host == vhost* ]]; then
print ${vio}_$host,$slot
fi
done < $VIO_ADAPTER_SLOT | sort -t , -k 1)\
<(while IFS=, read vio vhost vtd disk; do
if [[ $vhost == vhost* ]]; then
print ${vio}_$vhost
fi
done < $VIO_VHOST_VTD_DISK | sort -t , -k 1)
}Edit: I just submitted a link to Joe Armstrong's Minimum Viable Programs article from 2014. If the response to my comment is about the enterprise and imaginary scaling problems, realize that those situations don't apply to some programming problems.
Why waste time screwing around with ad-hoc file reads, then?
I mean, what exactly are you buying by rolling your own?
What's special about SQLite is that it already solves most of the things you need for data persistence without adding the same kind of overhead or trade offs as Postgres or other persistence layers, and that it saves you from solving those problems yourself in your json text files...
Like by all means don't use SQLite in every project. I have projects where I just use files on the disk too. But it's kinda inane to pretend it's some kind of burdensome tool that adds so much overhead it's not worth it.
Battle-tested, extremely performant, easier to use than a homegrown alternative?
By all means, hack around and make your own pseudo-database file system. Sounds like a fun weekend project. It doesn't sound easier or better or less costly than using SQLite in a production app though.
What overhead?
SQLite is literally more performant than fread/fwrite.
“Virding's First Rule of Programming: Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.”
Wait until you actually need it.
In this case, I feel like using the filesystem directly is the opposite: doing much more difficult programming and creating more complex code, in order to do less.
It depends on how you weigh the cost of the additional dependency that lets you write simpler code, of course, but I think in this case adding a SQLite dependency is a lower long-term maintenance burden than writing code to make atomic file writes.
The original post isn't about simplicity, though. It's about performance. They claim they achieved better performance by using the filesystem directly, which could (if they really need the extra performance) justify the extra challenge and code complexity.
Premature optimisation I believe that's called.
I've seen it play out many times in engineering over the years.
You are just mislabling good architecture as 'premature optimization'. So I will give you another platitude... "There is nothing so permanent as a temporary software solution"
SELECT \* from read_csv('example.csv');
Writing generally involves reading to an in-memory database, making whatever changes you want, then something like COPY new_table TO 'example.csv' (HEADER true, DELIMITER ',');you should be squashing bugs related to your business logic, not core data storage. Local data storage on your one horizontally-scaling box is a solved problem using SQLite. Not to mention atomic backups?
What is more likely, if you are making good decisions, is that you'll reach a point where the simple approach will fail to meet your needs. If you use the same attitude again and choose the simplest solution based on your _need_, you'll have concrete knowledge and constraints that you can redesign for.
e.g. worry about what makes your app unique. Data storage is not what makes your app unique. Outsource thinking about that to SQLite
Surely it does? Otherwise you cannot trust the interface point with SQLite and you're no further ahead. SQLite being flawless doesn't mean much if you screw things up before getting to it.
Regardless of whether most apps have enough requests per second to "need" a database for performance reasons, these are extremely important topics for any app used by a real business.