For about a month now I've been working on a suite of tools for dealing with JSON specifically written for the imagined audience of "for people who like CLIs or TUIs and have to deal with PILES AND PILES of JSON and care deeply about performance".
For me, I've been writing them just because it's an "itch". I like writing high performance/efficient software, and there's a few gaps that it bugged me they existed, that I knew I could fill.
I'm having fun and will be happy when I finish, regardless, but it would be so cool if it happened to solve a problem for someone else.
Working with this is pretty painful, so I convert the Pickled structure to other formats including JSON.
The file has always been prettified around ~500MB but as of recently expands to about 3GB I think because they’ve added extra regional parameters.
The file inflates to a large size because Pickle refcounts objects for deduping, whereas obviously that’s lost in JSON.
I care about speed and tools not choking on the large inputs so I use jaq for querying and instruction LLMs operating on the data to do the same.
You could probably do something similar for a faster jq.
> The query language is deliberately less expressive than jq's. jsongrep is a search tool, not a transformation tool-- it finds values but doesn't compute new ones. There are no filters, no arithmetic, no string interpolation.
Mind me asking what sorts of TB json files you work with? Seems excessively immense.
> Hadoop: bro
> Spark: bro
> hive: bro
> data team: bro
<https://adamdrake.com/command-line-tools-can-be-235x-faster-...>
Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)
Conclusion: Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools.I'm sure there are reasons against switching to something more efficient–we've all been there–I'm just surprised.
I'm not OP,but structured JSON logs can easily result in humongous ndjson files, even with a modest fleet of servers over a not-very-long period of time.
I'd probably just shove it all into Postgres, but even a multi terabyte SQLite database seems more reasonable.
Even if it's once off, some people handle a lot of once-offs, that's exactly where you need good CLI tooling to support it.
Sure jq isn't exactly super slow, but I also have avoided it in pipelines where I just need faster throughput.
rg was insanely useful in a project I once got where they had about 5GB of source files, a lot of them auto-generated. And you needed to find stuff in there. People were using Notepad++ and waiting minutes for a query to find something in the haystack. rg returned results in seconds.
The comment I was replying to implied this was something more regular.
EDIT: why is this being downvoted? I didn't think I was rude. The person I responded to made a good point, I was just clarifying that it wasn't quite the situation I was asking about.