upvote
That's exactly what I found. Why these files should exist at all? Some other IDEs just have a bunch of highlighting rules based on regular expressions and have a folder of tiny XML grammar files instead of a folder of bloaty shared libraries.
reply
Because it's far more reliable to use proper parsers instead of a bunch of regular expressions. Most languages cannot be properly parsed with regexes.

Those files are compiled tree-sitter grammars, read up on why it exists and where it is used instead of me poorly regurgitating official documentation:

https://tree-sitter.github.io/tree-sitter

reply
Funny enough, they are less than 10MB when compressed. I guess they could use something like upx to compress these binaries.

The whole Linux release is 15mb, but it uncompresses to 16MB binary and 200MB grammars on disk.

Why do we need to have 40MB of Verilog grammars on disk when 99% of people don't use them?

reply
That would waste CPU time and introduce additional delays when opening files.

They could probably lazily install the grammars like neovim does, but as someone who doesn't have much faith in the reliability of internet infrastructure, I'll personally take it...

Just ran `:TSInstall all` in neovim out of curiosity, and the results were predictable:

  ~/.local/share/nvim/lazy/nvim-treesitter/parser
  files 309
  size 232M

  /usr/lib/helix/runtime/grammars
  files 246
  size 185M
If disk space is important for your use case, I guess filesystem compression would save far more than just compressing binaries with upx. btrfs+zstd handle those .so well:

  $ compsize ~/.local/share/nvim/lazy/nvim-treesitter/parser
  Type       Perc     Disk Usage   Uncompressed Referenced
  TOTAL       11%       26M         231M         231M

  $ compsize /usr/lib/helix/runtime/grammars
  Type       Perc     Disk Usage   Uncompressed Referenced
  TOTAL       12%       23M         184M         184M
reply
I mean, they could decompress it once when using a language for the first time. It will still be fully offline, but with a bit uncompressing.
reply
If this is a concern, why not compress at the filesystem level?
reply
For real parsing a proper compiler codebase (via a language server implementation) should be used. Writing something manually can't work properly, especially with languages like C++ and Rust with complex includes/imports and macros. Newer LSP editions support syntax-based highlighting/colorizing, but if some LSP implementation doesn't support it, using regexp-based fallback is mostly fine.
reply