Still a cool project, thanks for sharing.
I have wondered about having LLMs output machine code directly and skipping the compiler/assembler altogether. Then you'd just commit your spec/prompt and run it through the LLM to get your binary.
cmp dword [rax], 'XAUT'
jne .rxa_next
cmp dword [rax+4], 'HORI'
jne .rxa_next
cmp word [rax+8], 'TY'
jne .rxa_next
cmp byte [rax+10], '='
jne .rxa_next
; Found XAUTHORITY=path
Okay, this is setup code that only runs once at startup - but that would be a reason to optimize it for size and/or readability! REPE CMPSB exists, and may not be the fastest, but certainly the most compact and idiomatic way to compare strings. Or write a subroutine to do it!This pattern is used everywhere for copying or comparing strings, this was just one example of it.
There's a state variable that's used to keep track of whether the input is text to be displayed or part of a control sequence. It's a full 64 bits, probably not because we need 18 quintillion states? Here's how it is evaluated:
; Dispatch based on state
mov rcx, [vt_state]
cmp rcx, VT_ESC
je .vtp_esc
cmp rcx, VT_CSI
je .vtp_csi
cmp rcx, VT_CSI_PARAM
...
In total, there are 7 compares + conditional jumps, one after another. Compilers would generate a jump table for this, and a better option in assembly might be to make vt_state a pointer to the label we want to go to. Branch predictors nowadays can handle indirect jumps, and may actually have more trouble with such tightly clustered conditionals as seen in this code.This code is on the "slow" path, there's a faster one for 7-bit ASCII outside of control sequences, with a lengthy comment by Claude at the top on how it optimized this. Even this one starts with a bunch of conditionals though:
cmp qword [vt_state], 0 ; VT_NORMAL == 0
jne .vtp_loop_slow
cmp dword [utf8_remaining], 0
jne .vtp_loop_slow
cmp byte [pending_wrap], 0
jne .vtp_loop_slow
These could likely all be condensed into a single test or indirect jump via the state variable, by introducing just a few more states for UTF-8 decoding and wrap. Following this, here's a "useless use of TEST" (the subtraction already set the flags): mov rbx, [grid_cols]
sub rbx, [cursor_col] ; rbx = cells left on this row
test rbx, rbx
jle .vtp_loop_slow ; no room (or already past)
This also again shows the compulsive use of 64-bit registers and variables for values that should never be this big. It's not the "natural" data size on x86-64 at all, every such instruction requires an extra prefix byte.I freely confess that I'm a "Luddite", and was explicitly looking for bad (and obviously so) code, but this took me just a few minutes of scrolling through the nearly 20K lines in this file, so it should be somewhat representative of the whole.
rust can do that. You can run a hyper stripped down rust that was made for embedded devices specifically because those devices don't have room for a runtime.
And it apparently can. And very well.
One advantage seems to be that the complete asm file fits easily into CC context window.
well, I can respect that for sure