Try just GLM-OCR if you want to get started quickly. It has good layout recognition quality, good text recognition quality, and they actually tested it on Apple Silicon laptops. It works easily out-of-the-box without the yak shaving I encountered with some other models. Chandra is even more accurate on text but its layout bounding boxes are worse and it runs very slowly unless you can set up batched inference with vLLM on CUDA. (I tried to get batching to run with vllm-mlx so it could work entirely on macOS, but a day spent shaving the yak with Claude Opus's help went nowhere.)
If you just want to transcribe documents, you can also try end-to-end models like olmOCR 2. I need pipeline models that expose inner details of document layout because I need to segment and restructure page contents for further processing. The end-to-end models just "magically" turn page scans into complete Markdown or HTML documents, which is more convenient for some uses but not mine.