undefined

points

[-]

Nah, I know about VMA and it's a poor bandaid. I want a single-line malloc with zero care about usage flags and which only produces one single pointer value, because that's all that's needed in pretty much all of my use cases. VMA does not provide that.

And Vulkans unnecessary complexity doesn't stop at that issue, there are plenty of follow-up issues that I also have no intention of dealing with. Instead, I'll just use Cuda which doesn't bother me with useless complexity until I actually opt-in to it when it's time to optimize. Cuda allows to easily get stuff done first then check the more complex stuff to optimize, unlike Vulkan which unloads the entire complexity on you right from the start, before you have any chance to figure out what to do.

by nicebyte3 hours ago|

parent|

[-]

> I want a single-line malloc with zero care about usage flags and which only produces one single pointer value

That's not realistic on non-UMA systems. I doubt you want to go over PCIe every time you sample a texture, so the allocator has to know what you're allocating memory _for_. Even with CUDA you have to do that.

And even with unified memory, only the implementation knows exactly how much space is needed for a texture with a given format and configuration (e.g. due to different alignment requirements and such). "just" malloc-ing gpu memory sounds nice and would be nice, but given many vendors and many devices the complexity becomes irreducible. If your only use case is compute on nvidia chips, you shouldn't be using vulkan in the first place.

by m-schuetz3 hours ago|

parent|

[-]

> Even with CUDA you have to do that.

No you don't, cuMemAlloc(&ptr, size) will just give you device memory, and cuMemAllocHost will give you pinned host memory. The usage flags are entirely pointless. Why would UMA be necessary for this? There is a clear separation between device and host memory. And of course you'd use device memory for the texture data. Not sure why you're constructing a case where I'd fetch them from host over PCI, that's absurd.

> only the implementation knows exactly how much space is needed for a texture with a given format and configuration

OpenGL handles this trivially, and there is also no reason for a device malloc to not also work trivially with that. Let me create a texture handle, and give me a function that queries the size that I can feed to malloc. That's it. No heap types, no usage flags. You're making things more complicated than they need to be.

by nice_byte2 hours ago|

parent|

[-]

> No you don't, cuMemAlloc(&ptr, size) will just give you device memory, and cuMemAllocHost will give you pinned host memory.

that's exactly what i said. You have to explicitly allocate one or the other type of memory. I.e. you have to think about what you need this memory _for_. It's literally just usage flags with extra steps.

> Why would UMA be necessary for this?

UMA is necessary if you want to be able to "just allocate some memory without caring about usage flags". Which is something you're not doing with CUDA.

> OpenGL handles this trivially,

OpenGL also doesn't allow you to explicitly manage memory. But you were asking for an explicit malloc. So which one do you want, "just make me a texture" or "just give me a chunk of memory"?

> Let me create a texture handle, and give me a function that queries the size that I can feed to malloc. That's it. No heap types, no usage flags.

Sure, that's what VMA gives you (modulo usage flags, which as we had established you can't get rid of). Excerpt from some code:

``` VmaAllocationCreateInfo vma_alloc_info = { .usage = VMA_MEMORY_USAGE_GPU_ONLY, .requiredFlags = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT};

VkImage img; VmaAllocation allocn; const VkResult create_alloc_vkerr = vmaCreateImage( vma_allocator, &vk_image_info, // <-- populated earlier with format, dimensions, etc. &vma_alloc_info, &img, &allocn, NULL); ```

Since i dont care about reslurce aliasing, that's the extent of "memory management" that i do in my rhi. The last time i had to think about different heap types or how to bind memory was approximately never.

by m-schuetz2 hours ago|

parent|

[-]

No, it's not usage flags with extra steps, it's less steps. It's explicitly saying you want device memory without any kind of magical guesswork of what your numerous potential combinations of usage flags may end up giving you. Just one simple device malloc.

Likewise, your claim about UMA makes zero sense. Device malloc gets you a pointer or handle to device memory, UMA has zero relation to that. The result can be unified, but there is no need for it to be.

Yeah, OpenGL does not do malloc. I'm flexible, I don't necessarily need malloc. What I want is a trivial way to allocate device memory, and Vulkan and VMA don't do that. OpenGL is also not the best example since it also uses usage flags in some cases, it's just a little less terrible than Vulkan when it comes to texture memory.

I find it fascinating how you're giving a bad VMA example and passing that of as exemplary. Like, why is there gpu-only and device-local. That vma alloc info as a whole is completely pointless because a theoretical vkMalloc should always give me device memory. I'm not going to allocate host memory for my 3d models.

by mirsadm2 hours ago|

parent|

[-]

But what if you want both on a shared memory system?

by m-schuetz1 hours ago|

parent|

[-]

No problem: Then you provide an optional more complex API that gives you additional control. That's the beautiful thing about Cuda, it has an easy API for the common case that suffices 99% of the time, and additional APIs for the complex case if you really need that. Instead of making you go through the complex API all the time.

by nice_byte1 hours ago|

parent|

prev|

[-]

> It's explicitly saying you want device memory

You are also explicitly saying that you want device memory by specifying DEVICE_LOCAL_BIT. There's no difference.

> Likewise, your claim about UMA makes zero sense. Device malloc gets you a pointer or handle to device memory,

It makes zero sense to you because we're talking past each other. I am saying that on systems without UMA you _have_ to care where your resources live. You _have_ to be able to allocate both on host and device.

> Like, why is there gpu-only and device-local.

Because there's such a thing as accessing GPU memory from the host. Hence, you _have_ to specify explicitly that no, only the GPU will try to access this GPU-local memory. And if you request host-visible GPU-local memory, you might not get more than around 256 megs unless your target system has ReBAR.

> a theoretical vkMalloc should always give me device memory.

No, because if that's the only way to allocate memory, how are you going to allocate staging buffers for the CPU to write to? In general, you can't give the copy engine a random host pointer and have it go to town. So, okay now we're back to vkDeviceMalloc and vkHostMalloc. But wait, there's this whole thing about device-local and host visible, so should we add another function? What about write-combined memory? Cache coherency? This is how you end up with a zillion flags.

This is the reason I keep bringing UMA up but you keep brushing it off.

by m-schuetz1 hours ago|

parent|

[-]

> You are also explicitly saying that you want device memory by specifying DEVICE_LOCAL_BIT. There's no difference.

There is. One is a simple malloc call, the other uses arguments with numerous combinations of usage flags which all end up doing exactly the same, so why do thy even exist.

> You _have_ to be able to allocate both on host and device.

cuMemAlloc and cuMemAllocHost, as mentioned before.

> Because there's such a thing as accessing GPU memory from the host

Never had the need for that, just cuMemcpyHtoD and DtoH the data. Of course host-mapped device memory can continue to exist as a separate, more cumbersome API. The 256MB limit is cute but apparently not relevant im Cuda where I've been memcpying buffers with GBs in size between host and device for years.

> No, because if that's the only way to allocate memory, how are you going to allocate staging buffers for the CPU to write to?

With the mallocHost counterpart.

cuMemAllocHost, so a theoretic vkMallocHost, gives you pinned host memory where you can prep data before sending it to device with cuMemcpyHtoD.

> This is how you end up with a zillion flags.

Apparently only if you insist on mapped/host-visible memory. This and usage flags never ever come up in Cuda where you just write to the host buffer and memcpy when done.

> This is the reason I keep bringing UMA up but you keep brushing it off.

Yes I think I now get why keep bringing up UMA - because you want to directly access buffers between host or device via pointers. That's great, but I don't have the need for that and I wouldn't trust the performance behaviour of that approach. I'll stick with memcpy which is fast, simple, has fairly clear performance behaviours and requires none of the nonsense you insist on being necessary. But what I want isn't either this or that approach, I want the simple approach in addition what exists now, so we can both have our cakes.

by MindSpunk46 minutes ago|

parent|

[-]

What exactly is the difference between these?

cuMemAlloc -> vmaAllocate + VMA_MEMORY_USAGE_GPU_ONLY

cuMemAllocHost -> vmaAllocate + VMA_MEMORY_USAGE_CPU_ONLY

It seems like the functionality is the same, just the memory usage is implicit in cuMemAlloc instead of being typed out? If it's that big of a deal write a wrapper function and be done with it?

Usage flags never come up in CUDA because everything is just a bag-of-bytes buffer. Vulkan needs to deal with render targets and textures too which historically had to be placed in special memory regions, and are still accessed through big blocks of fixed function hardware that are very much still relevant. And each of the ~6 different GPU vendors across 10+ years of generational iterations does this all differently and has different memory architectures and performance cliffs.

It's cumbersome, but can also be wrapped (i.e. VMA). Who cares if the "easy mode" comes in vulkan.h or vma.h, someone's got to implement it anyway. At least if it's in vma.h I can fix issues, unlike if we trusted all the vendors to do it right (they wont).

by nice_byte1 hours ago|

parent|

prev|

[-]

> I want the simple approach in addition what exists now, so we can both have our cakes.

The simple approach can be implemented on top of what Vulkan exposes currently.

In fact, it takes only a few lines to wrap that VMA snippet above and you never have to stare at those pesky structs again!

But Vulkan the API can't afford to be "like CUDA" because Vulkan is not a compute API for Nvidia GPUs. It has to balance a lot of things, that's the main reason it's so un-ergonomic (that's not to say there were no bad decisions made. Renderpasses were always a bad idea.)

by m-schuetz1 hours ago|

parent|

[-]

> In fact, it takes only a few lines to wrap that VMA snippet above and you never have to stare at those pesky structs again!

If it were just this issue, perhaps. But there are so many more unnecessary issues that I have no desire to deal with, so I just started software-rasterizing everything in Cuda instead. Which is way easier because Cuda always provides the simple API and makes complexity opt-in.