If DSPy is so great, why isn't anyone using it?

upvote

If DSPy is so great, why isn't anyone using it?

(skylarbpayne.com)

178 points

by sbpayne5 hours ago |

upvote

by deaux4 hours ago|

[-]

I don't see it at all.

> Typed I/O for every LLM call. Use Pydantic. Define what goes in and out.

Sure, not related to DSPy though, and completely tablestakes. Also not sure why the whole article assumes the only language in the world is Python.

> Separate prompts from code. Forces you to think about prompts as distinct things.

There's really no reason prompts must live in a file with a .md or .json or .txt extension rather than .py/.ts/.go/.., except if you indeed work at a company that decided it's a good idea to let random people change prod runtime behavior. If someone can think of a scenario where this is actually a good idea, feel free to elighten me. I don't see how it's any more advisable than editing code in prod while it's running.

> Composable units. Every LLM call should be testable, mockable, chainable.

> Abstract model calls. Make swapping GPT-4 for Claude a one-line change.

And LiteLLM or `ai` (Vercel), the actually most used packages, aren't? You're comparing downloads with Langchain, probably the worst package to gain popularity of the last decade. It was just first to market, then after a short while most realized it's horrifically architected, and now it's just coasting on former name recognition while everyone who needs to get shit done uses something lighter like the above two.

> Eval infrastructure early. Day one. How will you know if a change helped?

Sure, to an extent. Outside of programming, most things where LLMs deliver actual value are very nondeterministic with no right answer. That's exactly what they offer. Plenty of which an LLM can't judge the quality of. Having basic evals is useful, but you can quickly run into their development taking more time than it's worth.

But above all.. the comments on this post immediately make clear that the biggest differentiator of DSPy is the prompt optimization. Yet this article doesn't mention that at all? Weird.

reply

upvote

by alexjplant2 hours ago|

[-]

> Sure, not related to DSPy though, and completely tablestakes.

I agree but you'd be surprised at how many people will argue against static typing with a straight face. It's happened to me on at least three occasions that I can count and each time the usual suspects were trotted out: "it's quicker", "you should have tests to validate anyhow", "YOLO polymorphism is amazing", "Google writes Python so it's OK", etc.

It must be cultural as it always seems to be a specific subset of Python and ECMAScript devs making these arguments. I'm glad that type hints and Typescript are gaining traction as I fall firmly on the other side of this debate. The proliferation of LLM coding workflows has likely accelerated adoption since types provide such valuable local context to the models.

reply

upvote

by callbacked1 hours ago|

[-]

> not sure why the whole article assumes the only language in the world is Python

https://github.com/ax-llm/ax (if you're in the typescript world)

reply

upvote

by andyg_blog4 hours ago|

[-]

>the whole article assumes the only language in the world is Python.

This was my take as well.

My company recently started using Dspy, but you know what? We had to stand up an entire new repo in Python for it, because the vast majority of our code is not Python.

reply

upvote

by sbpayne4 hours ago|

[-]

I think this is an important point! I am actually a big fan of doing what works in the language(s) you're already using.

For example: I don't use Dspy at work! And I'm working in a primarily dotnet stack, so we definitely don't use Dspy... But still, I see the same patterns seeping through that I think are important to understand.

And then there's a question of "how do we implement these patterns idiomatically and ergonomically in our codebase/langugage?"

reply

upvote

by redwood2 hours ago|

[-]

Out of curiosity, what are you finding success with in dotnet land? My observation is that it's not clear when Semantic Kernel is recommended versus one of multiple other MSFT newly-branded creations

reply

upvote

by sbpayne2 hours ago|

[-]

we have been using Agent Framework. I also have been eyeing LlmTornado. Personally, I find dotnet as a whole hard to implement the kind of abstractions I want to have to make it ergonomic to implement AI stuff.

I've been fiddling around with many prototypes to try to figure out the right way to do this, but it feels challenging; I'm not yet familiar enough with how to do this ergonomically and idiomatically in dotnet haha

reply

upvote

by BoorishBears2 hours ago|

[-]

Why did you do that instead of using Liquid templates?

reply

upvote

by sbpayne4 hours ago|

[-]

I think all of these things are table-stakes; yet I see that they are implemented/supported poorly across many companies. All I'm saying is there are some patterns here that are important, and it makes sense to enter into building AI systems understanding them (whether or not you use Dspy) :)

reply

upvote

by PaulHoule2 hours ago|

[-]

I can say for 10 years I have been looking at general purpose frameworks like Dspy and even wrote one at work and they tend to be pretty bad, especially the one I wrote.

I agree with all the points that they list but I fear if I looked close at the code and how they did it I wouldn't stop cringing until I looked away. Frameworks like this tend to point out 10 concerns that you should be concerned about but aren't and make users learn a lot of new stuff to bend their work around your framework but they rarely get a clear understanding of what the concerns are, where exactly the value comes from the framework, etc.

That is, if you are trying to sell something you can do a lot better with something crazy and one-third-baked like OpenClaw, which will make your local Apple Store sell out of minis, than anything that rationally explains "you are going to have to invent all the stuff that is in this framework that looks like incomprehensible bloat to you right now." I mean, it is rational, it is true, but I can say empirically as a person-who-sells-things that it doesn't sell, in fact if you wanted me to make a magic charm that looks like it would sell things and make sure you don't sell anything it would be that.

reply

upvote

by sbpayne2 hours ago|

[-]

yeah the point I want to get across is less "you should use Dspy" and more "understand Dspy, so you are intentionally implementing the capabilities you need"

Implementations are generally always going to be messy; and still I feel like not all the messiness is incidental. A lot of it is accidental :)

reply

upvote

by persedes3 hours ago|

[-]

Dspys advertising aside, imho it is a library only for optimizing an existing workflow/ prompt and not for the use cases described there. Similar to how I would not write "production" code with sklearn :)

They themselves are turning into wrapper code for other libraries (e.g. the LLM abstraction which litellm handles for them).

Can also add:

Option 3: Use instructor + litellm (probabyly pydantic AI, but have not tried that yet)

Edit: As others pointed out their optimizing algorithms are very good (GEPA is great and let's you easily visualize / track the changes it makes to the prompt)

reply

upvote

by prpl3 hours ago|

[-]

The sklearn to me is (and mirrors) the insane amount of engineering that exists/existed to bring Jupyter notebooks to something more prod-worthy and reproducible. There’s always going to be re-engineering of these things, you don’t need to use the same tools for all use cases

reply

upvote

by persedes3 hours ago|

[-]

Hmm not quite what I meant. Sklearn has it's place in every ML toolbox, I'll use it to experiment and train my model. However for deploying it, I can e.g. just grab the weights of the model and run it with numpy in production without needing the heavy dependencies that sklearn adds.

reply

upvote

by hedgehog3 hours ago|

[-]

In my experience the behavior variation between models and providers is different enough that the "one-line swap" idea is only true for the simplest cases. I agree the prompt lifecycle is the same as code though. The compromise I'm at currently is to use text templates checked in with the rest of the code (Handlebars but it doesn't really matter) and enforce some structure with a wrapper that takes as inputs the template name + context data + output schema + target model, and internally papers over the behavioral differences I'm ok with ignoring.

I'm curious what other practitioners are doing.

reply

upvote

by dbreunig3 hours ago|

[-]

Model testing and swapping is one of the surprises people really appreciate DSPy for.

You're right: prompts are overfit to models. You can't just change the provider or target and know that you're giving it a fair shake. But if you have eval data and have been using a prompt optimizer with DSPy, you can try models with the one-line change followed by rerunning the prompt optimizer.

Dropbox just published a case study where they talk about this:

> At the same time, this experiment reinforced another benefit of the approach: iteration speed. Although gemma-3-12b was ultimately too weak for our highest-quality production judge paths, DSPy allowed us to reach that conclusion quickly and with measurable evidence. Instead of prolonged debate or manual trial and error, we could test the model directly against our evaluation framework and make a confident decision.

https://dropbox.tech/machine-learning/optimizing-dropbox-das...

reply

upvote

by hedgehog40 minutes ago|

[-]

It's not just about fitting prompts to models, it's things like how web search works, how structured outputs are handled, various knobs like level of reasoning effort, etc. I don't think the DSPy approach is bad but it doesn't really solve those issues.

reply

upvote

by persedes2 hours ago|

[-]

funnily enough the model switching is mostly thanks to litellm which dspy wraps around.

reply

upvote

by MoonWalk2 minutes ago|

[-]

Inaccessible: "net::ERR_CERT_AUTHORITY_INVALID" reply

reply

upvote

by BenGosub16 minutes ago|

[-]

About one and a half years ago I was an early adopter of DSPy and I had better results (compared to LlamaIndex) with structuring unstructured data just by putting it in DSPy models, before any optimization step whatsoever.

Also, IMO DSPy didn't take off because it requires preparing train and test datasets and that takes time and effort. Now with Gepa I expect things are getting very interesting, the optimizations can come just from descriptions.

IMO LangGraph is currently used a lot as an agent and RAG framework, DSPy doesn't have the same use case, even though there's overlap. And I think the montly numbers doesn't do justice, because what I see now is a lot of companies doing things wrongly.

reply

upvote

by nkozyra4 hours ago|

[-]

> f"Extract the company name from: {text}"

I think one thing that's lost in all of the LLM tooling is that it's LLM-or-nothing and people have lost knowledge of other ML approaches that actually work just fine, like entity recognition.

I understand it's easier to just throw every problem at an LLM but there are things where off-the-shelf ML/NLP products work just as well without the latency or expense.

reply

upvote

by roadside_picnic3 hours ago|

[-]

> like entity recognition

As someone who has done traditional NLP work as at least part of my job for the last 15 years, LLMs do ofter a vastly superior NER solution over any previous NLP options.

I agree with your overall statement, that frequently people rush to grab an LLM when superior options already exist (classification is a big example, especially when the power of embeddings can be leveraged), but NER is absolutely a case where LLMs are the superior option (unless you have latency/cost requirements to force you to choose and inferior quality as the trade off, but your default should be an LLM today).

reply

upvote

by mark_l_watson1 hours ago|

[-]

I agree! I used 'symbolic AI' for NLP starting in the early 1980s. Everything back then was so brittle, and very labor intensive.

reply

upvote

by sbpayne4 hours ago|

[-]

Oh 100%! There are many problems (including this one!) that probably aren't best suited for an LLM. I was just trying to pick a really simple example that most people would follow.

reply

upvote

by rao-v3 hours ago|

[-]

Is there a non-tranformer based entity extraction solution that's not brittle? My understanding is that the cutting edge in entity extraction (e.g. spaCy) is just small BERT models, which rock for certain things, but don't have the world knowledge to handle typos / misspellings etc.

reply

upvote

by swyx2 hours ago|

[-]

but then u run into edge cases with indirect references and entity recognition models arent smart enough to deal with them, and bitter lesson hits you again.

reply

upvote

by sbpayne1 hours ago|

[-]

the bitter lesson comes for us all, unfortunately!

reply

upvote

by Legend24402 hours ago|

[-]

I don't think you realize how bad NLP was prior to transformers. Oldschool entity recognition was extremely brittle to the point that it basically didn't work.

CV too for that matter, object recognition before deep learning required a white background and consistent angles. Remember this XKCD from only 2014? https://xkcd.com/1425/

reply

upvote

by nkozyra1 hours ago|

[-]

CV is a space where I would 100% agree with you. But - edge cases notwithstanding - there's not so much of a dropoff with NER that I would first go to an LLM.

reply

upvote

by LudwigNagasena2 hours ago|

[-]

The article starts with the comparison of DSPy and LangChain monthly downloads and then wastes time comparing DSPy to hand-rolling basic infra, which is quite trivial in every barely mature setup.

I conjecture that the core value proposition of DSPy is its optimizer? Yet the article doesn't really touch it in any important way. How does it work? How would I integrate it into my production? Is it even worth it for usual use-cases? Adding a retry is not a problem, creating and maintaining an AI control plane is. LangChain provides services for observability, online and offline evaluation, prompt engineering, deployment, you name it.

reply

upvote

by sbpayne2 hours ago|

[-]

You can see many people saying this in the comments :). I personally think this misses the core of what Dspy "is".

Dspy encourages you to write your code in a way that better enables optimization, yes (and provides direct abstractions for that). But this isn't in a sense unique to Dspy: you can get these same benefits by applying the right patterns.

And they are the patterns I just find people constantly implementing these without realizing it, and think they could benefit from understanding Dspy a bit better to make better implementations :)

reply

upvote

by memothon4 hours ago|

[-]

I think the real problem with using DSPy is that many of the problems people are trying to solve with LLMs (agents, chat) don't have an obvious path to evaluate. You have to really think carefully on how to build up a training and evaluation dataset that you can throw to DSPy to get it to optimize.

This takes a ton of upfront work and careful thinking. As soon as you move the goalposts of what you're trying to achieve you also have to update the training and evaluation dataset to cover that new use case.

This can actually get in the way of moving fast. Often teams are not trying to optimize their prompts but even trying to figure out what the set of questions and right answers should be!

reply

upvote

by sbpayne4 hours ago|

[-]

Yeah, I think Dspy often does not really show it's benefit until you have a good 'automated metric', which can be difficult to get to.

I think the unfortunate part is: the way it encourages you to structure your code is good for other reasons that might not be an 'acute' pain. And over time, it seems inevitable you'll end up building something that looks like it.

reply

upvote

by memothon4 hours ago|

[-]

Yeah I agree with this. I will try to use it in earnest on my next project.

That metric is the key piece. I don't know the right way to build an automated metric for a lot of the systems I want to build that will stand the test of time.

reply

upvote

by sbpayne4 hours ago|

[-]

To be clear: I don't know that I would recommend using it, exactly. I would just make sure you understand the lessons so you see how it best makes sense to apply to your project :)

reply

upvote

by stephantul4 hours ago|

[-]

Mannnn, here I thought this was going to be an informative article! But it’s just a commercial for the author’s consulting business.

reply

upvote

by sbpayne4 hours ago|

[-]

Oops! That's actually out of date from prior template I had. I don't actually consult at the moment :). Removing!

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by halb4 hours ago|

[-]

The author itself is probably ai-generated. The contact section in the blog is just placeholder values. I think the age of informative articles is gone

reply

upvote

by CharlieDigital4 hours ago|

[-]

I work with author; author is definitely not AI generated.

reply

upvote

by sbpayne4 hours ago|

[-]

This is definitely a mistake! What contact section are you referring to? The only references to contact I see in this post now are at the end where I linked to my X/LinkedIn profiles but those links look right to me?

reply

upvote

by giorgioz3 hours ago|

[-]

Loved the article because I exactly hit the stages all up till the 5th! Thank you for making me see the whole picture and journey!

I think a problem to DSPy is that they don't know the concept of THE WHOLE PRODUCT: https://en.wikipedia.org/wiki/Whole_product

Look at https://mastra.ai/ and https://www.copilotkit.ai/ to see how more inviting their pages look. A company is not selling only the product itself but all the other things around the product = THE WHOLE PRODUCT

A similar concept in developer tools is the docs are the product

Also I'm a fullstack javascript engineer and I don't use Python. Docs usually have a switch for the language at the top. Stripe.com is famous for it's docs and Developer Experience: https://docs.stripe.com/search#examples It's great to study other great products to get inspiration and copy the best traits that are relevant to your product as well.

reply

upvote

by sbpayne3 hours ago|

[-]

The "whole product" idea here makes a lot of sense to me. I think this is often a big barrier to adoption for sure!

reply

upvote

by TheTaytay5 hours ago|

[-]

I tried it in the past, one time “in earnest.” But when I discovered that none of my actual optimized prompts were extractable, I got cold feet and went a different route. The idea of needing to do fully commit to a framework scares me. The idea of having a computer optimize a prompt as a compilation step makes a lot of sense, but treating the underlying output prompt as an opaque blob doesn’t. Some of my use cases were JUST off of the beaten path that dspy was confusing, which didn’t help. And lastly, I felt like committing to dspy meant that I would be shutting the door on any other framework or tool or prompting approach down the road.

I think I might have just misunderstood how to use it.

reply

upvote

by sbpayne5 hours ago|

[-]

I don't know that you misunderstood. This is one of my biggest gripes with Dspy as well. I think it takes the "prompt is a parameter" concept a bit too far.

I highly recommend checking out this community plugin from Maxime, it helps "bridge the gap": https://github.com/dspy-community/dspy-template-adapter

reply

upvote

by TheTaytay2 minutes ago|

[-]

Woah, that community plugin looks so much closer to what I initially thought (hoped) Dspy was!

reply

upvote

by ndr4 hours ago|

[-]

It's not as ergonomic as they made it to be.

The fact that you have to bundle input+output signatures and everything is dynamically typed (sometimes into the args) just make it annoying to use in codebases that have type annotations everywhere.

Plus their out of the box agent loop has been a joke for the longest time, and writing your own if feasible but it's night and day when trying to get something done with pydantic-ai.

Too bad because it has a lot of nice things, I wish it were more popular.

reply

upvote

by sbpayne4 hours ago|

[-]

Yeah! I can agree with this. There's some improved ergonomics to get here

reply

upvote

by verdverm4 hours ago|

[-]

Have you looked at ADK? How does it compare? Does it even fit in the same place as Dspy?

https://google.github.io/adk-docs/

Disclaimer, I use ADK, haven't really looked at Dspy (though I have prior heard of it). ADK certainly addresses all of the points you have in the post.

reply

upvote

by sbpayne4 hours ago|

[-]

I personally haven't looked super closely at ADK. But I would love if someone more knowledgeable could do a sort of comparison. I imagine there are a lot of similar/shared ideas!

reply

upvote

by verdverm4 hours ago|

[-]

There are dozens if not 100s of agent frameworks in use today, 1000s if you peruse /new. I'm curious what features will make for longevity. One thing about ADK is that it comes in four languages (Py, TS, Go, Java; so far), which means understanding can transfer over/between teams in larger orgs, and they can share the same backing services (like the db to persist sessions).

reply

upvote

by panelcu4 hours ago|

[-]

https://www.tensorzero.com/docs has similar abstractions but doesn't require Python and doesn't require committing to the framework or a language. It's also pretty hard to onboard, but solves the same problems better and makes evaluating changes to models / prompts much easier to reason about.

reply

upvote

by TheTaytay2 minutes ago|

[-]

Yes, I was more impressed with their decoupling of prompts from parameters!

reply

upvote

by sbpayne4 hours ago|

[-]

I saw this some time ago! I personally have a distaste for external DSLs as I think it generally introduces complexity that I don't think is actually worthwhile, so I skipped over it. Also why I'm very "meh" on BAML.

reply

upvote

by GabrielBianconi3 hours ago|

[-]

TensorZero works with the OpenAI SDK out of the box:

```

from openai import OpenAI

# Point the client to the TensorZero Gateway

client = OpenAI(base_url="http://localhost:3000/openai/v1", api_key="not-used")

response = client.chat.completions.create(

    # Call any model provider (or TensorZero function)

    model="tensorzero::model_name::anthropic::claude-sonnet-4-6",

    messages=[

        {

            "role": "user",

            "content": "Share a fun fact about TensorZero.",

        }

    ],

)

```

You can layer additional features only as needed (fallbacks, templates, A/B testing, etc).

reply

upvote

by tcdent2 hours ago|

[-]

DSPy is cool from an integrated perspective but as someone who extensively develops agents, there have been two phases to the workflow that prevented me from adopting it:

1. Up until about six months ago, modifying prompts by hand and incorporating terminology with very specific intent and observing edge cases and essentially directing the LLM in a direction to the intended outcome was somewhat meticulous and also somewhat tricky. This is what the industry was commonly referring to as prompt engineering.

2. With the current state of SOTA models like Opus 4.6, the agent that is developing my applications alongside of me often has a more intelligent and/or generalized view of the system that we're creating.

We've reached a point in the industry where smaller models can accomplish tasks that were reserved for only the largest models. And now that we use the most intelligent models to create those systems, the feedback loop which was patterned by DSPy has essentially become adopted as part of my development workflow.

I can write an agent and a prompt as a first pass using an agentic coder, and then based on the observation of the performance of the agent by my agentic coder, continue to iterate on my prompts until I arrive at satisfactory results. This is further supported by all of the documentation, specifications, data structures, and other I/O aspects of the application that the agent integrates in which the coding agent can take into account when constructing and evaluating agentic systems.

So DSPy was certainly onto something but the level of abstraction, at least in my personal use case has, moved up a layer instead of being integrated into the actual system.

reply

upvote

by sbpayne2 hours ago|

[-]

I think many people have the same experience! And that's the point I'm trying to make. There are patterns here that are worth adopting, whether or not you're using Dspy :)

reply

upvote

by sethkim3 hours ago|

[-]

We build a product that's somewhat similar in spirit to DSPy, but people come to us for different reasons than the OP listed here.

1) It's slow: you first have to get acquainted with DSPY and then get hand-labeled data for prompt optimization. This can be a slow process so it's important to just label cases that are ambiguous, not obvious.

2) They know that manual prompt engineering is brittle, and want a prompt that's optimized and robust against a model they're invoking, which DSPy offers. However, it's really the optimizer (ex. GEPA) doing the heavy-lifting.

3) They don't actually want a model or prompt at all. They want a task completed, reliably, and they want that task to not regress in performance. Ideally, the task keeps improving in production.

Curious if folks in this thread feel more of these pains than the ones in the article.

reply

upvote

by sbpayne3 hours ago|

[-]

I think in some sense, this is the real thing everyone wants. Everything else is kind of an implementation detail! Would be really curious to see what you're building!

reply

upvote

by sethkim3 hours ago|

[-]

Feel free to shoot me a note at seth@sutro.sh if you want to check it out!

reply

upvote

by CraftingLinks4 hours ago|

[-]

I used dspy in production, then reverted the bloat as it literally gave me nothing of added value in practice but a lot of friction when i needed precise control over the context. Avoid!

reply

upvote

by matusp3 hours ago|

[-]

I enjoy working with it. I mostly just use it to define the input and outputs more programmatically compared to raw prompts.

reply

upvote

by alex7o45 minutes ago|

[-]

I have used baml before and that worked super well for me multiple times so I don't see a problem with that.

reply

upvote

by benh247744 minutes ago|

[-]

The adoption gap feels real. My experience is that developers don't trust AI outputs enough to build production workflows around them yet — the missing piece isn't better prompting frameworks, it's confidence signals that tell you when to trust the output.

reply

upvote

by whinvik2 hours ago|

[-]

I don't get it. All these are provided by many different agent libs like langgraph, Pydantic AI etc. I thought DSPy was for prompt optimization but I could never wrap my head around that aspect since like Langchain, DSPy seems to hide stuff a bit too much.

So this article seems surprising since it emphasizes more the non prompt optimization aspects. If that was the selling point I would rather use something like Pydantic AI when I already use Pydantic for so much of the rest.

reply

upvote

by sbpayne2 hours ago|

[-]

I think the reality is that prompt optimization is one of the only "legible benefits" (ie easy to understand why its valuable).

But I think it misses the point of what Dspy "is". It's less that Dspy is about prompt optimization and more that, Dspy encourages you to design your systems in a way that better _enables_ optimization.

You can apply the same principles without Dspy too :)

reply

upvote

by tech_hutch46 minutes ago|

[-]

I read the title as "If DarkSydePhil-y is so great, why isn't anyone using it?"

reply

upvote

by pjmlp4 hours ago|

[-]

Never heard of it, that is already a reason.

reply

upvote

by sbpayne4 hours ago|

[-]

hahaha this is true!

reply

upvote

by deepsquirrelnet3 hours ago|

[-]

Good article, and I think the "evolution of every AI system" is spot on.

In my opinion, the reason people don't use DSPy is because DSPy aims to be a machine learning platform. And like the article says -- this feels different or hard to people who are not used to engineering with probabilistic outputs. But these days, many more people are programming with probability machines than ever before.

The absolute biggest time sink and 'here be dragons' of using LLMs is poke and hope prompt "engineering" without proper evaluation metrics.

> You don’t have to use DSPy. But you should build like someone who understands why it exists.

And this is the salient point, and I think it's very well stated. It's not about the framework per se, but about the methodology.

reply

upvote

by sbpayne3 hours ago|

[-]

yeah this is the main point I wanted to get across! I rarely recommend people to use Dspy; but I think Dspy is often so polarizing that people "throw out the baby with the bathwater". They decide not to use Dspy, but also don't learn from the great ideas it has!

reply

upvote

by Silamoth1 hours ago|

[-]

Am I the only one disappointed this was about some LLM slop and not digital signal processing? DSP is a well-established technical acronym, so I expected to hear about a new Python DSP library. Oh well.

reply

upvote

by _andrei_3 hours ago|

[-]

Almost all the points are not about what DSPy is mainly supposed to offer. What's supposedly great at is automatic optimization, for everything else... who the hell puts Python in production just to make some API calls? There are "frameworks" available in all the better languages, but the constructs behind are not that complicated. And why does DSPy even try to compete with LangChain/Graph/crap?

reply

upvote

by sbpayne2 hours ago|

[-]

I think automatic optimization is valuable, but it's not what Dspy "is"; you can see this consistently through @lateinteraction's tweets.

And hopefully it's clear enough from the post: I'm not necessarily suggesting people use Dspy, just that there are important lessons to take with you, even if you don't use it :)

reply

upvote

by lysecret4 hours ago|

[-]

Main reason to me is that its layers on layer on top of the base LLM calls with not so much to show for it. Also a lot of native features (like for examples geminis native structured responses) aren't well supported.

reply

upvote

by ijk4 hours ago|

[-]

This matches my experience with Dspy. I ended up removing it from our production codebase because, at the time, it didn't quite work as effectively as just using Pydantic and so forth.

The real killer feature is the prompt compilation; it's also the hardest to get to an effective place and I frequently found myself needing more control over the context than it would allow. This was a while ago, so things may have improved. But good evals are hard and the really fancy algorithms will burn a lot of tokens to optimize your prompts.

reply

upvote

by sbpayne4 hours ago|

[-]

Yes! I have also felt this. I highly recommend taking a look at Maxime's template adapter: https://github.com/dspy-community/dspy-template-adapter

I think it solves some of this friction!

reply

upvote

by love2read2 hours ago|

[-]

I really enjoyed this blog format. I think it explained the problem well in a way that made it immediately clear why the solution solved the problem when shown DSPy.

reply

upvote

by sbpayne2 hours ago|

[-]

Thank you! Let me know if anything could be more clear, always something I can improve here I'm sure :)

reply

upvote

by QuadmasterXLII4 hours ago|

[-]

If you find yourself adding a database because thats less painful than regular deployments from your version control, something is hair on fire levels of wrong with your CICD setup.

reply

upvote

by sbpayne4 hours ago|

[-]

I think this misunderstands the need for iteration! Maybe I could have written it more clearly :).

The reality is that you don't want to re-deploy for every prompt change, especially early on. You want to get a really tight feedback loop. If prompt change requires a re-deploy, that is usually too slow. You don't have to use a database to solve this, but it's pretty common to see in my experience.

reply

upvote

by ijk4 hours ago|

[-]

I've been reaching for BAML when I really need prompt iteration at speed.

reply

upvote

by sbpayne5 hours ago|

[-]

I consistently hear great things from Dspy users. At the same time, it feels like adoption is always low.

Stranger still: it seems like every company I have worked with ends up building a half-baked version of Dspy.

reply

upvote

by CuriouslyC4 hours ago|

[-]

Two issues:

1. People don't want to switch frameworks, even though you can pull prompts generated by DSPy and use them elsewhere, it feels weird.

2. You need to do some up-front work to set up some of the optimizers which a lot of people are averse to.

reply

upvote

by brokensegue4 hours ago|

[-]

i've tried it a few times and it's never really helped as much as i expected. though i know they've released a couple times since I last tried it.

reply

upvote

by sbpayne4 hours ago|

[-]

yeah what I'm trying to get across here is that: Dspy does not solve an immediate problem, which is why many feel this way and consequently why it doesn't have great adoption!

But on the other hand, I think people unintentionally end up re-implementing a lot of Dspy.

reply

upvote

by msp263 hours ago|

[-]

> Data extraction tasks are amongst the easiest to evaluate because there’s a known “right” answer.

Wrong. There can be a lot of subjectivity and pretending that some golden answer exists does more harm and narrows down the scope of what you can build.

My other main problem with data extraction tasks and why I'm not satisfied with any of the existing eval tools is that the schemas I write change can drastically as my understanding of the problem increases. And nothing really seems to handle that well, I mostly just resort to reading diffs of what happens when I change something and reading the input/output data very closely. Marimo is fantastic for anything visual like this btw.

Also there is a difference between: the problem in reality → the business model → your db/application schema → the schema you send to the LLM. And to actually improve your schema/prompt you have to be mindful of the entire problem stack and how you might separate things that are handled through post processing rather than by the LLM directly.

> Abstract model calls. Make swapping GPT-4 for Claude a one-line change.

And in practice random limitations like structured output API schema limits between providers can make this non-trivial. God I hate the Gemini API.

reply

upvote

by sbpayne3 hours ago|

[-]

This is very true! I could have been more careful/precise in how I worded this. I was really trying to just get across that it's in a sense easier than some tasks that can be much more open ended.

I'll think about how to word this better, thanks for the feedback!

reply

upvote

by sethkim3 hours ago|

[-]

This is extremely true. In fact, from what we see many/most of the problems to be solved with LLMs do not have ground-truth values; even hand-labeled data tends to be mostly subjective.

reply

upvote

by rco87863 hours ago|

[-]

I think they're just saying that data extraction tasks are easy to evaluate because for a given input text/file you can specify the exact structured output you expect from it.

reply

upvote

by Lerc4 hours ago|

[-]

If [programming_language] is so great, why isn't anyone using it?

For many of the same reasons. A plethora of alteratives, personal preference, weird ideology, appropriateness for the task, inertia, not-invented-here.

The list goes on.

reply

upvote

by jatins4 hours ago|

[-]

Would have been nice if the post actually showed how Dspy does the things that were handrolled

reply

upvote

by sbpayne4 hours ago|

[-]

This is great feedback! I'll work on an update tonight :)

reply

upvote

by LoganDark4 hours ago|

[-]

This article seemingly misses any explanation of what DSPy even is or why it's supposedly so complicated and unfamiliar. Supposedly it solves the problems illustrated in the article, but it isn't explained how.

reply

upvote

by sbpayne3 hours ago|

[-]

Great feedback! I took for granted that people reading would be familiar with what Dspy is. I'll try to add this in tonight to introduce folks better. Thank you!

reply

upvote

by simopa4 hours ago|

[-]

"Great engineers write bad AI code" made my day ;)

reply

upvote

by sbpayne4 hours ago|

[-]

hahaha this has just been my entire last few years of experience :)

reply

upvote

by tilt4 hours ago|

[-]

Curious what you think of https://github.com/pipevals/pipevals (author)

reply

upvote

by sbpayne4 hours ago|

[-]

I have never heard of this! I took a quick look. I think I'm definitely not in the right audience for a tool like this, as I am more comfortable just writing code. But I think putting a UI over things like this _forces_ the underlying system to be more declarative...

So in practice I imagine you get at a lot of the same ideas / benefits!

reply

upvote

by dzonga4 hours ago|

[-]

at /u/ sbpyane - very useful info and pricing page as well.

useful for upcoming consultants to learn how to price services too.

reply

upvote

by sbpayne4 hours ago|

[-]

Highly recommend following @jxnl on X for consulting / positioning / pricing

reply

upvote

by AIorNot2 hours ago|

[-]

I kind of like BAML https://boundaryml.com/ been using it in production

Edit, read the article -its really good- that cycle of AI engineering progression is spot on -read the article too!

reply

upvote

by tinyhouse4 hours ago|

[-]

A lot of these ideas Dspy and RLM (from the same people IIRC) are more marketing than solving a real problem.

reply

upvote

by sbpayne4 hours ago|

[-]

This is a surprising take to me! Would love to learn more about what you mean. I feel like the problems they solve seem so direct to me. For example: RLMs are an approach to long context problems. Not every problem is a good fit for RLMs for sure, but I can see some problems where I imagine it would work well!

reply

upvote

by TZubiri4 hours ago|

[-]

>"Stage 2: “Can we tweak the prompt without deploying?”

Are we playing philosophy here? If you move some part of the code from the repo and into a database, then changing that database is still part of the deployment, but now you just made your versioning have identity crisis. Just put your prompts in your git repo and say no when someone requests an anti-pattern be implemented.

reply

upvote

by sbpayne4 hours ago|

[-]

I think the core challenge here is that being able to (in "development") quickly change the prompt or other parameters and re-run the system to see how it changes is really valuable for making a tight iteration loop.

It's annoying/difficult in practice if this is strictly in code. I don't think a database is necessarily the way to go, but it's just a common pattern I see. And I really strongly believe this is more of a need for a "development time override" than the primary way to deploy to production, to be clear.

reply

upvote

by markab214 hours ago|

[-]

I think the entire premise that the prompting is the surface area for optimizing the application is fundamentally the wrong framing, in the same way that in 1998 better cpam will save CGI. It's solving the wrong problems now, and the limitations in context and model intelligence require a tool like Dspy.

The only thing I'd grab dspy for at this point is to automate the edges of the agentic pipeline that could be improved with RL patterns. But if that is true, you're really shorting yourself by giving your domain DSPY. You should be building your own RL learning loops.

My experience: If you find yourself reaching for a tool like Dspy, you might be sitting on a scenario where reinforcement learning approaches would help even further up the stack than your prompts, and you're probably missing where the real optimization win is. (Think bigger)

reply

upvote

by sbpayne4 hours ago|

[-]

Yeah, I find it hard to recommend Dspy. At the same time, I can't escape the observation that many companies are re-implementing a lot of parts of it. So I think it's important to at least learn from what Dspy is :)

reply

upvote

by villgax4 hours ago|

[-]

Nobody uses it except for maybe the weaviate developer advocates running those jupyter cells.

reply

upvote

by aplomb10262 hours ago|

[-]

[dead]

reply

upvote

by maxothex4 hours ago|

[-]

[dead]

reply

upvote

by leontloveless4 hours ago|

[-]

[dead]

reply

upvote

by leontloveless4 hours ago|

[-]

[dead]

reply

upvote

by jee5994 hours ago|

[-]

[dead]

reply