undefined

points

[-]

If you think about it, this what reinforcement learning folks are doing. They use the vanilla state and then they lift it to the observed state by doing some additional calculation with the original state data.

For example you start with the raw coordinates of snake in a snake game, but you now can calculate how many escape routes the snake has, and train on it.

by AndrewKemendo4 hours ago|

parent|

[-]

That’s correct

I’ve spent most of the last 20 years doing reinforcement learning and it is exceptionally simple conceptually

The challenge is data acquisition and the right types of process frameworks

by 3abiton7 hours ago|

prev|

[-]

I think it's worth mentioning, that the achilles heel of DT, is in fact, data (more specifically feature) engineering. If one does not spend significant time cleaning and engineering the features, the results would be much worse than, say a "black box" model, like NN. This is the catch. Ironically, NN can detect such latent features, but very difficult to interpret why.

by datsci_est_20153 hours ago|

parent|

[-]

This varies so wildly from domain to domain. Highly structured data (time series, photos, audio, etc.) typically has a metric boatload of feature extraction methodology. Neural networks often draw on and exploit that structure (i.e. convolutions). You could even get some pretty good results on manually-extracted neural-network-esque features handed off to a random forest. This heuristic begins to fall off with deep learning though, which imo, is a precursor to LLMs and showed that emergent complexity is possible with machine learning.

But non-structured data? Pretty pointless to hand off to a neural network imo.

by srean3 hours ago|

parent|

[-]

Ah! Your comment helped me understand the parent comment so much more. I thought it was more about data hygiene needs.

Yes a DT on raw pixel values, or a DT on raw time values will in general be quite terrible.

That said the convolutional structure is hard coded in those neural nets, only the weights are learned. It is not that the network discovered on its own that convolutions are a good idea. So NNs too really (damn autocorrect, it's rely, rely) on human insight and structuring upon which they can then build over.

by srean7 hours ago|

parent|

prev|

[-]

That has not been my experience though, apart from the need of standard data hygiene that one has to maintain for any ML exercise.

Normalising the numeric features to a common range has been adequate. This too is strictly not necessary for DTs, but pretty much mandatory for linear models. (DTs are very robust to scale differences among features, linear models quite vulnerable to the same.)

One can think of each tree path from root to leaf as a data driven formulation/synthesis of a higher level feature built out of logical conjunctions ('AND' operation).

These auto-synthesized / discovered features are then ORed at the top. DTs are good at capturing multi-feature interactions that single layer linear models can't.

NNs certainly synthesize higher level features, but what does not get highlighted enough is that learning-theory motivated Adaboost algorithm and it's derivatives do that too.

Their modus operandi is "BYOWC, bring your own weak classifier, I will reweigh the data in such a way that your unmodified classifier will discover a higher level feature on its own. Later you can combine these higher features linearly, by voting, or by averaging".

I personally favor differentiable models over trees, but have to give credit where it's due, DTs work great.

What leaves me intellectually unsatisfied about decision trees is that the space of trees cannot be searched in any nice principled way.

Or to describe in terms of feature discovery, in DT, there's no notion of progressing smoothly towards better high level features that track the end goal at every step of this synthesis (in fact greedy hill climbing hurts performance in the case of decision trees). DTs use a combinatorial search over the feature space partitioning, essentially by trial and error.

Neural nets have a smooth way of devising these higher level features informed by the end goal. Roll infinitesimally in the direction of steepest progress -- that's so much more satisfying than trial and error.

As for classification performance where DT have struggled are cases where columns are sparse (low entropy to begin with).

Another weakness of DT is the difficulty of achieving high throughput runtime performance, unless the DT is compiled into machine code. Walking a runtime tree structure with not so predictable branching probabilities that doe'snt run very fast on our standard machines. Compared to DNNs though this is a laughable complaint, throughput of DTs are an order of one or two faster.

by u1hcw9nx10 hours ago|

prev|

[-]

There are decision trees for what you want do do.

Oblique Decision trees, Model Trees. (M5 Trees for example), Logistic Model Trees (LMT) or Hierarchical Mixture of Experts (HME).

by srean9 hours ago|

parent|

[-]

Yes.

I mention restricted oblique trees in passing in my original comment. In my experience, oblique trees tend to add considerable complexity, the others more so. Of course whether the complexity is warranted or not will depend on the dataset.

The merit of what I used is in its simplicity. Any simple ML library would have a linear classifier and a tree learner.

Super easy to implement, train, maintain, debug. One to two person team can handle this fine.

by ekjhgkejhgk10 hours ago|

prev|

[-]

> (ii) linear functions are terrible where equi-label regions have a partitioned structure.

Could you explain what "equi-label regions having a partitioned structure" mean?

by srean10 hours ago|

parent|

[-]

I missed a word "recursively", that I have edited in my original comment now.

Consider connected regions in the domain that have the same label. Much like countries on a political map. The situation where this has a short description in terms of recursive subdivision of space, is what I am calling a partitioned structure. It's really rather tautological.

by ekjhgkejhgk9 hours ago|

parent|

[-]

"recursively partitioned" sounds like a fractal to me. Not sure what you really mean.

by srean9 hours ago|

parent|

[-]

Taken to the limit you are absolutely right.

It turns out many dataset have such a fractal like nature but where the partitioning needs to be cut off at a certain depth and not continued till infinity.

by selimthegrim5 hours ago|

prev|

[-]

Is this not the heart of the IRM paper by Arjovsky, Bottou et al.?

by srean4 hours ago|

parent|

[-]

Oh theirs is a far more sophisticated and deeper idea, to look for invariant representations.

Mine is a quick but effective practical hack fuelled by a little bit of insight.