For example you start with the raw coordinates of snake in a snake game, but you now can calculate how many escape routes the snake has, and train on it.
I’ve spent most of the last 20 years doing reinforcement learning and it is exceptionally simple conceptually
The challenge is data acquisition and the right types of process frameworks
But non-structured data? Pretty pointless to hand off to a neural network imo.
Yes a DT on raw pixel values, or a DT on raw time values will in general be quite terrible.
That said the convolutional structure is hard coded in those neural nets, only the weights are learned. It is not that the network discovered on its own that convolutions are a good idea. So NNs too really (damn autocorrect, it's rely, rely) on human insight and structuring upon which they can then build over.
Normalising the numeric features to a common range has been adequate. This too is strictly not necessary for DTs, but pretty much mandatory for linear models. (DTs are very robust to scale differences among features, linear models quite vulnerable to the same.)
One can think of each tree path from root to leaf as a data driven formulation/synthesis of a higher level feature built out of logical conjunctions ('AND' operation).
These auto-synthesized / discovered features are then ORed at the top. DTs are good at capturing multi-feature interactions that single layer linear models can't.
NNs certainly synthesize higher level features, but what does not get highlighted enough is that learning-theory motivated Adaboost algorithm and it's derivatives do that too.
Their modus operandi is "BYOWC, bring your own weak classifier, I will reweigh the data in such a way that your unmodified classifier will discover a higher level feature on its own. Later you can combine these higher features linearly, by voting, or by averaging".
I personally favor differentiable models over trees, but have to give credit where it's due, DTs work great.
What leaves me intellectually unsatisfied about decision trees is that the space of trees cannot be searched in any nice principled way.
Or to describe in terms of feature discovery, in DT, there's no notion of progressing smoothly towards better high level features that track the end goal at every step of this synthesis (in fact greedy hill climbing hurts performance in the case of decision trees). DTs use a combinatorial search over the feature space partitioning, essentially by trial and error.
Neural nets have a smooth way of devising these higher level features informed by the end goal. Roll infinitesimally in the direction of steepest progress -- that's so much more satisfying than trial and error.
As for classification performance where DT have struggled are cases where columns are sparse (low entropy to begin with).
Another weakness of DT is the difficulty of achieving high throughput runtime performance, unless the DT is compiled into machine code. Walking a runtime tree structure with not so predictable branching probabilities that doe'snt run very fast on our standard machines. Compared to DNNs though this is a laughable complaint, throughput of DTs are an order of one or two faster.
Oblique Decision trees, Model Trees. (M5 Trees for example), Logistic Model Trees (LMT) or Hierarchical Mixture of Experts (HME).
I mention restricted oblique trees in passing in my original comment. In my experience, oblique trees tend to add considerable complexity, the others more so. Of course whether the complexity is warranted or not will depend on the dataset.
The merit of what I used is in its simplicity. Any simple ML library would have a linear classifier and a tree learner.
Super easy to implement, train, maintain, debug. One to two person team can handle this fine.
Could you explain what "equi-label regions having a partitioned structure" mean?
Consider connected regions in the domain that have the same label. Much like countries on a political map. The situation where this has a short description in terms of recursive subdivision of space, is what I am calling a partitioned structure. It's really rather tautological.
It turns out many dataset have such a fractal like nature but where the partitioning needs to be cut off at a certain depth and not continued till infinity.
Mine is a quick but effective practical hack fuelled by a little bit of insight.