Prefer narration? Listen to the audio version of this post.
Introduction
We recently stumbled across the first ever Hacker News thread discussing Attention Is All You Need, the landmark paper that describes the transformer architecture. The work had come out in the summer of 2017, but it took a while for the wider AI community to begin to grasp its significance and build on it.
Most commenters at the time expressed interest or admiration. Some didn’t understand it. Others thought that the paper would probably be out of date in a few months. And one person asked what it could do for capsule networks.
But … whatever happened to capsule networks?
Gone but not forgotten
For those who don’t remember what was hot and what was not back in 2017, capsule networks or ‘capsnets’ were an attempt to improve the recognition and understanding of visual information. Convolutional neural networks (CNNs), the most popular architecture for image processing, computer vision and related fields, were good at recognizing features in images (like edges or textures), but historically struggled to understand the spatial relationships between features.
Proposed by Geoffrey Hinton and Google Brain, capsnets were designed to overcome this by capturing both features and the hierarchical relationships within them. Instead of relying on individual neurons, capsnets used clusters of neurons called ‘capsules’ that could output more detailed information about an image, including size and position. Unlike traditional neural networks, where connections are fixed, capsnets use ‘dynamic routing’ to allow capsules to send their outputs to the most appropriate capsules in the next layer.
By understanding these hierarchical relationships better, capsnets would also be able to generalize better to new viewpoints of objects. Note that, conceptually, this idea is somewhat related to today’s popularized mixture of expert models and multi-headed self-attention, which also select subsets of a network or input tokens during inference. Although lingering questions remain about the actual amount of specialization one actually sees in individual experts in MoE models.
Capsnets were quickly branded as “potentially revolutionary” and an “amazing breakthrough” and inspired a run of research.
But capsnets, while containing some interesting high-level ideas, did not emerge as a new frontier in deep learning for a few different reasons:
They were significantly more computationally intensive than CNNs. Dynamic routing, key to their performance, requires additional computation to determine how capsules should be connected.
This made it hard to scale them up to large datasets or deep architectures, unlike CNNs, which have been optimized extensively.
Caspnets were also harder to train, as they were very sensitive to their initial settings. Researchers had to invest more time carefully and precisely tuning their hyperparameters.
As a new technique, it suffered from a standing start. CNNs offered practitioners a range of pre-trained models (e.g. ResNet, Inception), while this ecosystem was non-existent for capsnets.
All of these drawbacks may have been okay, but for one problem. While theoretically elegant, capsnets did not offer the real-world performance advantage over CNNs that would justify all this additional work on the part of researchers.
Many of the limitations of CNNs in 2017 were fixable by specific techniques (e.g. data augmentation) or improved organically, as better hardware and software made it easier to train larger models on ever larger datasets. While in a theoretical sense they may not have understood hierarchical relationships as well as capsnets, they were able to function well enough. Capsnets quickly became an unwieldy solution in search of a problem.
It’s not just the capsules
This isn’t designed to be a cheap retrospective shot at people exploring new computer vision techniques in 2017. Instead, we believe that capsnets and other near misses provide us with useful (reverse) indicators when it comes to judging the success of new and emerging methods.
To take another example, federated learning, introduced by Google in 2016, was seen by many as the future for a new kind of decentralized, privacy-preserving model of AI training. The idea was that models could be trained across decentralized servers or mobile devices, while the data was hosted locally. This would allow researchers working in sensitive domains like healthcare to access training data via partnerships, while reducing concerns about what they were ‘really doing’ with the data.
There was a surge of enthusiasm among researchers and the idea enjoyed significant popularity within government and healthcare circles, where it became seen as a magic solution to various regulatory and compliance challenges.
Again, theoretical elegance colliding with the harsh reality of scaling in the real world. Managing distributed data sources, ensuring consistent updates, and handling asynchronous communication between devices require sophisticated infrastructure and significant engineering effort while being computationally expensive.
It also hit up against mundane constraints like bandwidth and latency, risking uneven learning and model performance.
This made it all but impossible for the technique to scale across large datasets, especially versus the simpler solution of encrypting data and including common control policies. Also, critically for a ‘trustless’ technique, non-technical decision-makers often didn’t trust that their data was not actually leaving their server and couldn’t be reidentified from the globally trained model.
It also turned out there wasn’t a clever fix to any of the regulatory and compliance issues that federated learning was meant to solve. All the same issues surrounding liability, data ownership, and cross-border data transfers remained unsolved.
These challenges have plagued other briefly voguish ideas, including:
Resurrecting symbolic AI for complex reasoning - promised more interpretable and structured reasoning capabilities compared to neural networks, but lacked the learning and generalization abilities of more modern approaches.
Neural Architecture Search - offered an automated way to design neural network architectures optimized for a given task instead of relying on manual design, but required very careful human-design of the search space and often drew on significant computational resources to either rediscover or make minor tweaks to existing architectures.
Quantum machine learning - raised hopes of exponential speedups by leveraging quantum computation, but building practical quantum advantage has faced significant hardware and algorithmic hurdles.
It’s not just the software
We see these patterns in the world of hardware, as governments and researchers both look to break their dependence on GPUs. For example, over the past 8 years or so, there have been repeated false neuromorphic chip dawns.
These chips aim to mimic the low-power operation of biological neural networks, significantly reducing energy consumption. Their parallel, distributed processing architecture is also meant to enable the faster processing of certain types of AI workloads and allow for greater scalability.
As a result, neuromorphic chips have seen significant investment from corporations (e.g. Intel), prominent figures in the AI world (e.g. Sam Altman), and received financial support from the government. Despite billions of dollars having been thrown at the problem, these efforts have yielded little of practical use so far. GPUs reign supreme, despite attempts to innovate them out of relevance.
As with the software examples above, theoretical elegance lost out to power, flexibility, and scalability. In the same way capsule networks’ theoretically better understanding of hierarchies, lost out to CNN’s ‘good enough with scale’, the theoretically better adaptability of a neuromorphic chip loses out to the ‘good enough and much more powerful’ GPU.
But why was attention all we needed?
Our journey through the near-misses was prompted by reflecting on the early days of the transformer architecture, which has obviously been a phenomenal success. What was different here?
Firstly, it tackled a problem that couldn’t be fixed with scale. The most common pre-transformer method of processing and understanding language - recurrent neural networks (RNNs) - suffered from a genuine shortcoming. The fundamental issue with RNNs was their sequential, recurrent nature. As the sequence length increased, the information from the earlier parts of the sequence had a harder and harder time influencing the later parts. This is known as the vanishing gradient problem, where the gradients used to train the model effectively disappear as they propagate back through the many recurrent steps. Even if this was solved, the model would still struggle to learn what to keep and what to throw away in the state-space. Throwing more compute or data at the problem couldn’t correct for the innate limitations of the architecture.
The transformer architecture offered a direct solution. By allowing the model to selectively focus on relevant parts of the input, regardless of their position in the sequence, transformers were able to overcome the limitation of vanishing gradients.
As well as being able to offer a direct solution, unlike many of the other ‘solutions’ above, the architecture scaled well. Transformers are able to process different parts of the input at the same time, rather than sequentially. This allows them to take advantage of the powerful parallel processing capabilities of modern computer hardware, like GPUs.
Added to that, they’re built from a collection of well-defined modules, such as the self-attention mechanism, feed-forward neural networks, and layer normalization. While not taking away from the sophistication, the modular design makes it easier to implement.
This is not to say that the architecture was perfect in its original form. Self-attention scaled quadratically driving up computational cost, memory requirements, and introducing challenges around parallelization. Refinements like Flash Attention were able to overcome this through a combination of mathematical reformulation and approximation techniques, trading away some exactness for significant computational gains. We believe there are many more optimization tricks to be discovered as networks and chips (co)evolve.
Closing thoughts
What does this mini-history listen teach us? Firstly, it’s incredibly difficult for any new technique or approach to gain lasting traction. When we published our recent 5-year retrospective on the State of AI Report, it was striking how progress in individual fields or disciplines would happen slowly and then very quickly.
As we saw above with both hardware and software - it’s not enough for something just to be clever. A new approach has to provide an enduring performance advantage that justifies researchers junking what they were doing before, sometimes transitioning away from well-developed ecosystems of tools and embracing a standing start. That’s a hard bar for anyone to clear.