I cannot deny my background as a database administrator. Initially, when I started delving into neural networks, I tried to understand how they possibly function. The first thing I stumbled upon was where the heck is the data stored?
In the clouds? Under a secret stone at a British boarding school?
After all this has implications not only for data storage but also for privacy, which came into play in a way I wasn’t aware of when I began this journey. Fundamentally, I wanted to find out how these neural networks, specifically the Large Language Model like GPT-4, store the huge amount of data they process for training purposes, and whether they store it at all.
Storage
It may come as a surprise that inputs and outputs into a neural network, such as texts, images, videos, and audio, can be broken down into vectors. But I was astonished when I first understood where neural networks store data. It became clear to me that the data – sentences, lexicons, images (when trained on images), primary literature, books, articles, blog posts, etc. – are not stored in plaintext, binary code, or in a (structured) database as I, a conventional IT professional with a database background, had imagined. Instead, the weights of the neural network represent the knowledge that has been absorbed, and the information from the data itself is ‘stored’ in the parameters consisting of the weights and the bias, in a surprisingly different form.
While the raw data are not stored directly within the neural network, at all, the network learns and retains statistical patterns and relationships from the data during training. These are encoded within the network’s weights and biases, which effectively represent the accumulated knowledge in a compressed and abstracted form, rather than storing the data explicitly.
Training
This last sentence reveals another important characteristic of neural networks. These are trained through a rather simple reinforcement mechanism, called Backpropagation and gradient descent (or its variants). The model makes a prediction about which word would have come next. Another part of the model knows which word is supposed to be there. If the model makes the correct prediction, the existing structure is reinforced. If the model makes the wrong prediction, a new attempt is made. This leads to the creation of a gigantic ‘guessing machine,’ which is being trained for the first time. The model is adjusted so that the deviation from what the ‘correct’ word, the word that is really in the training data following the current word (token) vs. the word the network guessed would follow, is minimized. It is almost never zero or perfect.
Tokens
I just introduced another term and concept here: Token. It is kind of what a byte in traditional computation is. The smallest unit the system works with. But in neural networks, a ‘token’ refers to the basic unit of data that the system processes. Unlike bytes in traditional computing, which uniformly represent data, tokens vary significantly based on the type of data being processed. Here’s how tokens manifest across different data types:
Text Tokens: In text processing, tokens are usually words or parts of words. For example, the sentence ‘Hello, world!’ might be tokenized into [‘Hello’, ‘,’, ‘world’, ‘!’]. Advanced systems like BERT use subword tokenization, which might break down ‘unbelievable’ into [‘un’, ‘##believ’, ‘##able’] to handle unknown words more effectively and preserve meaning from known segments.
Of course there are other tokens, like Image and Audio Tokens. I will not cover them here. Tokens are the dissected pieces of data that neural networks analyze and learn from. They are not uniform or fixed-size like bytes in computing; rather, they are dynamically defined based on the characteristics of the data and the requirements of the processing task. This flexible approach allows neural networks to adapt and excel in a wide range of applications, from understanding human language to recognizing objects in images and sounds in audio tracks.
To summarize, as the one and only Andrej Karpathy recently posted:
The history of computing is repeating in an echo, except replace computers that do precise arithmetic on bytes with computers that do statistical arithmetic on tokens.
What is it not good for?
Once I came to realize that, I admit I was a bit relieved as, at least in the near future, nobody in their right mind would think about using AI to replace their (relational) database. From something as benign as your cooking recipes to your location coordinates, or your bank account or the doses of prescription medications, you wouldn’t want a giant gambler ‘statistical arithmetic’, remember, to mostly guess the balance of your account correctly, would you? Companies should, at least for now, not think of replacing this meticulous data storage of their records with AI just yet. I think at least the tax authorities in most countries would be little amused, as well as the customers who happened to be the ones whose orders the AI mixed up. In short, the 50+ years of RDBMs system development that was spent on data consistency cannot be replaced just yet. On the contrary, those data warehouses and lakes are valuable as they just might be the next treasure to train the next model on. So, faithful data storage is not the domain of AI, which could have been clear already when carefully reading my introduction into AI. But who reads that anyway? But AI can learn, can’t it? After all they are trained on data.
Catastrophic Forgetting.
No, this is not about your keys the other day, or your wife’s birthday.
Catastrophic forgetting, a significant challenge in neural network training, refers to a phenomenon in neural networks where the model loses or ‘forgets’ previously learned information upon learning new data. This occurs because the neural network’s weights, adjusted to accommodate new knowledge, can overwrite the information that encoded earlier learnings, leading to a significant deterioration in performance on tasks it was previously trained on.
Up until now, in AI, you have to be very careful with absolutes; the LLM will , for the most part, learn from you directly. Maybe ChatGPT will learn in that one session you asked it to answer you as Arnold Schwarzenegger in Terminator. It will not learn that you have a preference to communicate with movie characters in any new chat.
Yes, I am aware of the customize feature of the front end, to me, this looks like a feature of the app. And yes, my ChatGPT notified me a couple of days ago it will be learning across all sessions from now on. Let’s see how this goes.
But training an AI on new additional data would very often break that model, so that new data would ‘delete’ old trained data to the point that the model becomes useless very quickly.
If you ever followed one of these tips: ‘To train your own unique style in one Session, just never close this session or use a new one’ you may have noticed that it keeps forgetting things you told it earlier plus that it becomes useless after a while.
This is Catastrophic Forgetting and it as bad as it sounds. Yes, maybe Google research, once again, made a break through, at least the title is promising: ‘Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention’.
Can you be smart without memory?
The short answer seems to be yes. But it is not that simple. Let’s get back to the training and how much data was used.
The Link Between Compression, Abstraction, and Intelligence
Ray Solomonoff has illuminated how intelligence correlates with the ability to distill complex information into simpler, more manageable forms. Solomonoff’s algorithmic information theory suggests that the essence of intelligent behavior is the capacity to compress data efficiently, extracting underlying patterns that predict future occurrences.
Take, for example, the principle of Minimum Description Length (MDL) introduced by Jorma Rissanen. MDL posits that the best model for a dataset is the one that describes the data using the shortest description, effectively compressing the information. This principle mirrors how human cognition tends to streamline vast amounts of sensory information to form coherent, manageable understandings of the world.
In neuroscience, similar processes are observed where neural pathways strengthen or diminish to efficiently store important information while discarding the irrelevant. This neural behavior is effectively a biological data compression algorithm, optimizing brain resources.
This intersection of compression, abstraction, and intelligence is not just theoretical it directly influences how artificial intelligence systems are trained. The vast datasets used in training AI models are compressed into neural network weights during the learning process, as discussed earlier. These weights abstract the data’s complexity into usable intelligence, enabling AI systems to make predictions or understand language based on previously unseen data.
How much data anyway, how much compression?
You may find information suggesting that your LLM was trained on 45TB of data extracted from the internet. This was significantly reduced to about 570GB after extensive filtering and preprocessing, representing roughly 60% of the total training data used—equivalent to 400 billion byte-pair-encoded tokens. This indicates substantial compression. The remainder of the training, specifically for fine tuning, used an additional 380GB, bringing the total processed data to 950GB. Considering the final GPT-3 model file is approximately 800GB, the compression and abstraction from the original terabytes of data to a functional model size demonstrate remarkable data efficiency, though it is not without loss of accuracy due to the nature of the compression being non-lossless.
Emergent
But it is undeniable: Through what I’ll simply call ‘brute forcing’ the substantial increase in the model’s size by adding more parameters and enhancing the depth and width of the network layers – the network suddenly develops properties that were not explicitly programmed. How could it be otherwise, given that it was trained only on large datasets to understand language and generate natural language outputs, as we all can witness live every day. Thus, we must recognize that it is at least possible to simulate intelligent behavior convincingly -so convincingly that significant benefits can be derived, even though critics point out that these networks do NOT really understand the world. But what does ‘understand’ really mean?
Moreover, these networks are only capable of learning to a limited extent from you, as discussed above. But in the optional stage 3, or the second fine tuning stage, models will learn from human feedback when you pick an answer from a collection of possible answers. Furthermore my references are primarily to the Large Language Models of the GPT series. As for Tesla’s AI network, which evidently can drive very well, I am not privy to specific details, as they are not public knowledge. Tesla releases a new version approximately every two to three weeks, with each new version being better than the last. We must therefore conclude they have found a way to continuously improve this network, likely in the fine tuning stage, and probably train it with more and more data without the model breaking down.
Regarding the released LLaMA 3 by Meta, today, I cannot yet provide specific details.
So there is a way to evolve these models to continuously learn. As described in ‘The Very Very Swift Evolution,’ this process can only be described as a race. This applies not only to the race for humanoid robots but also to the race in AI development towards models and networks that possess an increasing range of capabilities and these capabilities grow with each week that goes by. So even if there is only the ‘dumb’ learning on datasets without constant learning, memory or even understanding the world, the results are staggering. And these restrictions are going away fast. Stay tuned for a lot more to come.
If you feel overwhelmed by this information and you would like some consultation, of you think you want to give your team and yourself an even more in depth understanding of this topic consider contacting me for consultation.