visit
That AI was ImageGPT from OpenAI where GPT stands for Generative Pretrained Transformer. Now surely I am not the only one who immediately went crazy with ideas about what to find a “plausible solution” to next.
Some saw just yet another AI image generator. But …I saw unfathomably big probability space reduced to a thin and focused slice.And what's more shocking. Reduced to thin slice by following a lot of very complex observed and learned rules. Notice how the position of the light source influenced lighting in the resulting generated pedestrian scene.Notice plausible transparency length and perspective of shadows. Plausible joint anatomy of furry cat paws. That is. If a cat ever wanted to hold a card? Heck, it even added a slight paw shadow over the card.But what really is the most important problem worth completing to move humans forward?
“Q: What is result of 1+1 and why”“A: Result of 1+1 is 2 acording to rules seen in arxivurl1 and arxivurl2”
Advantages of training on all scientific peer-reviewed sources:
Learning to arrive at conclusions by observing thought process down to basic elements (Arxiv), not just conclusions themselves(Wikipedia).
Information quality. Information density. Smaller dataset. Faster training. Less memory. Ability to follow and focus attention on the chain of references and citation counts. Less bias. No leaks.
What Shockingly plausible solutions will look like now ?What hidden rules will AI observe and apply from all the human scientific data and knowledge it was trained on? Will it complete the equation with a term describing the pattern it observed within data in some LHC paper?I proposed for live TV and net and got access to Gpt3.
“With enough complexity there is … Yannic Kilcher“Now short text like that activates only a part of model input and weights. aaaaSo out of pure curiosity, I tried to find max number of math equations I can ask gpt3 to compute at once.
Think of equations as observed shadows of some complex structure projected to our limited plane of understanding.
Equations are just language too, with equally learnable rules.
The multidimensional nature of AI weights so far most definitely proven to be able to observe and learn hidden rules of these higher dimensional constructs to project new construct shadows /equations.Just as it did with ImageGpt (12 billion parameters) and Images of Cats on top of the article. For example, AI exploring Math in an automated way already found new Conjecture.If it can learn known Math rules just by observation.
Than it can learn unknown Math rules as well.
Mixture-of-experts-based models tend to significantly underperform monolithic (regular) models for the same number of parameters. ”Eleuther AI Faq”Gpt3 undeniably produces often fascinating answers on subjects that were frequently present in the training dataset. Heck, it even seems to have learned simple math since seen just simple math.
But there lies a problem.
Wikipedia frequently contains only end results and conclusions.
Not the important process of how we arrived at them.
Model that observed multiple thought processes arriving at multiple conclusions is better then model that just observed multiple conclusions.So what if we trained it on whole Arxiv.org but including complex Equations that are often 50% content or more?
Update: We are getting there fast. Thanks to EleutherAI excellent opensource gpt3 clone model and dataset “” was recently born being trained on a large chunk of Arxiv and DeepMind Math examples. Yay ;DMy approach to dataset would be completely different from currently common …
Lets just throw at it a lot of random textYou would not start teaching a child high school math without basic math first.If you teach it basic rules from one educational book like algebra packed with new info every line. The kid will learn it faster than if the kid just did read random stories low on new info where it would learn it eventually too but it would take way longer.Unfortunately, that’s how inefficiently current GPT models are trained
If there is something as new information density then not all training text is equal.
For example:
Say text file containing 1GB of just addition examples will teach it just addition no matter how long it was. After the 5th occurrence, you are pretty much feeding it duplicate info and waste CPU and memory reinforcing what is probably not that important.So if we reorder the dataset with text having, say, much higher new information density first reshuffle little and repeat, then we should converge extremely faster and, more importantly, extract way higher-level information not decodable before due to not having required prior lower-level knowledge.Why?
In a sense book containing higher-level concepts like high school, math is comparable to a zip file requiring you to possess a decoding dictionary/prior external know-how to decode this now compressed content. Even a single symbol or duplet here is often referencing complex and info-heavy prior knowledge.So new info density and order are extremely important.in order to be able to unpack and understand and extract this new higher-level information
Pass 1: train many epochs on just thousands (but the densest sources of new info on the planet) of school educational books where every line is pretty much new info. And do it in a very specific order, like in school. Start by explaining and teaching English language basics, basic math, basic physics, basic chemistry, then high school variants of all those.
Alternatively, when not having a large set of wished educational books just random but big existing dataset. One can implement this curriculum training and sort documents in the dataset by estimating new information density by tracking big weight changes and frequency per doc.Pass 2: finetune on as many synthetic random computer-generated equations as you can.
I can't imagine a currently better candidate for this pass the excellent . Here are train examples:Question: Solve -42*r + 27*c = -1167 and 130*r + 4*c = 372 for r.
Answer: 4
Question: Calculate -841880142.544 + 411127.
Answer: -841469015.544
Question: Let x(g) = 9*g + 1. Let q(c) = 2*c + 1. Let f(i) = 3*i - 39. Let w(j) = q(x(j)). Calculate f(w(a)).
Answer: 54*a - 30
pass 3: Now that model has a firm grasp on mathematics and its computational units had formed. Only now train on all arxiv papers and theory candidates containing complex math equations. Well, complex for humans since our brain context and memory buffers are limited.
The model should then, in theory, be able to interactively explain any part of why it came to the conclusion it arrived at. And what's more. Like gpt3.Contemplate new variants. Papers themselves often frequently propose other possible directions that were due to time constraints not explored. Potential that AI may definitely tap into due to its infinite scalability.Because sometimes even the smallest Spark can lead to Big Fire.And yes. I know. There is a lot everyone has on their plate nowadays. And I also know it's not like you flick a switch and done. But when you think about it more. It can use already existing text-only toolchains…
Indeed that's how 76GB arxiv dataset was meanwhile born. See
I know, I know.
But as would Big Rocket Man say.
We should dream BIG to be exited about every next morning.Also published on .