Is Data Worth Nothing for Your LLM?
Maybe data is not the new oil. You can have all the content in the world and your LLM still sucks.
You can be the most popular social media platform—be the most popular website of 2021 surpassing even Google.
Youn can collect new user generated data at a neck-breaking speed. And that is just the video data with at least 25 images per second and sound. But on top of that video data you get textual data, descriptive data, usage data, user data and personal information.
And you still are unable to build a meaningful LLM?
Alex Heath journalistic investigations has uncovered that in the race of building generative AI ByteDance, the parent company of TikTok is secretly using OpenAI’s tech to build a competitor LLM:
TikTok’s entrancing “For You” feed made it an AI leader on the world stage. But that same company is now so behind in the generative AI race that it has been secretly using OpenAI’s technology to develop its own competing LLM, including for training and evaluating their model.
So is data worth nothing, or is it? Because what counts is the quality of the data. Building high quality training sets is crucial and cumbersome.