Just chatting

Engineers that build and train AI and machine learning models often can’t use real data as their learning set because of privacy and security legislation.

Therefore the engineers have often generated fake data sets as the learning data.

My guess is that the creation of large fake data sets is a tedious business, and as soon as they had a working large language model (GPT), they trained it up to create fake data sets for the next iteration of learning.

I can’t prove this but I know engineers and they can’t resist a shortcut, especially if it’s sitting there right in front of them.

If I’m right, current LLMs have fake data generation hard coded into their innards, and it can’t be unscrambled.

The opportunity here is to start from scratch only using real data sets and to never ask for fake data to be generated. The resulting LLM will be worth a fortune as an AI detector.