Many questions come to mind with AI dominating news cycles this past year. Besides the big “Will AI take over the world” question other questions around the ethics of AI training and how content is consumed is being widely debated right now. With this in mind, should you be thinking about protecting your online content? Is there anything you can do now to do so?
How are LLMs Trained?
First, let’s cover how Large language models (LLMs) like GPT-4 collect data to train on. These models are trained on a large array of internet content, including websites, books, articles, and other text data collected widely on the internet. This training involves processing and learning from this dataset to understand language patterns, grammar, facts, and various forms of information. On a basic level, these models learn to predict the next word in a sentence, enabling it to generate coherent and contextually relevant text. The data used for training is usually preprocessed to remove any personally identifiable information and to ensure a broad and diverse representation of language usage. Keep in mind that this preprocessing step is a manual task although it is assisted with algorithms and automated processes.
Why should you protect your content?
The answer to this question will be different for everyone and may be something you do not need to worry about. However, you may want to take action if you have content on your site that you want people to freely see but don’t want the next generation LLMs training on. This could include things like:
- Personal Information: Protect user-generated content, personal data, and sensitive information to ensure privacy and security. Remember, that preprocessing step is manual and prone to error.
- Proprietary Content: Guard your unique articles, research, reports, and proprietary data that offer a competitive edge. Most of this should be protected through authentication in my opinion but I could see a case for some to show more than others.
- Creative Works: Safeguard your original images, videos, music, and written works to protect intellectual property rights.
- Confidential Business Information: Secure internal documents, strategies, and plans to prevent unauthorized use. This one is obvious and shouldn’t be visible to unauthorized users but I have seen people in the past put private information on the internet and think that it is not “findable” because it wasn’t linked in the website. Not true!
- Paid or Premium Content: Protect content that is behind paywalls or subscription models to ensure only paying users access it. Again, your code should handle making sure someone is authorized before showing them this type of content.
How can you protect your content?
Currently we do not have a way to disallow LLMs specifically from our website content. However, most models seem to honor the old robots.txt functionality that search engines use for indexing. But, keep in mind that it is up to them to honor this and it is akin to you saying to a guest, “please don’t go into that room.” Some examples of content that is available to the public to read but disallowed for search engine index and most LLMs farming are X and LinkedIn posts and comments.
With this in mind here are some ways to protect content on your website from being consumed by language models and other automated tools. Consider implementing the following best practices:
- robots.txt: Use this file to disallow crawlers from accessing specific parts of your site.
- Meta Tags: Implement meta tags such as
<meta name="robots" content="noindex, nofollow">
to prevent search engines from indexing and following links on specific pages. - CAPTCHA: Use CAPTCHA challenges to ensure that only humans can access certain content.
- Content Fencing: Implement measures like paywalls or user authentication to restrict access to premium content.
- Terms of Service: Clearly state in your terms of service that unauthorized scraping or use of your content by automated systems is prohibited.
- Watermarking: For images and videos, use watermarks to protect against unauthorized use.
- Legal Notices: Include legal notices and copyright information to deter unauthorized use and provide grounds for enforcement.
These are not a fail safe way to protect your online content so if you are really concerned, you should consider locking your content down in some way. If you have content that is not sensitive but you don’t want it indexed or consumed by AI, you can try some of the suggestions above. If you are unsure how to implement this or you have questions about AI and how it might impact you and your business, reach out and we can help you through it.