AI | Peace News

IssueOctober - November 2023

Should campaigners block OpenAI from scraping their websites?

01 Oct 2023 Comment by PN staff

The Hollywood writers’ and actors’ strikes are currently at the centre of the fight over artificial intelligence (AI), where US-based writers and actors are trying to stop their work from being replaced by AI systems.

Here at PN, we’re wondering about our own resistance to the new digital overlords.

You may be surprised to hear that AI played a big role at the Global Investigative Journalism Network (GIJN) conference that was taking place in Gothenburg, Sweden, as we were putting together this issue. (The conference is a big deal: last year, it drew 1,668 participants from 144 countries.)

There were a lot of sessions at GIJN this year which focused on IT, especially on data techniques. These included: ‘Using Pivot Tables in Excel to Summarize Data’; ‘Introduction to Scraping Websites Using Data Miner’; ‘The Basics of Python’ (the programming language); and ‘So You Think You Can Google?’

There were at least nine sessions in which AI had a central or major role, including: ‘What Does AI Have to Do with Investigative Journalism? Everything!’; ‘How AI Can Save Small Newsrooms’; and ‘Open Source Reporting – Let’s Get Personal’. In that last workshop, online search expert Henk van Ess shared ‘brand new insights’ on ‘how to use AI to find synthetic drug dealers and how large language models can help to find out the secret address of a CEO.’

‘Large language models’ are generative AI systems like ChatGPT – which can create readable, human-like sentences on any chosen question based on a huge amount of text that has been scraped off the internet.

ChatGPT knows which words generally sit side-by-side on the internet, and it builds up new sentences by predicting the most likely word to come next in its answer to the user’s question.

Chinese-American science fiction author Ted Chiang is one of many people who don’t think that ‘artificial intelligence’ is the right label for this kind of computer work. He and many others suggest that a better name would be ‘applied statistics’.

One issue is that large language models like ChatGPT or DALL-E (which generates images rather than words) are based on content scraped from thousands of websites and books without acknowledgement or compensation.

In August, OpenAI, the company behind ChatGPT and DALL-E, released a technical note on how to block its ‘crawlers’ from taking content off your website.

Every website has a ‘robots.txt’ file on it, which can ask specific web crawlers not to access the site. (The crawler does not have to obey this request.) OpenAI has given details on how to identify and block its ChatGPT-related crawler and has promised not to scrape content from such websites.

However, putting the right words into your robot.txt file will only stop OpenAI from scraping your site in the future. The company is not allowing website owners to remove their content from the huge databases it has created in the past, the database which ChatGPT, for example, was ‘trained’ on. (The OpenAI robot.txt wording will also not stop other AI companies from scraping your site.)

What should we do?

In PN, there are different points of view on whether to block OpenAI from scraping the PN site, or its archive of past issues, from now on.

One perspective is that removing content about peace and radical nonviolence from AI databases would skew responses from ChatGPT and similar services even more towards militarist and violent approaches.

Another way of looking at things is that any (small) gains coming from PN’s archive being part of the dataset are more than offset by the point of principle: Big Tech shouldn’t be allowed to cannibalise this work as part of its drive to concentrate wealth and disempower workers.

As readers will know, our columnist Cath (desperately short of time) once asked ChatGPT to write a column for her, based on her past work for PN. It said the right kind of things, but it didn’t sound like her....

Topics: Technology, AI

IssueOctober - November 2023

AI

Additional info

Also in this issue

Features

News

Comments