In a groundbreaking exploration of AI capabilities, Anthropic, an artificial intelligence (AI) company, has tailored a large language model (LLM) to reflect user-defined values. This unique study involved gathering input from 1,000 participants to fine-tune the LLM’s responses based on their collective judgments.
Unlike conventional LLMs equipped with predefined guardrails to constrain certain outputs, Anthropic’s approach embraces user agency. Models like Claude from Anthropic and ChatGPT from OpenAI often adhere to preset safety responses, especially regarding sensitive topics. However, critics argue that such interventions might compromise user autonomy, as the definition of acceptability varies and is subjective across cultures and time periods.
A potential solution to this complex challenge is empowering users to shape the value alignment of AI models. Anthropic embarked on the “Collective Constitutional AI” experiment in collaboration with Polis and the Collective Intelligence Project. Engaging 1,000 users from diverse backgrounds, they posed a series of questions through polling to gather valuable insights.
The experiment revolves around granting users the authority to determine appropriateness without exposing them to undesirable outputs. This process involved eliciting user values and incorporating them into a pre-trained model. Anthropic employs a technique known as “Constitutional AI,” where the model is provided with a set of rules to follow, akin to a constitution guiding governance in nations.
In the Collective Constitutional AI experiment, Anthropic aimed to integrate feedback from user groups into the model’s constitution. According to Anthropic’s blog post, the results suggest a scientific success, shedding light on challenges associated with letting users collectively define the values of an LLM product.
A notable hurdle faced by the team was developing a novel benchmarking process. Given the experiment’s pioneering nature and reliance on Anthropic’s Constitutional AI methodology, there is no established test for comparing base models with those fine-tuned using crowd-sourced values.
In the end, it seems that the model incorporating data derived from user polling feedback exhibited a “slight” improvement over the base model in mitigating biased outputs. Anthropic expresses excitement not just about the resultant model but, more importantly, about the groundbreaking process itself. This experiment signifies one of the initial instances where the public, as a collective, intentionally influenced the behavior of a large language model. The hope is that communities globally will build upon such techniques to develop models that align with their specific cultural and contextual needs.