Security Underground: The Imperative of AI Alignment
Have you ever watched a movie where robots and superintelligent AI aim to take over the world? While it might seem like pure science fiction, some experts believe a similar scenario could unfold in reality. Visionaries were increasingly convinced that as we edge closer to achieving artificial general intelligence (AGI), the importance of AI alignment grows exponentially.
In 2023, leading figures in AI research called for a pause in AI development, signing an open letter that read: “Powerful AI systems should be developed only once we are confident that their effects will be positive and their risks will be manageable.” Oleg Boguslasvkyi, CTO at Data Science UA, emphasizes that societal attention to AI alignment is rising.
#Defining AI Alignment
At its core, AI alignment addresses ethical considerations to protect humanity from potential existential threats posed by advanced AI. On a less dramatic scale, it encompasses research aimed at ensuring AI systems operate safely and align with human values.
This involves programming AI with complex value systems, fostering fairness, implementing scalable oversight, auditing and interpreting AI models, and preventing harmful behaviors such as power-seeking tendencies.
AI alignment seeks to bridge the gap between the mathematical foundations of large language models (LLMs) and the nuanced social skills humans employ in communication.
As AI technology advances rapidly and AGI becomes a tangible goal, addressing these concerns is crucial. The implications span various domains, from language models to robots and autonomous vehicles, affecting many aspects of daily life.
#AI Alignment in Practice
Generative language models, by default, lack understanding of morals, ethics, human emotions, and the reasons behind ethical constraints, such as not providing instructions for harmful activities.
Fine-tuning processes instill these characteristics in AI models. For instance, ChatGPT, a fine-tuned LLM, can recognize potentially harmful questions and respond appropriately. In 2023, OpenAI committed 20% of its computational resources to AI alignment efforts.
While training materials are carefully selected to exclude explicitly harmful content, the vast corpus used for training may still contain fragments of such information.
#The Two Main Steps of AI Alignment
The first step involves developing specific instructions to ensure LLMs respond correctly. This is achieved by presenting example tasks to the model, which are then refined through human or AI-guided corrections. During this phase, the model also learns to seek clarification if a task is unclear.
The second step begins once the model can adequately address requests. Here, the model generates two responses and solicits feedback from a human or another instructive AI to determine the better answer. The superior response is rewarded and used as a benchmark for future answers.
#Ensuring Truthful Information and Addressing ChatGPT’s “Lies”
The tendency to produce misleading or false answers is a challenge for all LLMs. This issue arises from the vast and sometimes inaccurate information available on the internet, such as content from The Flat Earth Society. LLMs lack critical thinking, making the reward mechanism vital for guiding them towards accurate responses.
Human involvement in the learning process presents another challenge. LLMs generate answers closest to previously rewarded responses. When lacking an answer, an LLM might fabricate plausible information to receive a reward, a phenomenon known as “hallucination.”
Efforts are ongoing to ensure AI systems provide only objective truths and remain honest, crucial steps towards achieving reliable AGI and ASI. Getting the first 80% of accuracy is much easier than achieving other 20%. Nowadays, it looks like hitting 100% accuracy is not going to be possible with LLMs.
#Scalable Surveillance
Current training methods for AI systems incorporate machine ethics, instilling values like equality, impartiality, and non-harm. Despite human monitoring efforts, the process requires modernization to mitigate human error.
To address specific issues, AI-focused companies are incorporating source citation capabilities in LLMs. For instance, OpenAI announced that GPT-4 would include this feature.
OpenAI is also working on developing comprehensive oversight mechanisms to control superhuman AI, aiming to eventually create a superhuman automated AI alignment researcher.
By advancing AI alignment, we strive to ensure that the future of AI remains beneficial and secure for humanity.