AlphaCodium is a new system for solving coding competition problems
AlphaCodium is a new system for solving coding competition problems. It uses the CodeContests dataset, used for AlphaCode, which I told you all about a while back. This system, instead of training neural networks for coding, they build their system on top of an existing model, such as GPT-4. In addition to GPT-4 they also used DeepSeek, the Chinese LLM. Don't ask me why DeepSeek.
They describe the challenge:
"Code generation problems differ from common natural language problems -- they require matching the exact syntax of the target language, identifying happy paths and edge cases, paying attention to numerous small details in the problem spec, and addressing other code-specific issues and requirements. Hence, many of the optimizations and tricks that have been successful in natural language generation may not be effective for code tasks."
They criticize AlphaCode and describe their aspirations:
"AlphaCode was a code generation system developed by DeepMind, that utilizes a fine-tuned network specifically for competitive programming tasks. AlphaCode generates a very large number of possible solutions (up to 1M), that are then processed and clustered, and among them a small number (~ 10) is chosen and submitted. While the results of AlphaCode are impressive, the need to fine-tune a model specifically for code-oriented tasks, and the heavy computational brute-force-like load, makes it impractical for most real-life usages."
Instead of this they thought to come up with a way to prompt an LLM to reason in an iterative process. So they came up with a multi-stage process:
First, input the problem.
Next, encourage the LLM to reflect on the problem, which in practice means to describe the problem in bullet points, with special sections for the problem goal, inputs, outputs, rules, constraints. The LLM is challenged to describe any additional relevant details that may have appeared in the problem description.
Coding challenge problems usually come with examples they call "public tests" -- sets of input and output pairs that pass the test. The LLM is challenged to write an explanation for each public test, to describe why the input leads to the output.
Only after this is the LLM challenged to create possible solutions.
The LLM is then asked to decide which solution to test.
This program is tested against the coding competition's auto-grader.
If it passes, you're done, but if not, there are additional steps.
The code is tested against public tests, and if it fails, the LLM sees the output and is challenged to fix the error.
If the code passes the public tests, the LLM is challenged to create additional tests. (!)
At this point the cycle is repeated until either the LLM generates code that succeeds or the coding competition's try limit gets exceeded.
"When asking an LLM to reason about an issue, better results are obtained when demanding the output to be in bullet points format. Bullet points encourage an in-depth understanding of the problem, and force the model to divide the output into logical semantic sections, leading to improved results."
They also found asking the LLM to use a format called YAML helped.
"LLMs do better when generating a modular code:" "When clearly asking the model to: 'divide the generated code into small sub-functions, with meaningful names and functionality', we observe a better-produced code, with fewer bugs, and higher success rates for the iterative fixing stages."
"We add an extra step where, given the generated output, the model is asked to re-generate the same output, but correct it if needed. For example, given the generated AI tests as input, the model is asked to re-generate the same tests, while correcting wrong output, if exists. We found that this step of double validation, while encouraging the model to be critical and to reason, is more effective than asking a direct yes/no question: 'is this test correct?'"
"When we ask the model direct questions regarding complicated issues, we consistently see hallucinations and wrong answers. To counter this, we adopt a flow of gradual data accumulation, from easier tasks to harder ones."
"Even with double validation, some AI generated tests will be wrong. This makes iterations on them challenging -- when a test fails, how can we know if it is because the code is wrong, or because the test is wrong? When we ask the model directly 'who is wrong', we often see hallucinations, and may end up with a wrongly fixed code. To tackle this problem, we utilized a technique of test anchors."
This process is essentially, iterate first on the public tests, because we know they are correct, then iterate through the AI-generated tests, one by one, and if a test passes, add it to the list of "test anchors". If a test fails, assume it's because the code is incorrect, and try to fix the code, but the code must pass all the tests on the list of "tech anchors." In this manner the "test anchors" protect against breaking code by "fixing" it.
The researchers made note of several things that didn't work: They tried generating "execution traces", of the last 50 lines of code executed, and adding those to the prompt. That didn't help. They tried injecting multiple failed code solutions into the prompt instead of just one. That didn't help. They tried including information about the last change to the code (in git patch format) to the prompt. This did not help. They tried doing a lot in one step with a large prompt, and that didn't help.
What was the performance of the end result?
"GPT-4 accuracy increased from 19% with a single well-designed direct prompt to 44% with the AlphaCodium flow."
The paper has numbers showing AlphaCodium outperforming DeepMind's AlphaCode and another competitor called CodeChain. However, neither of those are open source, so they only report their system to the published results on the CodeContests dataset.
https://arxiv.org/abs/2401.08500
AlphaCodium is open source. Get the source code here: