Facebook researchers have developed “a neural transcompiler” named TransCoder AI — a system that converts source code from a high-level programming language (such as C++ or Python or Java) to another.
The source-to-source translator tool follows unsupervised machine translation — means it seeks previously undetected patterns in data sets without guiding labels with a minimal amount of human intervention, to train a fully unsupervised neural transcompiler.
The TransCoder AI tested on source code from GitHub and tool managed to translate functions between C++, Java, and Python with high accuracy.
Why need TransCoder AI?
TransCoder converts between programming languages that operate at a similar level of abstraction like a high-level programming language to another high-level language, unlike the traditional compilers that translate source code from a high-level to a lower-level programming language. And migrating an existing codebase to a modern or more efficient language like Java or C++ requires expertise in both the source and target languages, and is often costly. Using a transcompiler and manually adjusting the output source code may be a faster and cheaper solution than rewriting the entire codebase from scratch.
Testing and Accuracy of TransCoder
The Facebook researchers trained TransCoder using the GitHub public dataset that containing over 2.8 million open source repositories to focus on code translation at the function level. To validate the performance of the TransCoder, researchers used 852 parallel functions in all the three languages, C++, Java, and Python from GeeksforGeeks — an online platform that gathers coding problems and presents solutions.
In testing, TransCoder successfully understands the syntax specific to each language, learns data structures and their methods, and correctly aligns libraries across programming languages.
Regarding the accuracy of the translation by TransCoder:
- C++ to Java at 74.8% accuracy.
- C++ to Python at 67.2% accuracy.
- Java to C++ at 91.6% accuracy.
- Java to Python: at 68.7% accuracy.
- Python to Java at 56.1% accuracy.
- Python to C++ at 57.8% accuracy.
Researchers also noted that TransCoder failed to account for certain variable types during generation, for example — it outperformed frameworks that rewrite rules manually built using expert knowledge.
The TransCoder can easily be generalized to any programming language, does not require any expert knowledge, and outperforms commercial solutions by a large margin. According to researchers, the error happened during testing can easily be fixed by adding simple constraints to the decoder to ensure that the generated functions are syntactically correct.