EpiCoder: Encompassing Diversity and Complexity in Code Generation
- Yaoxiang Wang ,
- Haoling Li ,
- Xin Zhang ,
- Jie Wu ,
- Xiao Liu ,
- Wenxiang Hu ,
- Zhongxin Guo ,
- Yangyu Huang ,
- Ying Xin ,
- Yujiu Yang ,
- Jinsong Su ,
- Qi Chen ,
- Scarlett Li
ICML 2025 |
Existing methods for code generation use code snippets as seed data, restricting the complexity and diversity of the synthesized data. In this paper, we introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features derived from high-level abstractions of code. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features, which captures and recognizes more complex patterns and relationships within the code. By adjusting the depth and breadth of the sampled subtrees, our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios. We fine-tuned widely-used base models to obtain EpiCoder series, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data. Our code and data are publicly available at https://github.com/microsoft/EpiCoder.