HARE: HumAn pRiors, a key to small language model Efficiency

Lingyun Zhang, Bin jin, Gaojian Ge, Lunhui Liu, Xuewen Shen, Mingyong Wu, Houqian Zhang, Yongneng Jiang, Shiqi Chen, Shi Pu·June 17, 2024

Summary

The paper investigates the role of human priors in training small language models (SLMs) for resource-constrained environments. It highlights the limitations of web-scraped data and proposes a data construction principle that emphasizes high-quality, diverse, and consistent data. The authors introduce HARE-1.1B, an SLM trained following this principle, which outperforms state-of-the-art models on benchmark datasets. HARE-1.1B combines open-source data, semantically enriched synthetic data, and task-specific data, resulting in better generalization and reduced data leakage. The study also addresses data decontamination, showing HARE's lower risk of benchmark leakage compared to other models. HARE's performance in tasks like the open LLM leaderboard and API calling demonstrates its effectiveness in efficiency and practicality, particularly for mobile devices. The research suggests that incorporating human priors in model training can lead to improved performance and integrity, with potential for future work in network architecture and regularization.

Key findings

5

Tables

4

Advanced features