Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

预训练流程疑问 #2

Open
echo-valor opened this issue Dec 6, 2023 · 2 comments
Open

预训练流程疑问 #2

echo-valor opened this issue Dec 6, 2023 · 2 comments

Comments

@echo-valor
Copy link

个人认真阅读贵团队论文,很不错的工作!还有三个问题想要咨询一下

pre-training process过程问题list:
Q1:为什么需要在第一阶段先试用高质量的预训练数据过一遍,再混合高质量和低质量数据进行训练,这样做的意义是什么?
Q2:经过上述的一阶段训练后,再进行高质量数据的训练,相当于高质量数据在整个过程训练的3轮,是否训练过多?这样做的物理含义是什么?
Q3:试用了一下贵团队部署的chat模型,好像目前拒绝中文回答,这种拒绝不同语言的回答是怎样做到的呢?为什么需要这样做?
Instruction:你好
SeaLLMs:Sorry, the language you have asked is currently not supported. If you have questions in other supported languages, I'll be glad to help. Please also consider clearing the chat box for a better experience.

@IsakZhang
Copy link
Contributor

你好,感谢对我们工作的关注!

A1 & A2:主要是出于数据量的考虑,多语言的预训练数据,特别是低资源语言的高质量数据往往比较受限。我们经过探索性实验后设计出三阶段的预训练方式,从而尽可能充分的利用各个语言的高质量数据。我们的一些实验结果也证明,数据的质量十分重要(在某些阶段比数量更重要),所以不会存在过多的问题。

A3:我们还在测试中文的安全能力,所以暂时不支持中文回复。

@echo-valor
Copy link
Author

感谢您的回复。拒绝中文回答是通过对齐方式来实现的吗?这方面方便解答吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants