中文预训练ALBERT模型来了:小模型登顶GLUE,Base版模型小10倍速度快1倍
语言模型、文本段预测准确性、训练时间 Mask Language Model Accuarcy & Training Time
注:? 将很快替换
模型参数和配置 Configuration of Models
代码实现和测试 Implementation and Code Testing
通过运行以下命令测试主要的改进点,包括但不限于词嵌入向量参数的因式分解、跨层参数共享、段落连续性任务等。
python test_changes.py
预训练 Pre-training
生成特定格式的文件(tfrecords) Generate tfrecords Files
运行以下命令即可。项目自动了一个示例的文本文件(data/news_zh_1.txt)
bash create_pretrain_data.sh
如果你有很多文本文件,可以通过传入参数的方式,生成多个特定格式的文件(tfrecords)
执行预训练 pre-training on GPU/TPU
GPU: export BERT_BASE_DIR=albert_config nohup python3 run_pretraining.py --input_file=./data/tf*.tfrecord \ --output_dir=my_new_model_path --do_train=True --do_eval=True --bert_config_file=$BERT_BASE_DIR/albert_config_xxlarge.json \ --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=76 \ --num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176 \ --save_checkpoints_steps=2000 --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt & TPU, add following information: --use_tpu=True --tpu_name=grpc://10.240.1.66:8470 --tpu_zone=us-central1-a 注:如果你从头开始训练,可以不指定init_checkpoint; 如果你从现有的模型基础上训练,指定一下BERT_BASE_DIR的路径,并确保bert_config_file和init_checkpoint两个参数的值能对应到相应的文件上; 领域上的预训练,根据数据的大小,可以不用训练特别久。
下游任务 Fine-tuning
以使用albert_base做LCQMC任务为例。LCQMC任务是在口语化描述的数据集上做文本的相似性预测。
We will use LCQMC dataset for fine-tuning, it is oral language corpus, it is used to train and predict semantic similarity of a pair of sentences.
下载LCQMC数据集,包含训练、验证和测试集,训练集包含24万口语化描述的中文句子对,标签为1或0。1为句子语义相似,0为语义不相似。
通过运行下列命令做LCQMC数据集上的fine-tuning:
1. Clone this project: git clone https://github.com/brightmart/albert_zh.git 2. Fine-tuning by running the following command: export BERT_BASE_DIR=./albert_large_zh export TEXT_DIR=./lcqmc nohup python3 run_classifier.py --task_name=lcqmc_pair --do_train=False --do_eval=true --data_dir=$TEXT_DIR --vocab_file=./albert_config/vocab.txt \ --bert_config_file=./albert_config/albert_config_large.json --max_seq_length=128 --train_batch_size=64 --learning_rate=2e-5 --num_train_epochs=3 \ --output_dir=albert_large_lcqmc_checkpoints --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt & Notice/注: you need to download pre-trained chinese albert model, and also download LCQMC dataset 你需要下载预训练的模型,并放入到项目当前项目,假设目录名称为albert_large_zh; 需要下载LCQMC数据集,并放入到当前项目, 假设数据集目录名称为lcqmc
技术交流与问题讨论QQ群: 836811304 Join us on QQ group
If you have any question, you can raise an issue, or send me an email: brightmart@hotmail.com;
Currently how to use PyTorch version of albert is not clear yet, if you know how to do that, just email us or open an issue.
You can also send pull request to report you performance on your task or add methods on how to load models for PyTorch and so on.
If you have ideas for generate best performance pre-training Chinese model, please also let me know.
Research supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC)
Cite Us
Bright Liang Xu, albert_zh, (2019), GitHub repository, https://github.com/brightmart/albert_zh
Reference
1、ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations
2、BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
3、SpanBERT: Improving Pre-training by Representing and Predicting Spans
4、RoBERTa: A Robustly Optimized BERT Pretraining Approach
5、Large Batch Optimization for Deep Learning: Training BERT in 76 minutes(LAMB)
6、LAMB Optimizer,TensorFlow version
7、预训练小模型也能拿下13项NLP任务,ALBERT三大改造登顶GLUE基准