中文预训练ALBERT模型来了:小模型登顶GLUE,Base版模型小10倍速度快1倍

语言模型、文本段预测准确性、训练时间 Mask Language Model Accuarcy & Training Time

注:? 将很快替换

模型参数和配置 Configuration of Models

代码实现和测试 Implementation and Code Testing

通过运行以下命令测试主要的改进点,包括但不限于词嵌入向量参数的因式分解、跨层参数共享、段落连续性任务等。

python test_changes.py

预训练 Pre-training

生成特定格式的文件(tfrecords) Generate tfrecords Files

运行以下命令即可。项目自动了一个示例的文本文件(data/news_zh_1.txt)

   bash create_pretrain_data.sh

如果你有很多文本文件,可以通过传入参数的方式,生成多个特定格式的文件(tfrecords)

执行预训练 pre-training on GPU/TPU

GPU:
export BERT_BASE_DIR=albert_config
nohup python3 run_pretraining.py --input_file=./data/tf*.tfrecord  \
--output_dir=my_new_model_path --do_train=True --do_eval=True --bert_config_file=$BERT_BASE_DIR/albert_config_xxlarge.json \
--train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=76 \
--num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176    \
--save_checkpoints_steps=2000   --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt &

TPU, add following information:
    --use_tpu=True  --tpu_name=grpc://10.240.1.66:8470 --tpu_zone=us-central1-a

注:如果你从头开始训练,可以不指定init_checkpoint;
如果你从现有的模型基础上训练,指定一下BERT_BASE_DIR的路径,并确保bert_config_file和init_checkpoint两个参数的值能对应到相应的文件上;
领域上的预训练,根据数据的大小,可以不用训练特别久。

下游任务 Fine-tuning

以使用albert_base做LCQMC任务为例。LCQMC任务是在口语化描述的数据集上做文本的相似性预测。

We will use LCQMC dataset for fine-tuning, it is oral language corpus, it is used to train and predict semantic similarity of a pair of sentences.

下载LCQMC数据集,包含训练、验证和测试集,训练集包含24万口语化描述的中文句子对,标签为1或0。1为句子语义相似,0为语义不相似。

通过运行下列命令做LCQMC数据集上的fine-tuning:

1. Clone this project:

      git clone https://github.com/brightmart/albert_zh.git

2. Fine-tuning by running the following command:

    export BERT_BASE_DIR=./albert_large_zh
    export TEXT_DIR=./lcqmc
    nohup python3 run_classifier.py   --task_name=lcqmc_pair   --do_train=False   --do_eval=true   --data_dir=$TEXT_DIR   --vocab_file=./albert_config/vocab.txt  \
    --bert_config_file=./albert_config/albert_config_large.json --max_seq_length=128 --train_batch_size=64   --learning_rate=2e-5  --num_train_epochs=3 \
    --output_dir=albert_large_lcqmc_checkpoints --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt &

Notice/注:
    you need to download pre-trained chinese albert model, and also download LCQMC dataset 
    你需要下载预训练的模型,并放入到项目当前项目,假设目录名称为albert_large_zh; 需要下载LCQMC数据集,并放入到当前项目,
    假设数据集目录名称为lcqmc

技术交流与问题讨论QQ群: 836811304 Join us on QQ group

If you have any question, you can raise an issue, or send me an email: brightmart@hotmail.com;

Currently how to use PyTorch version of albert is not clear yet, if you know how to do that, just email us or open an issue.

You can also send pull request to report you performance on your task or add methods on how to load models for PyTorch and so on.

If you have ideas for generate best performance pre-training Chinese model, please also let me know.

Research supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC)

Cite Us

Bright Liang Xu, albert_zh, (2019), GitHub repository, https://github.com/brightmart/albert_zh

Reference

1、ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations

2、BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

3、SpanBERT: Improving Pre-training by Representing and Predicting Spans

4、RoBERTa: A Robustly Optimized BERT Pretraining Approach

5、Large Batch Optimization for Deep Learning: Training BERT in 76 minutes(LAMB)

6、LAMB Optimizer,TensorFlow version

7、预训练小模型也能拿下13项NLP任务,ALBERT三大改造登顶GLUE基准

相关文章:

法研杯要素识别第二名方案总结:多标签分类实践与效果对比

RoBERTa for Chinese:大规模中文预训练RoBERTa模型

【NLP】ALBERT粗读