網(wǎng)上有很多關(guān)于pos機(jī)的ts是什么意思,什么是遷移學(xué)習(xí)理論的知識,也有很多人為大家解答關(guān)于pos機(jī)的ts是什么意思的問題,今天pos機(jī)之家(m.afbey.com)為大家整理了關(guān)于這方面的知識,讓我們一起來看下吧!
本文目錄一覽:
pos機(jī)的ts是什么意思
一、遷移學(xué)習(xí)理論預(yù)訓(xùn)練模型(Pretrained model):一般情況下預(yù)訓(xùn)練模型都是大型模型,具備復(fù)雜的網(wǎng)絡(luò)結(jié)構(gòu),眾多的參數(shù)量,以及在足夠大的數(shù)據(jù)集下進(jìn)行訓(xùn)練而產(chǎn)生的模型. 在NLP領(lǐng)域,預(yù)訓(xùn)練模型往往是語言模型,因為語言模型的訓(xùn)練是無監(jiān)督的,可以獲得大規(guī)模語料,同時語言模型又是許多典型NLP任務(wù)的基礎(chǔ),如機(jī)器翻譯,文本生成,閱讀理解等,常見的預(yù)訓(xùn)練模型有BERT, GPT, roBERTa, transformer-XL等.
微調(diào)(Fine-tuning):根據(jù)給定的預(yù)訓(xùn)練模型,改變它的部分參數(shù)或者為其新增部分輸出結(jié)構(gòu)后,通過在小部分?jǐn)?shù)據(jù)集上訓(xùn)練,來使整個模型更好的適應(yīng)特定任務(wù).
微調(diào)腳本(Fine-tuning script):實(shí)現(xiàn)微調(diào)過程的代碼文件。這些腳本文件中,應(yīng)包括對預(yù)訓(xùn)練模型的調(diào)用,對微調(diào)參數(shù)的選定以及對微調(diào)結(jié)構(gòu)的更改等,同時,因為微調(diào)是一個訓(xùn)練過程,它同樣需要一些超參數(shù)的設(shè)定,以及損失函數(shù)和優(yōu)化器的選取等, 因此微調(diào)腳本往往也包含了整個遷移學(xué)習(xí)的過程.
關(guān)于微調(diào)腳本的說明:一般情況下,微調(diào)腳本應(yīng)該由不同的任務(wù)類型開發(fā)者自己編寫,但是由于目前研究的NLP任務(wù)類型(分類,提取,生成)以及對應(yīng)的微調(diào)輸出結(jié)構(gòu)都是有限的,有些微調(diào)方式已經(jīng)在很多數(shù)據(jù)集上被驗證是有效的,因此微調(diào)腳本也可以使用已經(jīng)完成的規(guī)范腳本.
兩種遷移方式:直接使用預(yù)訓(xùn)練模型,進(jìn)行相同任務(wù)的處理,不需要調(diào)整參數(shù)或模型結(jié)構(gòu),這些模型開箱即用。但是這種情況一般只適用于普適任務(wù), 如:fastTest工具包中預(yù)訓(xùn)練的詞向量模型。另外,很多預(yù)訓(xùn)練模型開發(fā)者為了達(dá)到開箱即用的效果,將模型結(jié)構(gòu)分各個部分保存為不同的預(yù)訓(xùn)練模型,提供對應(yīng)的加載方法來完成特定目標(biāo).
更加主流的遷移學(xué)習(xí)方式是發(fā)揮預(yù)訓(xùn)練模型特征抽象的能力,然后再通過微調(diào)的方式,通過訓(xùn)練更新小部分參數(shù)以此來適應(yīng)不同的任務(wù)。這種遷移方式需要提供小部分的標(biāo)注數(shù)據(jù)來進(jìn)行監(jiān)督學(xué)習(xí).
關(guān)于遷移方式的說明:直接使用預(yù)訓(xùn)練模型的方式, 已經(jīng)在fasttext的詞向量遷移中學(xué)習(xí). 接下來的遷移學(xué)習(xí)實(shí)踐將主要講解通過微調(diào)的方式進(jìn)行遷移學(xué)習(xí).
二、NLP中的標(biāo)準(zhǔn)數(shù)據(jù)集GLUE數(shù)據(jù)集合的介紹:GLUE由紐約大學(xué), 華盛頓大學(xué), Google聯(lián)合推出, 涵蓋不同NLP任務(wù)類型, 截止至2020年1月其中包括11個子任務(wù)數(shù)據(jù)集, 成為衡量NLP研究發(fā)展的衡量標(biāo)準(zhǔn).
LUE數(shù)據(jù)集合包含以下數(shù)據(jù)集CoLA 數(shù)據(jù)集SST-2 數(shù)據(jù)集MRPC 數(shù)據(jù)集STS-B 數(shù)據(jù)集QQP 數(shù)據(jù)集MNLI 數(shù)據(jù)集SNLI 數(shù)據(jù)集QNLI 數(shù)據(jù)集RTE 數(shù)據(jù)集WNLI 數(shù)據(jù)集diagnostics數(shù)據(jù)集(官方未完善)GLUE數(shù)據(jù)集合的下載方式:
下載腳本代碼:
''' Script for downloading all GLUE data.'''import osimport sysimport shutilimport argparseimport tempFileimport urllib.requestimport zipfileTASKS = ["CoLA", "SST", "MRPC", "QQP", "STS", "MNLI", "SNLI", "QNLI", "RTE", "WNLI", "diagnostic"]TASK2PATH = {"CoLA":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data/CoLA.zip?alt=media&token=46d5e637-3411-4188-bc44-5809b5bfb5f4', "SST":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data/SST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8', "MRPC":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data/mrpc_dev_ids.tsv?alt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc', "QQP":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data/QQP.zip?alt=media&token=700c6acf-160d-4d89-81d1-de4191d02cb5', "STS":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data/STS-B.zip?alt=media&token=bddb94a7-8706-4e0d-a694-1109e12273b5', "MNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data/MNLI.zip?alt=media&token=50329ea1-e339-40e2-809c-10c40afff3ce', "SNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data/SNLI.zip?alt=media&token=4afcfbb2-ff0c-4b2d-a09a-dbf07926f4df', "QNLI": 'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data/QNLIv2.zip?alt=media&token=6fdcf570-0fc5-4631-8456-9505272d1601', "RTE":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data/RTE.zip?alt=media&token=5efa7e85-a0bb-4f19-8ea2-9e1840f077fb', "WNLI":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data/WNLI.zip?alt=media&token=068ad0a0-ded7-4bd7-99a5-5e00222e0faf', "diagnostic":'https://storage.googleapis.com/mtl-sentence-representations.appspot.com/tsvsWithoutLabels/AX.tsv?GoogleAccessId=firebase-adminsdk-0khhl@mtl-sentence-representations.iam.gserviceaccount.com&Expires=2498860800&Signature=DuQ2CSPt2Yfre0C+iISrVYrIFaZH1Lc7hBVZDD4ZyR7fZYOMNOUGpi8QxBmTNOrNPjR3z1cggo7WXFfrgECP6FBJSsURv8Ybrue8Ypt/TPxbuJ0Xc2FhDi+arnecCBFO77RSbfuz+s95hRrYhTnByqu3U/YZPaj3tZt5QdfpH2IUROY8LiBXoXS46LE/gOQc/KN+A9SoscRDYsnxHfG0IjXGwHN+f88q6hOmAxeNPx6moDulUF6XMUAaXCSFU+nRO2RDL9CapWxj+Dl7syNyHhB7987hZ80B/wFkQ3MEs8auvt5XW1+d4aCU7ytgM69r8JDCwibfhZxpaa4gd50QXQ=='}MRPC_TRAIN = 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt'MRPC_TEST = 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt'def download_and_extract(task, data_dir): print("Downloading and extracting %s..." % task) data_file = "%s.zip" % task urllib.request.urlretrieve(TASK2PATH[task], data_file) with zipfile.ZipFile(data_file) as zip_ref: zip_ref.extractall(data_dir) os.remove(data_file) print("\Completed!")def format_mrpc(data_dir, path_to_data): print("Processing MRPC...") mrpc_dir = os.path.join(data_dir, "MRPC") if not os.path.isdir(mrpc_dir): os.mkdir(mrpc_dir) if path_to_data: mrpc_train_file = os.path.join(path_to_data, "msr_paraphrase_train.txt") mrpc_test_file = os.path.join(path_to_data, "msr_paraphrase_test.txt") else: print("Local MRPC data not specified, downloading data from %s" % MRPC_TRAIN) mrpc_train_file = os.path.join(mrpc_dir, "msr_paraphrase_train.txt") mrpc_test_file = os.path.join(mrpc_dir, "msr_paraphrase_test.txt") urllib.request.urlretrieve(MRPC_TRAIN, mrpc_train_file) urllib.request.urlretrieve(MRPC_TEST, mrpc_test_file) assert os.path.isfile(mrpc_train_file), "Train data not found at %s" % mrpc_train_file assert os.path.isfile(mrpc_test_file), "Test data not found at %s" % mrpc_test_file urllib.request.urlretrieve(TASK2PATH["MRPC"], os.path.join(mrpc_dir, "dev_ids.tsv")) dev_ids = [] with open(os.path.join(mrpc_dir, "dev_ids.tsv"), encoding="utf8") as ids_fh: for row in ids_fh: dev_ids.append(row.strip().split('\')) with open(mrpc_train_file, encoding="utf8") as data_fh, \\ open(os.path.join(mrpc_dir, "train.tsv"), 'w', encoding="utf8") as train_fh, \\ open(os.path.join(mrpc_dir, "dev.tsv"), 'w', encoding="utf8") as dev_fh: header = data_fh.readline() train_fh.write(header) dev_fh.write(header) for row in data_fh: label, id1, id2, s1, s2 = row.strip().split('\') if [id1, id2] in dev_ids: dev_fh.write("%s\%s\%s\%s\%s\" % (label, id1, id2, s1, s2)) else: train_fh.write("%s\%s\%s\%s\%s\" % (label, id1, id2, s1, s2)) with open(mrpc_test_file, encoding="utf8") as data_fh, \\ open(os.path.join(mrpc_dir, "test.tsv"), 'w', encoding="utf8") as test_fh: header = data_fh.readline() test_fh.write("index\#1 ID\#2 ID\#1 String\#2 String\") for idx, row in enumerate(data_fh): label, id1, id2, s1, s2 = row.strip().split('\') test_fh.write("%d\%s\%s\%s\%s\" % (idx, id1, id2, s1, s2)) print("\Completed!")def download_diagnostic(data_dir): print("Downloading and extracting diagnostic...") if not os.path.isdir(os.path.join(data_dir, "diagnostic")): os.mkdir(os.path.join(data_dir, "diagnostic")) data_file = os.path.join(data_dir, "diagnostic", "diagnostic.tsv") urllib.request.urlretrieve(TASK2PATH["diagnostic"], data_file) print("\Completed!") returndef get_tasks(task_names): task_names = task_names.split(',') if "all" in task_names: tasks = TASKS else: tasks = [] for task_name in task_names: assert task_name in TASKS, "Task %s not found!" % task_name tasks.append(task_name) return tasksdef main(arguments): parser = argparse.ArgumentParser() parser.add_argument('--data_dir', help='directory to save data to', type=str, default='glue_data') parser.add_argument('--tasks', help='tasks to download data for as a comma separated string', type=str, default='all') parser.add_argument('--path_to_mrpc', help='path to directory containing extracted MRPC data, msr_paraphrase_train.txt and msr_paraphrase_text.txt', type=str, default='') args = parser.parse_args(arguments) if not os.path.isdir(args.data_dir): os.mkdir(args.data_dir) tasks = get_tasks(args.tasks) for task in tasks: if task == 'MRPC': format_mrpc(args.data_dir, args.path_to_mrpc) elif task == 'diagnostic': download_diagnostic(args.data_dir) else: download_and_extract(task, args.data_dir)if __name__ == '__main__': sys.exit(main(sys.argv[1:]))
運(yùn)行腳本下載所有數(shù)據(jù)集:
# 假設(shè)你已經(jīng)將以上代碼copy到download_glue_data.py文件中# 運(yùn)行這個python腳本, 你將同目錄下得到一個glue文件夾python download_glue_data.py
輸出效果:
Downloading and extracting CoLA... Completed!Downloading and extracting SST... Completed!Processing MRPC...Local MRPC data not specified, downloading data from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt Completed!Downloading and extracting QQP... Completed!Downloading and extracting STS... Completed!Downloading and extracting MNLI... Completed!Downloading and extracting SNLI... Completed!Downloading and extracting QNLI... Completed!Downloading and extracting RTE... Completed!Downloading and extracting WNLI... Completed!Downloading and extracting diagnostic... Completed!GLUE數(shù)據(jù)集合中子數(shù)據(jù)集的樣式及其任務(wù)類型
CoLA數(shù)據(jù)集文件樣式
- CoLA/ - dev.tsv - original/ - test.tsv - train.tsv
文件樣式說明:
在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分別代表訓(xùn)練集, 驗證集和測試集. 其中train.tsv與dev.tsv數(shù)據(jù)樣式相同, 都是帶有標(biāo)簽的數(shù)據(jù), 其中test.tsv是不帶有標(biāo)簽的數(shù)據(jù).train.tsv數(shù)據(jù)樣式:
...gj04 1 She coughed herself awake as the leaf landed on her nose.gj04 1 The worm wriggled onto the carpet.gj04 1 The chocolate melted onto the carpet.gj04 0 * The ball wriggled itself loose.gj04 1 Bill wriggled himself loose.bc01 1 The sinking of the ship to collect the insurance was very devious.bc01 1 The ship's sinking was very devious.bc01 0 * The ship's sinking to collect the insurance was very devious.bc01 1 The testing of such drugs on oneself is too risky.bc01 0 * This drug's testing on oneself is too risky....
train.tsv數(shù)據(jù)樣式說明:
train.tsv中的數(shù)據(jù)內(nèi)容共分為4列, 第一列數(shù)據(jù), 如gj04, bc01等代表每條文本數(shù)據(jù)的來源即出版物代號; 第二列數(shù)據(jù), 0或1, 代表每條文本數(shù)據(jù)的語法是否正確, 0代表不正確, 1代表正確; 第三列數(shù)據(jù), '', 是作者最初的正負(fù)樣本標(biāo)記, 與第二列意義相同, ''表示不正確; 第四列即是被標(biāo)注的語法使用是否正確的文本句子.test.tsv數(shù)據(jù)樣式:
index sentence0 Bill whistled past the house.1 The car honked its way down the road.2 Bill pushed Harry off the sofa.3 the kittens yawned awake and played.4 I demand that the more John eats, the more he pay.5 If John eats more, keep your mouth shut tighter, OK?6 His expectations are always lower than mine are.7 The sooner you call, the more carefully I will word the letter.8 The more timid he feels, the more people he interviews without asking questions of.9 Once Janet left, Fred became a lot crazier....
test.tsv數(shù)據(jù)樣式說明:
test.tsv中的數(shù)據(jù)內(nèi)容共分為2列, 第一列數(shù)據(jù)代表每條文本數(shù)據(jù)的索引; 第二列數(shù)據(jù)代表用于測試的句子.CoLA數(shù)據(jù)集的任務(wù)類型:
二分類任務(wù)評估指標(biāo)為: MCC(馬修斯相關(guān)系數(shù), 在正負(fù)樣本分布十分不均衡的情況下使用的二分類評估指標(biāo))SST-2數(shù)據(jù)集文件樣式
- SST-2/ - dev.tsv - original/ - test.tsv - train.tsv
文件樣式說明:
在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分別代表訓(xùn)練集, 驗證集和測試集. 其中train.tsv與dev.tsv數(shù)據(jù)樣式相同, 都是帶有標(biāo)簽的數(shù)據(jù), 其中test.tsv是不帶有標(biāo)簽的數(shù)據(jù).train.tsv數(shù)據(jù)樣式:
sentence labelhide new secretions from the parental units 0contains no wit , only labored gags 0that loves its characters and communicates something rather beautiful about human nature 1remains utterly satisfied to remain the same throughout 0on the worst revenge-of-the-nerds clichés the filmmakers could dredge up 0that 's far too tragic to merit such superficial treatment 0demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop . 1of saucy 1a depressed fifteen-year-old 's suicidal poetry 0...
train.tsv數(shù)據(jù)樣式說明:
train.tsv中的數(shù)據(jù)內(nèi)容共分為2列, 第一列數(shù)據(jù)代表具有感情色彩的評論文本; 第二列數(shù)據(jù), 0或1, 代表每條文本數(shù)據(jù)是積極或者消極的評論, 0代表消極, 1代表積極.test.tsv數(shù)據(jù)樣式:
index sentence0 uneasy mishmash of styles and genres .1 this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .2 by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .3 director rob marshall went out gunning to make a great one .4 lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new .5 a well-made and often lovely depiction of the mysteries of friendship .6 none of this violates the letter of behan 's book , but missing is its spirit , its ribald , full-throated humor .7 although it bangs a very cliched drum at times , this crowd-pleaser 's fresh dialogue , energetic music , and good-natured spunk are often infectious .8 it is not a mass-market entertainment but an uncompromising attempt by one artist to think about another .9 this is junk food cinema at its greasiest ....
test.tsv數(shù)據(jù)樣式說明: * test.tsv中的數(shù)據(jù)內(nèi)容共分為2列, 第一列數(shù)據(jù)代表每條文本數(shù)據(jù)的索引; 第二列數(shù)據(jù)代表用于測試的句子.
SST-2數(shù)據(jù)集的任務(wù)類型:
二分類任務(wù)評估指標(biāo)為: ACC- MRPC/ - dev.tsv - test.tsv - train.tsv - dev_ids.tsv - msr_paraphrase_test.txt - msr_paraphrase_train.txt
文件樣式說明:
在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分別代表訓(xùn)練集, 驗證集和測試集. 其中train.tsv與dev.tsv數(shù)據(jù)樣式相同, 都是帶有標(biāo)簽的數(shù)據(jù), 其中test.tsv是不帶有標(biāo)簽的數(shù)據(jù).train.tsv數(shù)據(jù)樣式:
Quality #1 ID #2 ID #1 String #2 String1 702876 702977 Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .0 2108705 2108831 Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion . Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .1 1330381 1330521 They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added . On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .0 3344667 3344648 Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 . Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .1 1236820 1236712 The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange . PG & E Corp. shares jumped $ 1.63 or 8 percent to $ 21.03 on the New York Stock Exchange on Friday .1 738533 737951 Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier . With the scandal hanging over Stewart 's company , revenue the first quarter of the year dropped 15 percent from the same period a year earlier .0 264589 264502 The Nasdaq had a weekly gain of 17.27 , or 1.2 percent , closing at 1,520.15 on Friday . The tech-laced Nasdaq Composite .IXIC rallied 30.46 points , or 2.04 percent , to 1,520.15 .1 579975 579810 The DVD-CCA then appealed to the state Supreme Court . The DVD CCA appealed that decision to the U.S. Supreme Court ....
train.tsv數(shù)據(jù)樣式說明:
train.tsv中的數(shù)據(jù)內(nèi)容共分為5列, 第一列數(shù)據(jù), 0或1, 代表每對句子是否具有相同的含義, 0代表含義不相同, 1代表含義相同. 第二列和第三列分別代表每對句子的id, 第四列和第五列分別具有相同/不同含義的句子對.test.tsv數(shù)據(jù)樣式:
index #1 ID #2 ID #1 String #2 String0 1089874 1089925 PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So . Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So .1 3019446 3019327 The world 's two largest automakers said their U.S. sales declined more than predicted last month as a late summer sales frenzy caused more of an industry backlash than expected . Domestic sales at both GM and No. 2 Ford Motor Co. declined more than predicted as a late summer sales frenzy prompted a larger-than-expected industry backlash .2 1945605 1945824 According to the federal Centers for Disease Control and Prevention ( news - web sites ) , there were 19 reported cases of measles in the United States in 2002 . The Centers for Disease Control and Prevention said there were 19 reported cases of measles in the United States in 2002 .3 1430402 1430329 A tropical storm rapidly developed in the Gulf of Mexico Sunday and was expected to hit somewhere along the Texas or Louisiana coasts by Monday night . A tropical storm rapidly developed in the Gulf of Mexico on Sunday and could have hurricane-force winds when it hits land somewhere along the Louisiana coast Monday night .4 3354381 3354396 The company didn 't detail the costs of the replacement and repairs . But company officials expect the costs of the replacement work to run into the millions of dollars .5 1390995 1391183 The settling companies would also assign their possible claims against the underwriters to the investor plaintiffs , he added . Under the agreement , the settling companies will also assign their potential claims against the underwriters to the investors , he added .6 2201401 2201285 Air Commodore Quaife said the Hornets remained on three-minute alert throughout the operation . Air Commodore John Quaife said the security operation was unprecedented .7 2453843 2453998 A Washington County man may have the countys first human case of West Nile virus , the health department said Friday . The countys first and only human case of West Nile this year was confirmed by health officials on Sept . 8 ....
test.tsv數(shù)據(jù)樣式說明: * test.tsv中的數(shù)據(jù)內(nèi)容共分為5列, 第一列數(shù)據(jù)代表每條文本數(shù)據(jù)的索引; 其余列的含義與train.tsv中相同.
MRPC數(shù)據(jù)集的任務(wù)類型:
句子對二分類任務(wù)評估指標(biāo)為: ACC和F1STS-B數(shù)據(jù)集文件樣式
- STS-B/ - dev.tsv - test.tsv - train.tsv - LICENSE.txt - readme.txt - original/
文件樣式說明:
在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分別代表訓(xùn)練集, 驗證集和測試集. 其中train.tsv與dev.tsv數(shù)據(jù)樣式相同, 都是帶有標(biāo)簽的數(shù)據(jù), 其中test.tsv是不帶有標(biāo)簽的數(shù)據(jù).train.tsv數(shù)據(jù)樣式:
index genre filename year old_index source1 source2 sentence1 sentence2 score0 main-captions MSRvid 2012test 0001 none none A plane is taking off. An air plane is taking off. 5.0001 main-captions MSRvid 2012test 0004 none none A man is playing a large flute. A man is playing a flute. 3.8002 main-captions MSRvid 2012test 0005 none none A man is spreading shreded cheese on a pizza. A man is spreading shredded cheese on an uncooked pizza. 3.8003 main-captions MSRvid 2012test 0006 none none Three men are playing chess.Two men are playing chess. 2.6004 main-captions MSRvid 2012test 0009 none none A man is playing the cello.A man seated is playing the cello. 4.2505 main-captions MSRvid 2012test 0011 none none Some men are fighting. Two men are fighting. 4.2506 main-captions MSRvid 2012test 0012 none none A man is smoking. A man is skating. 0.5007 main-captions MSRvid 2012test 0013 none none The man is playing the piano. The man is playing the guitar. 1.6008 main-captions MSRvid 2012test 0014 none none A man is playing on a guitar and singing. A woman is playing an acoustic guitar and singing. 2.2009 main-captions MSRvid 2012test 0016 none none A person is throwing a cat on to the ceiling. A person throws a cat on the ceiling. 5.000...
train.tsv數(shù)據(jù)樣式說明:
train.tsv中的數(shù)據(jù)內(nèi)容共分為10列, 第一列數(shù)據(jù)是數(shù)據(jù)索引; 第二列代表每對句子的來源, 如main-captions表示來自字幕; 第三列代表來源的具體保存文件名, 第四列代表出現(xiàn)時間(年); 第五列代表原始數(shù)據(jù)的索引; 第六列和第七列分別代表句子對原始來源; 第八列和第九列代表相似程度不同的句子對; 第十列代表句子對的相似程度由低到高, 值域范圍是[0, 5].test.tsv數(shù)據(jù)樣式:
index genre filename year old_index source1 source2 sentence1 sentence20 main-captions MSRvid 2012test 0024 none none A girl is styling her hair. A girl is brushing her hair.1 main-captions MSRvid 2012test 0033 none none A group of men play soccer on the beach. A group of boys are playing soccer on the beach.2 main-captions MSRvid 2012test 0045 none none One woman is measuring another woman's ankle. A woman measures another woman's ankle.3 main-captions MSRvid 2012test 0063 none none A man is cutting up a cucumber. A man is slicing a cucumber.4 main-captions MSRvid 2012test 0066 none none A man is playing a harp. A man is playing a keyboard.5 main-captions MSRvid 2012test 0074 none none A woman is cutting onions. A woman is cutting tofu.6 main-captions MSRvid 2012test 0076 none none A man is riding an electric bicycle. A man is riding a bicycle.7 main-captions MSRvid 2012test 0082 none none A man is playing the drums. A man is playing the guitar.8 main-captions MSRvid 2012test 0092 none none A man is playing guitar. A lady is playing the guitar.9 main-captions MSRvid 2012test 0095 none none A man is playing a guitar. A man is playing a trumpet.10 main-captions MSRvid 2012test 0096 none none A man is playing a guitar. A man is playing a trumpet....
test.tsv數(shù)據(jù)樣式說明:
test.tsv中的數(shù)據(jù)內(nèi)容共分為9列, 含義與train.tsv前9列相同.STS-B數(shù)據(jù)集的任務(wù)類型:
句子對多分類任務(wù)/句子對回歸任務(wù)評估指標(biāo)為: Pearson-Spearman CorrQQP數(shù)據(jù)集文件樣式
- QQP/ - dev.tsv - original/ - test.tsv - train.tsv
文件樣式說明:
在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分別代表訓(xùn)練集, 驗證集和測試集. 其中train.tsv與dev.tsv數(shù)據(jù)樣式相同, 都是帶有標(biāo)簽的數(shù)據(jù), 其中test.tsv是不帶有標(biāo)簽的數(shù)據(jù).train.tsv數(shù)據(jù)樣式:
id qid1 qid2 question1 question2 is_duplicate133273 213221 213222 How is the life of a math student? Could you describe your own experiences?Which level of prepration is enough for the exam jlpt5? 0402555 536040 536041 How do I control my horny emotions? How do you control your horniness? 1360472 364011 490273 What causes stool color to change to yellow? What can cause stool to come out as little balls? 0150662 155721 7256 What can one do after MBBS? What do i do after my MBBS ? 1183004 279958 279959 Where can I find a power outlet for my laptop at Melbourne Airport? Would a second airport in Sydney, Australia be needed if a high-speed rail link was created between Melbourne and Sydney? 0119056 193387 193388 How not to feel guilty since I am Muslim and I'm conscious we won't have sex together? I don't beleive I am bulimic, but I force throw up atleast once a day after I eat something and feel guilty. Should I tell somebody, and if so who? 0356863 422862 96457 How is air traffic controlled? How do you become an air traffic controller?0106969 147570 787 What is the best self help book you have read? Why? How did it change your life? What are the top self help books I should read? 1...
train.tsv數(shù)據(jù)樣式說明:
train.tsv中的數(shù)據(jù)內(nèi)容共分為6列, 第一列代表文本數(shù)據(jù)索引; 第二列和第三列數(shù)據(jù)分別代表問題1和問題2的id; 第四列和第五列代表需要進(jìn)行'是否重復(fù)'判定的句子對; 第六列代表上述問題是/不是重復(fù)性問題的標(biāo)簽, 0代表不重復(fù), 1代表重復(fù).test.tsv數(shù)據(jù)樣式:
id question1 question20 Would the idea of Trump and Putin in bed together scare you, given the geopolitical implications? Do you think that if Donald Trump were elected President, he would be able to restore relations with Putin and Russia as he said he could, based on the rocky relationship Putin had with Obama and Bush?1 What are the top ten Consumer-to-Consumer E-commerce online? What are the top ten Consumer-to-Business E-commerce online?2 Why don't people simply 'Google' instead of asking questions on Quora? Why do people ask Quora questions instead of just searching google?3 Is it safe to invest in social trade biz? Is social trade geniune?4 If the universe is expanding then does matter also expand? If universe and space is expanding? Does that mean anything that occupies space is also expanding?5 What is the plural of hypothesis? What is the plural of thesis?6 What is the application form you need for launching a company? What is the application form you need for launching a company in Austria?7 What is Big Theta? When should I use Big Theta as opposed to big O? Is O(Log n) close to O(n) or O(1)?8 What are the health implications of accidentally eating a small quantity of aluminium foil?What are the implications of not eating vegetables?...
test.tsv數(shù)據(jù)樣式說明:
test.tsv中的數(shù)據(jù)內(nèi)容共分為3列, 第一列數(shù)據(jù)代表每條文本數(shù)據(jù)的索引; 第二列和第三列數(shù)據(jù)代表用于測試的問題句子對.QQP數(shù)據(jù)集的任務(wù)類型:
句子對二分類任務(wù)評估指標(biāo)為: ACC/F1(MNLI/SNLI)數(shù)據(jù)集文件樣式
- (MNLI/SNLI)/ - dev_matched.tsv - dev_mismatched.tsv - original/ - test_matched.tsv - test_mismatched.tsv - train.tsv
文件樣式說明:
在使用中常用到的文件是train.tsv, dev_matched.tsv, dev_mismatched.tsv, test_matched.tsv, test_mismatched.tsv分別代表訓(xùn)練集, 與訓(xùn)練集一同采集的驗證集, 與訓(xùn)練集不是一同采集驗證集, 與訓(xùn)練集一同采集的測試集, 與訓(xùn)練集不是一同采集測試集. 其中train.tsv與dev_matched.tsv和dev_mismatched.tsv數(shù)據(jù)樣式相同, 都是帶有標(biāo)簽的數(shù)據(jù), 其中test_matched.tsv與test_mismatched.tsv數(shù)據(jù)樣式相同, 都是不帶有標(biāo)簽的數(shù)據(jù).train.tsv數(shù)據(jù)樣式:
index promptID pairID genre sentence1_binary_parse sentence2_binary_parse sentence1_parse sentence2_parse sentence1 sentence2 label1 gold_label0 31193 31193n government ( ( Conceptually ( cream skimming ) ) ( ( has ( ( ( two ( basic dimensions ) ) - ) ( ( product and ) geography ) ) ) . ) ) ( ( ( Product and ) geography ) ( ( are ( what ( make ( cream ( skimming work ) ) ) ) ) . ) ) (ROOT (S (NP (JJ Conceptually) (NN cream) (NN skimming)) (VP (VBZ has) (NP (NP (CD two) (JJ basic) (NNS dimensions)) (: -) (NP (NN product) (CC and) (NN geography)))) (. .))) (ROOT (S (NP (NN Product) (CC and) (NN geography)) (VP (VBP are) (SBAR (WHNP (WP what)) (S (VP (VBP make) (NP (NP (NN cream)) (VP (VBG skimming) (NP (NN work)))))))) (. .))) Conceptually cream skimming has two basic dimensions - product and geography. Product and geography are what make cream skimming work. neutral neutral1 101457 101457e telephone ( you ( ( know ( during ( ( ( the season ) and ) ( i guess ) ) ) ) ( at ( at ( ( your level ) ( uh ( you ( ( ( lose them ) ( to ( the ( next level ) ) ) ) ( if ( ( if ( they ( decide ( to ( recall ( the ( the ( parent team ) ) ) ) ) ) ) ) ( ( the Braves ) ( decide ( to ( call ( to ( ( recall ( a guy ) ) ( from ( ( triple A ) ( ( ( then ( ( a ( double ( A guy ) ) ) ( ( goes up ) ( to ( replace him ) ) ) ) ) and ) ( ( a ( single ( A guy ) ) ) ( ( goes up ) ( to ( replace him ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ( You ( ( ( ( lose ( the things ) ) ( to ( the ( following level ) ) ) ) ( if ( ( the people ) recall ) ) ) . ) ) (ROOT (S (NP (PRP you)) (VP (VBP know) (PP (IN during) (NP (NP (DT the) (NN season)) (CC and) (NP (FW i) (FW guess)))) (PP (IN at) (IN at) (NP (NP (PRP$ your) (NN level)) (SBAR (S (INTJ (UH uh)) (NP (PRP you)) (VP (VBP lose) (NP (PRP them)) (PP (TO to) (NP (DT the) (JJ next) (NN level))) (SBAR (IN if) (S (SBAR (IN if) (S (NP (PRP they)) (VP (VBP decide) (S (VP (TO to) (VP (VB recall) (NP (DT the) (DT the) (NN parent) (NN team)))))))) (NP (DT the) (NNPS Braves)) (VP (VBP decide) (S (VP (TO to) (VP (VB call) (S (VP (TO to) (VP (VB recall) (NP (DT a) (NN guy)) (PP (IN from) (NP (NP (RB triple) (DT A)) (SBAR (S (S (ADVP (RB then)) (NP (DT a) (JJ double) (NNP A) (NN guy)) (VP (VBZ goes) (PRT (RP up)) (S (VP (TO to) (VP (VB replace) (NP (PRP him))))))) (CC and) (S (NP (DT a) (JJ single) (NNP A) (NN guy)) (VP (VBZ goes) (PRT (RP up)) (S (VP (TO to) (VP (VB replace) (NP (PRP him)))))))))))))))))))))))))))) (ROOT (S (NP (PRP You)) (VP (VBP lose) (NP (DT the) (NNS things)) (PP (TO to) (NP (DT the) (JJ following) (NN level))) (SBAR (IN if) (S (NP (DT the) (NNS people)) (VP (VBP recall))))) (. .))) you know during the season and i guess at at your level uh you lose them to the next level if if they decide to recall the the parent team the Braves decide to call to recall a guy from triple A then a double A guy goes up to replace him and a single A guy goes up to replace him You lose the things to the following level if the people recall. entailment entailment2 134793 134793e fiction ( ( One ( of ( our number ) ) ) ( ( will ( ( ( carry out ) ( your instructions ) ) minutely ) ) . ) ) ( ( ( A member ) ( of ( my team ) ) ) ( ( will ( ( execute ( your orders ) ) ( with ( immense precision ) ) ) ) . ) ) (ROOT (S (NP (NP (CD One)) (PP (IN of) (NP (PRP$ our) (NN number)))) (VP (MD will) (VP (VB carry) (PRT (RP out)) (NP (PRP$ your) (NNS instructions)) (ADVP (RB minutely)))) (. .))) (ROOT (S (NP (NP (DT A) (NN member)) (PP (IN of) (NP (PRP$ my) (NN team)))) (VP (MD will) (VP (VB execute) (NP (PRP$ your) (NNS orders)) (PP (IN with) (NP (JJ immense) (NN precision))))) (. .))) One of our number will carry out your instructions minutely. A member of my team will execute your orders with immense precision. entailment entailment3 37397 37397e fiction ( ( How ( ( ( do you ) know ) ? ) ) ( ( All this ) ( ( ( is ( their information ) ) again ) . ) ) ) ( ( This information ) ( ( belongs ( to them ) ) . ) ) (ROOT (S (SBARQ (WHADVP (WRB How)) (SQ (VBP do) (NP (PRP you)) (VP (VB know))) (. ?)) (NP (PDT All) (DT this)) (VP (VBZ is) (NP (PRP$ their) (NN information)) (ADVP (RB again))) (. .))) (ROOT (S (NP (DT This) (NN information)) (VP (VBZ belongs) (PP (TO to) (NP (PRP them)))) (. .))) How do you know? All this is their information again. This information belongs to them. entailment entailment...
train.tsv數(shù)據(jù)樣式說明:
train.tsv中的數(shù)據(jù)內(nèi)容共分為12列, 第一列代表文本數(shù)據(jù)索引; 第二列和第三列數(shù)據(jù)分別代表句子對的不同類型id; 第四列代表句子對的來源; 第五列和第六列代表具有句法結(jié)構(gòu)分析的句子對表示; 第七列和第八列代表具有句法結(jié)構(gòu)和詞性標(biāo)注的句子對表示, 第九列和第十列代表原始的句子對, 第十一和第十二列代表不同標(biāo)準(zhǔn)的標(biāo)注方法產(chǎn)生的標(biāo)簽, 在這里,他們始終相同, 一共有三種類型的標(biāo)簽, neutral代表兩個句子既不矛盾也不蘊(yùn)含, entailment代表兩個句子具有蘊(yùn)含關(guān)系, contradiction代表兩個句子觀點(diǎn)矛盾.test_matched.tsv數(shù)據(jù)樣式:
index promptID pairID genre sentence1_binary_parse sentence2_binary_parse sentence1_parse sentence2_parse sentence1 sentence20 31493 31493 travel ( ( ( ( ( ( ( ( Hierbas , ) ( ans seco ) ) , ) ( ans dulce ) ) , ) and ) frigola ) ( ( ( are just ) ( ( a ( few names ) ) ( worth ( ( keeping ( a look-out ) ) for ) ) ) ) . ) ) ( Hierbas ( ( is ( ( a name ) ( worth ( ( looking out ) for ) ) ) ) . ) ) (ROOT (S (NP (NP (NNS Hierbas)) (, ,) (NP (NN ans) (NN seco)) (, ,) (NP (NN ans) (NN dulce)) (, ,) (CC and) (NP (NN frigola))) (VP (VBP are) (ADVP (RB just)) (NP (NP (DT a) (JJ few) (NNS names)) (PP (JJ worth) (S (VP (VBG keeping) (NP (DT a) (NN look-out)) (PP (IN for))))))) (. .))) (ROOT (S (NP (NNS Hierbas)) (VP (VBZ is) (NP (NP (DT a) (NN name)) (PP (JJ worth) (S (VP (VBG looking) (PRT (RP out)) (PP (IN for))))))) (. .))) Hierbas, ans seco, ans dulce, and frigola are just a few names worth keeping a look-out for. Hierbas is a name worth looking out for.1 92164 92164 government ( ( ( The extent ) ( of ( the ( behavioral effects ) ) ) ) ( ( would ( ( depend ( in ( part ( on ( ( the structure ) ( of ( ( ( the ( individual ( account program ) ) ) and ) ( any limits ) ) ) ) ) ) ) ) ( on ( accessing ( the funds ) ) ) ) ) . ) ) ( ( Many people ) ( ( would ( be ( very ( unhappy ( to ( ( loose control ) ( over ( their ( own money ) ) ) ) ) ) ) ) ) . ) ) (ROOT (S (NP (NP (DT The) (NN extent)) (PP (IN of) (NP (DT the) (JJ behavioral) (NNS effects)))) (VP (MD would) (VP (VB depend) (PP (IN in) (NP (NP (NN part)) (PP (IN on) (NP (NP (DT the) (NN structure)) (PP (IN of) (NP (NP (DT the) (JJ individual) (NN account) (NN program)) (CC and) (NP (DT any) (NNS limits)))))))) (PP (IN on) (S (VP (VBG accessing) (NP (DT the) (NNS funds))))))) (. .))) (ROOT (S (NP (JJ Many) (NNS people)) (VP (MD would) (VP (VB be) (ADJP (RB very) (JJ unhappy) (PP (TO to) (NP (NP (JJ loose) (NN control)) (PP (IN over) (NP (PRP$ their) (JJ own) (NN money)))))))) (. .))) The extent of the behavioral effects would depend in part on the structure of the individual account program and any limits on accessing the funds. Many people would be very unhappy to loose control over their own money.2 9662 9662 government ( ( ( Timely access ) ( to information ) ) ( ( is ( in ( ( the ( best interests ) ) ( of ( ( ( both GAO ) and ) ( the agencies ) ) ) ) ) ) . ) ) ( It ( ( ( is ( in ( ( everyone 's ) ( best interest ) ) ) ) ( to ( ( have access ) ( to ( information ( in ( a ( timely manner ) ) ) ) ) ) ) ) . ) ) (ROOT (S (NP (NP (JJ Timely) (NN access)) (PP (TO to) (NP (NN information)))) (VP (VBZ is) (PP (IN in) (NP (NP (DT the) (JJS best) (NNS interests)) (PP (IN of) (NP (NP (DT both) (NNP GAO)) (CC and) (NP (DT the) (NNS agencies))))))) (. .))) (ROOT (S (NP (PRP It)) (VP (VBZ is) (PP (IN in) (NP (NP (NN everyone) (POS 's)) (JJS best) (NN interest))) (S (VP (TO to) (VP (VB have) (NP (NN access)) (PP (TO to) (NP (NP (NN information)) (PP (IN in) (NP (DT a) (JJ timely) (NN manner))))))))) (. .))) Timely access to information is in the best interests of both GAO and the agencies. It is in everyone's best interest to have access to information in a timely manner.3 5991 5991 travel ( ( Based ( in ( ( the ( Auvergnat ( spa town ) ) ) ( of Vichy ) ) ) ) ( , ( ( the ( French government ) ) ( often ( ( ( ( proved ( more zealous ) ) ( than ( its masters ) ) ) ( in ( ( ( suppressing ( civil liberties ) ) and ) ( ( drawing up ) ( anti-Jewish legislation ) ) ) ) ) . ) ) ) ) ) ( ( The ( French government ) ) ( ( passed ( ( anti-Jewish laws ) ( aimed ( at ( helping ( the Nazi ) ) ) ) ) ) . ) ) (ROOT (S (PP (VBN Based) (PP (IN in) (NP (NP (DT the) (NNP Auvergnat) (NN spa) (NN town)) (PP (IN of) (NP (NNP Vichy)))))) (, ,) (NP (DT the) (JJ French) (NN government)) (ADVP (RB often)) (VP (VBD proved) (NP (JJR more) (NNS zealous)) (PP (IN than) (NP (PRP$ its) (NNS masters))) (PP (IN in) (S (VP (VP (VBG suppressing) (NP (JJ civil) (NNS liberties))) (CC and) (VP (VBG drawing) (PRT (RP up)) (NP (JJ anti-Jewish) (NN legislation))))))) (. .))) (ROOT (S (NP (DT The) (JJ French) (NN government)) (VP (VBD passed) (NP (NP (JJ anti-Jewish) (NNS laws)) (VP (VBN aimed) (PP (IN at) (S (VP (VBG helping) (NP (DT the) (JJ Nazi)))))))) (. .))) Based in the Auvergnat spa town of Vichy, the French government often proved more zealous than its masters in suppressing civil liberties and drawing up anti-Jewish legislation. The French government passed anti-Jewish laws aimed at helping the Nazi....
est_matched.tsv數(shù)據(jù)樣式說明:
test_matched.tsv中的數(shù)據(jù)內(nèi)容共分為10列, 與train.tsv的前10列含義相同.(MNLI/SNLI)數(shù)據(jù)集的任務(wù)類型:
句子對多分類任務(wù)評估指標(biāo)為: ACC(QNLI/RTE/WNLI)數(shù)據(jù)集文件樣式
* QNLI, RTE, WNLI三個數(shù)據(jù)集的樣式基本相同.
- (QNLI/RTE/WNLI)/ - dev.tsv - test.tsv - train.tsv
文件樣式說明:
在使用中常用到的文件是train.tsv, dev.tsv, test.tsv, 分別代表訓(xùn)練集, 驗證集和測試集. 其中train.tsv與dev.tsv數(shù)據(jù)樣式相同, 都是帶有標(biāo)簽的數(shù)據(jù), 其中test.tsv是不帶有標(biāo)簽的數(shù)據(jù).QNLI中的train.tsv數(shù)據(jù)樣式:
index question sentence label0 When did the third Digimon series begin? Unlike the two seasons before it and most of the seasons that followed, Digimon Tamers takes a darker and more realistic approach to its story featuring Digimon who do not reincarnate after their deaths and more complex character development in the original Japanese. not_entailment1 Which missile batteries often have individual launchers several kilometres from one another? When MANPADS is operated by specialists, batteries may have several dozen teams deploying separately in small sections; self-propelled air defence guns may deploy in pairs. not_entailment2 What two things does Popper argue Tarski's theory involves in an evaluation of truth? He bases this interpretation on the fact that examples such as the one described above refer to two things: assertions and the facts to which they refer. entailment3 What is the name of the village 9 miles north of Calafat where the Ottoman forces attacked the Russians? On 31 December 1853, the Ottoman forces at Calafat moved against the Russian force at Chetatea or Cetate, a small village nine miles north of Calafat, and engaged them on 6 January 1854. entailment4 What famous palace is located in London? London contains four World Heritage Sites: the Tower of London; Kew Gardens; the site comprising the Palace of Westminster, Westminster Abbey, and St Margaret's Church; and the historic settlement of Greenwich (in which the Royal Observatory, Greenwich marks the Prime Meridian, 0° longitude, and GMT). not_entailment5 When is the term 'German dialects' used in regard to the German language? When talking about the German language, the term German dialects is only used for the traditional regional varieties. entailment6 What was the name of the island the English traded to the Dutch in return for New Amsterdam? At the end of the Second Anglo-Dutch War, the English gained New Amsterdam (New York) in North America in exchange for Dutch control of Run, an Indonesian island. entailment7 How were the Portuguese expelled from Myanmar? From the 1720s onward, the kingdom was beset with repeated Meithei raids into Upper Myanmar and a nagging rebellion in Lan Na. not_entailment8 What does the word 'customer' properly apply to? The bill also required rotation of principal maintenance inspectors and stipulated that the word "customer" properly applies to the flying public, not those entities regulated by the FAA. entailment...
RTE中的train.tsv數(shù)據(jù)樣式:
index sentence1 sentence2 label0 No Weapons of Mass Destruction Found in Iraq Yet. Weapons of Mass Destruction Found in Iraq. not_entailment1 A place of sorrow, after Pope John Paul II died, became a place of celebration, as Roman Catholic faithful gathered in downtown Chicago to mark the installation of new Pope Benedict XVI.Pope Benedict XVI is the new leader of the Roman Catholic Church. entailment2 Herceptin was already approved to treat the sickest breast cancer patients, and the company said, Monday, it will discuss with federal regulators the possibility of prescribing the drug for more breast cancer patients. Herceptin can be used to treat breast cancer. entailment3 Judie Vivian, chief executive at ProMedica, a medical service company that helps sustain the 2-year-old Vietnam Heart Institute in Ho Chi Minh City (formerly Saigon), said that so far about 1,500 children have received treatment. The previous name of Ho Chi Minh City was Saigon.entailment4 A man is due in court later charged with the murder 26 years ago of a teenager whose case was the first to be featured on BBC One's Crimewatch. Colette Aram, 16, was walking to her boyfriend's house in Keyworth, Nottinghamshire, on 30 October 1983 when she disappeared. Her body was later found in a field close to her home. Paul Stewart Hutchinson, 50, has been charged with murder and is due before Nottingham magistrates later. Paul Stewart Hutchinson is accused of having stabbed a girl. not_entailment5 Britain said, Friday, that it has barred cleric, Omar Bakri, from returning to the country from Lebanon, where he was released by police after being detained for 24 hours. Bakri was briefly detained, but was released. entailment6 Nearly 4 million children who have at least one parent who entered the U.S. illegally were born in the United States and are U.S. citizens as a result, according to the study conducted by the Pew Hispanic Center. That's about three quarters of the estimated 5.5 million children of illegal immigrants inside the United States, according to the study. About 1.8 million children of undocumented immigrants live in poverty, the study found. Three quarters of U.S. illegal immigrants have children. not_entailment7 Like the United States, U.N. officials are also dismayed that Aristide killed a conference called by Prime Minister Robert Malval in Port-au-Prince in hopes of bringing all the feuding parties together. Aristide had Prime Minister Robert Malval murdered in Port-au-Prince. not_entailment8 WASHINGTON -- A newly declassified narrative of the Bush administration's advice to the CIA on harsh interrogations shows that the small group of Justice Department lawyers who wrote memos authorizing controversial interrogation techniques were operating not on their own but with direction from top administration officials, including then-Vice President Dick Cheney and national security adviser Condoleezza Rice. At the same time, the narrative suggests that then-Defense Secretary Donald H. Rumsfeld and then-Secretary of State Colin Powell were largely left out of the decision-making process. Dick Cheney was the Vice President of Bush. entailment
WNLI中的train.tsv數(shù)據(jù)樣式:
index sentence1 sentence2 label0 I stuck a pin through a carrot. When I pulled the pin out, it had a hole. The carrot had a hole. 11 John couldn't see the stage with Billy in front of him because he is so short. John is so short. 12 The police arrested all of the gang members. They were trying to stop the drug trade in the neighborhood. The police were trying to stop the drug trade in the neighborhood. 13 Steve follows Fred's example in everything. He influences him hugely. Steve influences him hugely. 04 When Tatyana reached the cabin, her mother was sleeping. She was careful not to disturb her, undressing and climbing back into her berth. mother was careful not to disturb her, undressing and climbing back into her berth. 05 George got free tickets to the play, but he gave them to Eric, because he was particularly eager to see it. George was particularly eager to see it. 06 John was jogging through the park when he saw a man juggling watermelons. He was very impressive. John was very impressive. 07 I couldn't put the pot on the shelf because it was too tall. The pot was too tall. 18 We had hoped to place copies of our newsletter on all the chairs in the auditorium, but there were simply not enough of them. There were simply not enough copies of the newsletter. 1
(QNLI/RTE/WNLI)中的train.tsv數(shù)據(jù)樣式說明:
train.tsv中的數(shù)據(jù)內(nèi)容共分為4列, 第一列代表文本數(shù)據(jù)索引; 第二列和第三列數(shù)據(jù)代表需要進(jìn)行'是否蘊(yùn)含'判定的句子對; 第四列數(shù)據(jù)代表兩個句子是否具有蘊(yùn)含關(guān)系, 0/not_entailment代表不是蘊(yùn)含關(guān)系, 1/entailment代表蘊(yùn)含關(guān)系.QNLI中的test.tsv數(shù)據(jù)樣式:
index question sentence0 What organization is devoted to Jihad against Israel? For some decades prior to the First Palestine Intifada in 1987, the Muslim Brotherhood in Palestine took a "quiescent" stance towards Israel, focusing on preaching, education and social services, and benefiting from Israel's "indulgence" to build up a network of mosques and charitable organizations.1 In what century was the Yarrow-Schlick-Tweedy balancing system used? In the late 19th century, the Yarrow-Schlick-Tweedy balancing 'system' was used on some marine triple expansion engines.2 The largest brand of what store in the UK is located in Kingston Park? Close to Newcastle, the largest indoor shopping centre in Europe, the MetroCentre, is located in Gateshead.3 What does the IPCC rely on for research? In principle, this means that any significant new evidence or events that change our understanding of climate science between this deadline and publication of an IPCC report cannot be included.4 What is the principle about relating spin and space variables? Thus in the case of two fermions there is a strictly negative correlation between spatial and spin variables, whereas for two bosons (e.g. quanta of electromagnetic waves, photons) the correlation is strictly positive.5 Which network broadcasted Super Bowl 50 in the U.S.? CBS broadcast Super Bowl 50 in the U.S., and charged an average of $5 million for a 30-second commercial during the game.6 What did the museum acquire from the Royal College of Science? To link this to the rest of the museum, a new entrance building was constructed on the site of the former boiler house, the intended site of the Spiral, between 1978 and 1982.7 What is the name of the old north branch of the Rhine? From Wijk bij Duurstede, the old north branch of the Rhine is called Kromme Rijn ("Bent Rhine") past Utrecht, first Leidse Rijn ("Rhine of Leiden") and then, Oude Rijn ("Old Rhine").8 What was one of Luther's most personal writings? It remains in use today, along with Luther's hymns and his translation of the Bible....
(RTE/WNLI)中的test.tsv數(shù)據(jù)樣式:
index sentence1 sentence20 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight. Horses ran away when Maude and Dora came in sight.1 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight. Horses ran away when the trains came in sight.2 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight. Horses ran away when the puffs came in sight.3 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight. Horses ran away when the roars came in sight.4 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight. Horses ran away when the whistles came in sight.5 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight. Horses ran away when the horses came in sight.6 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they saw a train coming. Maude and Dora saw a train coming.7 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they saw a train coming. The trains saw a train coming.8 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they saw a train coming. The puffs saw a train coming....
(QNLI/RTE/WNLI)中的test.tsv數(shù)據(jù)樣式說明:
test.tsv中的數(shù)據(jù)內(nèi)容共分為3列, 第一列數(shù)據(jù)代表每條文本數(shù)據(jù)的索引; 第二列和第三列數(shù)據(jù)代表需要進(jìn)行'是否蘊(yùn)含'判定的句子對.(QNLI/RTE/WNLI)數(shù)據(jù)集的任務(wù)類型:
句子對二分類任務(wù)評估指標(biāo)為: ACC三、NLP中的常用預(yù)訓(xùn)練模型當(dāng)下NLP中流行的預(yù)訓(xùn)練模型BERTGPTGPT-2Transformer-XLXLNetXLMRoBERTaDistilBERTALBERTT5XLM-RoBERTaBERT及其變體:
bert-base-uncased: 編碼器具有12個隱層, 輸出768維張量, 12個自注意力頭, 共110M參數(shù)量, 在小寫的英文文本上進(jìn)行訓(xùn)練而得到.bert-large-uncased: 編碼器具有24個隱層, 輸出1024維張量, 16個自注意力頭, 共340M參數(shù)量, 在小寫的英文文本上進(jìn)行訓(xùn)練而得到.bert-base-cased: 編碼器具有12個隱層, 輸出768維張量, 12個自注意力頭, 共110M參數(shù)量, 在不區(qū)分大小寫的英文文本上進(jìn)行訓(xùn)練而得到.bert-large-cased: 編碼器具有24個隱層, 輸出1024維張量, 16個自注意力頭, 共340M參數(shù)量, 在不區(qū)分大小寫的英文文本上進(jìn)行訓(xùn)練而得到.bert-base-multilingual-uncased: 編碼器具有12個隱層, 輸出768維張量, 12個自注意力頭, 共110M參數(shù)量, 在小寫的102種語言文本上進(jìn)行訓(xùn)練而得到.bert-large-multilingual-uncased: 編碼器具有24個隱層, 輸出1024維張量, 16個自注意力頭, 共340M參數(shù)量, 在小寫的102種語言文本上進(jìn)行訓(xùn)練而得到.bert-base-chinese: 編碼器具有12個隱層, 輸出768維張量, 12個自注意力頭, 共110M參數(shù)量, 在簡體和繁體中文文本上進(jìn)行訓(xùn)練而得到.GPT:
openai-gpt: 編碼器具有12個隱層, 輸出768維張量, 12個自注意力頭, 共110M參數(shù)量, 由OpenAI在英文語料上進(jìn)行訓(xùn)練而得到.GPT-2及其變體:
gpt2: 編碼器具有12個隱層, 輸出768維張量, 12個自注意力頭, 共117M參數(shù)量, 在OpenAI GPT-2英文語料上進(jìn)行訓(xùn)練而得到.gpt2-xl: 編碼器具有48個隱層, 輸出1600維張量, 25個自注意力頭, 共1558M參數(shù)量, 在大型的OpenAI GPT-2英文語料上進(jìn)行訓(xùn)練而得到.Transformer-XL:
transfo-xl-wt103: 編碼器具有18個隱層, 輸出1024維張量, 16個自注意力頭, 共257M參數(shù)量, 在wikitext-103英文語料進(jìn)行訓(xùn)練而得到.XLNet及其變體:xlnet-base-cased: 編碼器具有12個隱層, 輸出768維張量, 12個自注意力頭, 共110M參數(shù)量, 在英文語料上進(jìn)行訓(xùn)練而得到.xlnet-large-cased: 編碼器具有24個隱層, 輸出1024維張量, 16個自注意力頭, 共240參數(shù)量, 在英文語料上進(jìn)行訓(xùn)練而得到.XLM:
xlm-mlm-en-2048: 編碼器具有12個隱層, 輸出2048維張量, 16個自注意力頭, 在英文文本上進(jìn)行訓(xùn)練而得到.RoBERTa及其變體:
roberta-base: 編碼器具有12個隱層, 輸出768維張量, 12個自注意力頭, 共125M參數(shù)量, 在英文文本上進(jìn)行訓(xùn)練而得到.roberta-large: 編碼器具有24個隱層, 輸出1024維張量, 16個自注意力頭, 共355M參數(shù)量, 在英文文本上進(jìn)行訓(xùn)練而得到.DistilBERT及其變體:
distilbert-base-uncased: 基于bert-base-uncased的蒸餾(壓縮)模型, 編碼器具有6個隱層, 輸出768維張量, 12個自注意力頭, 共66M參數(shù)量.distilbert-base-multilingual-cased: 基于bert-base-multilingual-uncased的蒸餾(壓縮)模型, 編碼器具有6個隱層, 輸出768維張量, 12個自注意力頭, 共66M參數(shù)量.ALBERT:
albert-base-v1: 編碼器具有12個隱層, 輸出768維張量, 12個自注意力頭, 共125M參數(shù)量, 在英文文本上進(jìn)行訓(xùn)練而得到.albert-base-v2: 編碼器具有12個隱層, 輸出768維張量, 12個自注意力頭, 共125M參數(shù)量, 在英文文本上進(jìn)行訓(xùn)練而得到, 相比v1使用了更多的數(shù)據(jù)量, 花費(fèi)更長的訓(xùn)練時間.T5及其變體:
t5-small: 編碼器具有6個隱層, 輸出512維張量, 8個自注意力頭, 共60M參數(shù)量, 在C4語料上進(jìn)行訓(xùn)練而得到.t5-base: 編碼器具有12個隱層, 輸出768維張量, 12個自注意力頭, 共220M參數(shù)量, 在C4語料上進(jìn)行訓(xùn)練而得到.t5-large: 編碼器具有24個隱層, 輸出1024維張量, 16個自注意力頭, 共770M參數(shù)量, 在C4語料上進(jìn)行訓(xùn)練而得到.XLM-RoBERTa及其變體:
xlm-roberta-base: 編碼器具有12個隱層, 輸出768維張量, 8個自注意力頭, 共125M參數(shù)量, 在2.5TB的100種語言文本上進(jìn)行訓(xùn)練而得到.xlm-roberta-large: 編碼器具有24個隱層, 輸出1027維張量, 16個自注意力頭, 共355M參數(shù)量, 在2.5TB的100種語言文本上進(jìn)行訓(xùn)練而得到.預(yù)訓(xùn)練模型說明:
所有上述預(yù)訓(xùn)練模型及其變體都是以transformer為基礎(chǔ),只是在模型結(jié)構(gòu)如神經(jīng)元連接方式,編碼器隱層數(shù),多頭注意力的頭數(shù)等發(fā)生改變,這些改變方式的大部分依據(jù)都是由在標(biāo)準(zhǔn)數(shù)據(jù)集上的表現(xiàn)而定,因此,對于我們使用者而言,不需要從理論上深度探究這些預(yù)訓(xùn)練模型的結(jié)構(gòu)設(shè)計的優(yōu)劣,只需要在自己處理的目標(biāo)數(shù)據(jù)上,盡量遍歷所有可用的模型對比得到最優(yōu)效果即可.四、加載和使用預(yù)訓(xùn)練模型加載和使用預(yù)訓(xùn)練模型的工具
在這里我們使用torch.hub工具進(jìn)行模型的加載和使用.這些預(yù)訓(xùn)練模型由世界先進(jìn)的NLP研發(fā)團(tuán)隊huggingface提供.加載和使用預(yù)訓(xùn)練模型的步驟
第一步: 確定需要加載的預(yù)訓(xùn)練模型并安裝依賴包.第二步: 加載預(yù)訓(xùn)練模型的映射器tokenizer.第三步: 加載帶/不帶頭的預(yù)訓(xùn)練模型.第四步: 使用模型獲得輸出結(jié)果.第一步: 確定需要加載的預(yù)訓(xùn)練模型并安裝依賴包能夠加載哪些模型可以參考2.3 NLP中的常用預(yù)訓(xùn)練模型這里假設(shè)我們處理的是中文文本任務(wù), 需要加載的模型是BERT的中文模型: bert-base-chinese在使用工具加載模型前需要安裝必備的依賴包:pip install tqdm boto3 requests regex sentencepiece sacremoses第二步: 加載預(yù)訓(xùn)練模型的映射器tokenizer
import torch# 預(yù)訓(xùn)練模型來源source = 'huggingface/pytorch-transformers'# 選定加載模型的哪一部分, 這里是模型的映射器part = 'tokenizer'# 加載的預(yù)訓(xùn)練模型的名字model_name = 'bert-base-chinese'tokenizer = torch.hub.load(source, part, model_name) 第三步: 加載帶/不帶頭的預(yù)訓(xùn)練模型加載預(yù)訓(xùn)練模型時我們可以選擇帶頭或者不帶頭的模型這里的'頭'是指模型的任務(wù)輸出層, 選擇加載不帶頭的模型, 相當(dāng)于使用模型對輸入文本進(jìn)行特征表示.選擇加載帶頭的模型時, 有三種類型的'頭'可供選擇, modelWithLMHead(語言模型頭), modelForSequenceClassification(分類模型頭), modelForQuestionAnswering(問答模型頭)不同類型的'頭', 可以使預(yù)訓(xùn)練模型輸出指定的張量維度. 如使用'分類模型頭', 則輸出尺寸為(1,2)的張量, 用于進(jìn)行分類任務(wù)判定結(jié)果.
# 加載不帶頭的預(yù)訓(xùn)練模型part = 'model'model = torch.hub.load(source, part, model_name)# 加載帶有語言模型頭的預(yù)訓(xùn)練模型part = 'modelWithLMHead'lm_model = torch.hub.load(source, part, model_name)# 加載帶有類模型頭的預(yù)訓(xùn)練模型part = 'modelForSequenceClassification'classification_model = torch.hub.load(source, part, model_name)# 加載帶有問答模型頭的預(yù)訓(xùn)練模型part = 'modelForQuestionAnswering'qa_model = torch.hub.load(source, part, model_name)第四步: 使用模型獲得輸出結(jié)果使用不帶頭的模型進(jìn)行輸出:
# 輸入的中文文本input_text = "人生該如何起頭"# 使用tokenizer進(jìn)行數(shù)值映射indexed_tokens = tokenizer.encode(input_text)# 打印映射后的結(jié)構(gòu)print("indexed_tokens:", indexed_tokens)# 將映射結(jié)構(gòu)轉(zhuǎn)化為張量輸送給不帶頭的預(yù)訓(xùn)練模型tokens_tensor = torch.tensor([indexed_tokens])# 使用不帶頭的預(yù)訓(xùn)練模型獲得結(jié)果with torch.no_grad(): encoded_layers, _ = model(tokens_tensor)print("不帶頭的模型輸出結(jié)果:", encoded_layers)print("不帶頭的模型輸出結(jié)果的尺寸:", encoded_layers.shape)
輸出效果:
# tokenizer映射后的結(jié)果, 101和102是起止符, # 中間的每個數(shù)字對應(yīng)"人生該如何起頭"的每個字.indexed_tokens: [101, 782, 4495, 6421, 1963, 862, 6629, 1928, 102]不帶頭的模型輸出結(jié)果: tensor([[[ 0.5421, 0.4526, -0.0179, ..., 1.0447, -0.1140, 0.0068], [-0.1343, 0.2785, 0.1602, ..., -0.0345, -0.1646, -0.2186], [ 0.9960, -0.5121, -0.6229, ..., 1.4173, 0.5533, -0.2681], ..., [ 0.0115, 0.2150, -0.0163, ..., 0.6445, 0.2452, -0.3749], [ 0.8649, 0.4337, -0.1867, ..., 0.7397, -0.2636, 0.2144], [-0.6207, 0.1668, 0.1561, ..., 1.1218, -0.0985, -0.0937]]])# 輸出尺寸為1x9x768, 即每個字已經(jīng)使用768維的向量進(jìn)行了表示,# 我們可以基于此編碼結(jié)果進(jìn)行接下來的自定義操作, 如: 編寫自己的微調(diào)網(wǎng)絡(luò)進(jìn)行最終輸出.不帶頭的模型輸出結(jié)果的尺寸: torch.Size([1, 9, 768])
使用帶有語言模型頭的模型進(jìn)行輸出:
# 使用帶有語言模型頭的預(yù)訓(xùn)練模型獲得結(jié)果with torch.no_grad(): lm_output = lm_model(tokens_tensor)print("帶語言模型頭的模型輸出結(jié)果:", lm_output)print("帶語言模型頭的模型輸出結(jié)果的尺寸:", lm_output[0].shape)
輸出效果:
帶語言模型頭的模型輸出結(jié)果: (tensor([[[ -7.9706, -7.9119, -7.9317, ..., -7.2174, -7.0263, -7.3746], [ -8.2097, -8.1810, -8.0645, ..., -7.2349, -6.9283, -6.9856], [-13.7458, -13.5978, -12.6076, ..., -7.6817, -9.5642, -11.9928], ..., [ -9.0928, -8.6857, -8.4648, ..., -8.2368, -7.5684, -10.2419], [ -8.9458, -8.5784, -8.6325, ..., -7.0547, -5.3288, -7.8077], [ -8.4154, -8.5217, -8.5379, ..., -6.7102, -5.9782, -7.6909]]]),)# 輸出尺寸為1x9x21128, 即每個字已經(jīng)使用21128維的向量進(jìn)行了表示, # 同不帶頭的模型一樣, 我們可以基于此編碼結(jié)果進(jìn)行接下來的自定義操作, 如: 編寫自己的微調(diào)網(wǎng)絡(luò)進(jìn)行最終輸出.帶語言模型頭的模型輸出結(jié)果的尺寸: torch.Size([1, 9, 21128])
使用帶有分類模型頭的模型進(jìn)行輸出:
# 使用帶有分類模型頭的預(yù)訓(xùn)練模型獲得結(jié)果with torch.no_grad(): classification_output = classification_model(tokens_tensor)print("帶分類模型頭的模型輸出結(jié)果:", classification_output)print("帶分類模型頭的模型輸出結(jié)果的尺寸:", classification_output[0].shape)
輸出效果:
帶分類模型頭的模型輸出結(jié)果: (tensor([[-0.0649, -0.1593]]),)# 輸出尺寸為1x2, 可直接用于文本二分問題的輸出帶分類模型頭的模型輸出結(jié)果的尺寸: torch.Size([1, 2])
使用帶有問答模型頭的模型進(jìn)行輸出:
# 使用帶有問答模型頭的模型進(jìn)行輸出時, 需要使輸入的形式為句子對# 第一條句子是對客觀事物的陳述# 第二條句子是針對第一條句子提出的問題# 問答模型最終將得到兩個張量, # 每個張量中最大值對應(yīng)索引的分別代表答案的在文本中的起始位置和終止位置.input_text1 = "我家的小狗是黑色的"input_text2 = "我家的小狗是什么顏色的呢?"# 映射兩個句子indexed_tokens = tokenizer.encode(input_text1, input_text2)print("句子對的indexed_tokens:", indexed_tokens)# 輸出結(jié)果: [101, 2769, 2157, 4638, 2207, 4318, 3221, 7946, 5682, 4638, 102, 2769, 2157, 4638, 2207, 4318, 3221, 784, 720, 7582, 5682, 4638, 1450, 136, 102]# 用0,1來區(qū)分第一條和第二條句子segments_ids = [0]*11 + [1]*14# 轉(zhuǎn)化張量形式segments_tensors = torch.tensor([segments_ids])tokens_tensor = torch.tensor([indexed_tokens])# 使用帶有問答模型頭的預(yù)訓(xùn)練模型獲得結(jié)果with torch.no_grad(): start_logits, end_logits = qa_model(tokens_tensor, token_type_ids=segments_tensors)print("帶問答模型頭的模型輸出結(jié)果:", (start_logits, end_logits))print("帶問答模型頭的模型輸出結(jié)果的尺寸:", (start_logits.shape, end_logits.shape))
輸出效果:
句子對的indexed_tokens: [101, 2769, 2157, 4638, 2207, 4318, 3221, 7946, 5682, 4638, 102, 2769, 2157, 4638, 2207, 4318, 3221, 784, 720, 7582, 5682, 4638, 1450, 136, 102]帶問答模型頭的模型輸出結(jié)果: (tensor([[ 0.2574, -0.0293, -0.8337, -0.5135, -0.3645, -0.2216, -0.1625, -0.2768, -0.8368, -0.2581, 0.0131, -0.1736, -0.5908, -0.4104, -0.2155, -0.0307, -0.1639, -0.2691, -0.4640, -0.1696, -0.4943, -0.0976, -0.6693, 0.2426, 0.0131]]), tensor([[-0.3788, -0.2393, -0.5264, -0.4911, -0.7277, -0.5425, -0.6280, -0.9800, -0.6109, -0.2379, -0.0042, -0.2309, -0.4894, -0.5438, -0.6717, -0.5371, -0.1701, 0.0826, 0.1411, -0.1180, -0.4732, -0.1541, 0.2543, 0.2163, -0.0042]]))# 輸出為兩個形狀1x25的張量, 他們是兩條句子合并長度的概率分布,# 第一個張量中最大值所在的索引代表答案出現(xiàn)的起始索引, # 第二個張量中最大值所在的索引代表答案出現(xiàn)的終止索引.帶問答模型頭的模型輸出結(jié)果的尺寸: (torch.Size([1, 25]), torch.Size([1, 25]))五、遷移學(xué)習(xí)實(shí)踐
指定任務(wù)類型的微調(diào)腳本:
huggingface研究機(jī)構(gòu)向我們提供了針對GLUE數(shù)據(jù)集合任務(wù)類型的微調(diào)腳本, 這些微調(diào)腳本的核心都是微調(diào)模型的最后一個全連接層.通過簡單的參數(shù)配置來指定GLUE中存在任務(wù)類型(如: CoLA對應(yīng)文本二分類, MRPC對應(yīng)句子對文本二分類, STS-B對應(yīng)句子對文本多分類), 以及指定需要微調(diào)的預(yù)訓(xùn)練模型.指定任務(wù)類型的微調(diào)腳本使用步驟第一步: 下載微調(diào)腳本文件第二步: 配置微調(diào)腳本參數(shù)第三步: 運(yùn)行并檢驗效果第一步: 下載微調(diào)腳本文件# 克隆huggingface的transfomers文件git clone https://github.com/huggingface/transformers.git# 進(jìn)行transformers文件夾cd transformers# 安裝python的transformer工具包, 因為微調(diào)腳本是py文件.pip install .# 當(dāng)前的版本可能跟我們教學(xué)的版本并不相同,你還需要執(zhí)行:pip install transformers==2.3.0# 進(jìn)入微調(diào)腳本所在路徑并查看cd examplesls# 其中run_glue.py就是針對GLUE數(shù)據(jù)集合任務(wù)類型的微調(diào)腳本
注意:
對于run_glue.py,由于版本變更導(dǎo)致,請通過該地址http://git.itcast.cn/Stephen/AI-key-file/blob/master/run_glue.py復(fù)制里面的代碼,覆蓋原有內(nèi)容。第二步: 配置微調(diào)腳本參數(shù)在run_glue.py同級目錄下創(chuàng)建run_glue.sh文件, 寫入內(nèi)容如下:# 定義DATA_DIR: 微調(diào)數(shù)據(jù)所在路徑, 這里我們使用glue_data中的數(shù)據(jù)作為微調(diào)數(shù)據(jù)export DATA_DIR="../../glue_data"# 定義SAVE_DIR: 模型的保存路徑, 我們將模型保存在當(dāng)前目錄的bert_finetuning_test文件中export SAVE_DIR="./bert_finetuning_test/"# 使用python運(yùn)行微調(diào)腳本# --model_type: 選擇需要微調(diào)的模型類型, 這里可以選擇BERT, XLNET, XLM, roBERTa, distilBERT, ALBERT# --model_name_or_path: 選擇具體的模型或者變體, 這里是在英文語料上微調(diào), 因此選擇bert-base-uncased# --task_name: 它將代表對應(yīng)的任務(wù)類型, 如MRPC代表句子對二分類任務(wù)# --do_train: 使用微調(diào)腳本進(jìn)行訓(xùn)練# --do_eval: 使用微調(diào)腳本進(jìn)行驗證# --data_dir: 訓(xùn)練集及其驗證集所在路徑, 將自動尋找該路徑下的train.tsv和dev.tsv作為訓(xùn)練集和驗證集# --max_seq_length: 輸入句子的最大長度, 超過則截斷, 不足則補(bǔ)齊# --learning_rate: 學(xué)習(xí)率# --num_train_epochs: 訓(xùn)練輪數(shù)# --output_dir $SAVE_DIR: 訓(xùn)練后的模型保存路徑# --overwrite_output_dir: 再次訓(xùn)練時將清空之前的保存路徑內(nèi)容重新寫入python run_glue.py \\ --model_type BERT \\ --model_name_or_path bert-base-uncased \\ --task_name MRPC \\ --do_train \\ --do_eval \\ --data_dir $DATA_DIR/MRPC/ \\ --max_seq_length 128 \\ --learning_rate 2e-5 \\ --num_train_epochs 1.0 \\ --output_dir $SAVE_DIR \\ --overwrite_output_dir第三步: 運(yùn)行并檢驗效果
# 使用sh命令運(yùn)行sh run_glue.sh
輸出效果:
# 最終打印模型的驗證結(jié)果:01/05/2020 23:59:53 - INFO - __main__ - Saving features into cached file ../../glue_data/MRPC/cached_dev_bert-base-uncased_128_mrpc01/05/2020 23:59:53 - INFO - __main__ - ***** Running evaluation *****01/05/2020 23:59:53 - INFO - __main__ - Num examples = 40801/05/2020 23:59:53 - INFO - __main__ - Batch size = 8Evaluating: 100%|█| 51/51 [00:23<00:00, 2.20it/s]01/06/2020 00:00:16 - INFO - __main__ - ***** Eval results *****01/06/2020 00:00:16 - INFO - __main__ - acc = 0.767156862745098101/06/2020 00:00:16 - INFO - __main__ - acc_and_f1 = 0.807334450634186301/06/2020 00:00:16 - INFO - __main__ - f1 = 0.8475120385232745
查看$SAVE_DIR的文件內(nèi)容:
added_tokens.json checkpoint-450 checkpoint-400 checkpoint-350 checkpoint-200 checkpoint-300 checkpoint-250 checkpoint-200 checkpoint-150 checkpoint-100 checkpoint-50 pytorch_model.bin training_args.binconfig.json special_tokens_map.json vocab.txteval_results.txt tokenizer_config.json
文件解釋:
pytorch_model.bin代表模型參數(shù),可以使用torch.load加載查看;traning_args.bin代表模型訓(xùn)練時的超參,如batch_size,epoch等,仍可使用torch.load查看;config.json是模型配置文件,如多頭注意力的頭數(shù),編碼器的層數(shù)等,代表典型的模型結(jié)構(gòu),如bert,xlnet,一般不更改;added_token.json記錄在訓(xùn)練時通過代碼添加的自定義token對應(yīng)的數(shù)值,即在代碼中使用add_token方法添加的自定義詞匯;special_token_map.json當(dāng)添加的token具有特殊含義時,如分隔符,該文件存儲特殊字符的及其對應(yīng)的含義,使文本中出現(xiàn)的特殊字符先映射成其含義,之后特殊字符的含義仍然使用add_token方法映射。checkpoint: 若干步驟保存的模型參數(shù)文件(也叫檢測點(diǎn)文件)。通過微調(diào)腳本微調(diào)后模型的使用步驟第一步: 在https://huggingface.co/join上創(chuàng)建一個帳戶第二步: 在服務(wù)器終端使用transformers-cli登陸第三步: 使用transformers-cli上傳模型并查看第四步: 使用pytorch.hub加載模型進(jìn)行使用第一步: 在https://huggingface.co/join上創(chuàng)建一個帳戶# 如果由于網(wǎng)絡(luò)原因無法訪問, 我們已經(jīng)為你提供了默認(rèn)賬戶username: ItcastAIpassword: ItcastAI第二步: 在服務(wù)器終端使用transformers-cli登陸
# 在微調(diào)模型的服務(wù)器上登陸# 使用剛剛注冊的用戶名和密碼# 默認(rèn)username: ItcastAI# 默認(rèn)password: ItcastAI$ transformers-cli login第三步: 使用transformers-cli上傳模型并查看
# 使用transformers-cli upload命令上傳模型# 選擇正確的微調(diào)模型路徑$ transformers-cli upload ./bert_finetuning_test/# 查看上傳結(jié)果$ transformers-cli lsFilename LastModified ETag Size----------------------------------------------------- ------------------------ ---------------------------------- ---------bert_finetuning_test/added_tokens.json 2020-01-05T17:39:57.000Z "99914b932bd37a50b983c5e7c90ae93b" 2bert_finetuning_test/checkpoint-400/config.json 2020-01-05T17:26:49.000Z "74d53ea41e5acb6d60496bc195d82a42" 684bert_finetuning_test/checkpoint-400/training_args.bin 2020-01-05T17:26:47.000Z "b3273519c2b2b1cb2349937279880f50" 1207bert_finetuning_test/checkpoint-450/config.json 2020-01-05T17:15:42.000Z "74d53ea41e5acb6d60496bc195d82a42" 684bert_finetuning_test/checkpoint-450/pytorch_model.bin 2020-01-05T17:15:58.000Z "077cc0289c90b90d6b662cce104fe4ef" 437982584bert_finetuning_test/checkpoint-450/training_args.bin 2020-01-05T17:15:40.000Z "b3273519c2b2b1cb2349937279880f50" 1207bert_finetuning_test/config.json 2020-01-05T17:28:50.000Z "74d53ea41e5acb6d60496bc195d82a42" 684bert_finetuning_test/eval_results.txt 2020-01-05T17:28:56.000Z "67d2d49a96afc4308d33bfcddda8a7c5" 81bert_finetuning_test/pytorch_model.bin 2020-01-05T17:28:59.000Z "d46a8ccfb8f5ba9ecee70cef8306679e" 437982584bert_finetuning_test/special_tokens_map.json 2020-01-05T17:28:54.000Z "8b3fb1023167bb4ab9d70708eb05f6ec" 112bert_finetuning_test/tokenizer_config.json 2020-01-05T17:28:52.000Z "0d7f03e00ecb582be52818743b50e6af" 59bert_finetuning_test/training_args.bin 2020-01-05T17:28:48.000Z "b3273519c2b2b1cb2349937279880f50" 1207bert_finetuning_test/vocab.txt 2020-01-05T17:39:55.000Z "64800d5d8528ce344256daf115d4965e" 231508第四步: 使用pytorch.hub加載模型進(jìn)行使用, 更多信息請參考2.4 加載和使用預(yù)訓(xùn)練模型
# 若之前使用過huggingface的transformers, 請清除~/.cacheimport torch# 如: ItcastAI/bert_finetuning_testsource = 'huggingface/pytorch-transformers'# 選定加載模型的哪一部分, 這里是模型的映射器part = 'tokenizer'############################################## 加載的預(yù)訓(xùn)練模型的名字# 使用自己的模型名字"username/model_name"# 如:'ItcastAI/bert_finetuning_test'model_name = 'ItcastAI/bert_finetuning_test'#############################################tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', model_name)model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', model_name)index = tokenizer.encode("Talk is cheap", "Please show me your code!")# 102是bert模型中的間隔(結(jié)束)符號的數(shù)值映射mark = 102# 找到第一個102的索引, 即句子對的間隔符號k = index.index(mark)# 句子對分割id列表, 由0,1組成, 0的位置代表第一個句子, 1的位置代表第二個句子segments_ids = [0]*(k + 1) + [1]*(len(index) - k - 1)# 轉(zhuǎn)化為tensortokens_tensor = torch.tensor([index])segments_tensors = torch.tensor([segments_ids])# 使用評估模式with torch.no_grad(): # 使用模型預(yù)測獲得結(jié)果 result = model(tokens_tensor, token_type_ids=segments_tensors) # 打印預(yù)測結(jié)果以及張量尺寸 print(result) print(result[0].shape)
輸出效果:
(tensor([[-0.0181, 0.0263]]),)torch.Size([1, 2])通過微調(diào)方式進(jìn)行遷移學(xué)習(xí)的兩種類型類型一: 使用指定任務(wù)類型的微調(diào)腳本微調(diào)預(yù)訓(xùn)練模型, 后接帶有輸出頭的預(yù)定義網(wǎng)絡(luò)輸出結(jié)果.類型二: 直接加載預(yù)訓(xùn)練模型進(jìn)行輸入文本的特征表示, 后接自定義網(wǎng)絡(luò)進(jìn)行微調(diào)輸出結(jié)果.說明: 所有類型的實(shí)戰(zhàn)演示, 都將針對中文文本進(jìn)行.類型一實(shí)戰(zhàn)演示使用文本二分類的任務(wù)類型SST-2的微調(diào)腳本微調(diào)中文預(yù)訓(xùn)練模型, 后接帶有分類輸出頭的預(yù)定義網(wǎng)絡(luò)輸出結(jié)果. 目標(biāo)是判斷句子的情感傾向.準(zhǔn)備中文酒店評論的情感分析語料, 語料樣式與SST-2數(shù)據(jù)集相同, 標(biāo)簽0代表差評, 標(biāo)簽1好評.語料存放在與glue_data/同級目錄cn_data/下, 其中的SST-2目錄包含train.tsv和dev.tsv
train.tsv
sentence label早餐不好,服務(wù)不到位,晚餐無西餐,早餐晚餐相同,房間條件不好,餐廳不分吸煙區(qū).房間不分有無煙房. 0去的時候 ,酒店大廳和餐廳在裝修,感覺大廳有點(diǎn)擠.由于餐廳裝修本來該享受的早飯,也沒有享受(他們是8點(diǎn)開始每個房間送,但是我時間來不及了)不過前臺服務(wù)員態(tài)度好! 1有很長時間沒有在西藏大廈住了,以前去北京在這里住的較多。這次住進(jìn)來發(fā)現(xiàn)換了液晶電視,但網(wǎng)絡(luò)不是很好,他們自己說是收費(fèi)的原因造成的。其它還好。 1非常好的地理位置,住的是豪華海景房,打開窗戶就可以看見棧橋和海景。記得很早以前也住過,現(xiàn)在重新裝修了??偟膩碚f比較滿意,以后還會住 1交通很方便,房間小了一點(diǎn),但是干凈整潔,很有香港的特色,性價比較高,推薦一下哦 1酒店的裝修比較陳舊,房間的隔音,主要是衛(wèi)生間的隔音非常差,只能算是一般的 0酒店有點(diǎn)舊,房間比較小,但酒店的位子不錯,就在海邊,可以直接去游泳。8樓的海景打開窗戶就是海。如果想住在熱鬧的地帶,這里不是一個很好的選擇,不過威海城市真的比較小,打車還是相當(dāng)便宜的。晚上酒店門口出租車比較少。 1位置很好,走路到文廟、清涼寺5分鐘都用不了,周邊公交車很多很方便,就是出租車不太愛去(老城區(qū)路窄愛堵車),因為是老賓館所以設(shè)施要陳舊些, 1酒店設(shè)備一般,套房里臥室的不能上網(wǎng),要到客廳去。 0
dev.tsv
sentence label房間里有電腦,雖然房間的條件略顯簡陋,但環(huán)境、服務(wù)還有飯菜都還是很不錯的。如果下次去無錫,我還是會選擇這里的。 1我們是5月1日通過攜程網(wǎng)入住的,條件是太差了,根本達(dá)不到四星級的標(biāo)準(zhǔn),所有的東西都很陳舊,衛(wèi)生間水龍頭用完竟關(guān)不上,浴缸的漆面都掉了,估計是十年前的四星級吧,總之下次是不會入住了。 0離火車站很近很方便。住在東樓標(biāo)間,相比較在九江住的另一家酒店,房間比較大。衛(wèi)生間設(shè)施略舊。服務(wù)還好。10元中式早餐也不錯,很豐富,居然還有青菜肉片湯。 1坐落在香港的老城區(qū),可以體驗香港居民生活,門口交通很方便,如果時間不緊,坐叮當(dāng)車很好呀!周圍有很多小餐館,早餐就在中遠(yuǎn)后面的南北嚼吃的,東西很不錯。我們定的大床房,挺安靜的,總體來說不錯。前臺結(jié)賬沒有銀聯(lián)! 1酒店前臺服務(wù)差,對待客人不熱情。號稱攜程沒有預(yù)定。感覺是客人在求他們,我們一定得住。這樣的賓館下次不會入??! 0價格確實(shí)比較高,而且還沒有早餐提供。 1是一家很實(shí)惠的酒店,交通方便,房間也寬敞,晚上沒有電話騷擾,住了兩次,有一次?。担埃狈块g,洗澡間排水不暢通,也許是個別問題.服務(wù)質(zhì)量很好,剛?cè)胱r沒有調(diào)好寬帶,服務(wù)員很快就幫忙解決了. 1位置非常好,就在西街的街口,但是卻鬧中取靜,環(huán)境很清新優(yōu)雅。 1房間應(yīng)該超出30平米,是HK同級酒店中少有的大;重裝之后,設(shè)備也不錯. 1
在run_glue.py同級目錄下創(chuàng)建run_cn.sh文件, 寫入內(nèi)容如下:
# 定義DATA_DIR: 微調(diào)數(shù)據(jù)所在路徑export DATA_DIR="../../cn_data"# 定義SAVE_DIR: 模型的保存路徑, 我們將模型保存在當(dāng)前目錄的bert_finetuning文件中export SAVE_DIR="./bert_cn_finetuning/"# 使用python運(yùn)行微調(diào)腳本# --model_type: 選擇BERT# --model_name_or_path: 選擇bert-base-chinese# --task_name: 句子二分類任務(wù)SST-2# --do_train: 使用微調(diào)腳本進(jìn)行訓(xùn)練# --do_eval: 使用微調(diào)腳本進(jìn)行驗證# --data_dir: "./cn_data/SST-2/", 將自動尋找該路徑下的train.tsv和dev.tsv作為訓(xùn)練集和驗證集# --max_seq_length: 128,輸入句子的最大長度# --output_dir $SAVE_DIR: "./bert_finetuning/", 訓(xùn)練后的模型保存路徑python run_glue.py \\ --model_type BERT \\ --model_name_or_path bert-base-chinese \\ --task_name SST-2 \\ --do_train \\ --do_eval \\ --data_dir $DATA_DIR/SST-2/ \\ --max_seq_length 128 \\ --learning_rate 2e-5 \\ --num_train_epochs 1.0 \\ --output_dir $SAVE_DIR \\
運(yùn)行并檢驗效果
# 使用sh命令運(yùn)行sh run_cn.sh
輸出效果:
# 最終打印模型的驗證結(jié)果, 準(zhǔn)確率高達(dá)0.88.01/06/2020 14:22:36 - INFO - __main__ - Saving features into cached file ../../cn_data/SST-2/cached_dev_bert-base-chinese_128_sst-201/06/2020 14:22:36 - INFO - __main__ - ***** Running evaluation *****01/06/2020 14:22:36 - INFO - __main__ - Num examples = 100001/06/2020 14:22:36 - INFO - __main__ - Batch size = 8Evaluating: 100%|████████████| 125/125 [00:56<00:00, 2.20it/s]01/06/2020 14:23:33 - INFO - __main__ - ***** Eval results *****01/06/2020 14:23:33 - INFO - __main__ - acc = 0.88
查看$SAVE_DIR的文件內(nèi)容:
added_tokens.jsoncheckpoint-350checkpoint-200checkpoint-300checkpoint-250checkpoint-200checkpoint-150checkpoint-100checkpoint-50pytorch_model.bintraining_args.binconfig.jsonspecial_tokens_map.jsonvocab.txteval_results.txttokenizer_config.json
使用transformers-cli上傳模型:
# 默認(rèn)username: ItcastAI# 默認(rèn)password: ItcastAI$ transformers-cli login# 使用transformers-cli upload命令上傳模型# 選擇正確的微調(diào)模型路徑$ transformers-cli upload ./bert_cn_finetuning/
通過pytorch.hub加載模型進(jìn)行使用:
import torchsource = 'huggingface/pytorch-transformers'# 模型名字為'ItcastAI/bert_cn_finetuning'model_name = 'ItcastAI/bert_cn_finetuning'tokenizer = torch.hub.load(source, 'tokenizer', model_name)model = torch.hub.load(source, 'modelForSequenceClassification', model_name)def get_label(text): index = tokenizer.encode(text) tokens_tensor = torch.tensor([index]) # 使用評估模式 with torch.no_grad(): # 使用模型預(yù)測獲得結(jié)果 result = model(tokens_tensor) predicted_label = torch.argmax(result[0]).item() return predicted_labelif __name__ == "__main__": # text = "早餐不好,服務(wù)不到位,晚餐無西餐,早餐晚餐相同,房間條件不好" text = "房間應(yīng)該超出30平米,是HK同級酒店中少有的大;重裝之后,設(shè)備也不錯." print("輸入文本為:", text) print("預(yù)測標(biāo)簽為:", get_label(text))
輸出效果:
輸入文本為: 早餐不好,服務(wù)不到位,晚餐無西餐,早餐晚餐相同,房間條件不好預(yù)測標(biāo)簽為: 0輸入文本為: 房間應(yīng)該超出30平米,是HK同級酒店中少有的大;重裝之后,設(shè)備也不錯.預(yù)測標(biāo)簽為: 1類型二實(shí)戰(zhàn)演示直接加載預(yù)訓(xùn)練模型進(jìn)行輸入文本的特征表示, 后接自定義網(wǎng)絡(luò)進(jìn)行微調(diào)輸出結(jié)果.使用語料和完成的目標(biāo)與類型一實(shí)戰(zhàn)相同.
直接加載預(yù)訓(xùn)練模型進(jìn)行輸入文本的特征表示:
import torch# 進(jìn)行句子的截斷補(bǔ)齊(規(guī)范長度)from keras.preprocessing import sequencesource = 'huggingface/pytorch-transformers'# 直接使用預(yù)訓(xùn)練的bert中文模型model_name = 'bert-base-chinese'# 通過torch.hub獲得已經(jīng)訓(xùn)練好的bert-base-chinese模型model = torch.hub.load(source, 'model', model_name)# 獲得對應(yīng)的字符映射器, 它將把中文的每個字映射成一個數(shù)字tokenizer = torch.hub.load(source, 'tokenizer', model_name)# 句子規(guī)范長度cutlen = 32def get_bert_encode(text): """ description: 使用bert-chinese編碼中文文本 :param text: 要進(jìn)行編碼的文本 :return: 使用bert編碼后的文本張量表示 """ # 首先使用字符映射器對每個漢字進(jìn)行映射 # 這里需要注意, bert的tokenizer映射后會為結(jié)果前后添加開始和結(jié)束標(biāo)記即101和102 # 這對于多段文本的編碼是有意義的, 但在我們這里沒有意義, 因此使用[1:-1]對頭和尾進(jìn)行切片 indexed_tokens = tokenizer.encode(text[:cutlen])[1:-1] # 對映射后的句子進(jìn)行截斷補(bǔ)齊 indexed_tokens = sequence.pad_sequences([indexed_tokens], cutlen) # 之后將列表結(jié)構(gòu)轉(zhuǎn)化為tensor tokens_tensor = torch.LongTensor(indexed_tokens) # 使模型不自動計算梯度 with torch.no_grad(): # 調(diào)用模型獲得隱層輸出 encoded_layers, _ = model(tokens_tensor) # 輸出的隱層是一個三維張量, 最外層一維是1, 我們使用[0]降去它. encoded_layers = encoded_layers[0] return encoded_layers
調(diào)用:
if __name__ == "__main__": text = "早餐不好,服務(wù)不到位,晚餐無西餐,早餐晚餐相同,房間條件不好" encoded_layers = get_bert_encode(text) print(encoded_layers) print(encoded_layers.shape)
輸出效果:
tensor([[-1.2282, 1.0551, -0.7953, ..., 2.3363, -0.6413, 0.4174], [-0.9769, 0.8361, -0.4328, ..., 2.1668, -0.5845, 0.4836], [-0.7990, 0.6181, -0.1424, ..., 2.2845, -0.6079, 0.5288], ..., [ 0.9514, 0.5972, 0.3120, ..., 1.8408, -0.1362, -0.1206], [ 0.1250, 0.1984, 0.0484, ..., 1.2302, -0.1905, 0.3205], [ 0.2651, 0.0228, 0.1534, ..., 1.0159, -0.3544, 0.1479]])torch.Size([32, 768])
自定義單層的全連接網(wǎng)絡(luò)作為微調(diào)網(wǎng)絡(luò):
根據(jù)實(shí)際經(jīng)驗, 自定義的微調(diào)網(wǎng)絡(luò)參數(shù)總數(shù)應(yīng)大于0.5倍的訓(xùn)練數(shù)據(jù)量, 小于10倍的訓(xùn)練數(shù)據(jù)量, 這樣有助于模型在合理的時間范圍內(nèi)收斂.import torch.nn as nnimport torch.nn.functional as Fclass Net(nn.Module): """定義微調(diào)網(wǎng)絡(luò)的類""" def __init__(self, char_size=32, embedding_size=768): """ :param char_size: 輸入句子中的字符數(shù)量, 即輸入句子規(guī)范后的長度128. :param embedding_size: 字嵌入的維度, 因為使用的bert中文模型嵌入維度是768, 因此embedding_size為768 """ super(Net, self).__init__() # 將char_size和embedding_size傳入其中 self.char_size = char_size self.embedding_size = embedding_size # 實(shí)例化一個全連接層 self.fc1 = nn.Linear(char_size*embedding_size, 2) def forward(self, x): # 對輸入的張量形狀進(jìn)行變換, 以滿足接下來層的輸入要求 x = x.view(-1, self.char_size*self.embedding_size) # 使用一個全連接層 x = self.fc1(x) return x
調(diào)用:
if __name__ == "__main__": # 隨機(jī)初始化一個輸入?yún)?shù) x = torch.randn(1, 32, 768) # 實(shí)例化網(wǎng)絡(luò)結(jié)構(gòu), 所有參數(shù)使用默認(rèn)值 net = Net() nr = net(x) print(nr)
輸出效果:
tensor([[0.3279, 0.2519]], grad_fn=<ReluBackward0>)
構(gòu)建訓(xùn)練與驗證數(shù)據(jù)批次生成器:
import pandas as pdfrom collections import Counterfrom functools import reducefrom sklearn.utils import shuffledef data_loader(train_data_path, valid_data_path, batch_size): """ description: 從持久化文件中加載數(shù)據(jù) :param train_data_path: 訓(xùn)練數(shù)據(jù)路徑 :param valid_data_path: 驗證數(shù)據(jù)路徑 :param batch_size: 訓(xùn)練和驗證數(shù)據(jù)集的批次大小 :return: 訓(xùn)練數(shù)據(jù)生成器, 驗證數(shù)據(jù)生成器, 訓(xùn)練數(shù)據(jù)數(shù)量, 驗證數(shù)據(jù)數(shù)量 """ # 使用pd進(jìn)行csv數(shù)據(jù)的讀取, 并去除第一行的列名 train_data = pd.read_csv(train_data_path, header=None, sep="\").drop([0]) valid_data = pd.read_csv(valid_data_path, header=None, sep="\").drop([0]) # 打印訓(xùn)練集和驗證集上的正負(fù)樣本數(shù)量 print("訓(xùn)練數(shù)據(jù)集的正負(fù)樣本數(shù)量:") print(dict(Counter(train_data[1].values))) print("驗證數(shù)據(jù)集的正負(fù)樣本數(shù)量:") print(dict(Counter(valid_data[1].values))) # 驗證數(shù)據(jù)集中的數(shù)據(jù)總數(shù)至少能夠滿足一個批次 if len(valid_data) < batch_size: raise("Batch size or split not match!") def _loader_generator(data): """ description: 獲得訓(xùn)練集/驗證集的每個批次數(shù)據(jù)的生成器 :param data: 訓(xùn)練數(shù)據(jù)或驗證數(shù)據(jù) :return: 一個批次的訓(xùn)練數(shù)據(jù)或驗證數(shù)據(jù)的生成器 """ # 以每個批次的間隔遍歷數(shù)據(jù)集 for batch in range(0, len(data), batch_size): # 定義batch數(shù)據(jù)的張量列表 batch_encoded = [] batch_labels = [] # 將一個bitch_size大小的數(shù)據(jù)轉(zhuǎn)換成列表形式, 并進(jìn)行逐條遍歷 for item in shuffle(data.values.tolist())[batch: batch+batch_size]: # 使用bert中文模型進(jìn)行編碼 encoded = get_bert_encode(item[0]) # 將編碼后的每條數(shù)據(jù)裝進(jìn)預(yù)先定義好的列表中 batch_encoded.append(encoded) # 同樣將對應(yīng)的該batch的標(biāo)簽裝進(jìn)labels列表中 batch_labels.append([int(item[1])]) # 使用reduce高階函數(shù)將列表中的數(shù)據(jù)轉(zhuǎn)換成模型需要的張量形式 # encoded的形狀是(batch_size*max_len, embedding_size) encoded = reduce(lambda x, y: torch.cat((x, y), dim=0), batch_encoded) labels = torch.tensor(reduce(lambda x, y: x + y, batch_labels)) # 以生成器的方式返回數(shù)據(jù)和標(biāo)簽 yield (encoded, labels) # 對訓(xùn)練集和驗證集分別使用_loader_generator函數(shù), 返回對應(yīng)的生成器 # 最后還要返回訓(xùn)練集和驗證集的樣本數(shù)量 return _loader_generator(train_data), _loader_generator(valid_data), len(train_data), len(valid_data)
調(diào)用:
if __name__ == "__main__": train_data_path = "./cn_data/SST-2/train.tsv" valid_data_path = "./cn_data/SST-2/dev.tsv" batch_size = 16 train_data_labels, valid_data_labels, \\ train_data_len, valid_data_len = data_loader(train_data_path, valid_data_path, batch_size) print(next(train_data_labels)) print(next(valid_data_labels)) print("train_data_len:", train_data_len) print("valid_data_len:", valid_data_len)
輸出效果:
訓(xùn)練數(shù)據(jù)集的正負(fù)樣本數(shù)量:{'0': 1518, '1': 1442}驗證數(shù)據(jù)集的正負(fù)樣本數(shù)量:{'1': 518, '0': 482}(tensor([[[-0.8328, 0.9376, -1.2489, ..., 1.8594, -0.4636, -0.1682], [-0.9798, 0.5113, -0.9868, ..., 1.5500, -0.1934, 0.2521], [-0.7574, 0.3086, -0.6031, ..., 1.8467, -0.2507, 0.3916], ..., [ 0.0064, 0.2321, 0.3785, ..., 0.3376, 0.4748, -0.1272], [-0.3175, 0.4018, -0.0377, ..., 0.6030, 0.2916, -0.4172], [-0.6154, 1.0439, 0.2921, ..., 0.5048, -0.0983, 0.0061]]]), tensor([0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0]))(tensor([[[-0.1611, 0.9182, -0.3419, ..., 0.6323, -0.2013, 0.0184], [-0.1224, 0.7706, -0.2386, ..., 0.7925, 0.0444, 0.2160], [-0.0301, 0.6867, -0.1510, ..., 0.9140, 0.0308, 0.2611], ..., [ 0.3662, -0.4925, 1.2332, ..., 0.7741, -0.1007, -0.3099], [-0.0932, -0.8494, 0.6586, ..., 0.1235, -0.3152, -0.1635], [ 0.5306, -0.5510, 0.3105, ..., 1.2631, -0.5882, -0.1133]]]), tensor([1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0]))train_data_len: 2960valid_data_len: 1000
編寫訓(xùn)練和驗證函數(shù):
import torch.optim as optimdef train(train_data_labels): """ description: 訓(xùn)練函數(shù), 在這個過程中將更新模型參數(shù), 并收集準(zhǔn)確率和損失 :param train_data_labels: 訓(xùn)練數(shù)據(jù)和標(biāo)簽的生成器對象 :return: 整個訓(xùn)練過程的平均損失之和以及正確標(biāo)簽的累加數(shù) """ # 定義訓(xùn)練過程的初始損失和準(zhǔn)確率累加數(shù) train_running_loss = 0.0 train_running_acc = 0.0 # 循環(huán)遍歷訓(xùn)練數(shù)據(jù)和標(biāo)簽生成器, 每個批次更新一次模型參數(shù) for train_tensor, train_labels in train_data_labels: # 初始化該批次的優(yōu)化器 optimizer.zero_grad() # 使用微調(diào)網(wǎng)絡(luò)獲得輸出 train_outputs = net(train_tensor) # 得到該批次下的平均損失 train_loss = criterion(train_outputs, train_labels) # 將該批次的平均損失加到train_running_loss中 train_running_loss += train_loss.item() # 損失反向傳播 train_loss.backward() # 優(yōu)化器更新模型參數(shù) optimizer.step() # 將該批次中正確的標(biāo)簽數(shù)量進(jìn)行累加, 以便之后計算準(zhǔn)確率 train_running_acc += (train_outputs.argmax(1) == train_labels).sum().item() return train_running_loss, train_running_accdef valid(valid_data_labels): """ description: 驗證函數(shù), 在這個過程中將驗證模型的在新數(shù)據(jù)集上的標(biāo)簽, 收集損失和準(zhǔn)確率 :param valid_data_labels: 驗證數(shù)據(jù)和標(biāo)簽的生成器對象 :return: 整個驗證過程的平均損失之和以及正確標(biāo)簽的累加數(shù) """ # 定義訓(xùn)練過程的初始損失和準(zhǔn)確率累加數(shù) valid_running_loss = 0.0 valid_running_acc = 0.0 # 循環(huán)遍歷驗證數(shù)據(jù)和標(biāo)簽生成器 for valid_tensor, valid_labels in valid_data_labels: # 不自動更新梯度 with torch.no_grad(): # 使用微調(diào)網(wǎng)絡(luò)獲得輸出 valid_outputs = net(valid_tensor) # 得到該批次下的平均損失 valid_loss = criterion(valid_outputs, valid_labels) # 將該批次的平均損失加到valid_running_loss中 valid_running_loss += valid_loss.item() # 將該批次中正確的標(biāo)簽數(shù)量進(jìn)行累加, 以便之后計算準(zhǔn)確率 valid_running_acc += (valid_outputs.argmax(1) == valid_labels).sum().item() return valid_running_loss, valid_running_acc
調(diào)用并保存模型:
if __name__ == "__main__": # 設(shè)定數(shù)據(jù)路徑 train_data_path = "./cn_data/SST-2/train.tsv" valid_data_path = "./cn_data/SST-2/dev.tsv" # 定義交叉熵?fù)p失函數(shù) criterion = nn.CrossEntropyLoss() # 定義SGD優(yōu)化方法 optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9) # 定義訓(xùn)練輪數(shù) epochs = 4 # 定義批次樣本數(shù)量 batch_size = 16 # 進(jìn)行指定輪次的訓(xùn)練 for epoch in range(epochs): # 打印輪次 print("Epoch:", epoch + 1) # 通過數(shù)據(jù)加載器獲得訓(xùn)練數(shù)據(jù)和驗證數(shù)據(jù)生成器, 以及對應(yīng)的樣本數(shù)量 train_data_labels, valid_data_labels, train_data_len, \\ valid_data_len = data_loader(train_data_path, valid_data_path, batch_size) # 調(diào)用訓(xùn)練函數(shù)進(jìn)行訓(xùn)練 train_running_loss, train_running_acc = train(train_data_labels) # 調(diào)用驗證函數(shù)進(jìn)行驗證 valid_running_loss, valid_running_acc = valid(valid_data_labels) # 計算每一輪的平均損失, train_running_loss和valid_running_loss是每個批次的平均損失之和 # 因此將它們乘以batch_size就得到了該輪的總損失, 除以樣本數(shù)即該輪次的平均損失 train_average_loss = train_running_loss * batch_size / train_data_len valid_average_loss = valid_running_loss * batch_size / valid_data_len # train_running_acc和valid_running_acc是每個批次的正確標(biāo)簽累加和, # 因此只需除以對應(yīng)樣本總數(shù)即是該輪次的準(zhǔn)確率 train_average_acc = train_running_acc / train_data_len valid_average_acc = valid_running_acc / valid_data_len # 打印該輪次下的訓(xùn)練損失和準(zhǔn)確率以及驗證損失和準(zhǔn)確率 print("Train Loss:", train_average_loss, "|", "Train Acc:", train_average_acc) print("Valid Loss:", valid_average_loss, "|", "Valid Acc:", valid_average_acc) print('Finished Training') # 保存路徑 MODEL_PATH = './BERT_net.pth' # 保存模型參數(shù) torch.save(net.state_dict(), MODEL_PATH) print('Finished Saving')
輸出效果:
Epoch: 1Train Loss: 2.144986984236597 | Train Acc: 0.7347972972972973Valid Loss: 2.1898122818128902 | Valid Acc: 0.704Epoch: 2Train Loss: 1.3592962406135032 | Train Acc: 0.8435810810810811Valid Loss: 1.8816152956699324 | Valid Acc: 0.784Epoch: 3Train Loss: 1.5507876996199943 | Train Acc: 0.8439189189189189Valid Loss: 1.8626576719331536 | Valid Acc: 0.795Epoch: 4Train Loss: 0.7825378059198299 | Train Acc: 0.9081081081081082Valid Loss: 2.121698483480899 | Valid Acc: 0.803Finished TrainingFinished Saving
加載模型進(jìn)行使用:
if __name__ == "__main__": MODEL_PATH = './BERT_net.pth' # 加載模型參數(shù) net.load_state_dict(torch.load(MODEL_PATH)) # text = "酒店設(shè)備一般,套房里臥室的不能上網(wǎng),要到客廳去。" text = "房間應(yīng)該超出30平米,是HK同級酒店中少有的大;重裝之后,設(shè)備也不錯." print("輸入文本為:", text) with torch.no_grad(): output = net(get_bert_encode(text)) # 從output中取出最大值對應(yīng)的索引 print("預(yù)測標(biāo)簽為:", torch.argmax(output).item())
輸出效果:
輸入文本為: 房間應(yīng)該超出30平米,是HK同級酒店中少有的大;重裝之后,設(shè)備也不錯.預(yù)測標(biāo)簽為: 1輸入文本為: 酒店設(shè)備一般,套房里臥室的不能上網(wǎng),要到客廳去。預(yù)測標(biāo)簽為: 0
以上就是關(guān)于pos機(jī)的ts是什么意思,什么是遷移學(xué)習(xí)理論的知識,后面我們會繼續(xù)為大家整理關(guān)于pos機(jī)的ts是什么意思的知識,希望能夠幫助到大家!
