1 Star 0 Fork 0

modelee / dpr-ctx_encoder-multiset-base

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
CC-BY-4.0
language license tags datasets inference
en
cc-by-nc-4.0
dpr
nq_open
false

dpr-ctx_encoder-multiset-base

Table of Contents

Model Details

Model Description: Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. dpr-ctx_encoder-multiset-base is the context encoder trained using the Natural Questions (NQ) dataset, TriviaQA, WebQuestions (WQ), and CuratedTREC (TREC).

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import DPRContextEncoder, DPRContextEncoderTokenizer

tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
model = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
input_ids = tokenizer("Hello, is my dog cute ?", return_tensors="pt")["input_ids"]
embeddings = model(input_ids).pooler_output

Uses

Direct Use

dpr-ctx_encoder-multiset-base, dpr-question_encoder-multiset-base, and dpr-reader-multiset-base can be used for the task of open-domain question answering.

Misuse and Out-of-scope Use

The model should not be used to intentionally create hostile or alienating environments for people. In addition, the set of DPR models was not trained to be factual or true representations of people or events, and therefore using the models to generate such content is out-of-scope for the abilities of this model.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware this section may contain content that is disturbing, offensive, and can propogate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). Predictions generated by the model can include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.

Training

Training Data

This model was trained using the following datasets:

Training Procedure

The training procedure is described in the associated paper:

Given a collection of M text passages, the goal of our dense passage retriever (DPR) is to index all the passages in a low-dimensional and continuous space, such that it can retrieve efficiently the top k passages relevant to the input question for the reader at run-time.

Our dense passage retriever (DPR) uses a dense encoder EP(·) which maps any text passage to a d- dimensional real-valued vectors and builds an index for all the M passages that we will use for retrieval. At run-time, DPR applies a different encoder EQ(·) that maps the input question to a d-dimensional vector, and retrieves k passages of which vectors are the closest to the question vector.

The authors report that for encoders, they used two independent BERT (Devlin et al., 2019) networks (base, un-cased) and use FAISS (Johnson et al., 2017) during inference time to encode and index passages. See the paper for further details on training, including encoders, inference, positive and negative passages, and in-batch negatives.

Evaluation

The following evaluation information is extracted from the associated paper.

Testing Data, Factors and Metrics

The model developers report the performance of the model on five QA datasets, using the top-k accuracy (k ∈ {20, 100}). The datasets were NQ, TriviaQA, WebQuestions (WQ), CuratedTREC (TREC), and SQuAD v1.1.

Results

Top 20 Top 100
NQ TriviaQA WQ TREC SQuAD NQ TriviaQA WQ TREC SQuAD
79.4 78.8 75.0 89.1 51.6 86.0 84.7 82.9 93.9 67.6

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). We present the hardware type and based on the associated paper.

  • Hardware Type: 8 32GB GPUs
  • Hours used: Unknown
  • Cloud Provider: Unknown
  • Compute Region: Unknown
  • Carbon Emitted: Unknown

Technical Specifications

See the associated paper for details on the modeling architecture, objective, compute infrastructure, and training details.

Citation Information

  @inproceedings{karpukhin-etal-2020-dense,
    title = "Dense Passage Retrieval for Open-Domain Question Answering",
    author = "Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.550",
    doi = "10.18653/v1/2020.emnlp-main.550",
    pages = "6769--6781",
}

Model Card Authors

This model card was written by the team at Hugging Face.

--- language: en license: cc-by-nc-4.0 tags: - dpr datasets: - nq_open inference: false --- # `dpr-ctx_encoder-multiset-base` ## Table of Contents - [Model Details](#model-details) - [How To Get Started With the Model](#how-to-get-started-with-the-model) - [Uses](#uses) - [Risks, Limitations and Biases](#risks-limitations-and-biases) - [Training](#training) - [Evaluation](#evaluation-results) - [Environmental Impact](#environmental-impact) - [Technical Specifications](#technical-specifications) - [Citation Information](#citation-information) - [Model Card Authors](#model-card-authors) ## Model Details **Model Description:** [Dense Passage Retrieval (DPR)](https://github.com/facebookresearch/DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. `dpr-ctx_encoder-multiset-base` is the context encoder trained using the [Natural Questions (NQ) dataset](https://huggingface.co/datasets/nq_open), [TriviaQA](https://huggingface.co/datasets/trivia_qa), [WebQuestions (WQ)](https://huggingface.co/datasets/web_questions), and [CuratedTREC (TREC)](https://huggingface.co/datasets/trec). - **Developed by:** See [GitHub repo](https://github.com/facebookresearch/DPR) for model developers - **Model Type:** BERT-based encoder - **Language(s):** [CC-BY-NC-4.0](https://github.com/facebookresearch/DPR/blob/main/LICENSE), also see [Code of Conduct](https://github.com/facebookresearch/DPR/blob/main/CODE_OF_CONDUCT.md) - **License:** English - **Related Models:** - [`dpr-question_encoder-multiset-base`](https://huggingface.co/facebook/dpr-question_encoder-multiset-base) - [`dpr-reader-multiset-base`](https://huggingface.co/facebook/dpr-reader-multiset-base) - [`dpr-question-encoder-single-nq-base`](https://huggingface.co/facebook/dpr-question_encoder-single-nq-base) - [`dpr-reader-single-nq-base`](https://huggingface.co/facebook/dpr-reader-single-nq-base) - [`dpr-ctx_encoder-single-nq-base`](https://huggingface.co/facebook/dpr-ctx_encoder-single-nq-base) - **Resources for more information:** - [Research Paper](https://arxiv.org/abs/2004.04906) - [GitHub Repo](https://github.com/facebookresearch/DPR) - [Hugging Face DPR docs](https://huggingface.co/docs/transformers/main/en/model_doc/dpr) - [BERT Base Uncased Model Card](https://huggingface.co/bert-base-uncased) ## How to Get Started with the Model Use the code below to get started with the model. ```python from transformers import DPRContextEncoder, DPRContextEncoderTokenizer tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-multiset-base") model = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-multiset-base") input_ids = tokenizer("Hello, is my dog cute ?", return_tensors="pt")["input_ids"] embeddings = model(input_ids).pooler_output ``` ## Uses #### Direct Use `dpr-ctx_encoder-multiset-base`, [`dpr-question_encoder-multiset-base`](https://huggingface.co/facebook/dpr-question_encoder-multiset-base), and [`dpr-reader-multiset-base`](https://huggingface.co/facebook/dpr-reader-multiset-base) can be used for the task of open-domain question answering. #### Misuse and Out-of-scope Use The model should not be used to intentionally create hostile or alienating environments for people. In addition, the set of DPR models was not trained to be factual or true representations of people or events, and therefore using the models to generate such content is out-of-scope for the abilities of this model. ## Risks, Limitations and Biases **CONTENT WARNING: Readers should be aware this section may contain content that is disturbing, offensive, and can propogate historical and current stereotypes.** Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model can include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. ## Training #### Training Data This model was trained using the following datasets: - **[Natural Questions (NQ) dataset](https://huggingface.co/datasets/nq_open)** ([Lee et al., 2019](https://aclanthology.org/P19-1612/); [Kwiatkowski et al., 2019](https://aclanthology.org/Q19-1026/)) - **[TriviaQA](https://huggingface.co/datasets/trivia_qa)** ([Joshi et al., 2017](https://aclanthology.org/P17-1147/)) - **[WebQuestions (WQ)](https://huggingface.co/datasets/web_questions)** ([Berant et al., 2013](https://aclanthology.org/D13-1160/)) - **[CuratedTREC (TREC)](https://huggingface.co/datasets/trec)** ([Baudiš & Šedivý, 2015](https://www.aminer.cn/pub/599c7953601a182cd263079b/reading-wikipedia-to-answer-open-domain-questions)) #### Training Procedure The training procedure is described in the [associated paper](https://arxiv.org/pdf/2004.04906.pdf): > Given a collection of M text passages, the goal of our dense passage retriever (DPR) is to index all the passages in a low-dimensional and continuous space, such that it can retrieve efficiently the top k passages relevant to the input question for the reader at run-time. > Our dense passage retriever (DPR) uses a dense encoder EP(·) which maps any text passage to a d- dimensional real-valued vectors and builds an index for all the M passages that we will use for retrieval. At run-time, DPR applies a different encoder EQ(·) that maps the input question to a d-dimensional vector, and retrieves k passages of which vectors are the closest to the question vector. The authors report that for encoders, they used two independent BERT ([Devlin et al., 2019](https://aclanthology.org/N19-1423/)) networks (base, un-cased) and use FAISS ([Johnson et al., 2017](https://arxiv.org/abs/1702.08734)) during inference time to encode and index passages. See the paper for further details on training, including encoders, inference, positive and negative passages, and in-batch negatives. ## Evaluation The following evaluation information is extracted from the [associated paper](https://arxiv.org/pdf/2004.04906.pdf). #### Testing Data, Factors and Metrics The model developers report the performance of the model on five QA datasets, using the top-k accuracy (k ∈ {20, 100}). The datasets were [NQ](https://huggingface.co/datasets/nq_open), [TriviaQA](https://huggingface.co/datasets/trivia_qa), [WebQuestions (WQ)](https://huggingface.co/datasets/web_questions), [CuratedTREC (TREC)](https://huggingface.co/datasets/trec), and [SQuAD v1.1](https://huggingface.co/datasets/squad). #### Results | | Top 20 | | | | | Top 100| | | | | |:----:|:------:|:---------:|:--:|:----:|:-----:|:------:|:---------:|:--:|:----:|:-----:| | | NQ | TriviaQA | WQ | TREC | SQuAD | NQ | TriviaQA | WQ | TREC | SQuAD | | | 79.4 | 78.8 |75.0| 89.1 | 51.6 | 86.0 | 84.7 |82.9| 93.9 | 67.6 | ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). We present the hardware type and based on the [associated paper](https://arxiv.org/abs/2004.04906). - **Hardware Type:** 8 32GB GPUs - **Hours used:** Unknown - **Cloud Provider:** Unknown - **Compute Region:** Unknown - **Carbon Emitted:** Unknown ## Technical Specifications See the [associated paper](https://arxiv.org/abs/2004.04906) for details on the modeling architecture, objective, compute infrastructure, and training details. ## Citation Information ```bibtex @inproceedings{karpukhin-etal-2020-dense, title = "Dense Passage Retrieval for Open-Domain Question Answering", author = "Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.550", doi = "10.18653/v1/2020.emnlp-main.550", pages = "6769--6781", } ``` ## Model Card Authors This model card was written by the team at Hugging Face.

简介

暂无描述 展开 收起
CC-BY-4.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/modelee/dpr-ctx_encoder-multiset-base.git
git@gitee.com:modelee/dpr-ctx_encoder-multiset-base.git
modelee
dpr-ctx_encoder-multiset-base
dpr-ctx_encoder-multiset-base
main

搜索帮助

344bd9b3 5694891 D2dac590 5694891