Dialogue Research-Tencent AI Lab

Dialogue Research, Tencent AI Lab

Publications Datasets Projects

News

We are now recruiting interns for our project "Multimodal Intelligent Human-Computer Interaction".
Please contact us by sending your CV to chatbot@tencent.com.

Publications

Yu Cao, Wei Bi, Meng Fang, Shuming Shi, and Dacheng Tao.
A Model-Agnostic Data Manipulation Method for Persona-Based Dialogue Generation. ACL 2022.
[pdf][code][bib]
Zhiyong Wu, Wei Bi, Xiang Li, Lingpeng Kong, and Ben Kao.
Lexical Knowledge Internalization for Neural Dialog Generation. ACL 2022.
[pdf][code][bib]
Chao Zhao, Wenlin Yao, Dian Yu, Kaiqiang Song, Dong Yu, and Jianshu Chen.
Learning-by-Narrating: Narrative Pre-training for Zero-Shot Dialogue Comprehension. ACL 2022 (Short).
[pdf][code][bib]
Qingyang Wu, Zhou Yu, Kun Xu, Eric Xing, and Pengtao Xie.
On the Generation of Medical Dialogs for COVID-19. ACL 2021.
[pdf][code][bib]
Zhangming Chan, Lemao Liu, Juntao Li, Haisong Zhang, Dongyan Zhao, Shuming Shi, and Rui Yan.
Enhancing the Open-Domain Dialogue Evaluation in Latent Space. ACL 2021 (Findings).
[pdf][code][bib]
Jiannan Xiang, Yahui Liu, Deng Cai, Huayang Li, Defu Lian, and Lemao Liu.
Assessing Dialogue Systems with Distribution Distances. ACL 2021 (Findings, Short).
[pdf][code][bib]
Xuefeng Bai, Yulong Chen, Linfeng Song, and Yue Zhang.
Semantic Representation for Dialogue Modeling. ACL 2021.
[pdf][code][bib]
Han Wu, Kun Xu, Linfeng Song, Lifeng Jin, Haisong Zhang, and Linqi Song.
Domain-Adaptive Pretraining Methods for Dialogue Understanding. ACL 2021 (Short).
[pdf][code][bib]
Jun Gao, Wei Bi, Ruifeng Xu, and Shuming Shi. REAM♯
An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation. ACL 2021 (Findings).
[pdf][code][bib]
Kun Xu, Han Wu, Linfeng Song, Haisong Zhang, Linqi Song and Dong Yu.
Conversational Semantic Role Labeling. TASLP.
[pdf] [code] [bib]
Xiaoyang Wang, Chen Li, Jianqiao Zhao, and Dong Yu.
NaturalConv: A Chinese Dialogue Dataset Towards Multi-Turn Topic-Driven Conversation. AAAI 2021.
[pdf] [code] [bib]
Zhiliang Tian, Wei Bi, Zihan Zhang, Dongkyu Lee, Yiping Song, and Nevin Zhang
Learning from My Friends: Few-Shot Personalized Conversation Systems via Social Networks. AAAI 2021.
[pdf] [code] [bib]
Wei Wang, Piji Li, and Haitao Zheng.
Generating Diversified Comments via Reader-Aware Topic Modeling and Saliency Detection. AAAI 2021.
[pdf] [code] [bib]
Changying Hao, Liang Pang, Yanyan Lan, Yan Wang, Jiafeng Guo, and Xueqi Cheng.
Sketch and Customize: A Counterfactual Story Generator. AAAI 2021.
[pdf] [code] [bib]
Kun Xu, Haochen Tan, Linfeng Song, Han Wu, Haisong Zhang, Linqi Song and Dong Yu.
Semantic Role Labeling Guided Multi-turn Dialogue ReWriter. EMNLP 2020.
[pdf] [code] [bib]
Lingzhi Wang, Jing Li, Xingshan Zeng, Haisong Zhang and Kam-Fai Wong.
Continuity of Topic, Interaction, and Query: Learning to Quote in Online Conversations. EMNLP 2020.
[pdf] [code] [bib]
Haoyu Song, Yan Wang, Wei-Nan Zhang, Zhengyu Zhao, Ting Liu, Xiaojiang Liu.
Profile Consistency Identification for Open-domain Dialogue Agents. EMNLP 2020.
[pdf] [code] [bib]
Zibo Lin, Deng Cai, Yan Wang, Xiaojiang Liu, Haitao Zheng and Shuming Shi.
The World is not Binary: Learning to Rank with Grayscale Data for Dialogue Response Selection. EMNLP 2020.
[pdf] [code] [bib]
Yu Cao, Wei Bi, Meng Fang, Dacheng Tao.
Pretrained Language Models for Dialogue Generation with Multiple Input Sources. EMNLP 2020.
[pdf] [code] [bib]
Yifan Gao, Piji Li, Wei Bi, Xiaojiang Liu, Michael Lyu and Irwin King.
Dialogue Generation on Infrequent Sentence Functions via Structured Meta-Learning. EMNLP 2020.
[pdf] [code] [bib]
Zhiliang Tian, Wei Bi, Dongkyu Lee, Lanqing Xue, Yiping Song, Xiaojiang Liu, and Nevin L. Zhang.
Response-Anticipated Memory for On-Demand Knowledge Integration in Response Generation. ACL 2020.
[pdf] [code] [bib]
Haoyu Song, Yan Wang, Wei-Nan Zhang, Xiaojiang Liu, and Ting Liu.
Generate, Delete and Rewrite: A Three-Stage Framework for Improving Persona Consistency of Dialogue Generation. ACL 2020.
[pdf] [code] [bib]
Yiping Song, Zequn Liu, Wei Bi, Rui Yan, and Ming Zhang.
Learning to Customize Model Structures for Few-shot Dialogue Generation Tasks. ACL 2020.
[pdf] [code] [bib]
Xin Li, Piji Li, Wei Bi, Xiaojiang Liu, and Wai Lam.
Relevance-Promoting Language Model for Short-Text Conversation. AAAI 2020.
[pdf] [code] [bib]
Jian Wang, Junhao Liu, Wei Bi, Xiaojiang Liu, Kejing He, Ruifeng Xu, and Min Yang.
Improving Knowledge-aware Dialogue Generation via Knowledge Base Question Answering. AAAI 2020.
[pdf] [code] [bib]
Jun Gao, Wei Bi, Xiaojiang Liu, Junhui Li, Guodong Zhou, and Shuming Shi.
A Discrete CVAE for Response Generation on Short-Text Conversation. EMNLP 2019.
[pdf] [code] [bib]
Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu, Xiaojiang Liu, and Shuming Shi.
Retrieval-Guided Dialogue Response Generation via a Matching-to-Generation Framework. EMNLP 2019.
[pdf] [code] [bib]
Zhufeng Pan, Kun Bai, Yan Wang, Lianqiang Zhou, and Xiaojiang Liu.
Improving Open-Domain Dialogue Systems via Multi-Turn Incomplete Utterance Restoration. EMNLP 2019.
[pdf] [code] [bib]
Wenhu Chen, Jianshu Chen, Pengda Qin, Xifeng Yan, William Yang Wang
Semantically Conditioned Dialog Response Generation via Hierarchical Disentangled Self-Attention. ACL 2019.
[pdf] [code] [bib]
Wei Bi, Jun Gao, Xiaojiang Liu, Shuming Shi Fine-Grained Sentence Functions for Short-Text Conversation. ACL 2019
[pdf] [code] [bib]
Yifan Gao, Piji Li, Irwin King, and Michael R. Lyu Interconnected Question Generation with Coreference Alignment and Conversation Flow Modeling. ACL 2019.
[pdf] [code] [bib]
Lisong Qiu, Juntao Li, Wei Bi, Dongyan Zhao, and Rui Yan Are Training Samples Correlated? Learning to Generate Dialogue Responses with Multiple References. ACL 2019.
[pdf] [code] [bib]
Zhiliang Tian, Wei Bi, Xiaopeng Li, and Nevin L. Zhang Learning to Abstract for Memory-Augmented Conversational Response Generation. ACL 2019.
[pdf] [code] [bib]
Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu, Xiaojiang Liu, Wai Lam, and Shuming Shi Skeleton-to-Response: Dialogue Generation Guided by Retrieval Memory. NAACL 2019.
[pdf] [code] [bib]
Jun Gao, Wei Bi, Xiaojiang Liu, Junhui Li, Shuming Shi Generating Multiple Diverse Responses for Short-Text Conversation. AAAI 2019.
[pdf] [code] [bib]
Yahui Liu, Wei Bi, Jun Gao, Xiaojiang Liu, Jian Yao and Shuming Shi Towards Less Generic Responses in Neural Conversation Models: A Statistical Re-weighting Method. EMNLP 2018
[pdf] [code] [bib] [dataset available soon]
Jiachen Du, Wenjie Li, Yulan He, Ruifeng Xu, Lidong Bing, and Xuan Wang Variational Autoregressive Decoder for Neural Response Generation. EMNLP 2018.
[pdf] [code] [bib]
Yan Wang, Xiaojiang Liu, Shuming Shi.
Deep Neural Solver for Math Word Problems. EMNLP 2017.
[pdf] [code] [bib]

Datasets

NaturalConv Dataset for Dialogue
This is the NaturalConv dataset for the paper "NaturalConv: A Chinese Dialogue Dataset Towards Multi-turn Topic-driven Conversation". This dataset contains human annotated conversations grounded on Chinese news articles. The dialogues are natural and not limited by the grounding document. Please refer to our paper for more details.
NaturalConv: A Chinese Dialogue Dataset Towards Multi-turn Topic-driven Conversation. AAAI 2021.
Please kindly cite our paper if you find this dataset useful.
https://arxiv.org/abs/2103.02548
[license] [dataset download]
Math23K Dataset
Math23K is a dataset created for math word problem solving, contains 23, 162 Chinese problems crawled from internet. Refer to our paper for more details:
Deep Neural Solver for Math Word Problems. EMNLP 2017. Please kindly cite our paper if this paper and the dataset are helpful.
[dataset download]
Profile Consistency Dataset for Dialogue
This is the KvPI dataset for paper "Proﬁle Consistency Identification for Open-domain Dialogue Agents". The KvPI dataset provides fine-grained consistency labels between key-value profile (such as gender, location, and constellation) and dialogue response, and also offers single-turn dialogues with correponding profiles. We hope this dataset could aid training dialogue agents to be more consistent. Please refer to our paper for more details.
Proﬁle Consistency Identification for Open-domain Dialogue Agents. EMNLP 2020.
Please kindly cite our paper if this paper and the KvPI dataset are helpful.
https://arxiv.org/abs/2009.09680
[dataset download]
Grayscale Dataset for Dialogue
This is the dataset for paper "Grayscale Data Construction and Multi-Level Ranking Objective for Dialogue Response Selection". The dataset is a supplement to the existing douban and ubuntu corpus. Besides the ground truth response and random responses given in the original corpus, two grayscale labeled responses (retrieval/generated) data are automatically constructed to partly simulates the real-world scenarios of retrieval-based chatbots. Please refer to our paper for more details.
Grayscale Data Construction and Multi-Level Ranking Objective for Dialogue Response Selection. ArXiv:2004.02421 Please kindly cite our paper if this paper and the dataset are helpful.
https://arxiv.org/abs/2004.02421
[dataset download]
Gender-Specific Chat
This dataset is used in the paper "Stylistic Dialogue Generation via Information-Guided Reinforcement Learning Strategy". The entire dataset contains two different datasets:
- Gender Classification Dataset: A human annotated dataset which assigns each response with a gender preference. Gender-Specific Dialogue Dataset:
- Dataset annotated by a gender classifier (with accuracy of 91.7%). It is the first large-scale single-turn dialogue dataset with gender preference.
Refer to our paper for more details.
Stylistic Dialogue Generation via Information-Guided Reinforcement Learning Strategy. ArXiv:2004.02202
Please kindly cite our paper if this paper and the dataset are helpful.
https://arxiv.org/abs/2004.02202
[dataset download]
Retrieval Generation Chat
This is the dataset for paper "Retrieval-guided Dialogue Response Generation via a Matching-to-Generation Framework". It contains about 550K single-turn query-response pairs, and top-10 retrievals from an open-domain chitchat api: https://ai.qq.com/product/nlpchat.shtml.
Refer to our paper for more details.
Retrieval-Guided Dialogue Response Generation via a Matching-to-Generation Framework. EMNLP 2019.
Please kindly cite our paper if this paper and the dataset are helpful.
[dataset download]
Restoration-200K datasets
The Restoration-200K dataset has about 200K multi-turn conversations in open-domain， and each of the conversations contains six utterances. An annotation team is hired to (1) label whether an utterance is related to its context or not, and (2) restore an incomplete utterance to a complete and context-free form based on its context.
Refer to our paper for more details: Improving Open-Domain Dialogue Systems via Multi-Turn Incomplete Utterance Restoration. EMNLP2019
Please kindly cite our paper if this paper and the dataset are helpful.
[dataset download]
Chinese Sentence Function Datasets
We create a new Short-Text Conversation dataset with manually annotated SEntence FUNctions (STC-Sefun), in which each sentence segment in the query-response pairs is labeled with its sentence functions. Besides the four major sentence functions, we further decompose each of them into fine-grained sentence functions according to their different purposes indicated in conversations.
Refer to our paper for more details: Fine-Grained Sentence Functions for Short-Text Conversation. ACL 2019
Please kindly cite our paper if this paper and the dataset are helpful.
[dataset download]
Weibo Conversation Datasets
This is a benchmark dataset (Shang, Lu, and Li 2015) and we pre-processed it for high-quality data pairs. We perform the following pre-processing steps:
- Hash tags and some special tokens such as “Zhuan (forward)” in the post/response are removed;
- Irrelevant pairs are removed, which are detected by a Twitter LDA model (Zhao et al. 2011). We use a subset of posts to train the LDA model and each post/response is represented with its topic-distribution vector. By computing the cosine similarity between the post and the response with their topic-distribution vectors, we filter out those pairs with similarities lower than a threshold.
- To handle the out-of-vocabulary (OOV) problem, we keep the top 50,000 most frequent words and set the rest as UNK.
Refer to our paper for more details: Generating Multiple Diverse Responses for Short-Text Conversation. AAAI2019
Please kindly cite our paper if this paper and the dataset are helpful.
[dataset download]

Projects

Multimodal Intelligent Human-Computer Interaction 多模态智能人机交互
智能人机交互系统的终极目标是使人与机器交互和人与人交互一样轻松自然。在我们逼近这一目标的过程中，就可以逐渐赋能一系列应用场景，比如在办公场景下的虚拟秘书、家庭里的陪伴机器宠物、游戏里的虚拟玩家、车载虚拟助手、和社交里的千人千面的交互bot。
The ultimate goal of intelligent human-computer interaction system is to make human-computer interaction as easy and natural as human-human interaction. As we approach this goal, we can gradually enable a series of application scenarios, such as virtual secretaries in office scenarios, companion robot pets in families, virtual players in games, vehicle-mounted virtual assistants, and personalized interactive bot with social interaction.