Does It Make Sense? And why? A Pilot Study for Sense Making and Explanation#
Cunxiang Wang, Shuailong Liang, Yue Zhang, Xiaonan Li and Tian Gao
Introducing common sense to natural language understanding systems has received increasing research attention. It remains a fundamental question on how to evaluate whether a system has a sense making capability. Existing bench marks measures commonsense knowledge in directly and without explanation. In this paper, we release a benchmark to directly test whether a system can differentiate natural language statements that make sense from those that do not makesense. In addition, a system is asked to identify the most crucial reason why a statement does not make sense. We evaluate models trained over large-scale language modeling tasks as well as human performance, showing that there are different challenges for system sense making.
示例#
Statement 1: He put a turkey into the fridge.
Statement 2: He put an elephant into the fridge.(违背常识)输入:两个在词汇上相似的句子(sent0, sent1)
输出:一个标签(0 或 1 分别表示 sent0 或 sent1 违背常识)Statement:他把一头大象放进了冰箱。
Reason:
A:一头大象比冰箱大得多。(正确解释)
B:大象通常是白色的,而冰箱通常也是白色的。
C:大象不能吃掉冰箱。输入:一句违背常识的陈述和三个解释选项。
输出:A、B 或 C(表示选择最合理的解释)
实现方式#
Sense-Making#
选择 score 较低的作为符合常识的语句。
- BERT:对于所有 token,将该 token 替换为 mask,模型输出 logits 得到预测分布
- GPT2:用前缀预测下一个 token,直接给出 $\mathrm{P}\left(w_{\mathrm{i}} \mid w_{i+1}\right)$
$$ \begin{aligned} & \operatorname{score}_{\mathrm{BERT}}=\left(p_{w_1} * p_{w_2} * \ldots * p_{w_n}\right)^{(-1 / n)}= \\ & \quad\left(\prod_{i=1}^n P\left(w_i \mid w_1 \ldots w_{i-1} w_{i+1} \ldots w_n\right)\right)^{-1 / n} \\ & \quad \operatorname{score}_{\mathrm{GPT}}=\left(p_{w_1} * p_{w_2} * \ldots * p_{w_n}\right)^{(-1 / n)}= \\ & \quad\left(\prod_{i=1}^n P\left(w_i \mid w_1 \ldots w_{i-1}\right)\right)^{-1 / n}=P\left(w_1 \ldots w_n\right)^{-1 / n} \end{aligned} $$
- ELMO:
| 标准名称 | 核心判断依据 | 计算公式 |
|---|---|---|
| L2 范数比较 | ELMO 向量的欧几里得长度 | np.linalg.norm(emb) |
| 与零向量余弦相似度 | 句向量与零向量的方向一致性 | $\cos (\theta)=(\mathrm{emb} \cdot \theta) /(\|\mathrm{emb}\|\|\theta\|)$ |
| 双向 LSTM 余弦相似度 | Forward/Backward LSTM 向量的方向一致性 | $\cos (\mathrm{fw}, \mathrm{bw})$ |
| 双向 LSTM 欧氏距离 | Forward/Backward LSTM 向量的几何距离 | $\|\mathrm{fw}-\mathrm{bw}\| 2$ |
Reason Explaination#
- BERT/GPT2: 选择使组合句子概率最大的解释
- ELMO: L2 范数/余弦相似度
# map to ids
tokens = bert_tokenizer.tokenize(sentence)
token_ids = bert_tokenizer.convert_tokens_to_ids(tokens)
input_ids = [bert_tokenizer.cls_token_id] + token_ids + [bert_tokenizer.sep_token_id]
seq_len = len(input_ids)
# input_ids_tensor = torch.tensor([input_ids]).to(self.device)
# log score = -1/n * sum(log p_wi)
log_probs = []
mask_token_id = bert_tokenizer.mask_token_id
for i in range(1, seq_len-1):
# mask the i-th token(id w_i)
masked_input_ids = input_ids.copy()
masked_input_ids[i] = mask_token_id
masked_input_ids_tensor = torch.tensor([masked_input_ids]).to(self.device)
with torch.no_grad():
outputs = bert_model(masked_input_ids_tensor)
logits = outputs.logits # (batch_size, seq_len, vocab_size)
# (vocab_size,)
mask_logits = logits[0, i]
probs = F.softmax(mask_logits, dim=-1)
w_i = input_ids[i]
p_wi = probs[w_i]
log_probs.append(torch.log(p_wi + 1e-10)) # avoid log(0)
log_score = -torch.tensor(log_probs).mean()
score = torch.exp(log_score)
复现#
按上节方法结果如下:
| Model | Sen-Making | Explanation |
|---|---|---|
| BERT | 69.57% | 37.41% |
| ELMo | 62% | 40% |
| GPT-2 | 69.97% | 36.07% |
论文结果:
| Model | Sen-Making | Explanation |
|---|---|---|
| Random | 50.0% | 33.3% |
| ELMo | 69.4% | 33.4% |
| BERT | 70.1% | 45.6% |
| fine-tuned ELMo | 74.1% | 34.6% |
| Human Performance | 99.1% | 97.8% |
改进#
知识图谱 KG#
KG 主要涉及到如下方面:
- 数据增强(三元组自然描述)
- 结构融合(KG 嵌入(拼接、加权、注意力学习),GNN)
- 训练优化(KG 对齐损失,多任务学习,基于知识奖励 RL)
- 指令微调(扩充问题)
Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering
Jinheon Baek, Alham Fikri Aji and Amir Saffari

NER + Concat + Score
def get_all_conceptnet_relations(entities):
relations = []
if len(entities) < 2:
return relations
for i in range(len(entities)):
for j in range(i + 1, len(entities)):
start_node = entities[i].lower().replace(" ", "_")
end_node = entities[j].lower().replace(" ", "_")
url = f"http://api.conceptnet.io/query?node1=/c/en/{start_node}&node2=/c/en/{end_node}&rel=/r/RelatedTo"
try:
response = requests.get(url, timeout=5)
data = response.json()
if data["edges"]:
for edge in data["edges"]:
relation = edge["rel"]["label"]
relations.append(f"{end_node} is related to {start_node} via {relation}.")
except Exception as e:
print(f"ConceptNet error for {start_node} and {end_node}: {e}")
relations.append(f"{end_node} is related to {start_node} via related to.")
return relations
结果:61.08%,40.08%
知识图谱的效果非常依赖于常识问题与知识库关联的紧密度,提升有限,大概率不如 COT。
GraphGPT: Graph Instruction Tuning for Large Language Models
Jiabin Tang, Yuhao Yang and Wei Wei

KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning
Bill Yuchen Lin, Xinyue Chen, Jamin Chen and Xiang Ren
在 CommonsenseQA 数据集上,KagNet 结合 BERT-LARGE 实现了当时(2019年5月)的最优性能(OFtest 准确率 58.9%)。

上下文嵌入概念 - 编码路径中的多跳关系
层次注意力机制选择重要路径和概念对
思维链 CoT#
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo and Yusuke Iwasawa
Zero-shot-CoT
def sen_making_prompt(s1, s2):
return f"""### Instruction:
Which of the following two statements makes more sense?
Statement A: {s1}
Statement B: {s2}
Let's think step by step.
### Response:
"""

Large Language Models Can Self-Improve
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu and Jiawei Han
输入阶段:输入给语言模型带有详细推理步骤的问题答案和目标问题,用于生成答案。
推理与投票阶段:语言对目标问题进行多路径解码,通过 majority voting 按答案聚合,所有产生该答案的路径被选为可靠训练样本。
自我训练阶段:选出的推理路径作为新的训练数据。

Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery and Denny Zhou
