信息抽取技术进展【4】-- 新的挑战

【摘要】行业知识图谱是行业认知智能化应用的基石。目前在大部分细分垂直领域中，行业知识图谱的schema构建依赖领域专家的重度参与，该模式人力投入成本高，建设周期长，同时在缺乏大规模有监督数据的情形下的信息抽取效果欠佳，这限制了行业知识图谱的落地且降低了图谱的接受度。本文对与上述schema构建和低资源抽取困难相关的最新技术进展进行了整理和分析，其中包含我们在半自动schema构建方面的实践，同时给出了Document AI和长结构化语言模型在文档级信息抽取上的前沿技术分析和讨论，期望能给同行的研究工作带来一定的启发和帮助。

【引自】万字综述：行业知识图谱构建最新进展

作者：李晶阳[1]，牛广林[2]，唐呈光[1]，余海洋[1]，李杨[1]，付彬[1]，孙健[1]

单位：阿里巴巴-达摩院-小蜜Conversational AI团队[1]，北京航空航天大学计算机学院[2]

新的挑战

1 文档级信息抽取难题

在实际项目中，除了从句子和段落中进行实体和关系抽取之外，我们还面临从文档中进行信息抽取的新挑战。下面两图是保险合同相关的pdf文档的截图。在此类文档的处理上，我们面对两个任务：

（1）文档结构抽取

在很多垂直行业中，如例图所示的半结构化文档大量存在。如何很好的按照文档内容本身的层次化结构进行数据解析，进而针对其层级结构来归纳整理知识图谱schema是当下面临的新的巨大挑战。行业文档的格式多样，有pdf，word，txt等多种格式，pdf格式中又分为标准pdf，可搜索pdf和扫描版pdf，word文档的版本也是不尽相同。文档内部的格式更是千变万化，比如有单栏的，双栏的，横版的，竖版的（较少），标题明显的，标题不明显的，有些segment如标题是有价值的，有些segment如附注是相对价值小的等等。当然，除此之外，还面临其中嵌入大量的表格、图片等信息的识别混淆等各类问题。

（2）给定 schema 的信息抽取

在知识图谱schema给定的前提下，从此类文档中进行特定信息的抽取，比如抽取保险的投保年龄。由于文档格式和行业表述的多样性以及文档内的交叉引用，使得从文档中直接抽取此类信息变得十分困难，比如第一份文档中的"投保范围"对应投保年龄，第二份文档中的"投保年龄"的真实内容引用了文档10.1节的内容。这些需要文档级的语义理解能力和逻辑推理能力，才能很好的进行此类信息抽取。

2 前沿研究

面对文档级信息抽取的挑战，我们发现新近出现的两类技术有可能进行整合最终给出文档级信息抽取的一个解决方案。下面分别对其进行简介：

2.1 Document AI

面对前述文档级信息抽取任务，首先需要考虑的是此类文档的数据解析问题，即如何将文档中的数据按照其原有的结构进行抽取。其中涉及多源文档读取，segment/paragraph判别，segment/paragraph之间关系判别等多种任务。显然，此类文档的视觉信息（Layout information）对于数据解析至关重要。

Document Intelligentce（也称Document AI）是专门分析文档Layout信息和内部structure的研究领域 [69]，其旨在将文档或图片化的文档分解为独立的region（Phisical Layout），并结构出region的角色（如标题或者段落）和相互关系（Logical Structure），如标题与子标题关系、标题与内容关系。因此，Document AI领域的模型能用来解决前述文档数据结构化抽取的难题。

但现有较为先进的Document AI模型，如LayOut（见下图）[70] 等，主要用于处理票据内容的结构化识别。最为前沿的数据集是 DocBank [71]，其是根据arxiv网站大量的论文pdf文档与其latex代码之间的对应关系而自动化构建出的Document AI训练数据，但其仅对论文中的region进行识别，如识别Abstract, Introduction, caption, table等内容，但缺少对region之间的logical structure的识别，而region的logical structure识别，对于前述文档的信息结构化是至关重要的。

因此，在此方面的研究上，无论是大规模数据集构建还是综合Physical Layout和Logical Structure的联合抽取模型，相关文献目前都还是鲜有出现，亟需得到更多的关注和深入的研究。

2.2 长结构化语言模型

只有在文档的segment的表示中融合文档的整体信息，才能做好以文档基础的信息抽取任务。因此，在前述文档得到有效结构化抽取的前提下，如何编码和表示此类结构化数据的宏观和局部信息，也我们面临的第二大挑战。

最近出现的用于编码长句子（1w字符- 10w字符级别）的语言模型或许能有效解决上述挑战。具体的，ETC 模型[72]（见下图）利用Global-local的Attention机制实现了对于长且结构化语句的预训练表示，并在基于网页的层次结构数据的关键短语抽取上验证了其有效性。但是 ETC[72]在多层级结构的语句的编码上仍然没有得到很好的设计，而且Global-local的稀疏Attention机制也面临信息损失的缺陷。

因此，如何基于前述文档结构，在最近出现的长句子语言模型上进行新的架构设计，使得语言模型能更加有效的编码文档的结构化信息和文本信息，同样亟需得到更多的关注和深入研究。

3 小结

本节介绍了我们在实际业务中面临的文档级信息抽取的新挑战，同时作为潜在的解决方案，本节也介绍了Document AI和长句子语言模型两个技术模块。总结来看，Document AI和语言模型都无法直接适配当下的抽取任务，面向前述文档级信息抽取的新挑战，目前还没有一项成体系的解决方案，因此，此研究方向值得更多的研究人员和工程同行的关注和研究。

4. 总结和展望

本文围绕行业知识图谱构建，对schema构建、实体识别和关系抽取相关技术和最新进展进行了介绍和分析。同时介绍了我们遇到的文档级信息抽取的新挑战，并分析讨论了Document AI和长结构化文档语言模型在此新挑战上的前沿技术进展。

随着知识图谱作为认知底层的不断发展和完善，应用领域也从互联网渗透进各类垂直行业，基于各行业知识的高效图谱构建将会是知识图谱应用到ToB市场的关键。从我们的角度看，行业知识图谱构建未来有以下几个趋势：

schema构建自动化：
行业知识图谱构建领域会发展出一套有效的schema构建相关的标准和规范，从而为schema自动构建算法提供明晰的优化迭代目标和合理的架构设计参考。随着NLP领域的快速蓬勃发展，schema构建所涉及的信息抽取和抽象整合的能力短板也会得到很大提升。因此，行业图谱schema构建中的人机投入比会从7:3不断发展到5:5，3:7甚至完全实现自动schema构建。
信息抽取的统一性和低资源化：
信息抽取方案会越来越偏向信息的综合抽取，统一涵盖实体、关系、事件等综合信息，这必将给算法模型架构设计和数据工程链路建设带来巨大变化。同时，除大规模语言模型的不断发展外，隐式数据资源生成和显式行业先验知识资源的融入技术也将不断发展成熟，这些都会推进低资源化的信息抽取模型将成为主流解决方案。
从句子级、段落级到文档级：
以其数据的大规模性、知识的结构性和宏观性以及内容的多模态性，文档级信息抽取必将得到越来越多的研究，从而使得以大规模行业结构化甚至无结构化文档为输入，直接输出图谱化行业知识的端到端的图谱构建链路成为未来的流行。

最后，希望本篇进展研究可以对读者的研究工作带来一定的启发和帮助，同时也感谢各位读者的耐心研读，本文若有纰漏或不妥之处，请不吝赐教。

参考文献

1. Han, Hao Zhu, Pengfei Yu, ZiyunWang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2018d. Fewrel: A largescale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proceedings of EMNLP, pages 4803--4809.

2. Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2019. FewRel 2.0: Towards more challenging few-shot relation classification. In Proceedings of EMNLP-IJCNLP, pages 6251--6256.

[https://github.com/gabrielStanovsky/oie-benchmark](https://link.zhihu.com/?target=https%3A//github.com/gabrielStanovsky/oie-benchmark)

4. 《知识图谱: 方法,实践与应用》，王昊奋 / 漆桂林 / 陈华钧主编，电子工业出版社, 2019.

5. Yates, A.; Banko, M.; Broadhead, M.; Cafarella, M.; Etzioni,O.; and Soderland, S. 2007. Textrunner: Open information extraction on the web. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), 25--26..

6. Diego Marcheggiani and Ivan Titov. 2016. Discretestate variational autoencoders for joint discovery and factorization of relations. Transactions of ACL..

7. Elsahar, H., Demidova, E., Gottschalk, S., Gravier, C., & Laforest, F. (2017, May). Unsupervised open relation extraction. In European Semantic Web Conference (pp. 12-16). Springer, Cham..

8. Wu, R., Yao, Y., Han, X., Xie, R., Liu, Z., Lin, F., \... & Sun, M. (2019, November). Open relation extraction: Relational knowledge transfer from supervised data to unsupervised data. In EMNLP-IJCNLP (pp.219-228)..

9. Stanovsky, G., Michael, J., Zettlemoyer, L., & Dagan, I. (2018, June). Supervised open information extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 885-895)..

10. Zhan, J., & Zhao, H. (2020, April). Span model for open information extraction on accurate corpus. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 05, pp. 9523-9530).

[11. Cui, L., Wei, F., & Zhou, M. (2018). Neural open information extraction. arXiv preprint arXiv:1805.04270.

12. Sameer Pradhan, Mitchell P. Marcus, Martha Palmer, Lance A. Ramshaw, Ralph M. Weischedel, and Nianwen Xue, editors. 2011. Proceedings of the Fifteenth Conference on Computational Natural Language Learning:Shared Task, CoNLL 2011, Portland, Oregon, USA, June 23-24, 2011. ACL.

13. Gina-Anne Levow. 2006. The third international Chinese language processing bakeoff: Word segmentation and named entity recognition. In Proceedings of the Fifth SIGHANWorkshop on Chinese Language Processing, pages 108--117, Sydney, Australia. Association for Computational Linguistics.

14. Nanyun Peng and Mark Dredze. 2015. Named entity recognition for Chinese social media with jointly trained embeddings. In EMNLP. pages 548--554.

15. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Languageindependent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31 - June 1, 2003, pages 142--147\.

16. George R Doddington, Alexis Mitchell, Mark A Przybocki, Stephanie M Strassel Lance A Ramshaw, and Ralph M Weischedel. 2005. The automatic content extraction (ace) program-tasks, data, and evaluation. In LREC, 2:1.

17. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Bj¨orkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using OntoNotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 143--152, Sofia, Bulgaria.Association for Computational Linguistics.

18. 阮彤, 王梦婕, 王昊奋, & 胡芳槐. (2016). 垂直知识图谱的构建与应用研究. 知识管理论坛(3).

19. Wu, T.; Qi, G.; Li, C.; Wang, M. A Survey of Techniques for Constructing Chinese Knowledge Graphs and Their Applications. Sustainability 2018, 10, 3245.

20. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of machine learning research, 12(ARTICLE), 2493-2537.

\[21\] Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.

22. Strubell, E., Verga, P., Belanger, D., & McCallum, A. (2017). Fast and accurate entity recognition with iterated dilated convolutions. arXiv preprint arXiv:1702.02098.

23. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

24. Zhang, Y., & Yang, J. (2018). Chinese ner using lattice lstm. arXiv preprint arXiv:1805.02023.

25. Gui, T., Ma, R., Zhang, Q., Zhao, L., Jiang, Y. G., & Huang, X. (2019, August). CNN-Based Chinese NER with Lexicon Rethinking. In IJCAI (pp. 4982-4988).

26. Li, X., Yan, H., Qiu, X., & Huang, X. (2020). FLAT: Chinese NER Using Flat-Lattice Transformer. arXiv preprint arXiv:2004.11795.

27. Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., & Li, J. (2019). A unified mrc framework for named entity recognition. arXiv preprint arXiv:1910.11476.

28. Yuchen Lin, B., Lee, D. H., Shen, M., Moreno, R., Huang, X., Shiralkar, P., & Ren, X. (2020). TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition. arXiv, arXiv-2004.

\[29\] Zhang, X., Jiang, Y., Peng, H., Tu, K., & Goldwasser, D. (2017). Semi-supervised structured prediction with neural crf autoencoder. Association for Computational Linguistics (ACL).

30. Chen, M., Tang, Q., Livescu, K., & Gimpel, K. (2019). Variational sequential labelers for semisupervised learning. arXiv preprint arXiv:1906.09535.

31. Chen, J., Wang, Z., Tian, R., Yang, Z., & Yang, D. (2020). Local Additivity Based Data Augmentation for Semi-supervised NER. arXiv preprint arXiv:2010.01677.

32. Lakshmi Narayan, P. (2019). Exploration of Noise Strategies in Semi-supervised Named Entity Classification.

33. Alejandro Metke-Jimenez and Sarvnaz Karimi. 2015. Concept extraction to identify adverse drug reactions in medical forums: A comparison of algorithms. CoRR abs/1504.06936.

34. Xiang Dai, Sarvnaz Karimi, Ben Hachey, Cécile Paris. An Effective Transition-based Model for Discontinuous NER. ACL 2020: 5860-5870

35. Wei Lu and Dan Roth. 2015. Joint mention extraction and classification with mention hypergraphs. In Conference on Empirical Methods in Natural Language Processing, pages 857--867, Lisbon, Portugal.

36. Walker, C., Strassel, S., Medero, J., and Maeda, K. 2005. ACE 2005 multilingual training corpuslinguistic data consortium.

37. Szpakowicz, S. 2009. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, pages 94--99. Association for Computational Linguistics.

38. Zhang, Yuhao and Zhong, Victor and Chen, Danqi and Angeli, Gabor and Manning, Christopher D. 2017. Position-aware Attention and Supervised Data Improve Slot Filling. In Proceedings of EMNLP. Pages 35-45.

39. Riedel, S., Yao, L., and McCallum, A. 2010. Modeling relations and their mentions without labeled text. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 148-163. Springer.

40. Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. 2019. DocRED: A large-scale document-level relation extraction dataset. In Proceedings of ACL, pages 764--777.

41. Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In Proceedings of COLING, pages 2335--2344.

42. Linlin Wang, Zhu Cao, Gerard De Melo, and Zhiyuan Liu. 2016. Relation classification via multi-level attention cnns. In Proceedings of ACL, pages 1298--1307.

43. Dongxu Zhang and Dong Wang. 2015. Relation classification via recurrent neural network. arXiv preprint arXiv:1508.01006.

44. Xu, Y., Mou, L., Li, G., Chen, Y., Peng, H., and Jin, Z. 2015. Classifying relations via long short term memory networks along shortest dependency paths. In proceedings of EMNLP, pages 1785--1794.

45. Shanchan Wu and Yifan He. 2019. Enriching pre-trained language model with entity information for relation classification.

46. Zhao, Y., Wan, H., Gao, J., and Lin, Y. 2019. Improving relation classification by entity pair graph. In Asian Conference on Machine Learning, pages 1156--1171.

47. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of ACL-IJCNLP, pages 1003--1011.

48. Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. 2012. Multi-instance multi-label learning for relation extraction. In Proceedings of EMNLP, pages 455--465.

49. Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of EMNLP, pages 1753--1762.

50. Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In Proceedings of ACL, pages 2124--2133.

51. Yuhao Zhang, Peng Qi, and Christopher D. Manning. 2018. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of EMNLP, pages 2205--2215.

52. Guoliang Ji, Kang Liu, Shizhu He, Jun Zhao, et al. 2017. Distant supervision for relation extraction with sentence-level attention and entity descriptions. In AAAI, pages 3060--3066.

53. Bordes A, Usunier N, Garcia-Duran A, et al. 2013. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems. pages 2787-2795.

54. Xu Han, Pengfei Yu, Zhiyuan Liu, Maosong Sun, and Peng Li. 2018. Hierarchical relation extraction with coarse-to-fine grained attention. In Proceedings of EMNLP, pages 2236--2245.

55. Ningyu Zhang, Shumin Deng, Zhanlin Sun, Guanying Wang, Xi Chen, Wei Zhang, and Huajun Chen. 2019. Longtail relation extraction via knowledge graph embeddings and graph convolution networks. In Proceedings of NAACL-HLT, pages 3016--3025.

56. Qin, P., Xu, W., and Wang, W. Y. 2018b. Robust distant supervision relation extraction via deep reinforcement learning. arXiv preprint arXiv:1805.09927.

57. Xiangrong Zeng, Shizhu He, Kang Liu, and Jun Zhao. 2018. Large scaled relation extraction with reinforcement learning. In Proceedings of AAAI, pages 5658--5665.

58. Jun Feng, Minlie Huang, Li Zhao, Yang Yang, and Xiaoyan Zhu. 2018. Reinforcement learning for relation classification from noisy data. In Proceedings of AAAI, pages 5779--5786.

59. Yi Wu, David Bamman, and Stuart Russell. 2017. Adversarial training for relation extraction. In Proceeding of EMNLP, pages 1778--1783.

60. Pengda Qin, Weiran Xu, William Yang Wang. 2018. DSGAN: Generative Adversarial Training for Distant Supervision Relation Extraction. In Proceeding of ACL, pages 496--505.

61. Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the blanks: Distributional similarity for relation learning. In Proceedings of ACL, pages 2895--2905.

62. Meng Qu, Tianyu Gao, Louis-Pascal Xhonneux, Jian Tang. 2020. Few-shot Relation Extraction via Bayesian Meta-learning on Task Graphs. In Proceedings of ICML.

63. Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing Hao,Peng Zhou, Bo Xu. 2017. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1227--1236.

64. Wei, Zhepei and Su, Jianlin and Wang, Yue and Tian, Yuan and Chang, Yi. 2020 A Novel Cascade Binary Tagging Framework for Relational Triple Extraction}. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, pages 1476---1488.

65. Luan, Y., Wadden, D., He, L., Shah, A., Ostendorf, M., & Hajishirzi, H. (2019). A general framework for information extraction using dynamic span graphs. arXiv preprint arXiv:1904.03296.

66. Wadden, D., Wennberg, U., Luan, Y., & Hajishirzi, H. (2019). Entity, relation, and event extraction with contextualized span representations. arXiv preprint arXiv:1909.03546.

67. Sahu, S. K., et al. 2019. Inter-sentence Relation Extraction with Document-level Graph Convolutional Neural Network. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics:4309--4316.

68. mLiu, B., Gao, H., Qi, G., Duan, S., Wu, T., & Wang, M. (2019, April). Adversarial Discriminative Denoising for Distant Supervision Relation Extraction. In International Conference on Database Systems for Advanced Applications (pp. 282-286). Springer, Cham.

69. Namboodiri, A. M., & Jain, A. K. (2007). Document structure and layout analysis. In Digital Document Processing (pp. 29-48). Springer, London.

70. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2020, August). Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1192-1200).

71. Li, M., Xu, Y., Cui, L., Huang, S., Wei, F., Li, Z., & Zhou, M. (2020). DocBank: A Benchmark Dataset for Document Layout Analysis. arXiv preprint arXiv:2006.01038.

72. Ainslie, J., Ontanon, S., Alberti, C., Cvicek, V., Fisher, Z., Pham, P., \... & Yang, L. (2020, November). ETC: Encoding Long and Structured Inputs in Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 268-284).

73. Tang, J., Lu, Y., Lin, H., Han, X., Sun, L., Xiao, X., & Wu, H. (2020, November). Syntactic and Semantic-driven Learning for Open Information Extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings (pp. 782-792).