cs.CL - 2023-08-24

Text Similarity from Image Contents using Statistical and Semantic Analysis Techniques

paper_url: http://arxiv.org/abs/2308.12842
repo_url: None
paper_authors: Sagar Kulkarni, Sharvari Govilkar, Dhiraj Amin
for: 这种论文主要targets the problem of plagiarism in image content, such as figures, graphs, and tables, and proposes a system to detect plagiarism in these contents.
methods: The proposed system uses a combination of statistical algorithms, including Jaccard and Cosine, and semantic algorithms, such as LSA, BERT, and WordNet, to detect plagiarism in image content.
results: The system outperformed in detecting efficient and accurate plagiarism in image content, demonstrating its effectiveness in addressing the challenge of plagiarism in this area.

Abstract
Plagiarism detection is one of the most researched areas among the Natural Language Processing(NLP) community. A good plagiarism detection covers all the NLP methods including semantics, named entities, paraphrases etc. and produces detailed plagiarism reports. Detection of Cross Lingual Plagiarism requires deep knowledge of various advanced methods and algorithms to perform effective text similarity checking. Nowadays the plagiarists are also advancing themselves from hiding the identity from being catch in such offense. The plagiarists are bypassed from being detected with techniques like paraphrasing, synonym replacement, mismatching citations, translating one language to another. Image Content Plagiarism Detection (ICPD) has gained importance, utilizing advanced image content processing to identify instances of plagiarism to ensure the integrity of image content. The issue of plagiarism extends beyond textual content, as images such as figures, graphs, and tables also have the potential to be plagiarized. However, image content plagiarism detection remains an unaddressed challenge. Therefore, there is a critical need to develop methods and systems for detecting plagiarism in image content. In this paper, the system has been implemented to detect plagiarism form contents of Images such as Figures, Graphs, Tables etc. Along with statistical algorithms such as Jaccard and Cosine, introducing semantic algorithms such as LSA, BERT, WordNet outperformed in detecting efficient and accurate plagiarism.

摘要
“抄袭探测是自然语言处理（NLP）社区中最受欢迎的研究领域之一。一个好的抄袭探测系统应包括所有NLP方法，包括语意、名称实体、重复文本等，并生成详细的抄袭报告。跨语言抄袭探测需要深厚的多种高级方法和算法，以进行有效的文本相似性检查。现在，抄袭者也在不断地提高自己的隐身技巧，以避免被检测。抄袭者会使用技巧如重写、词汇替换、不一致的引用、翻译语言等。图像内容抄袭探测（ICPD）已经获得了重要性，通过进阶的图像内容处理技术来确保图像内容的完整性。但是，图像内容抄袭探测仍然是一个未解决的挑战。因此，有一个急需开发方法和系统来检测图像内容中的抄袭。在这篇文章中，我们已经实现了对图像内容中的内容进行抄袭探测，包括 figura、图表、グラフ等。我们还使用了统计算法如Jaccard和Cosine，以及语义算法如LSA、BERT和WordNet，它们在检测效率和准确性方面表现出色。”

Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities

paper_url: http://arxiv.org/abs/2308.12833
repo_url: None
paper_authors: Maximilian Mozes, Xuanli He, Bennett Kleinberg, Lewis D. Griffin
for: 本研究旨在提高大语言模型（LLM）的安全性和安全性问题的认识，包括恶意使用、人工欺诈和生成恶意软件等问题。
methods: 本研究使用了现有的科学研究来识别和解决LLM所存在的威胁和漏洞。我们还提出了一个概念体系，描述了威胁、预防措施和预防措施的漏洞之间的关系。
results: 本研究希望通过提高开发者和实践者对LLM安全性问题的认识，提高LLM在安全性方面的可靠性和可靠性。

Abstract
Spurred by the recent rapid increase in the development and distribution of large language models (LLMs) across industry and academia, much recent work has drawn attention to safety- and security-related threats and vulnerabilities of LLMs, including in the context of potentially criminal activities. Specifically, it has been shown that LLMs can be misused for fraud, impersonation, and the generation of malware; while other authors have considered the more general problem of AI alignment. It is important that developers and practitioners alike are aware of security-related problems with such models. In this paper, we provide an overview of existing - predominantly scientific - efforts on identifying and mitigating threats and vulnerabilities arising from LLMs. We present a taxonomy describing the relationship between threats caused by the generative capabilities of LLMs, prevention measures intended to address such threats, and vulnerabilities arising from imperfect prevention measures. With our work, we hope to raise awareness of the limitations of LLMs in light of such security concerns, among both experienced developers and novel users of such technologies.

摘要
受到大语言模型（LLM）的快速发展和散布的影响，近期的许多研究都集中在LLM的安全和安全性问题上，特别是在涉及到可犯罪活动的情况下。据显示，LLM可以被滥用于诈骗、人身伪造和生成恶意软件等; 而其他作者则考虑了更一般的人工智能对齐问题。在这篇论文中，我们提供了现有的主要科学努力，以识别和解决由LLM引起的威胁和漏洞。我们提出了一种分类方案，描述了由LLM生成能力引起的威胁、防范措施和不完全防范措施所导致的漏洞之间的关系。我们希望通过这种工作，让开发者和实践者都意识到LLM的安全限制，以及相关的安全问题。

WavMark: Watermarking for Audio Generation

paper_url: http://arxiv.org/abs/2308.12770
repo_url: None
paper_authors: Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, Furu Wei
for: 这篇论文旨在提出一种新的声音水印框架，以增强声音水印的鲁棒性和可靠性。
methods: 该框架使用了1秒钟的音频抽象，并在不可见的情况下编码了32位水印。它还可以组合多个水印段以实现更高的鲁棒性和容量。
results: 该框架在10-20秒的音频上实现了0.48%的比特错误率，比现有的水印工具减少了超过2800%的比特错误率。

Abstract
Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice using just a few seconds of recording while maintaining a high level of realism. Alongside its potential benefits, this powerful technology introduces notable risks, including voice fraud and speaker impersonation. Unlike the conventional approach of solely relying on passive methods for detecting synthetic data, watermarking presents a proactive and robust defence mechanism against these looming risks. This paper introduces an innovative audio watermarking framework that encodes up to 32 bits of watermark within a mere 1-second audio snippet. The watermark is imperceptible to human senses and exhibits strong resilience against various attacks. It can serve as an effective identifier for synthesized voices and holds potential for broader applications in audio copyright protection. Moreover, this framework boasts high flexibility, allowing for the combination of multiple watermark segments to achieve heightened robustness and expanded capacity. Utilizing 10 to 20-second audio as the host, our approach demonstrates an average Bit Error Rate (BER) of 0.48\% across ten common attacks, a remarkable reduction of over 2800\% in BER compared to the state-of-the-art watermarking tool. See https://aka.ms/wavmark for demos of our work.

摘要
最近的零频讲话突破口已经使得可以通过几秒钟的录音来模仿说话者的声音，同时保持高度的真实感。然而，这项强大技术也带来了明显的风险，包括声音骗财和说话者模仿。不同于传统的仅仅依靠静止方法来检测合成数据，水印技术提供了一种积极和坚强的防御机制。这篇论文介绍了一种创新的音频水印框架，可以在1秒钟的音频截取中编码Up to 32位的水印，人类感知不到。这个水印具有强大的抗击攻击特性，可以作为合成声音的标识符，并且有广泛的应用前途在音频版权保护方面。此外，该框架具有高灵活性，可以将多个水印段组合以实现更高的坚强性和扩展性。使用10-20秒的音频作为主机，我们的方法在十种常见攻击中显示了平均的比特错误率（BER）为0.48%，相比之下，当前的水印工具的BER下降了超过2800%。请参考https://aka.ms/wavmark了解我们的工作。

Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

paper_url: http://arxiv.org/abs/2308.12734
repo_url: None
paper_authors: Jordan J. Bird, Ahmad Lotfi
for: 这个研究旨在探讨深度复制声音技术，以探索它们对于声音识别和伪造的潜在影响。
methods: 这个研究使用了Retrieval-based Voice Conversion技术生成了DEEP-VOICE数据集，包含了八位知名人士的真实声音，并将其转换为对方的声音。研究以binary分类问题的形式进行分析，使用了时间 audio 特征的 Statistical Analysis，发现真实声音和AI生成声音之间存在显著的不同分布。
results: 这个研究发现，使用Extreme Gradient Boosting模型可以实现平均分类精度为99.3%，并且可以在0.004毫秒运行，即一秒钟的声音。所有的数据都公开发布，以便未来的研究人员对于AI声音检测进行更多的研究。

Abstract
There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion. To address the above emerging issues, the DEEP-VOICE dataset is generated in this study, comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion. Presenting as a binary classification problem of whether the speech is real or AI-generated, statistical analysis of temporal audio features through t-testing reveals that there are significantly different distributions. Hyperparameter optimisation is implemented for machine learning models to identify the source of speech. Following the training of 208 individual machine learning models over 10-fold cross validation, it is found that the Extreme Gradient Boosting model can achieve an average classification accuracy of 99.3% and can classify speech in real-time, at around 0.004 milliseconds given one second of speech. All data generated for this study is released publicly for future research on AI speech detection.

摘要
“有着增长的对话AI生成技术的含义，可以让语音变为别人的语音。这种技术可能会导致隐私泄露和误导，因此需要实时检测AI生成的语音。为解决这些问题，本研究中提出了DEEP-VOICE数据集，包括8名知名人士的真实语音和使用检索式语音转换后转换为别人的语音。这被视为一个二分类问题，即语音是真实的或者是AI生成的。通过统计Audio特征的时间分布，通过t检测发现了不同的分布。对于机器学习模型来源的标识，进行了参数优化。经过208个个人机器学习模型的10次横向分割训练，发现了使用极限梯度提升模型，可以在0.004毫秒内将语音分类为真实或AI生成，并且在1秒钟语音时，平均分类精度达99.3%。所有用于本研究的数据都公开发布，以便未来关于AI语音检测的研究。”

Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models

paper_url: http://arxiv.org/abs/2308.12711
repo_url: None
paper_authors: Yue Wang, Xinrui Wang, Juntao Li, Jinxiong Chang, Qishen Zhang, Zhongyi Liu, Guannan Zhang, Min Zhang
for: This paper aims to explore alternative methods for generating high-quality instruction data for training large language models, without relying on closed-source models.
methods: The authors investigate various existing instruction generation methods and integrate the most efficient variant with two novel strategies to enhance quality.
results: The generated instruction data outperforms Alpaca, a method reliant on closed-source models, as demonstrated by evaluation results from two benchmarks and the GPT-4 model.Here’s the text in Simplified Chinese:
for: 本研究目的是探讨不使用关闭源模型的方法生成高质量的指令数据，用于训练大语言模型。
methods: 作者们 investigate了多种现有的指令生成方法，并将最高效的变体与两种新策略相结合，以进一步提高质量。
results: 生成的指令数据超过了基于关闭源模型的Alpaca方法，如果从两个 benchmark 和 GPT-4 模型的评估结果来看。

Abstract
Instruction tuning is instrumental in enabling Large Language Models~(LLMs) to follow user instructions to complete various open-domain tasks. The success of instruction tuning depends on the availability of high-quality instruction data. Owing to the exorbitant cost and substandard quality of human annotation, recent works have been deeply engaged in the exploration of the utilization of powerful closed-source models to generate instruction data automatically. However, these methods carry potential risks arising from the usage requirements of powerful closed-source models, which strictly forbid the utilization of their outputs to develop machine learning models. To deal with this problem, in this work, we explore alternative approaches to generate high-quality instruction data that do not rely on closed-source models. Our exploration includes an investigation of various existing instruction generation methods, culminating in the integration of the most efficient variant with two novel strategies to enhance the quality further. Evaluation results from two benchmarks and the GPT-4 model demonstrate the effectiveness of our generated instruction data, which can outperform Alpaca, a method reliant on closed-source models. We hope that more progress can be achieved in generating high-quality instruction data without using closed-source models.

摘要
大型语言模型（LLM）的 instrucion tuning 是实现用户指令完成各种开放领域任务的关键因素。 instrucion tuning 的成功取决于高质量的 instrucion 数据的可用性。由于人工标注的成本高昂且质量不高，现有的研究专注于自动生成 instrucion 数据的方法。但这些方法存在受到强大关闭源模型的使用需求的风险。为了解决这个问题，这个工作寻找不靠强大关闭源模型的替代方法来生成高质量的 instrucion 数据。我们的探索包括评估多种现有的 instrucion 生成方法，并将最高效的variant与两个新的策略相互融合，以进一步提高 instrucion 数据的质量。实验结果显示，我们所生成的 instrucion 数据能够超越 Alpaca，一种基于关闭源模型的方法。我们希望这个领域能够获得更多的进步，以生成高质量的 instrucion 数据，不靠强大关闭源模型。

From Chatter to Matter: Addressing Critical Steps of Emotion Recognition Learning in Task-oriented Dialogue

paper_url: http://arxiv.org/abs/2308.12648
repo_url: None
paper_authors: Shutong Feng, Nurul Lubis, Benjamin Ruppik, Christian Geishauser, Michael Heck, Hsien-chin Lin, Carel van Niekerk, Renato Vukovic, Milica Gašić
for: 提高人类对话机器人的听众性能（Emotion Recognition in Conversations，ERC），特别是在任务导向对话（Task-Oriented Dialogues，ToDs）中。
methods: 将适合协议对话（Chit-Chat Dialogues）的ERC模型转化为任务导向的模型，通过三个关键方面：数据、特征和目标。首先，我们提出了两种增强罕见情感的方法。其次，我们使用对话状态作为辅助特征，以包含用户的目标信息。最后，我们采用多方面情感定义和多任务学习目标，以及一种情感距离权重损失函数。
results: 在Emowoz大规模数据集上，我们的框架可以提高许多适合协议对话的ERC模型的性能。此外，我们还进行了不同ToD数据集上的满意度预测研究，并与超参数比较，显示了我们的框架在各种场景中的可 reuse性。

Abstract
Emotion recognition in conversations (ERC) is a crucial task for building human-like conversational agents. While substantial efforts have been devoted to ERC for chit-chat dialogues, the task-oriented counterpart is largely left unattended. Directly applying chit-chat ERC models to task-oriented dialogues (ToDs) results in suboptimal performance as these models overlook key features such as the correlation between emotions and task completion in ToDs. In this paper, we propose a framework that turns a chit-chat ERC model into a task-oriented one, addressing three critical aspects: data, features and objective. First, we devise two ways of augmenting rare emotions to improve ERC performance. Second, we use dialogue states as auxiliary features to incorporate key information from the goal of the user. Lastly, we leverage a multi-aspect emotion definition in ToDs to devise a multi-task learning objective and a novel emotion-distance weighted loss function. Our framework yields significant improvements for a range of chit-chat ERC models on EmoWOZ, a large-scale dataset for user emotion in ToDs. We further investigate the generalisability of the best resulting model to predict user satisfaction in different ToD datasets. A comparison with supervised baselines shows a strong zero-shot capability, highlighting the potential usage of our framework in wider scenarios.

摘要
情感认知在对话中（ERC）是创建人类化对话代理的关键任务。虽然有大量努力投入到了ERC的普通对话（Chit-chat）中，但相对的任务导向对话（ToD）尚未得到了足够的注意。直接将Chit-chat ERC模型应用到ToD中会导致性能下降，因为这些模型忽略了对话完成任务的关键特征。在这篇论文中，我们提出了一个框架，把Chit-chat ERC模型转换成任务导向的模型，解决了三个关键方面：数据、特征和目标。首先，我们提出了两种增强罕见情感的方法，以提高ERC性能。其次，我们使用对话状态作为助记特征，以包含用户的目标信息。最后，我们采用了多方面情感定义和多任务学习目标，并提出了一种新的情感距离权重损失函数。我们的框架在EmoWOZ大规模数据集上实现了显著的改进，并且我们进一步调查了最佳模型在不同的ToD数据集中预测用户满意度的能力。与超级vised基准相比，我们的模型在零上shot情况下具有强大的能力， highlighting the potential usage of our framework in wider scenarios.

Probabilistic Method of Measuring Linguistic Productivity

paper_url: http://arxiv.org/abs/2308.12643
repo_url: None
paper_authors: Sergei Monakhov
for: 这个论文旨在提出一种新的语言产率测量方法，以评估词根在拓展新词汇时的能力，而不是直接依赖于token频率。
methods: 该方法认为语言产率可以视为词根与随机基本单词的概率相互结合。这些优点包括：首先，token频率不会直接影响产率测量；其次，我们不仅是计数已知词类具有词根，而是通过模拟词类的构造并检查它们是否存在于词库中来评估词根的产率。最后，基于词库和随机设计，新词和旧词都有平等的机会被选择。
results: 在英语和俄语数据上测试了该算法，结果显示，语言产率与词类数量和token频率之间存在一定的关系。具体来说，语言产率首先增加高频项目，然后才增加低频项目。

Abstract
In this paper I propose a new way of measuring linguistic productivity that objectively assesses the ability of an affix to be used to coin new complex words and, unlike other popular measures, is not directly dependent upon token frequency. Specifically, I suggest that linguistic productivity may be viewed as the probability of an affix to combine with a random base. The advantages of this approach include the following. First, token frequency does not dominate the productivity measure but naturally influences the sampling of bases. Second, we are not just counting attested word types with an affix but rather simulating the construction of these types and then checking whether they are attested in the corpus. Third, a corpus-based approach and randomised design assure that true neologisms and words coined long ago have equal chances to be selected. The proposed algorithm is evaluated both on English and Russian data. The obtained results provide some valuable insights into the relation of linguistic productivity to the number of types and tokens. It looks like burgeoning linguistic productivity manifests itself in an increasing number of types. However, this process unfolds in two stages: first comes the increase in high-frequency items, and only then follows the increase in low-frequency items.

摘要
在这篇论文中，我提出了一种新的语言产率测量方法，该方法对象ively评估一个词缀的使用能力，并不直接受到各种各样的token频率的影响。 Specifically, 我建议将语言产率视为一个词缀与随机基础的概率。这些方法的优点包括以下几点：首先，token频率不会控制产率测量，而是自然地影响采样的基础。其次，我们不仅是统计已知的单词类型，而是通过模拟这些类型的构建，然后检查它们是否存在于词库中。最后，基于词库和随机设计，所有新词和昔日的词都有平等的机会被选择。我们使用的算法在英语和俄语数据上进行了评估，得到的结果提供了一些有价值的发现，关于语言产率与单元和token之间的关系。看来，语言产率的增长 manifested itself in an increasing number of types，但是这个过程发展在两个阶段：首先是高频项的增长，然后才是低频项的增长。

PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation

paper_url: http://arxiv.org/abs/2308.12604
repo_url: None
paper_authors: Haibo Jin, Haoxuan Che, Yi Lin, Hao Chen
for:The paper aims to improve the accuracy of automatic medical report generation (MRG) by proposing a novel framework called PromptMRG.methods:PromptMRG uses an encoder-decoder architecture with an extra disease classification branch, and incorporates cross-modal feature enhancement and adaptive logit-adjusted loss to address the challenges of disease imbalance and precise clinical understanding.results:The proposed method achieves state-of-the-art clinical efficacy performance on two MRG benchmarks, demonstrating its effectiveness in improving the accuracy of MRG.

Abstract
Automatic medical report generation (MRG) is of great research value as it has the potential to relieve radiologists from the heavy burden of report writing. Despite recent advancements, accurate MRG remains challenging due to the need for precise clinical understanding and the identification of clinical findings. Moreover, the imbalanced distribution of diseases makes the challenge even more pronounced, as rare diseases are underrepresented in training data, making their diagnostic performance unreliable. To address these challenges, we propose diagnosis-driven prompts for medical report generation (PromptMRG), a novel framework that aims to improve the diagnostic accuracy of MRG with the guidance of diagnosis-aware prompts. Specifically, PromptMRG is based on encoder-decoder architecture with an extra disease classification branch. When generating reports, the diagnostic results from the classification branch are converted into token prompts to explicitly guide the generation process. To further improve the diagnostic accuracy, we design cross-modal feature enhancement, which retrieves similar reports from the database to assist the diagnosis of a query image by leveraging the knowledge from a pre-trained CLIP. Moreover, the disease imbalanced issue is addressed by applying an adaptive logit-adjusted loss to the classification branch based on the individual learning status of each disease, which overcomes the barrier of text decoder's inability to manipulate disease distributions. Experiments on two MRG benchmarks show the effectiveness of the proposed method, where it obtains state-of-the-art clinical efficacy performance on both datasets.

摘要
自动医疗报告生成（MRG）具有很大的研究价值，因为它有可能减轻医生对报告写作的重荷。尽管最近有所进步，但准确的MRG仍然是一项挑战，因为需要精准的临床理解和病理发现的识别。此外，疾病的分布不均，使得报告生成的性能变得更加困难，因为罕见疾病在训练数据中的表现不充分，导致报告生成的准确性受到影响。为 Address these challenges, we propose 诊断驱动的报告生成PromptMRG，一种新的框架，用于提高MRG的诊断准确性，通过诊断意识的指导来Explicitly guide the generation process。具体来说，PromptMRG采用encoder-decoder架构，并添加了疾病分类分支。当生成报告时，从分支中获取的诊断结果会被转换为token提示，以直接导引生成过程。为了进一步提高诊断准确性，我们设计了跨模态特征增强，该方法可以通过在数据库中检索相似报告，帮助诊断查询图像的疾病诊断，并利用预训练的CLIP来增强报告生成的准确性。此外，我们还解决了疾病分布不均的问题，通过应用适应的逻辑调整损失，根据每种疾病的学习状态来调整分支的损失。实验结果表明，提案的方法在两个MRG标准测试集上表现出色，在两个 dataset上都达到了状态的诊断效果。

A Small and Fast BERT for Chinese Medical Punctuation Restoration

paper_url: http://arxiv.org/abs/2308.12568
repo_url: None
paper_authors: Tongtao Ling, Chen Liao, Zhipeng Yu, Lei Chen, Shilei Huang, Yi Liu
for: 提高Automatic Speech Recognition（ASR）输出文本的精度和可读性，使得医疗报告能够更加precise和可理解。
methods: 基于’预训练和精度调整’ парадигм，提出了一种快速和轻量级的预训练模型，并通过练习对比学习和一种新的辅助预训练任务（括号标记预测）来适应医疗报告的括号 restauration。
results: 我们的实验表明，我们的模型可以达到95%的性能水平，而且与state-of-the-art中的Chinese RoBERTa模型相比，其模型体积只占10%。

Abstract
In clinical dictation, utterances after automatic speech recognition (ASR) without explicit punctuation marks may lead to the misunderstanding of dictated reports. To give a precise and understandable clinical report with ASR, automatic punctuation restoration is required. Considering a practical scenario, we propose a fast and light pre-trained model for Chinese medical punctuation restoration based on 'pretraining and fine-tuning' paradigm. In this work, we distill pre-trained models by incorporating supervised contrastive learning and a novel auxiliary pre-training task (Punctuation Mark Prediction) to make it well-suited for punctuation restoration. Our experiments on various distilled models reveal that our model can achieve 95% performance while 10% model size relative to state-of-the-art Chinese RoBERTa.

摘要
在临床词汇录制中，无法显式标点符号的utterances可能会导致译音报告的歧义。为了提供准确可理解的临床报告，自动标点 restauration 是必要的。在实践场景中，我们提议一种快速轻量级的预训练模型，基于 '预训练和精度调整' 模型。在这种情况下，我们通过将预训练模型进行精度调整，并在新的辅助预训练任务（标点符号预测）中进行精度调整。我们的实验表明，我们的模型可以达到 95% 的性能，而且与 state-of-the-art 中文 RoBERTa 模型相比，只有 10% 的模型大小。

CARE: Co-Attention Network for Joint Entity and Relation Extraction

paper_url: http://arxiv.org/abs/2308.12531
repo_url: None
paper_authors: Wenjun Kong, Yamei Xia
for: 本文旨在提高 JOINT 信息抽取的性能，即同时提取实体和关系信息。
methods: 该方法使用 Co-Attention 网络，分别学习实体和关系信息的表示，以避免特征混淆。主要 componenet 是两个任务之间的协作模块，使模型可以利用实体信息来预测关系，并 vice versa。
results: EXTENSIVE 实验表明，提出的模型在三个 JOINT 实体-关系抽取 benchmark 数据集（NYT、WebNLG 和 SciERC）上表现出色，超过现有基eline模型。

Abstract
Joint entity and relation extraction is the fundamental task of information extraction, consisting of two subtasks: named entity recognition and relation extraction. Most existing joint extraction methods suffer from issues of feature confusion or inadequate interaction between two subtasks. In this work, we propose a Co-Attention network for joint entity and Relation Extraction (CARE). Our approach involves learning separate representations for each subtask, aiming to avoid feature overlap. At the core of our approach is the co-attention module that captures two-way interaction between two subtasks, allowing the model to leverage entity information for relation prediction and vice versa, thus promoting mutual enhancement. Extensive experiments on three joint entity-relation extraction benchmark datasets (NYT, WebNLG and SciERC) show that our proposed model achieves superior performance, surpassing existing baseline models.

摘要
共同实体和关系抽取是信息抽取的基本任务，包括两个子任务：命名实体识别和关系抽取。现有的大多数共同抽取方法受到特征混乱或两个子任务之间不足的互动的问题。在这种情况下，我们提出了一种共同注意力网络 для共同实体和关系抽取（CARE）。我们的方法是学习每个子任务的独立表示，以避免特征重叠。我们的核心方法是两个子任务之间的共同注意力模块，允许模型利用实体信息来预测关系，并 vice versa，从而促进互助。我们在三个共同实体-关系抽取 benchmark 数据集（NYT、WebNLG 和 SciERC）进行了广泛的实验，结果显示，我们提出的模型在比较存在的基准模型上表现出色，超越了现有的基准模型。

Large Language Model as Autonomous Decision Maker

paper_url: http://arxiv.org/abs/2308.12519
repo_url: None
paper_authors: Yining Ye, Xin Cong, Yujia Qin, Yankai Lin, Zhiyuan Liu, Maosong Sun
For: This paper aims to enable large language models (LLMs) to make autonomous decisions by endowing them with self-judgment ability, allowing them to explore and judge decision steps based on their values and utilities.* Methods: The proposed approach, called JuDec, uses an Elo-based Self-Judgment Mechanism to assign Elo scores to decision steps and guide the decision-searching process toward the optimal solution.* Results: Experimental results on the ToolBench dataset show that JuDec achieves over 10% improvement in Pass Rate on diverse tasks, offering higher-quality solutions and reducing costs (ChatGPT API calls), demonstrating its effectiveness and efficiency.Here is the summary in Traditional Chinese:* For: 这篇论文目的是将大型语言模型（LLMs）变成自主的决策者，通过将自己的判断能力授与 LLMs。* Methods: 提案的方法称为 JuDec，采用 Elo 分数自我评价机制，将 Elo 分数 assign 给决策步骤，以judge 其值和利益via 对两个解决方案的对比，导引决策搜寻过程向优化解决方案。* Results: 实验结果显示，JuDec 在 ToolBench 数据集上超过 10% 的提升率，在多个任务上表现出色，提供更高质量的解决方案，降低 ChatGPT API 调用成本，强调其效率和有效性。

Abstract
While large language models (LLMs) exhibit impressive language understanding and in-context learning abilities, their decision-making ability still heavily relies on the guidance of task-specific expert knowledge when solving real-world tasks. To unleash the potential of LLMs as autonomous decision makers, this paper presents an approach JuDec to endow LLMs with the self-judgment ability, enabling LLMs to achieve autonomous judgment and exploration for decision making. Specifically, in JuDec, Elo-based Self-Judgment Mechanism is designed to assign Elo scores to decision steps to judge their values and utilities via pairwise comparisons between two solutions and then guide the decision-searching process toward the optimal solution accordingly. Experimental results on the ToolBench dataset demonstrate JuDec's superiority over baselines, achieving over 10% improvement in Pass Rate on diverse tasks. It offers higher-quality solutions and reduces costs (ChatGPT API calls), highlighting its effectiveness and efficiency.

摘要
大型语言模型（LLM）具有吸引人的语言理解和Contextual learning能力，但它们的决策能力仍然受到专业知识的导引。为了让 LLM 成为独立决策者，这篇论文提出了 JuDec approach，旨在赋予 LLM 自我评价能力，使其能够达到自主评估和探索决策。具体来说，JuDec 使用 Elo 分数机制来评估决策步骤的价值和用于导引决策搜索过程。实验结果表明 JuDec 在 ToolBench 数据集上表现出优于基eline，实现了多达 10% 的提升率，并且可以提供更高质量的解决方案，降低 ChatGPT API 调用成本， highlighting 其效率和可行性。

MultiPA: a multi-task speech pronunciation assessment system for a closed and open response scenario

paper_url: http://arxiv.org/abs/2308.12490
repo_url: None
paper_authors: Yu-Wen Chen, Zhou Yu, Julia Hirschberg
for: 这个研究旨在开发一种多任务语音发音评估系统，以提供更加精准和全面的发音技巧评估。
methods: 这个系统使用多任务学习方法，包括卷积神经网络和长期循环神经网络，以实现在关闭和开放响应场景下的发音评估。
results: 实验结果表明，这个系统在关闭响应场景下的性能相对较高，而且在开放响应场景下的性能更加稳定。

Abstract
The design of automatic speech pronunciation assessment can be categorized into closed and open response scenarios, each with strengths and limitations. A system with the ability to function in both scenarios can cater to diverse learning needs and provide a more precise and holistic assessment of pronunciation skills. In this study, we propose a Multi-task Pronunciation Assessment model called MultiPA. MultiPA provides an alternative to Kaldi-based systems in that it has simpler format requirements and better compatibility with other neural network models. Compared with previous open response systems, MultiPA provides a wider range of evaluations, encompassing assessments at both the sentence and word-level. Our experimental results show that MultiPA achieves comparable performance when working in closed response scenarios and maintains more robust performance when directly used for open responses.

摘要

American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

paper_url: http://arxiv.org/abs/2308.12477
repo_url: https://github.com/dell-research-harvard/americanstories
paper_authors: Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D’Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring
for: The paper is written to extract full article texts from newspaper images in the Library of Congress’s public domain Chronicling America collection, with the goal of providing high-quality data for pre-training a large language model and improving historical information accessibility.
methods: The paper develops a novel, deep learning pipeline for extracting full article texts from newspaper images, including layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes. The pipeline is designed for mobile phones to achieve high scalability.
results: The resulting American Stories dataset provides high-quality, structured article texts that can be used for pre-training a large language model, topic classification, detection of reproduced content, and news story clustering. The dataset also facilitates innovation in multimodal layout analysis models and other multimodal applications.

Abstract
Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection. The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes. To achieve high scalability, it is built with efficient architectures designed for mobile phones. The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge. The dataset could also be added to the external database of a retrieval-augmented language model to make historical information - ranging from interpretations of political events to minutiae about the lives of people's ancestors - more widely accessible. Furthermore, structured article texts facilitate using transformer-based methods for popular social science applications like topic classification, detection of reproduced content, and news story clustering. Finally, American Stories provides a massive silver quality dataset for innovating multimodal layout analysis models and other multimodal applications.

摘要
现有的美国公共领域报纸全文数据集不能识别报纸扫描图像中的复杂布局，因此扫描的内容会混乱，包括文章、标题、标签、广告和其他布局区域。此研究开发了一个新的深度学习管道，用于从报纸图像中提取全文，并应用到美国国会图书馆的公共领域Chronicling America收藏中的 nearly 20 万扫描。该管道包括布局检测、可读性分类、自定义 OCR 和文章тексты横跨多个 bounding box 的关联。为实现高可扩展性，它采用了高效的建筑设计 для移动电话。结果的美国故事数据集提供了高质量的数据，可以用于预训练大型自然语言模型，以达到更好地理解历史英语和历史世界知识。此数据集还可以添加到一个外部数据库中，以便通过检索增强的语言模型来访问历史信息，从 interpretations of political events 到人们祖先的生活细节。此外，结构化的文章文本可以使用 transformer 类型的方法进行流行的社会科学应用，如新闻故事归一化、检测复制内容和主题分类。最后，美国故事提供了一个庞大的高质量银屑数据集，用于创新多模态布局分析模型和其他多模态应用。

Evolution of ESG-focused DLT Research: An NLP Analysis of the Literature

paper_url: http://arxiv.org/abs/2308.12420
repo_url: None
paper_authors: Walter Hernandez, Kamil Tylinski, Alastair Moore, Niall Roche, Nikhil Vadgama, Horst Treiblmaier, Jiangbo Shangguan, Paolo Tasca, Jiahua Xu
for: 本研究旨在提供一种机器学习驱动的系统atic literature review方法，用于 investigate Distributed Ledger Technologies (DLTs) 的多种组成部分。
methods: 研究采用了 transformer-based 语言模型，使用自己标注的数据集进行Named Entity Recognition (NER) 任务的 fine-tuning，并使用 Temporal Graph Analysis (TGA) 进行文献综述。
results: 研究发现了一个包含 505 个关键论文的核心论文集，这些论文具有关于 DLT 的 Environmental, Sustainability, and Governance (ESG) 方面的内容。同时，研究还提供了一个包含 54,808 个名实体的 NER 数据集，可供 DLT 和 ESG-相关的探索。

Abstract
Distributed Ledger Technologies (DLTs) have rapidly evolved, necessitating comprehensive insights into their diverse components. However, a systematic literature review that emphasizes the Environmental, Sustainability, and Governance (ESG) components of DLT remains lacking. To bridge this gap, we selected 107 seed papers to build a citation network of 63,083 references and refined it to a corpus of 24,539 publications for analysis. Then, we labeled the named entities in 46 papers according to twelve top-level categories derived from an established technology taxonomy and enhanced the taxonomy by pinpointing DLT's ESG elements. Leveraging transformer-based language models, we fine-tuned a pre-trained language model for a Named Entity Recognition (NER) task using our labeled dataset. We used our fine-tuned language model to distill the corpus to 505 key papers, facilitating a literature review via named entities and temporal graph analysis on DLT evolution in the context of ESG. Our contributions are a methodology to conduct a machine learning-driven systematic literature review in the DLT field, placing a special emphasis on ESG aspects. Furthermore, we present a first-of-its-kind NER dataset, composed of 54,808 named entities, designed for DLT and ESG-related explorations.

摘要
分布式笔记技术（DLT）在发展中，需要全面的掌握其多种组成部分。然而，一篇系统性的文献评议，强调环境、可持续发展和管理（ESG）方面的分析，仍然缺失。为了填补这一空白，我们选择了107个种子论文，建立了63,083个参考文献的公共网络，并从中缩放到24,539篇文献进行分析。然后，我们将46篇论文中的名称实体标注为12个顶级类别，根据已有的技术分类标准，并将DLT的ESG元素细化。通过使用基于转换器的自然语言模型，我们精细了一个预训练的语言模型，以进行名称实体识别（NER）任务。我们使用我们精细化的语言模型，对文献库进行筛选，得到505份关键论文，可以进行基于名称实体和时间图分析，对DLT在ESG方面的进化。我们的贡献包括在DLT领域进行机器学习驱动的系统性文献评议方法，以及一个特有的NER数据集，包含54,808个名称实体，适用于DLT和ESG相关的探索。

Toward American Sign Language Processing in the Real World: Data, Tasks, and Methods

paper_url: http://arxiv.org/abs/2308.12419
repo_url: None
paper_authors: Bowen Shi
for: 本论文的目的是研究自然环境中的自动手语处理，使用来自互联网的签语视频。methods: 本论文使用了新的大规模 ASL datasets，以及一些新的任务和方法。其中大部分章节都关注了手语中的指文字识别，这是手语中重要的一部分，尚未受到过去的研究。results: 本论文提出了一种基于迭代注意力的 end-to-end 方法，可以从 raw video 直接识别指文字。此外，使用 Conformer 网络同时模型手势和 lip mouthing 可以达到人类水平的性能。此外，本论文还提出了一些用于实际应用程序的两个任务：指文字检测和搜索。

Abstract
Sign language, which conveys meaning through gestures, is the chief means of communication among deaf people. Recognizing sign language in natural settings presents significant challenges due to factors such as lighting, background clutter, and variations in signer characteristics. In this thesis, I study automatic sign language processing in the wild, using signing videos collected from the Internet. This thesis contributes new datasets, tasks, and methods. Most chapters of this thesis address tasks related to fingerspelling, an important component of sign language and yet has not been studied widely by prior work. I present three new large-scale ASL datasets in the wild: ChicagoFSWild, ChicagoFSWild+, and OpenASL. Using ChicagoFSWild and ChicagoFSWild+, I address fingerspelling recognition, which consists of transcribing fingerspelling sequences into text. I propose an end-to-end approach based on iterative attention that allows recognition from a raw video without explicit hand detection. I further show that using a Conformer-based network jointly modeling handshape and mouthing can bring performance close to that of humans. Next, I propose two tasks for building real-world fingerspelling-based applications: fingerspelling detection and search. For fingerspelling detection, I introduce a suite of evaluation metrics and a new detection model via multi-task training. To address the problem of searching for fingerspelled keywords in raw sign language videos, we propose a novel method that jointly localizes and matches fingerspelling segments to text. Finally, I will describe a benchmark for large-vocabulary open-domain sign language translation based on OpenASL. To address the challenges of sign language translation in realistic settings, we propose a set of techniques including sign search as a pretext task for pre-training and fusion of mouthing and handshape features.

摘要
sign language，通过手势表达意义，是聋人之主要沟通方式。在自然环境中识别手语具有许多因素的挑战，如照明、背景干扰和手语表达者的变化。在这个论文中，我研究了在野外自动处理手语，使用互联网上收集的手语视频。这个论文的贡献包括新的数据集、任务和方法。大多数本论文的章节关注手语 fingerspelling，尚未得到了前期研究的广泛关注。我提供了三个大规模的ASL数据集在野外：ChicagoFSWild、ChicagoFSWild+和OpenASL。使用ChicagoFSWild和ChicagoFSWild+，我解决了手语 fingerspelling 识别问题，即将手语 fingerspelling 序列转换成文本。我提出了一种综合注意力的端到端方法，可以从原始视频中识别手语 fingerspelling 无需显式手势检测。此外，我还证明了使用基于 Conformer 网络同时模型手势和嘴形可以达到人类水平。接下来，我提出了两个用于实际应用程序开发的任务：手语 fingerspelling 检测和搜索。为手语 fingerspelling 检测，我引入了一系列评估指标和一种新的检测模型，通过多任务训练来实现。为了在原始手语视频中搜索手语 fingerspelling 关键字，我们提出了一种新的方法，即同时地Localize和匹配手语 fingerspelling 段落到文本。最后，我将介绍一个基于 OpenASL 的大词汇开放语言翻译 benchmark，用于Addressing the challenges of sign language translation in realistic settings, we propose a set of techniques including sign search as a pretext task for pre-training and fusion of mouthing and handshape features.

Vision Transformer Adapters for Generalizable Multitask Learning

paper_url: http://arxiv.org/abs/2308.12372
repo_url: https://github.com/IVRL/VTAGML
paper_authors: Deblina Bhattacharjee, Sabine Süsstrunk, Mathieu Salzmann
for: 这篇论文旨在提出一种基于视Transformer的多任务适应器，可以将任务相似性学习到新任务和领域中，无需重新训练或微调。
methods: 该方法基于视Transformer的底层结构，并将多个稠密视图任务集成到一起，通过一种任务相似性学习机制来学习通用任务相似性。
results: 作者表明，对于多个稠密视图任务，该方法可以在参数效率的情况下同时解决多个任务，并且在零shot任务转移、无监督领域适应和不需要微调到新领域中达到更高的性能。

Abstract
We introduce the first multitasking vision transformer adapters that learn generalizable task affinities which can be applied to novel tasks and domains. Integrated into an off-the-shelf vision transformer backbone, our adapters can simultaneously solve multiple dense vision tasks in a parameter-efficient manner, unlike existing multitasking transformers that are parametrically expensive. In contrast to concurrent methods, we do not require retraining or fine-tuning whenever a new task or domain is added. We introduce a task-adapted attention mechanism within our adapter framework that combines gradient-based task similarities with attention-based ones. The learned task affinities generalize to the following settings: zero-shot task transfer, unsupervised domain adaptation, and generalization without fine-tuning to novel domains. We demonstrate that our approach outperforms not only the existing convolutional neural network-based multitasking methods but also the vision transformer-based ones. Our project page is at \url{https://ivrl.github.io/VTAGML}.

摘要
我们介绍了首个多任务视觉变换器适应器，这些适应器可以学习通用的任务相似性，并可以应用于新任务和领域。它们integrated into an off-the-shelf vision transformer backbone，可以同时解决多个密集视觉任务，而不需要 Parametrically expensive。与现有的多任务变换器不同，我们不需要在新任务或领域添加时进行重新训练或微调。我们引入了任务适应的注意力机制，该机制 combining gradient-based task similarities with attention-based ones。学习的任务相似性可以通过以下设置进行推广：零学习任务转移、不监督领域适应和无需微调来新领域的普适化。我们 demonstrably that our approach outperforms not only the existing convolutional neural network-based multitasking methods but also the vision transformer-based ones。我们的项目页面可以在 \url{https://ivrl.github.io/VTAGML} 中找到。

Prompt2Model: Generating Deployable Models from Natural Language Instructions

paper_url: http://arxiv.org/abs/2308.12261
repo_url: https://github.com/neulab/prompt2model
paper_authors: Vijay Viswanathan, Chenyang Zhao, Amanda Bertsch, Tongshuang Wu, Graham Neubig
for: 这 paper 是为了探讨如何使用 prompt 来训练特殊用途 NLP 模型，以提高模型的性能和可用性。
methods: 这 paper 使用了一种多步骤的方法，包括数据 retrieve 和 pretrained 模型，数据生成 using LLMs，以及supervised fine-tuning 这些获取和生成的数据。
results: 在三个任务上，这 paper 展示了Prompt2Model 可以使用同样的几个示例prompt来训练模型，并取得比 gpt-3.5-turbo 强的平均提升20%，而且模型的大小可以减少到700倍。此外，这 paper 还表明这些数据可以用于获取可靠的模型性能估计，帮助模型开发者在部署前评估模型可靠性。

Abstract
Large language models (LLMs) enable system builders today to create competent NLP systems through prompting, where they only need to describe the task in natural language and provide a few examples. However, in other ways, LLMs are a step backward from traditional special-purpose NLP models; they require extensive computational resources for deployment and can be gated behind APIs. In this paper, we propose Prompt2Model, a general-purpose method that takes a natural language task description like the prompts provided to LLMs, and uses it to train a special-purpose model that is conducive to deployment. This is done through a multi-step process of retrieval of existing datasets and pretrained models, dataset generation using LLMs, and supervised fine-tuning on these retrieved and generated datasets. Over three tasks, we demonstrate that given the same few-shot prompt as input, Prompt2Model trains models that outperform the results of a strong LLM, gpt-3.5-turbo, by an average of 20% while being up to 700 times smaller. We also show that this data can be used to obtain reliable performance estimates of model performance, enabling model developers to assess model reliability before deployment. Prompt2Model is available open-source at https://github.com/neulab/prompt2model.

摘要
Prompt2Model 通过多个步骤来实现这一目标：首先，检索现有的数据集和预训练模型；其次，使用 LLM 生成新的数据集；最后，使用这些检索和生成的数据集进行supervised fine-tuning。在三个任务上，我们证明了，给定相同的几个示例提示，Prompt2Model 可以训练比 gpt-3.5-turbo 强的模型，而且模型的大小可以减少到 700 倍。此外，我们还示出了这些数据可以用于获得可靠的模型性能估计，从而帮助模型开发者在部署之前评估模型可靠性。Prompt2Model 已经开源在 GitHub 上，可以在中下载。

How to Protect Copyright Data in Optimization of Large Language Models?

paper_url: http://arxiv.org/abs/2308.12247
repo_url: None
paper_authors: Timothy Chu, Zhao Song, Chiwun Yang
for: 本研究旨在防止大语言模型（LLMs）生成版权数据。
methods: 本研究使用了一种新的方法，即视为softmax回归问题进行大语言模型训练和优化。
results: 研究表明，通过这种方法可以有效避免生成版权数据，从而实现了训练大语言模型的理论基础。

Abstract
Large language models (LLMs) and generative AI have played a transformative role in computer research and applications. Controversy has arisen as to whether these models output copyrighted data, which can occur if the data the models are trained on is copyrighted. LLMs are built on the transformer neural network architecture, which in turn relies on a mathematical computation called Attention that uses the softmax function. In this paper, we show that large language model training and optimization can be seen as a softmax regression problem. We then establish a method of efficiently performing softmax regression, in a way that prevents the regression function from generating copyright data. This establishes a theoretical method of training large language models in a way that avoids generating copyright data.

摘要
大型语言模型（LLM）和生成AI在计算机研究和应用中扮演了transformative的角色。但是，有 controvery arose as to whether these models output copyrighted data, which can occur if the data the models are trained on is copyrighted. LLMs are built on the transformer neural network architecture, which in turn relies on a mathematical computation called Attention that uses the softmax function. 在这篇论文中，我们显示了大型语言模型的训练和优化可以被看作softmax regression问题。我们然后建立了一种有效地进行softmax regression的方法，以避免生成版权 Daten。这个方法可以帮助train大型语言模型，以避免生成版权 Daten。