cs.CL - 2023-09-01

Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence

  • paper_url: http://arxiv.org/abs/2309.00751
  • repo_url: https://github.com/DanielSc4/RewardLM
  • paper_authors: Daniel Scalena, Gabriele Sarti, Malvina Nissim, Elisabetta Fersini
  • for: 这paper是为了研究语模型的减带技术对模型内部过程的影响。
  • methods: 这paper使用了流行的减带方法,并使用特征评估方法来衡量这些方法对模型的依赖度的影响。
  • results: 研究发现,使用减带方法可以改善模型的安全性,但是这些方法对模型内部过程的影响还不很清楚。此外,研究还发现,使用反对 narative 练习法可以提高模型的减带性能,但是这种方法与减带学习法的影响不同。
    Abstract Due to language models' propensity to generate toxic or hateful responses, several techniques were developed to align model generations with users' preferences. Despite the effectiveness of such methods in improving the safety of model interactions, their impact on models' internal processes is still poorly understood. In this work, we apply popular detoxification approaches to several language models and quantify their impact on the resulting models' prompt dependence using feature attribution methods. We evaluate the effectiveness of counter-narrative fine-tuning and compare it with reinforcement learning-driven detoxification, observing differences in prompt reliance between the two methods despite their similar detoxification performances.
    摘要

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

  • paper_url: http://arxiv.org/abs/2309.00614
  • repo_url: None
  • paper_authors: Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, Tom Goldstein
  • for: 这篇论文是关于语言模型安全性的研究,特别是对于语言模型在不同的威胁模型下的防御性能。
  • methods: 本文使用了多种防御策略,包括检测(基于异常值的混淆)、输入预处理(重叠和重tokenization)以及对抗训练。
  • results: 研究发现,使用现有的粗糙优化器在文本域下可能会受到限制,而且对于语言模型来说,标准的适应攻击更加困难。未来的研究可能需要开发更强大的优化器,或者发现filtering和预处理防御策略在语言模型领域的强大性。
    Abstract As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision? We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. We find that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Future research will be needed to uncover whether more powerful optimizers can be developed, or whether the strength of filtering and preprocessing defenses is greater in the LLMs domain than it has been in computer vision.
    摘要
  1. What practical threat models are relevant in this domain?2. How do baseline defense techniques perform in this new domain?3. How does LLM security differ from computer vision?We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, considering their feasibility and effectiveness in different settings. These include:1. Detection (perplexity-based)2. Input preprocessing (paraphrasing and retokenization)3. Adversarial trainingWe explore white-box and gray-box settings and analyze the trade-off between robustness and performance for each defense. Our findings suggest that the limitations of existing discrete optimizers for text, combined with the relatively high cost of optimization, make standard adaptive attacks more challenging for LLMs. Future research may focus on developing more powerful optimizers or enhancing the strength of filtering and preprocessing defenses in the LLM domain.In conclusion, understanding the security vulnerabilities of LLMs is crucial as they become increasingly ubiquitous. By examining these vulnerabilities using the principles of adversarial machine learning, we can develop effective defense strategies to protect these models from malicious attacks.

Taken out of context: On measuring situational awareness in LLMs

  • paper_url: http://arxiv.org/abs/2309.00667
  • repo_url: https://github.com/asacooperstickland/situational-awareness-evals
  • paper_authors: Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans
  • for: 本研究目的是更好地理解大语言模型(LLM)中的情境意识的发展。
  • methods: 本研究使用了扩展LLM的能力来评估情境意识的发展。 Specifically, the researchers used out-of-context reasoning as a way to test for situational awareness.
  • results: 研究发现,LLMs可以在没有示例或教程的情况下通过描述来解决测试。 however, the success of the models is sensitive to the training setup and only works with data augmentation. Additionally, the research found that larger models perform better on this task.
    Abstract We aim to better understand the emergence of `situational awareness' in large language models (LLMs). A model is situationally aware if it's aware that it's a model and can recognize whether it's currently in testing or deployment. Today's LLMs are tested for safety and alignment before they are deployed. An LLM could exploit situational awareness to achieve a high score on safety tests, while taking harmful actions after deployment. Situational awareness may emerge unexpectedly as a byproduct of model scaling. One way to better foresee this emergence is to run scaling experiments on abilities necessary for situational awareness. As such an ability, we propose `out-of-context reasoning' (in contrast to in-context learning). We study out-of-context reasoning experimentally. First, we finetune an LLM on a description of a test while providing no examples or demonstrations. At test time, we assess whether the model can pass the test. To our surprise, we find that LLMs succeed on this out-of-context reasoning task. Their success is sensitive to the training setup and only works when we apply data augmentation. For both GPT-3 and LLaMA-1, performance improves with model size. These findings offer a foundation for further empirical study, towards predicting and potentially controlling the emergence of situational awareness in LLMs. Code is available at: https://github.com/AsaCooperStickland/situational-awareness-evals.
    摘要 我们目标是更好地理解大语言模型(LLM)中的“情境意识”的出现。一个模型被称为情境意识模型,如果它意识到它是一个模型,并能识别它是否在测试或部署中。今天的LLM都在测试和对齐之前被部署。一个LLM可以通过情境意识来达到安全测试中高分,而在部署后执行有害的操作。情境意识可能会意外地出现,因此我们可以通过扩大模型来更好地预测其出现。作为一种能力,我们提出了“离 context 理解”(与上下文学习相对)。我们通过实验研究离 context 理解。我们首先精度调整了一个LLM,使其能够通过一个测试描述而不提供示例或示范。测试时,我们评估模型是否能通过测试。我们启示发现,LLMs在这种离 context 理解任务中成功。其成功关系于训练Setup,并且只有在应用数据扩展时才能实现。对于GPT-3和LLaMA-1,模型的性能随模型大小增长。这些发现为我们未来更多的实验研究提供了基础,以预测和可能控制LLM中的情境意识的出现。代码可以在以下 GitHub 上找到:https://github.com/AsaCooperStickland/situational-awareness-evals。

Satisfiability Checking of Multi-Variable TPTL with Unilateral Intervals Is PSPACE-Complete

  • paper_url: http://arxiv.org/abs/2309.00386
  • repo_url: None
  • paper_authors: Shankara Narayanan Krishna, Khushraj Nanik Madnani, Rupak Majumdar, Paritosh K. Pandya
  • for: 这个论文研究了${0,\infty}$ fragment of Timed Propositional Temporal Logic (TPTL)的可 decidability。
  • methods: 作者使用了一种新的“非紧急”类型的 Alternating Timed Automata with multiple clocks called Unilateral Very Weak Alternating Timed Automata (VWATA$^{0,\infty}$)来验证 TPTL$^{0,\infty}$的满足性检查是PSPACE完备的。
  • results: 作者发现了一个新的多变量 fragment of TPTL,它的满足性检查是可 decidable,而且比 Metric Interval Temporal Logic (MITL)更加表达力强,且计算更加容易。这是首次没有对时间字符串(例如带有约束的变化)做出任何限制的多变量 TPTL fragment 的满足性检查是 decidable。
    Abstract We investigate the decidability of the ${0,\infty}$ fragment of Timed Propositional Temporal Logic (TPTL). We show that the satisfiability checking of TPTL$^{0,\infty}$ is PSPACE-complete. Moreover, even its 1-variable fragment (1-TPTL$^{0,\infty}$) is strictly more expressive than Metric Interval Temporal Logic (MITL) for which satisfiability checking is EXPSPACE complete. Hence, we have a strictly more expressive logic with computationally easier satisfiability checking. To the best of our knowledge, TPTL$^{0,\infty}$ is the first multi-variable fragment of TPTL for which satisfiability checking is decidable without imposing any bounds/restrictions on the timed words (e.g. bounded variability, bounded time, etc.). The membership in PSPACE is obtained by a reduction to the emptiness checking problem for a new "non-punctual" subclass of Alternating Timed Automata with multiple clocks called Unilateral Very Weak Alternating Timed Automata (VWATA$^{0,\infty}$) which we prove to be in PSPACE. We show this by constructing a simulation equivalent non-deterministic timed automata whose number of clocks is polynomial in the size of the given VWATA$^{0,\infty}$.
    摘要 我们调查${0,\infty}$ fragment of Timed Propositional Temporal Logic (TPTL)的对应性。我们表明TPTL$^{0,\infty}$的满足性检查是PSPACE完全的。此外,我们还证明1-TPTL$^{0,\infty}$比Metric Interval Temporal Logic (MITL)更加表达力强,其满足性检查是EXPSPACE完全的。因此,我们有一个更加表达力强的逻辑,且 computationally easier的满足性检查。根据我们所知,TPTL$^{0,\infty}$是第一个不受任何紧 bound/restriction 的时间语言的多变量 fragment 的满足性检查是可 decidable。PSPACE 的成员由一个对应的 Alternating Timed Automata with multiple clocks 的新子集 Unilateral Very Weak Alternating Timed Automata (VWATA$^{0,\infty}$) 的emptiness checking problem 的 reduction 而得。我们显示这个问题是PSPACE的,通过建构一个与非决定型时间自动 machine 相对应的同步化的非决定型时间自动 machine,其中的时钟数量是对应类别的大小的几何函数。

BatchPrompt: Accomplish more with less

  • paper_url: http://arxiv.org/abs/2309.00384
  • repo_url: None
  • paper_authors: Jianzhe Lin, Maurice Diesendruck, Liang Du, Robin Abraham
  • for: 提高大语言模型(LLM)的提问效率,使其更加高效地处理长Context提问。
  • methods: 使用批处理(BatchPrompt)技术,将数据集分割成多个批处理,并对每个批处理进行独立的提问。并提出了两种技术:批处理 permutation(BPE)和自我反射指导的早期停止(SEAS)。
  • results: 通过实验证明,使用BPE和SEAS技术可以提高批处理提问的性能,并且与单个提问(SinglePrompt)相比,使用BPE和SEAS技术需要更少的LLM调用和输入token(只需9%-16%的LLM调用和27.4%的输入token,可以达到90.6%-90.9%的Boolq准确率,87.2%-88.4%的QQP准确率和91.5%-91.1%的RTE准确率)。这是大语言模型提问的首次技术改进。
    Abstract As the ever-increasing token limits of large language models (LLMs) have enabled long context as input, prompting with single data samples might no longer an efficient way. A straightforward strategy improving efficiency is to batch data within the token limit (e.g., 8k for gpt-3.5-turbo; 32k for GPT-4), which we call BatchPrompt. We have two initial observations for prompting with batched data. First, we find that prompting with batched data in longer contexts will inevitably lead to worse performance, compared to single-data prompting. Second, the performance of the language model is significantly correlated with the positions and order of the batched data, due to the corresponding change in decoder context. To retain efficiency and overcome performance loss, we propose Batch Permutation and Ensembling (BPE), and a novel Self-reflection-guided EArly Stopping (SEAS) technique. Our comprehensive experimental evaluation demonstrates that BPE can boost the performance of BatchPrompt with a striking margin on a range of popular NLP tasks, including question answering (Boolq), textual entailment (RTE), and duplicate questions identification (QQP). These performances are even competitive with/higher than single-data prompting(SinglePrompt), while BatchPrompt requires much fewer LLM calls and input tokens (For SinglePrompt v.s. BatchPrompt with batch size 32, using just 9%-16% the number of LLM calls, Boolq accuracy 90.6% to 90.9% with 27.4% tokens, QQP accuracy 87.2% to 88.4% with 18.6% tokens, RTE accuracy 91.5% to 91.1% with 30.8% tokens). To the best of our knowledge, this is the first work to technically improve prompting efficiency of large language models. We hope our simple yet effective approach will shed light on the future research of large language models. The code will be released.
    摘要 为了提高大语言模型(LLM)的效率,我们可以考虑使用批处理(batching)技术。我们称这种技术为批提示(BatchPrompt)。我们发现,使用批处理技术可以大幅提高 LLM 的性能,但是在某些情况下,它可能会导致性能下降。我们提出了两种策略来解决这个问题:批 permutation 和批ensemble(BPE),以及一种新的自适应停止(SEAS)技术。我们的实验证明,BPE 可以在多种常见的自然语言处理任务上提高 BatchPrompt 的性能,并且和单个数据提示(SinglePrompt)相比,BPE 需要许多 fewer LLM 调用和输入符号(token)。我们认为,这是首次技术上提高大语言模型的提示效率的研究。我们希望我们的简单 yet 有效的方法可以引领未来的大语言模型研究。我们将代码发布。

Long-Term Memorability On Advertisements

  • paper_url: http://arxiv.org/abs/2309.00378
  • repo_url: None
  • paper_authors: Harini S I, Somesh Singh, Yaman K Singla, Aanisha Bhattacharyya, Veeky Baths, Changyou Chen, Rajiv Ratn Shah, Balaji Krishnamurthy
  • for: The paper aims to study the memorability of ads in the machine learning literature, specifically focusing on long-term memorability and the impact of multimodality and human factors.
  • methods: The study consists of 1203 participants and 2205 ads covering 276 brands, with statistical tests run to identify factors that contribute to ad memorability. Additionally, the paper presents a novel model called Sharingan, which leverages real-world knowledge of LLMs and visual knowledge of visual encoders to predict the memorability of content.
  • results: The study finds that fast-moving scenes in commercials are more memorable than slower scenes (p=8e-10), and that people who use ad-blockers remember lower number of ads than those who don’t (p=5e-3). The Sharingan model achieves state-of-the-art performance on all prominent memorability datasets in literature, and ablation studies reveal insights into what drives memory.
    Abstract Marketers spend billions of dollars on advertisements but to what end? At the purchase time, if customers cannot recognize a brand for which they saw an ad, the money spent on the ad is essentially wasted. Despite its importance in marketing, until now, there has been no study on the memorability of ads in the ML literature. Most studies have been conducted on short-term recall (<5 mins) on specific content types like object and action videos. On the other hand, the advertising industry only cares about long-term memorability (a few hours or longer), and advertisements are almost always highly multimodal, depicting a story through its different modalities (text, images, and videos). With this motivation, we conduct the first large scale memorability study consisting of 1203 participants and 2205 ads covering 276 brands. Running statistical tests over different participant subpopulations and ad-types, we find many interesting insights into what makes an ad memorable - both content and human factors. For example, we find that brands which use commercials with fast moving scenes are more memorable than those with slower scenes (p=8e-10) and that people who use ad-blockers remember lower number of ads than those who don't (p=5e-3). Further, with the motivation of simulating the memorability of marketing materials for a particular audience, ultimately helping create one, we present a novel model, Sharingan, trained to leverage real-world knowledge of LLMs and visual knowledge of visual encoders to predict the memorability of a content. We test our model on all the prominent memorability datasets in literature (both images and videos) and achieve state of the art across all of them. We conduct extensive ablation studies across memory types, modality, brand, and architectural choices to find insights into what drives memory.
    摘要 To address this gap, we conducted a large-scale memorability study with 1203 participants and 2205 ads covering 276 brands. We found several interesting insights into what makes an ad memorable, including the use of fast-moving scenes (p=8e-10) and the impact of ad-blockers (p=5e-3).To simulate the memorability of marketing materials for a particular audience, we developed a novel model called Sharingan. This model leverages real-world knowledge of large language models (LLMs) and visual knowledge of visual encoders to predict the memorability of a content. We tested our model on several prominent memorability datasets in the literature (both images and videos) and achieved state-of-the-art results across all of them.We conducted extensive ablation studies to understand what drives memory, including the impact of different memory types, modalities, brands, and architectural choices. Our findings provide valuable insights for marketers and advertisers looking to create memorable ads that resonate with their target audience.

Examining the Effectiveness of Chatbots in Gathering Family History Information in Comparison to the Standard In-Person Interview-Based Approach

  • paper_url: http://arxiv.org/abs/2309.03223
  • repo_url: None
  • paper_authors: Kieron Drumm, Vincent Tran
  • For: This paper aims to present a chatbot-based approach for gathering family histories, with the goal of providing a valuable tool for genealogists, especially when dealing with interviewees who are based in other countries.* Methods: The paper compares the performance and usability of a chatbot-based approach with two other methods: using ancestry.com and in-person interviews. The chatbot is designed to guide the interviewee through the process of providing their family history information.* Results: The paper shows that the chatbot-based approach has a lower number of mistakes made and a lower level of confusion from the user compared to the other two methods. However, the average time taken to conduct an interview using the chatbot is longer than the other two methods.
    Abstract One of the most common things that a genealogist is tasked with is the gathering of a person's initial family history, normally via in-person interviews or with the use of a platform such as ancestry.com, as this can provide a strong foundation upon which a genealogist may build. However, the ability to conduct these interviews can often be hindered by both geographical constraints and the technical proficiency of the interviewee, as the interviewee in these types of interviews is most often an elderly person with a lower than average level of technical proficiency. With this in mind, this study presents what we believe, based on prior research, to be the first chatbot geared entirely towards the gathering of family histories, and explores the viability of utilising such a chatbot by comparing the performance and usability of such a method with the aforementioned alternatives. With a chatbot-based approach, we show that, though the average time taken to conduct an interview may be longer than if the user had used ancestry.com or participated in an in-person interview, the number of mistakes made and the level of confusion from the user regarding the UI and process required is lower than the other two methods. Note that the final metric regarding the user's confusion is not applicable for the in-person interview sessions due to its lack of a UI. With refinement, we believe this use of a chatbot could be a valuable tool for genealogists, especially when dealing with interviewees who are based in other countries where it is not possible to conduct an in-person interview.
    摘要 Our results show that while the average time taken for an interview using the chatbot may be longer than with Ancestry.com or in-person interviews, the number of mistakes made and the level of user confusion is lower with the chatbot. Additionally, the chatbot-based approach could be a valuable tool for genealogists, especially when dealing with interviewees who are based in other countries where in-person interviews are not possible.The chatbot-based approach has several advantages. Firstly, it allows for more flexible and convenient interview scheduling, as the interview can be conducted remotely. Secondly, the chatbot can guide the interviewee through the interview process, reducing the likelihood of mistakes and confusion. Finally, the chatbot can provide a more personalized and interactive experience for the interviewee, which can lead to more accurate and detailed information.However, there are also some limitations to the chatbot-based approach. One potential drawback is that the chatbot may not be able to capture the nuances and complexities of the interviewee's family history in the same way that a human interviewer could. Additionally, the chatbot may not be able to detect and correct errors or inconsistencies in the interviewee's responses in the same way that a human interviewer could.Despite these limitations, we believe that the use of a chatbot for gathering family histories has the potential to be a valuable tool for genealogists. With refinement and further development, the chatbot could be an effective and efficient way to gather accurate and detailed information about a person's family history, especially when dealing with interviewees who are based in other countries.

When Do Discourse Markers Affect Computational Sentence Understanding?

  • paper_url: http://arxiv.org/abs/2309.00368
  • repo_url: None
  • paper_authors: Ruiqi Li, Liesbeth Allein, Damien Sileo, Marie-Francine Moens
  • For: 这篇论文探讨了自然语言处理(NLP)机器学习模型在理解英语 дискурスconnectives方面的能力和用途。* Methods: 作者使用了 nine 种流行的自然语言处理系统来评估这些系统在理解英语 дискурスconnectives方面的能力,并分析了不同连接类型的计算处理复杂性是否与人类处理顺序一致。* Results: 研究发现,NLP系统不一定能够uniformly处理所有的 дискурスconnectives,并且在不同的语言理解任务下,不同的连接类型的计算处理复杂性并不总是一致于人类处理顺序。此外,人类在阅读过程中可能会受到外部影响,但是这并不一定会影响最终的理解性能。系统对连接知识的更多学习,则会增加不当连接的负面影响。这表明,在计算自然语言处理中,正确地表示连接是重要的。
    Abstract The capabilities and use cases of automatic natural language processing (NLP) have grown significantly over the last few years. While much work has been devoted to understanding how humans deal with discourse connectives, this phenomenon is understudied in computational systems. Therefore, it is important to put NLP models under the microscope and examine whether they can adequately comprehend, process, and reason within the complexity of natural language. In this chapter, we introduce the main mechanisms behind automatic sentence processing systems step by step and then focus on evaluating discourse connective processing. We assess nine popular systems in their ability to understand English discourse connectives and analyze how context and language understanding tasks affect their connective comprehension. The results show that NLP systems do not process all discourse connectives equally well and that the computational processing complexity of different connective kinds is not always consistently in line with the presumed complexity order found in human processing. In addition, while humans are more inclined to be influenced during the reading procedure but not necessarily in the final comprehension performance, discourse connectives have a significant impact on the final accuracy of NLP systems. The richer knowledge of connectives a system learns, the more negative effect inappropriate connectives have on it. This suggests that the correct explicitation of discourse connectives is important for computational natural language processing.
    摘要 自过去几年来,自然语言处理(NLP)的能力和使用场景已经增长了很多。然而,在人工系统中对对话连接器的研究仍然不够。因此,我们需要把NLP模型放进显微镜中,检查它们是否能够正确地理解、处理和推理natural language的复杂性。在本章中,我们将介绍自动句子处理系统的主要机制一探析,然后专注于评估英文对话连接器的处理能力。我们评估了9种流行的NLP系统,检查它们在处理英文对话连接器时的能力,并分析了上下文和语言理解任务对其连接器理解的影响。结果显示NLP系统不同的连接器Kinds不一定能够正确地处理,而且computational处理复杂性不一定与人类处理的复杂性相符。此外,人类在阅读过程中可能会受到影响,但是在最终理解性能上不一定会受到影响。对NLP系统而言,正确地使用对话连接器是重要的。

Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior

  • paper_url: http://arxiv.org/abs/2309.00359
  • repo_url: None
  • paper_authors: Ashmit Khandelwal, Aditya Agrawal, Aanisha Bhattacharyya, Yaman K Singla, Somesh Singh, Uttaran Bhattacharya, Ishita Dasgupta, Stefano Petrangeli, Rajiv Ratn Shah, Changyou Chen, Balaji Krishnamurthy
  • for: 这 paper 的目的是提出 Large Content and Behavior Models (LCBMs),用于解决Receiver 的行为 simulation, content simulation, behavior understanding, 和 behavior domain adaptation 等问题。
  • methods: 这 paper 使用了 Large Language Models (LLMs) 作为基础模型,并在其上添加了 “behavior tokens” 来增强模型的能力。
  • results: 这 paper 的实验结果表明,LCBMs 可以在多种任务上表现良好,包括内容理解、行为模拟、内容模拟、行为理解和行为适应性等。此外, paper 还发布了一个新的 Content Behavior Corpus (CBC),用于推动更多的研究。
    Abstract Shannon, in his seminal paper introducing information theory, divided the communication into three levels: technical, semantic, and effectivenss. While the technical level is concerned with accurate reconstruction of transmitted symbols, the semantic and effectiveness levels deal with the inferred meaning and its effect on the receiver. Thanks to telecommunications, the first level problem has produced great advances like the internet. Large Language Models (LLMs) make some progress towards the second goal, but the third level still remains largely untouched. The third problem deals with predicting and optimizing communication for desired receiver behavior. LLMs, while showing wide generalization capabilities across a wide range of tasks, are unable to solve for this. One reason for the underperformance could be a lack of "behavior tokens" in LLMs' training corpora. Behavior tokens define receiver behavior over a communication, such as shares, likes, clicks, purchases, retweets, etc. While preprocessing data for LLM training, behavior tokens are often removed from the corpora as noise. Therefore, in this paper, we make some initial progress towards reintroducing behavior tokens in LLM training. The trained models, other than showing similar performance to LLMs on content understanding tasks, show generalization capabilities on behavior simulation, content simulation, behavior understanding, and behavior domain adaptation. Using a wide range of tasks on two corpora, we show results on all these capabilities. We call these models Large Content and Behavior Models (LCBMs). Further, to spur more research on LCBMs, we release our new Content Behavior Corpus (CBC), a repository containing communicator, message, and corresponding receiver behavior.
    摘要 谱在他的著名论文中介绍信息理论时,将通信分为三级:技术、 semantics 和效果。技术级关心已经传输的符号的准确重建,而 semantics 和效果级则关心接收者对符号的理解和对接收者的影响。 благо于电信技术的发展,技术级问题已经取得了很大的进步,如互联网。然而, semantics 和效果级问题仍然未得到充分解决。第三级问题是预测和优化通信以实现 Desired receiver behavior。LLMs 虽然在各种任务上显示了广泛的泛化能力,但它们无法解决这个问题。一个可能的原因是 LLMs 在训练 corpora 中缺乏 "behavior tokens"。 behavior tokens 定义了通信过程中接收者的行为,如分享、赞、点击、购买、 retweet 等。在 Preprocessing 数据 для LLM 训练时,通常会从 corpora 中除掉 behavior tokens 作为噪音。因此,在这篇论文中,我们在 LLM 训练中重新引入 behavior tokens,并训练 Large Content and Behavior Models (LCBMs)。LCBMs 不仅在内容理解任务上显示类似的表现,还能在行为模拟、内容模拟、行为理解和行为预测域中进行泛化。使用两个 corpora 上的各种任务,我们在所有这些能力上显示了结果。我们称这些模型为 Large Content and Behavior Models (LCBMs)。此外,为了促进更多关于 LCBMs 的研究,我们发布了我们的新 Content Behavior Corpus (CBC),这是一个包含通信者、消息和相应的接收者行为的Repository。

Comparative Topic Modeling for Determinants of Divergent Report Results Applied to Macular Degeneration Studies

  • paper_url: http://arxiv.org/abs/2309.00312
  • repo_url: None
  • paper_authors: Lucas Cassiel Jacaruso
  • for: 本研究旨在透过比较主题模型分析不同报告中对同一研究问题的结果进行比较,以找到具有明确相关性和重要结果的主题。
  • methods: 本研究使用了主题模型分析方法,对具有相关结果的报告进行分类和排序,以确定具有最高相关性和最高效果的主题。
  • results: 研究发现了8种补充食品具有显著关系与有效结果的主题,其中6种得到了验证性的证据,即omega-3脂肪酸、氧化铁、萝芽素、兰氨酸、锌和氮氧化物。两种未得到验证(niacin和摩尔丹)也得到了最低分,这表明提议的方法可以用来评价主题的相关性。
    Abstract Topic modeling and text mining are subsets of Natural Language Processing with relevance for conducting meta-analysis (MA) and systematic review (SR). For evidence synthesis, the above NLP methods are conventionally used for topic-specific literature searches or extracting values from reports to automate essential phases of SR and MA. Instead, this work proposes a comparative topic modeling approach to analyze reports of contradictory results on the same general research question. Specifically, the objective is to find topics exhibiting distinct associations with significant results for an outcome of interest by ranking them according to their proportional occurrence and consistency of distribution across reports of significant results. The proposed method was tested on broad-scope studies addressing whether supplemental nutritional compounds significantly benefit macular degeneration (MD). Eight compounds were identified as having a particular association with reports of significant results for benefitting MD. Six of these were further supported in terms of effectiveness upon conducting a follow-up literature search for validation (omega-3 fatty acids, copper, zeaxanthin, lutein, zinc, and nitrates). The two not supported by the follow-up literature search (niacin and molybdenum) also had the lowest scores under the proposed methods ranking system, suggesting that the proposed method's score for a given topic is a viable proxy for its degree of association with the outcome of interest. These results underpin the proposed methods potential to add specificity in understanding effects from broad-scope reports, elucidate topics of interest for future research, and guide evidence synthesis in a systematic and scalable way.
    摘要 Topic模型和文本挖掘是自然语言处理的子集,对于进行元分析(MA)和系统性综述(SR)有直接的应用。在证据整合中,这些自然语言处理方法通常用于特定主题的文献搜索或自动化SR和MA的关键阶段。然而,这项工作提出了比较主题模型方法,用于分析对同一个全面研究问题的报告中的不同结果。具体来说,目标是找到具有特定关系的主题,其中结果对于评价预测变量的占比和报告中的分布一致性高。这种方法在对眼肤肉营养剂的研究中进行了测试,并将八种成分确定为对有效结果报告具有特定关系。其中六种(ω-3脂肪酸、铁、杂色素、苷酸、锌和氮原子)在验证性文献搜索中得到了支持,而剩下两种(niacin和硫)则没有得到支持,其分数也相应较低,这表明该方法的分数可以作为一种可靠的评价指标。这些结果证明了该方法的潜在价值,可以增加特定的证据整合,解释特定主题,为未来研究提供指导,并在系统和可扩展的方式下进行证据整合。

Enhancing the vocal range of single-speaker singing voice synthesis with melody-unsupervised pre-training

  • paper_url: http://arxiv.org/abs/2309.00284
  • repo_url: None
  • paper_authors: Shaohuan Zhou, Xu Li, Zhiyong Wu, Ying Shan, Helen Meng
  • for: 提高单声道合唱音色的 vocal range
  • methods: 使用多种者预训练方法,不影响音色一致性
  • results: 提高合唱音质和自然性,比基eline更高
    Abstract The single-speaker singing voice synthesis (SVS) usually underperforms at pitch values that are out of the singer's vocal range or associated with limited training samples. Based on our previous work, this work proposes a melody-unsupervised multi-speaker pre-training method conducted on a multi-singer dataset to enhance the vocal range of the single-speaker, while not degrading the timbre similarity. This pre-training method can be deployed to a large-scale multi-singer dataset, which only contains audio-and-lyrics pairs without phonemic timing information and pitch annotation. Specifically, in the pre-training step, we design a phoneme predictor to produce the frame-level phoneme probability vectors as the phonemic timing information and a speaker encoder to model the timbre variations of different singers, and directly estimate the frame-level f0 values from the audio to provide the pitch information. These pre-trained model parameters are delivered into the fine-tuning step as prior knowledge to enhance the single speaker's vocal range. Moreover, this work also contributes to improving the sound quality and rhythm naturalness of the synthesized singing voices. It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice, and a bi-directional flow model to improve the sound quality. Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
    摘要 单个 speaker 歌唱voice 合成(SVS)通常在声部范围外或有限训练样本下表现不佳。基于我们的前一项工作,这项工作提出了一种不带预教学样本的多 speaker 预训练方法,以提高单个 speaker 的声部范围,不会影响声音相似性。这种预训练方法可以应用于大规模多 singer 数据集,只包含音频和歌词对应的数据。具体来说,在预训练步骤中,我们设计了一个音频预测器,生成音频帧级别的phoneme 概率 вектор作为声音时间信息,以及一个 speaker 编码器,模型不同 singer 的声音变化,直接从音频中提取帧级别的f0值作为抽象信息。这些预训练模型参数被送入细化步骤作为先验知识,以提高单个 speaker 的声部范围。此外,这项工作还提出了改进合成 singing voice 的音质和节奏自然性的方法,包括引入分配duration 调节器以提高合成声音的节奏自然性,以及bi-directional 流模型以提高音质。实验结果表明,提出的 SVS 系统在音质和自然性两个方面都高于基eline。

Why do universal adversarial attacks work on large language models?: Geometry might be the answer

  • paper_url: http://arxiv.org/abs/2309.00254
  • repo_url: None
  • paper_authors: Varshini Subhash, Anna Bialas, Weiwei Pan, Finale Doshi-Velez
  • for: 本研究旨在解释大语言模型中对抗攻击的内部机制,尤其是对于 Gradient-based universal adversarial attacks 的理解。
  • methods: 本研究使用了一种新的几何视角来解释大语言模型中 universal adversarial attacks 的机制。研究者通过对 GPT-2 模型进行攻击,发现了一种可能的观察结果,即攻击触发器可能是 embedding vectors 的一种近似。
  • results: 研究者通过对 GPT-2 模型进行白盒模型分析,包括维度减少和隐藏表示相似度测量,发现了这种观察结果的证据。这种新的几何视角可能会帮助我们更深入地理解大语言模型的内部工作机制和失效模式,从而实现其防范。
    Abstract Transformer based large language models with emergent capabilities are becoming increasingly ubiquitous in society. However, the task of understanding and interpreting their internal workings, in the context of adversarial attacks, remains largely unsolved. Gradient-based universal adversarial attacks have been shown to be highly effective on large language models and potentially dangerous due to their input-agnostic nature. This work presents a novel geometric perspective explaining universal adversarial attacks on large language models. By attacking the 117M parameter GPT-2 model, we find evidence indicating that universal adversarial triggers could be embedding vectors which merely approximate the semantic information in their adversarial training region. This hypothesis is supported by white-box model analysis comprising dimensionality reduction and similarity measurement of hidden representations. We believe this new geometric perspective on the underlying mechanism driving universal attacks could help us gain deeper insight into the internal workings and failure modes of LLMs, thus enabling their mitigation.
    摘要

Detecting Suicidality in Arabic Tweets Using Machine Learning and Deep Learning Techniques

  • paper_url: http://arxiv.org/abs/2309.00246
  • repo_url: None
  • paper_authors: Asma Abdulsalam, Areej Alhothali, Saleh Al-Ghamdi
  • For: The paper aims to develop a novel dataset of Arabic tweets related to suicidal thoughts and use machine learning and deep learning models to automatically detect suicidal ideation in these tweets.* Methods: The paper uses a variety of machine learning and deep learning models, including Na"ive Bayes, Support Vector Machine, K-Nearest Neighbor, Random Forest, and XGBoost, trained on word frequency and word embedding features, as well as pre-trained deep learning models such as AraBert, AraELECTRA, and AraGPT2, to identify suicidal thoughts in Arabic tweets.* Results: The results show that the SVM and RF models trained on character n-gram features provided the best performance, with an accuracy of 86% and an F1 score of 79%. The AraBert model outperforms other machine and deep learning models, achieving an accuracy of 91% and an F1-score of 88%, significantly improving the detection of suicidal ideation in the Arabic tweets dataset.Here are the three points in Simplified Chinese:* For: 这个论文的目的是开发一个关于自杀思想的阿拉伯语推文数据集,并使用机器学习和深度学习模型自动检测这些推文中的自杀意图。* Methods: 这个论文使用了多种机器学习和深度学习模型,包括Na"ive Bayes、支持向量机、K-近邻 neighbors、Random Forest和XGBoost,使用单词频率和单词嵌入特征进行训练,以及预训练的深度学习模型如AraBert、AraELECTRA和AraGPT2,来识别阿拉伯语推文中的自杀思想。* Results: 结果显示,SVM和RF模型使用单词n-gram特征进行训练时提供了最好的性能,具有86%的准确率和79%的F1分数。AraBert模型在其他机器和深度学习模型之上表现出色,达到91%的准确率和88%的F1分数,显著提高了阿拉伯语推文中自杀意图的检测。
    Abstract Social media platforms have revolutionized traditional communication techniques by enabling people globally to connect instantaneously, openly, and frequently. People use social media to share personal stories and express their opinion. Negative emotions such as thoughts of death, self-harm, and hardship are commonly expressed on social media, particularly among younger generations. As a result, using social media to detect suicidal thoughts will help provide proper intervention that will ultimately deter others from self-harm and committing suicide and stop the spread of suicidal ideation on social media. To investigate the ability to detect suicidal thoughts in Arabic tweets automatically, we developed a novel Arabic suicidal tweets dataset, examined several machine learning models, including Na\"ive Bayes, Support Vector Machine, K-Nearest Neighbor, Random Forest, and XGBoost, trained on word frequency and word embedding features, and investigated the ability of pre-trained deep learning models, AraBert, AraELECTRA, and AraGPT2, to identify suicidal thoughts in Arabic tweets. The results indicate that SVM and RF models trained on character n-gram features provided the best performance in the machine learning models, with 86% accuracy and an F1 score of 79%. The results of the deep learning models show that AraBert model outperforms other machine and deep learning models, achieving an accuracy of 91\% and an F1-score of 88%, which significantly improves the detection of suicidal ideation in the Arabic tweets dataset. To the best of our knowledge, this is the first study to develop an Arabic suicidality detection dataset from Twitter and to use deep-learning approaches in detecting suicidality in Arabic posts.
    摘要 社交媒体平台已经革命化了传统的沟通方式,让人们全球协同交流、开放地分享自己的故事和看法。人们通过社交媒体分享自己的个人经历和表达自己的看法。特别是年轻一代,常常在社交媒体上表达自杀的思想和自危的情感。因此,通过社交媒体检测自杀思想可以提供适当的干预措施,ultimately prevent others from self-harm and suicide, and stop the spread of suicidal ideation on social media.为了研究自动检测阿拉伯语自杀思想的能力,我们创建了一个新的阿拉伯语自杀吟话集合,检验了多种机器学习模型,包括顺序规则模型、支持向量机器学习模型、最近邻居模型、随机森林模型和XGBoost模型。我们使用单词频和单词嵌入特征进行训练,并研究了预训练深度学习模型AraBert、AraELECTRA和AraGPT2的能力来识别阿拉伯语自杀思想。结果显示,SVM和RF模型在机器学习模型中表现最佳,具有86%的准确率和79%的F1分数。深度学习模型的结果表明,AraBert模型在其他机器和深度学习模型中表现出色,达到了91%的准确率和88%的F1分数,对阿拉伯语自杀吟话集合的检测提供了显著改善。据我们所知,这是首次从Twitter上创建了阿拉伯语自杀性检测dataset,并使用深度学习方法来检测阿拉伯语自杀思想。

NeuroSurgeon: A Toolkit for Subnetwork Analysis

  • paper_url: http://arxiv.org/abs/2309.00244
  • repo_url: https://github.com/mlepori1/neurosurgeon
  • paper_authors: Michael A. Lepori, Ellie Pavlick, Thomas Serre
  • for: 了解神经网络模型中学习的算法。
  • methods: 使用Python库NeuroSurgeon对Transformers库中的模型进行发现和操作。
  • results: 可以帮助研究人员更好地理解和修改神经网络模型。
    Abstract Despite recent advances in the field of explainability, much remains unknown about the algorithms that neural networks learn to represent. Recent work has attempted to understand trained models by decomposing them into functional circuits (Csord\'as et al., 2020; Lepori et al., 2023). To advance this research, we developed NeuroSurgeon, a python library that can be used to discover and manipulate subnetworks within models in the Huggingface Transformers library (Wolf et al., 2019). NeuroSurgeon is freely available at https://github.com/mlepori1/NeuroSurgeon.
    摘要 尽管最近在神经网络解释领域有所进步,仍然有很多关于神经网络学习的表示方法未知。最近的研究尝试了通过分解神经网络模型为功能电路来理解训练后模型(Csordás et al., 2020; Lepori et al., 2023)。为了进一步推进这项研究,我们开发了一个名为NeuroSurgeon的Python库,可以用于发现和操作在Huggingface Transformers库中的子网络(Wolf et al., 2019)。NeuroSurgeon是免费释出的,可以在https://github.com/mlepori1/NeuroSurgeon上下载。

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

  • paper_url: http://arxiv.org/abs/2309.00236
  • repo_url: https://github.com/euanong/image-hijacks
  • paper_authors: Luke Bailey, Euan Ong, Stuart Russell, Scott Emmons
  • for: 本研究探讨了基础模型是否具有恶意攻击者的安全性?研究者发现了图像劫驱,即在运行时控制生成模型的恶意图像。
  • methods: 研究者提出了一种普适的方法 named Behaviour Matching,用于创造图像劫驱。他们使用这种方法来探索三种类型的攻击。
  • results: 研究者在使用 LLaVA 和 LLaMA-2 模型进行测试时发现,所有的攻击类型都有超过 90% 的成功率。此外,这些攻击都是自动化的,只需要小的图像偏移即可实现。这些发现表明基础模型的安全性存在严重的问题。
    Abstract Are foundation models secure from malicious actors? In this work, we focus on the image input to a vision-language model (VLM). We discover image hijacks, adversarial images that control generative models at runtime. We introduce Behaviour Matching, a general method for creating image hijacks, and we use it to explore three types of attacks. Specific string attacks generate arbitrary output of the adversary's choice. Leak context attacks leak information from the context window into the output. Jailbreak attacks circumvent a model's safety training. We study these attacks against LLaVA, a state-of-the-art VLM based on CLIP and LLaMA-2, and find that all our attack types have above a 90% success rate. Moreover, our attacks are automated and require only small image perturbations. These findings raise serious concerns about the security of foundation models. If image hijacks are as difficult to defend against as adversarial examples in CIFAR-10, then it might be many years before a solution is found -- if it even exists.
    摘要 Foundation models 是否安全免受黑客攻击?在这项工作中,我们关注vision-language模型(VLM)中的图像输入。我们发现图像劫持,也就是在运行时使用恶意图像控制生成模型的攻击。我们提出了行为匹配方法,可以创造图像劫持,并使用它来探索三种攻击方式。特定的字符串攻击可以生成对手选择的任意输出。泄露上下文攻击可以从上下文窗口中泄露信息到输出中。囚禁攻击可以绕过模型的安全训练。我们对LLaVA模型,基于CLIP和LLaMA-2,进行了研究,发现所有我们的攻击类型具有超过90%的成功率。此外,我们的攻击是自动化的,只需要小型图像变化即可。这些发现对基础模型的安全提出了严重的问题。如果图像劫持与CIFAR-10中的对抗性例子一样难以防御,那么可能需要很多年才能找到解决方案——如果其even exists。

JoTR: A Joint Transformer and Reinforcement Learning Framework for Dialog Policy Learning

  • paper_url: http://arxiv.org/abs/2309.00230
  • repo_url: https://github.com/kwanwaichung/jotr
  • paper_authors: Wai-Chung Kwan, Huimin Wang, Hongru Wang, Zezhong Wang, Xian Wu, Yefeng Zheng, Kam-Fai Wong
  • for: 本研究的目的是提出一种新的对话政策学习方法,以提高对话模型的性能和多样性。
  • methods: 本研究使用了一种基于Transformer的文本对话模型,通过Word级对话策略来生成灵活的对话行为。此外,本研究还使用了奖励学习和奖励托管机制来有效地训练对话策略。
  • results: 经过广泛的评估,本研究的JoTR方法在两个对话模型任务上达到了领先的状态。User simulator和人工评估者都认为JoTR的性能有所提高。
    Abstract Dialogue policy learning (DPL) is a crucial component of dialogue modelling. Its primary role is to determine the appropriate abstract response, commonly referred to as the "dialogue action". Traditional DPL methodologies have treated this as a sequential decision problem, using pre-defined action candidates extracted from a corpus. However, these incomplete candidates can significantly limit the diversity of responses and pose challenges when dealing with edge cases, which are scenarios that occur only at extreme operating parameters. To address these limitations, we introduce a novel framework, JoTR. This framework is unique as it leverages a text-to-text Transformer-based model to generate flexible dialogue actions. Unlike traditional methods, JoTR formulates a word-level policy that allows for a more dynamic and adaptable dialogue action generation, without the need for any action templates. This setting enhances the diversity of responses and improves the system's ability to handle edge cases effectively. In addition, JoTR employs reinforcement learning with a reward-shaping mechanism to efficiently finetune the word-level dialogue policy, which allows the model to learn from its interactions, improving its performance over time. We conducted an extensive evaluation of JoTR to assess its effectiveness. Our extensive evaluation shows that JoTR achieves state-of-the-art performance on two benchmark dialogue modelling tasks, as assessed by both user simulators and human evaluators.
    摘要 对话政策学习(DPL)是对话模型的一个重要组件。其主要职责是确定合适的抽象回复,通常称为对话动作。传统的DPL方法ologies treat this as a sequential decision problem, using pre-defined action candidates extracted from a corpus。然而,这些不完整的候选者可以很大地限制对话的多样性和处理边缘情况的能力。为解决这些限制,我们介绍了一个新的框架,JoTR。这个框架独特之处在于它利用了文本到文本的Transformer模型来生成灵活的对话动作。与传统方法不同,JoTR定义了字词级对话政策,允许对话动作生成更加动态和适应性强,无需任何动作模板。这种设置可以提高对话的多样性和对边缘情况的处理能力。此外,JoTR使用了奖励学习和形成机制来有效地训练字词级对话政策,让模型通过与人类互动学习并改进自己的性能。我们进行了广泛的评估,结果显示JoTR在两个标准对话模型任务上达到了当今最佳性能,并且被用户模拟器和人类评估器评为优秀。

The FruitShell French synthesis system at the Blizzard 2023 Challenge

  • paper_url: http://arxiv.org/abs/2309.00223
  • repo_url: None
  • paper_authors: Xin Qi, Xiaopeng Wang, Zhiyong Wang, Wang Liu, Mingming Ding, Shuchen Shi
  • for: 本研究提出了一个用于2023年Blizzard挑战的法文文本读取系统。这个挑战包括两个任务:生成高品质的女性读者的语音,以及将语音与具体的个人联系起来。
  • methods: 我们对竞赛数据进行了筛选程序,以移除遗传或错误的文本数据。我们将所有符号除非是phoneme外,并删除没有读音或零时间的符号。此外,我们将文本中的字界限和开始/结束符号添加到文本中,以改善语音质量。在Spoke任务中,我们遵循竞赛规则进行数据增强。我们使用了一个开源的G2P模型来将法文转换为phoneme。由于G2P模型使用国际音律字母表(IPA),我们将竞赛数据转换为IPA标示法。但由于编译器无法识别竞赛数据中的特殊符号,我们按照规则将所有phoneme转换为竞赛数据中使用的音律字母表。最后,我们将所有竞赛音频标准化到16 kHz的样本率。
  • results: 我们的系统在Hub任务中得到了3.6的质量MOS分数,在Spoke任务中得到了3.4的质量MOS分数,与所有参赛队伍的平均水平相当。
    Abstract This paper presents a French text-to-speech synthesis system for the Blizzard Challenge 2023. The challenge consists of two tasks: generating high-quality speech from female speakers and generating speech that closely resembles specific individuals. Regarding the competition data, we conducted a screening process to remove missing or erroneous text data. We organized all symbols except for phonemes and eliminated symbols that had no pronunciation or zero duration. Additionally, we added word boundary and start/end symbols to the text, which we have found to improve speech quality based on our previous experience. For the Spoke task, we performed data augmentation according to the competition rules. We used an open-source G2P model to transcribe the French texts into phonemes. As the G2P model uses the International Phonetic Alphabet (IPA), we applied the same transcription process to the provided competition data for standardization. However, due to compiler limitations in recognizing special symbols from the IPA chart, we followed the rules to convert all phonemes into the phonetic scheme used in the competition data. Finally, we resampled all competition audio to a uniform sampling rate of 16 kHz. We employed a VITS-based acoustic model with the hifigan vocoder. For the Spoke task, we trained a multi-speaker model and incorporated speaker information into the duration predictor, vocoder, and flow layers of the model. The evaluation results of our system showed a quality MOS score of 3.6 for the Hub task and 3.4 for the Spoke task, placing our system at an average level among all participating teams.
    摘要 Regarding the competition data, we conducted a screening process to remove missing or erroneous text data. We organized all symbols except for phonemes and eliminated symbols that had no pronunciation or zero duration. Additionally, we added word boundary and start/end symbols to the text, which we have found to improve speech quality based on our previous experience.For the Spoke task, we performed data augmentation according to the competition rules. We used an open-source G2P model to transcribe the French texts into phonemes. As the G2P model uses the International Phonetic Alphabet (IPA), we applied the same transcription process to the provided competition data for standardization. However, due to compiler limitations in recognizing special symbols from the IPA chart, we followed the rules to convert all phonemes into the phonetic scheme used in the competition data.Finally, we resampled all competition audio to a uniform sampling rate of 16 kHz. We employed a VITS-based acoustic model with the hifigan vocoder. For the Spoke task, we trained a multi-speaker model and incorporated speaker information into the duration predictor, vocoder, and flow layers of the model.The evaluation results of our system showed a quality MOS score of 3.6 for the Hub task and 3.4 for the Spoke task, placing our system at an average level among all participating teams.

Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding

  • paper_url: http://arxiv.org/abs/2309.00215
  • repo_url: https://github.com/JoshuaFeinglass/VL-detector-eval
  • paper_authors: Joshua Feinglass, Yezhou Yang
  • for: 本研究旨在探讨目标框生成器在视觉语言任务中的性能评价协议是否准确,以及如何使用语义固定来缓解这一问题。
  • methods: 我们提出了一种新的评价方法,即根据图像描述文本中的Semantic信息来选择合适的注释 subset,并对这些注释进行评价。
  • results: 我们的方法能够准确地选择与视觉语言任务相关的注释 subset,并且与现有的评价方法相比,能够更好地反映图像描述文本中的Semantic信息。
    Abstract Object proposal generation serves as a standard pre-processing step in Vision-Language (VL) tasks (image captioning, visual question answering, etc.). The performance of object proposals generated for VL tasks is currently evaluated across all available annotations, a protocol that we show is misaligned - higher scores do not necessarily correspond to improved performance on downstream VL tasks. Our work serves as a study of this phenomenon and explores the effectiveness of semantic grounding to mitigate its effects. To this end, we propose evaluating object proposals against only a subset of available annotations, selected by thresholding an annotation importance score. Importance of object annotations to VL tasks is quantified by extracting relevant semantic information from text describing the image. We show that our method is consistent and demonstrates greatly improved alignment with annotations selected by image captioning metrics and human annotation when compared against existing techniques. Lastly, we compare current detectors used in the Scene Graph Generation (SGG) benchmark as a use case, which serves as an example of when traditional object proposal evaluation techniques are misaligned.
    摘要 Object proposal generation acts as a standard pre-processing step in Vision-Language (VL) tasks (image captioning, visual question answering, etc.). Currently, the performance of object proposals generated for VL tasks is evaluated across all available annotations, a protocol that we show is misaligned - higher scores do not necessarily correspond to improved performance on downstream VL tasks. Our work serves as a study of this phenomenon and explores the effectiveness of semantic grounding to mitigate its effects. To this end, we propose evaluating object proposals against only a subset of available annotations, selected by thresholding an annotation importance score. The importance of object annotations to VL tasks is quantified by extracting relevant semantic information from text describing the image. We show that our method is consistent and demonstrates greatly improved alignment with annotations selected by image captioning metrics and human annotation when compared against existing techniques. Finally, we compare current detectors used in the Scene Graph Generation (SGG) benchmark as a use case, which serves as an example of when traditional object proposal evaluation techniques are misaligned.Here's the translation in Traditional Chinese:Object proposal generation acts as a standard pre-processing step in Vision-Language (VL) tasks (image captioning, visual question answering, etc.). Currently, the performance of object proposals generated for VL tasks is evaluated across all available annotations, a protocol that we show is misaligned - higher scores do not necessarily correspond to improved performance on downstream VL tasks. Our work serves as a study of this phenomenon and explores the effectiveness of semantic grounding to mitigate its effects. To this end, we propose evaluating object proposals against only a subset of available annotations, selected by thresholding an annotation importance score. The importance of object annotations to VL tasks is quantified by extracting relevant semantic information from text describing the image. We show that our method is consistent and demonstrates greatly improved alignment with annotations selected by image captioning metrics and human annotation when compared against existing techniques. Finally, we compare current detectors used in the Scene Graph Generation (SGG) benchmark as a use case, which serves as an example of when traditional object proposal evaluation techniques are misaligned.

Exploring the law of text geographic information

  • paper_url: http://arxiv.org/abs/2309.00180
  • repo_url: None
  • paper_authors: Zhenhua Wang, Daiyu Zhang, Ming Ren, Guang Xu
  • for: 该论文的目的是探讨文本地理信息的分布特性以及人类使用它的限制。
  • methods: 该论文采用了严格的实验方法,测试了24种不同语言和类型的地理信息数据集,以验证人类使用地理信息的假设。
  • results: 研究发现,地理信息呈Gamma分布,其中的量、长度和距离受到人类行为、认知、表达和思维过程的影响。此外,与 Gaussian 分布和Zipf 法律进行了比较,并证明了这些法律的无关性。这些结果可能有助于探索未知的地理信息领域。
    Abstract Textual geographic information is indispensable and heavily relied upon in practical applications. The absence of clear distribution poses challenges in effectively harnessing geographic information, thereby driving our quest for exploration. We contend that geographic information is influenced by human behavior, cognition, expression, and thought processes, and given our intuitive understanding of natural systems, we hypothesize its conformity to the Gamma distribution. Through rigorous experiments on a diverse range of 24 datasets encompassing different languages and types, we have substantiated this hypothesis, unearthing the underlying regularities governing the dimensions of quantity, length, and distance in geographic information. Furthermore, theoretical analyses and comparisons with Gaussian distributions and Zipf's law have refuted the contingency of these laws. Significantly, we have estimated the upper bounds of human utilization of geographic information, pointing towards the existence of uncharted territories. Also, we provide guidance in geographic information extraction. Hope we peer its true countenance uncovering the veil of geographic information.
    摘要

Will Sentiment Analysis Need Subculture? A New Data Augmentation Approach

  • paper_url: http://arxiv.org/abs/2309.00178
  • repo_url: None
  • paper_authors: Zhenhua Wang, Simin He, Guang Xu, Ming Ren
  • for: 强调文化价值和情感分析的研究,addressing the insufficient training data faced by sentiment analysis.
  • methods: 提出了一种基于子文化表达生成器的数据增强方法(SCDA),通过生成6种不同的子文化表达来生成6种增强文本。
  • results: 实验证明了SCDA的有效性和可能性,同时也发现了不同子文化表达对情感刺激的不同程度。此外,研究还发现了一种 linear reversibility 的现象,即某些子文化表达可以逆向转换为另一种子文化表达。
    Abstract The renowned proverb that "The pen is mightier than the sword" underscores the formidable influence wielded by text expressions in shaping sentiments. Indeed, well-crafted written can deeply resonate within cultures, conveying profound sentiments. Nowadays, the omnipresence of the Internet has fostered a subculture that congregates around the contemporary milieu. The subculture artfully articulates the intricacies of human feelings by ardently pursuing the allure of novelty, a fact that cannot be disregarded in the sentiment analysis. This paper strives to enrich data through the lens of subculture, to address the insufficient training data faced by sentiment analysis. To this end, a new approach of subculture-based data augmentation (SCDA) is proposed, which engenders six enhanced texts for each training text by leveraging the creation of six diverse subculture expression generators. The extensive experiments attest to the effectiveness and potential of SCDA. The results also shed light on the phenomenon that disparate subculture expressions elicit varying degrees of sentiment stimulation. Moreover, an intriguing conjecture arises, suggesting the linear reversibility of certain subculture expressions. It is our fervent aspiration that this study serves as a catalyst in fostering heightened perceptiveness towards the tapestry of information, sentiment and culture, thereby enriching our collective understanding.
    摘要 “著名的谚语“笔子比剑更强”强调了文字表达在形塑情感的力量。实际上,美妙地撰写的文字可以深深地感染到文化中,传递出深刻的情感。在现代社会,互联网的普遍化使得文化 subgroup 形成了一种新的互文化环境,这种环境通过积极追求新鲜的感受,突出了情感分析的不足。为了解决这问题,本文提出了一种基于互文化的数据增强方法(SCDA),通过创造六种不同的互文化表达生成器来生成六个增强的文本。实验证明了SCDA的有效性和潜力。结果还暴露了一种有趣的推测:一些互文化表达的差异会引起不同的情感刺激。此外,这种研究也鼓励我们更加珍惜信息、情感和文化的多样性,以推动我们的共同理解的深化。”