We gratefully acknowledge support from
the Simons Foundation and member institutions.

Computation and Language

New submissions

[ total of 149 entries: 1-149 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 31 May 24

[1]  arXiv:2405.19425 [pdf, other]
Title: Adaptive In-conversation Team Building for Language Model Agents
Subjects: Computation and Language (cs.CL)

Leveraging multiple large language model (LLM) agents has shown to be a promising approach for tackling complex tasks, while the effective design of multiple agents for a particular application remains an art. It is thus intriguing to answer a critical question: Given a task, how can we build a team of LLM agents to solve it effectively? Our new adaptive team-building paradigm offers a flexible solution, realized through a novel agent design named Captain Agent. It dynamically forms and manages teams for each step of a task-solving process, utilizing nested group conversations and reflection to ensure diverse expertise and prevent stereotypical outputs. It allows for a flexible yet structured approach to problem-solving and can help reduce redundancy and enhance output diversity. A comprehensive evaluation across six real-world scenarios demonstrates that Captain Agent significantly outperforms existing multi-agent methods with 21.94% improvement in average accuracy, providing outstanding performance without requiring task-specific prompt engineering.

[2]  arXiv:2405.19426 [pdf, other]
Title: Deep Learning for Assessment of Oral Reading Fluency
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Reading fluency assessment is a critical component of literacy programmes, serving to guide and monitor early education interventions. Given the resource intensive nature of the exercise when conducted by teachers, the development of automatic tools that can operate on audio recordings of oral reading is attractive as an objective and highly scalable solution. Multiple complex aspects such as accuracy, rate and expressiveness underlie human judgements of reading fluency. In this work, we investigate end-to-end modeling on a training dataset of children's audio recordings of story texts labeled by human experts. The pre-trained wav2vec2.0 model is adopted due its potential to alleviate the challenges from the limited amount of labeled data. We report the performance of a number of system variations on the relevant measures, and also probe the learned embeddings for lexical and acoustic-prosodic features known to be important to the perception of reading fluency.

[3]  arXiv:2405.19433 [pdf, other]
Title: Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals
Subjects: Computation and Language (cs.CL)

While current automated essay scoring (AES) methods show high agreement with human raters, their scoring mechanisms are not fully explored. Our proposed method, using counterfactual intervention assisted by Large Language Models (LLMs), reveals that when scoring essays, BERT-like models primarily focus on sentence-level features, while LLMs are attuned to conventions, language complexity, as well as organization, indicating a more comprehensive alignment with scoring rubrics. Moreover, LLMs can discern counterfactual interventions during feedback. Our approach improves understanding of neural AES methods and can also apply to other domains seeking transparency in model-driven decisions. The codes and data will be released at GitHub.

[4]  arXiv:2405.19462 [pdf, other]
Title: Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning
Comments: Accepted to ACL 2024 Findings
Subjects: Computation and Language (cs.CL)

Neural Machine Translation models are extremely data and compute-hungry. However, not all data points contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significant drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints Across Time (CAT), that leverages early model training dynamics to identify the most relevant data points for model performance. We benchmark CAT against several data pruning techniques including COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks on Indo-European languages on multiple test sets. When applied to English-German, English-French and English-Swahili translation tasks, CAT achieves comparable performance to using the full dataset, while pruning up to 50% of training data. We inspect the data points that CAT selects and find that it tends to favour longer sentences and sentences with unique or rare words.

[5]  arXiv:2405.19487 [pdf, other]
Title: A Full-duplex Speech Dialogue Scheme Based On Large Language Models
Subjects: Computation and Language (cs.CL)

We present a generative dialogue system capable of operating in a full-duplex manner, allowing for seamless interaction. It is based on a large language model (LLM) carefully aligned to be aware of a perception module, a motor function module, and the concept of a simple finite state machine (called neural FSM) with two states. The perception and motor function modules operate simultaneously, allowing the system to simultaneously speak and listen to the user. The LLM generates textual tokens for inquiry responses and makes autonomous decisions to start responding to, wait for, or interrupt the user by emitting control tokens to the neural FSM. All these tasks of the LLM are carried out as next token prediction on a serialized view of the dialogue in real-time. In automatic quality evaluations simulating real-life interaction, the proposed system reduces the average conversation response latency by more than 3 folds compared with LLM-based half-duplex dialogue systems while responding within less than 500 milliseconds in more than 50% of evaluated interactions. Running a LLM with only 8 billion parameters, our system exhibits a 8% higher interruption precision rate than the best available commercial LLM for voice-based dialogue.

[6]  arXiv:2405.19519 [pdf, other]
Title: Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Retrieval augmented generation (RAG) provides the capability to constrain generative model outputs, and mitigate the possibility of hallucination, by providing relevant in-context text. The number of tokens a generative large language model (LLM) can incorporate as context is finite, thus limiting the volume of knowledge from which to generate an answer. We propose a two-layer RAG framework for query-focused answer generation and evaluate a proof-of-concept for this framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. The evaluations demonstrate the effectiveness of the two-layer framework in resource constrained settings to enable researchers in obtaining near real-time data from users.

[7]  arXiv:2405.19538 [pdf, other]
Title: CheXpert Plus: Hundreds of Thousands of Aligned Radiology Texts, Images and Patients
Comments: 13 pages
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Since the release of the original CheXpert paper five years ago, CheXpert has become one of the most widely used and cited clinical AI datasets. The emergence of vision language models has sparked an increase in demands for sharing reports linked to CheXpert images, along with a growing interest among AI fairness researchers in obtaining demographic data. To address this, CheXpert Plus serves as a new collection of radiology data sources, made publicly available to enhance the scaling, performance, robustness, and fairness of models for all subsequent machine learning tasks in the field of radiology. CheXpert Plus is the largest text dataset publicly released in radiology, with a total of 36 million text tokens, including 13 million impression tokens. To the best of our knowledge, it represents the largest text de-identification effort in radiology, with almost 1 million PHI spans anonymized. It is only the second time that a large-scale English paired dataset has been released in radiology, thereby enabling, for the first time, cross-institution training at scale. All reports are paired with high-quality images in DICOM format, along with numerous image and patient metadata covering various clinical and socio-economic groups, as well as many pathology labels and RadGraph annotations. We hope this dataset will boost research for AI models that can further assist radiologists and help improve medical care. Data is available at the following URL: https://stanfordaimi.azurewebsites.net/datasets/5158c524-d3ab-4e02-96e9-6ee9efc110a1 Models are available at the following URL: https://github.com/Stanford-AIMI/chexpert-plus

[8]  arXiv:2405.19563 [pdf, other]
Title: Unlearning Climate Misinformation in Large Language Models
Subjects: Computation and Language (cs.CL)

Misinformation regarding climate change is a key roadblock in addressing one of the most serious threats to humanity. This paper investigates factual accuracy in large language models (LLMs) regarding climate information. Using true/false labeled Q&A data for fine-tuning and evaluating LLMs on climate-related claims, we compare open-source models, assessing their ability to generate truthful responses to climate change questions. We investigate the detectability of models intentionally poisoned with false climate information, finding that such poisoning may not affect the accuracy of a model's responses in other domains. Furthermore, we compare the effectiveness of unlearning algorithms, fine-tuning, and Retrieval-Augmented Generation (RAG) for factually grounding LLMs on climate change topics. Our evaluation reveals that unlearning algorithms can be effective for nuanced conceptual claims, despite previous findings suggesting their inefficacy in privacy contexts. These insights aim to guide the development of more factually reliable LLMs and highlight the need for additional work to secure LLMs against misinformation attacks.

[9]  arXiv:2405.19575 [pdf, other]
Title: A Deep Convolutional Neural Network-based Model for Aspect and Polarity Classification in Hausa Movie Reviews
Comments: To be published in the proceedings of ICCAIT 2023
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Aspect-based Sentiment Analysis (ABSA) is crucial for understanding sentiment nuances in text, especially across diverse languages and cultures. This paper introduces a novel Deep Convolutional Neural Network (CNN)-based model tailored for aspect and polarity classification in Hausa movie reviews, an underrepresented language in sentiment analysis research. A comprehensive Hausa ABSA dataset is created, filling a significant gap in resource availability. The dataset, preprocessed using sci-kit-learn for TF-IDF transformation, includes manually annotated aspect-level feature ontology words and sentiment polarity assignments. The proposed model combines CNNs with attention mechanisms for aspect-word prediction, leveraging contextual information and sentiment polarities. With 91% accuracy on aspect term extraction and 92% on sentiment polarity classification, the model outperforms traditional machine models, offering insights into specific aspects and sentiments. This study advances ABSA research, particularly in underrepresented languages, with implications for cross-cultural linguistic research.

[10]  arXiv:2405.19635 [pdf, other]
Title: GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM Deployment
Subjects: Computation and Language (cs.CL)

The burgeoning size of Large Language Models (LLMs) has led to enhanced capabilities in generating responses, albeit at the expense of increased inference times and elevated resource demands. Existing methods of acceleration, predominantly hinged on knowledge distillation, generally necessitate fine-tuning of considerably large models, such as Llama-7B, posing a challenge for average users. Furthermore, present techniques for expediting inference and reducing costs operate independently. To address these issues, we introduce a novel and intuitive Guidance-based Knowledge Transfer (GKT) framework. This approach leverages a larger LLM as a ''teacher'' to create guidance prompts, paired with a smaller ''student'' model to finalize responses. Remarkably, GKT requires no fine-tuning and doesn't necessitate the teacher and student models to have the same vocabulary, allowing for extensive batch generation to accelerate the process while ensuring user customization. GKT can be seamlessly integrated into cloud-edge collaboration architectures, and is versatile enough for plug-and-play application across various models. It excels in both efficiency and affordability, epitomizing a ''cheap and cheerful'' solution. GKT achieves a maximum accuracy improvement of 14.18%, along with a 10.72 times speed-up on GSM8K and an accuracy improvement of 14.00 % along with a 7.73 times speed-up in CSQA. When utilizing ChatGPT as teacher model and Llama2-70B as the student model, we can achieve 95.00% of ChatGPT's performance at 52% of the cost. The results highlight substantial enhancements in accuracy and processing speed on the GSM8K and CSQA datasets, surpassing the performance of using either the student or teacher models in isolation.

[11]  arXiv:2405.19648 [pdf, other]
Title: Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach
Comments: ICAI'24 - The 26th Int'l Conf on Artificial Intelligence
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Concerns regarding the propensity of Large Language Models (LLMs) to produce inaccurate outputs, also known as hallucinations, have escalated. Detecting them is vital for ensuring the reliability of applications relying on LLM-generated content. Current methods often demand substantial resources and rely on extensive LLMs or employ supervised learning with multidimensional features or intricate linguistic and semantic analyses difficult to reproduce and largely depend on using the same LLM that hallucinated. This paper introduces a supervised learning approach employing two simple classifiers utilizing only four numerical features derived from tokens and vocabulary probabilities obtained from other LLM evaluators, which are not necessarily the same. The method yields promising results, surpassing state-of-the-art outcomes in multiple tasks across three different benchmarks. Additionally, we provide a comprehensive examination of the strengths and weaknesses of our approach, highlighting the significance of the features utilized and the LLM employed as an evaluator. We have released our code publicly at https://github.com/Baylor-AI/HalluDetect.

[12]  arXiv:2405.19660 [pdf, other]
Title: PATIENT-Ψ: Using Large Language Models to Simulate Patients for Training Mental Health Professionals
Comments: Work in progress
Subjects: Computation and Language (cs.CL)

Mental illness remains one of the most critical public health issues, with a significant gap between the available mental health support and patient needs. Many mental health professionals highlight a disconnect between their training and real-world patient interactions, leaving some trainees feeling unprepared and potentially affecting their early career success. In this paper, we propose PATIENT-{\Psi}, a novel patient simulation framework for cognitive behavior therapy (CBT) training. To build PATIENT-{\Psi}, we constructed diverse patient profiles and their corresponding cognitive models based on CBT principles, and then used large language models (LLMs) programmed with the patient cognitive models to act as a simulated therapy patient. We propose an interactive training scheme, PATIENT-{\Psi}-TRAINER, for mental health trainees to practice a key skill in CBT -- formulating the cognitive model of the patient -- through role-playing a therapy session with PATIENT-{\Psi}. To evaluate PATIENT-{\Psi}, we conducted a user study of 4 mental health trainees and 10 experts. The results demonstrate that practice using PATIENT-{\Psi}-TRAINER greatly enhances the perceived skill acquisition and confidence of the trainees beyond existing forms of training such as textbooks, videos, and role-play with non-patients. Based on the experts' perceptions, PATIENT-{\Psi} is perceived to be closer to real patient interactions than GPT-4, and PATIENT-{\Psi}-TRAINER holds strong promise to improve trainee competencies. Our pioneering patient simulation training framework, using LLMs, holds great potential to enhance and advance mental health training, ultimately leading to improved patient care and outcomes. We will release all our data, code, and the training platform.

[13]  arXiv:2405.19670 [pdf, other]
Title: One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models
Comments: working in progress, repo: this https URL
Subjects: Computation and Language (cs.CL)

Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs) for generating more factual, accurate, and up-to-date content. Existing methods either optimize prompts to guide LLMs in leveraging retrieved information or directly fine-tune the LLMs to adapt to RAG scenarios. Although fine-tuning can yield better performance, it often compromises the LLMs' general generation capabilities by modifying their parameters. This limitation poses challenges in practical applications, especially when LLMs are already deployed, as parameter adjustments may affect their original functionality. To address this, we propose a novel method that involves learning scalable and pluggable virtual tokens for RAG. By maintaining the LLMs' original parameters and fine-tuning only the embeddings of these pluggable tokens, our approach not only enhances LLMs' performance but also preserves their general generation capacities. Furthermore, we design several training strategies to improve the scalability, flexibility, and generalizability of our method. Comprehensive experiments across nine question-answering tasks demonstrate the superiority of our approach.

[14]  arXiv:2405.19701 [pdf, other]
Title: Significance of Chain of Thought in Gender Bias Mitigation for English-Dravidian Machine Translation
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Gender bias in machine translation (MT) systems poses a significant challenge to achieving accurate and inclusive translations. This paper examines gender bias in machine translation systems for languages such as Telugu and Kannada from the Dravidian family, analyzing how gender inflections affect translation accuracy and neutrality using Google Translate and ChatGPT. It finds that while plural forms can reduce bias, individual-centric sentences often maintain the bias due to historical stereotypes. The study evaluates the Chain of Thought processing, noting significant bias mitigation from 80% to 4% in Telugu and from 40% to 0% in Kannada. It also compares Telugu and Kannada translations, emphasizing the need for language specific strategies to address these challenges and suggesting directions for future research to enhance fairness in both data preparation and prompts during inference.

[15]  arXiv:2405.19715 [pdf, other]
Title: SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K -- the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We study the choice of the candidate length K and formulate it as a Markov Decision Process. We theoretically show that the optimal policy of this Markov decision process takes the form of a threshold policy, i.e., the current speculation should stop and be verified when the probability of getting a rejection exceeds a threshold value. Motivated by this theory, we propose SpecDec++, an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. We augment the draft model with a trained acceptance prediction head to predict the conditional acceptance probability of the candidate tokens. SpecDec++ will stop the current speculation when the predicted probability that at least one token gets rejected exceeds a threshold. We implement SpecDec++ and apply it to the llama-2-chat 7B & 70B model pair. Our adaptive method achieves a 2.04x speedup on the Alpaca dataset (an additional 7.2% improvement over the baseline speculative decoding). On the GSM8K and HumanEval datasets, our method achieves a 2.26x speedup (9.4% improvement) and 2.23x speedup (11.1% improvement), respectively.

[16]  arXiv:2405.19737 [pdf, other]
Title: Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

As Large Language Models (LLMs) scale up and gain powerful Chain-of-Thoughts (CoTs) reasoning abilities, practical resource constraints drive efforts to distill these capabilities into more compact Smaller Language Models (SLMs). We find that CoTs consist mainly of simple reasoning forms, with a small proportion ($\approx 4.7\%$) of key reasoning steps that truly impact conclusions. However, previous distillation methods typically involve supervised fine-tuning student SLMs only on correct CoTs data produced by teacher LLMs, resulting in students struggling to learn the key reasoning steps, instead imitating the teacher's reasoning forms and making errors or omissions on these steps. To address these issues, drawing an analogy to human learning, where analyzing mistakes according to correct solutions often reveals the crucial steps leading to successes or failures, we propose mistak\textbf{E}-\textbf{D}riven key reason\textbf{I}ng step distilla\textbf{T}ion (\textbf{EDIT}), a novel method that further aids SLMs learning key reasoning steps rather than mere simple fine-tuning. Firstly, to expose these crucial steps in CoTs, we design specific prompts to generate dual CoTs data with similar reasoning paths but divergent conclusions. Then, we apply the minimum edit distance algorithm on the dual CoTs data to locate these key steps and optimize the likelihood of these steps. Extensive experiments validate the effectiveness of EDIT across both in-domain and out-of-domain benchmark reasoning datasets. Further analysis shows that EDIT can generate high-quality CoTs with more correct key reasoning steps. Notably, we also explore how different mistake patterns affect performance and find that EDIT benefits more from logical errors than from knowledge or mathematical calculation errors in dual CoTs\footnote{Code can be found at \url{https://github.com/C-W-D/EDIT}}.

[17]  arXiv:2405.19740 [pdf, other]
Title: PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations
Comments: 23 pages, 12 figures, 10 tables
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

Expert-designed close-ended benchmarks serve as vital tools in assessing the knowledge capacity of large language models (LLMs). Despite their widespread use, concerns have mounted regarding their reliability due to limited test scenarios and an unavoidable risk of data contamination. To rectify this, we present PertEval, a toolkit devised for in-depth probing of LLMs' knowledge capacity through knowledge-invariant perturbations. These perturbations employ human-like restatement techniques to generate on-the-fly test samples from static benchmarks, meticulously retaining knowledge-critical content while altering irrelevant details. Our toolkit further includes a suite of transition analyses that compare performance on raw vs. perturbed test sets to precisely assess LLMs' genuine knowledge capacity. Six state-of-the-art LLMs are re-evaluated using PertEval. Results reveal significantly inflated performance of the LLMs on raw benchmarks, including an absolute 21% overestimation for GPT-4. Additionally, through a nuanced response pattern analysis, we discover that PertEval retains LLMs' uncertainty to specious knowledge, potentially being resolved through rote memorization and leading to inflated performance. We also find that the detailed transition analyses by PertEval could illuminate weaknesses in existing LLMs' knowledge mastery and guide the development of refinement. Given these insights, we posit that PertEval can act as an essential tool that, when applied alongside any close-ended benchmark, unveils the true knowledge capacity of LLMs, marking a significant step toward more trustworthy LLM evaluation.

[18]  arXiv:2405.19744 [pdf, other]
Title: X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions
Comments: ACL 2024. Our codes, data and model weights are available at this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Large language models respond well in high-resource languages like English but struggle in low-resource languages. It may arise from the lack of high-quality instruction following data in these languages. Directly translating English samples into these languages can be a solution but unreliable, leading to responses with translation errors and lacking language-specific or cultural knowledge. To address this issue, we propose a novel method to construct cross-lingual instruction following samples with instruction in English and response in low-resource languages. Specifically, the language model first learns to generate appropriate English instructions according to the natural web texts in other languages as responses. The candidate cross-lingual instruction tuning samples are further refined and diversified. We have employed this method to build a large-scale cross-lingual instruction tuning dataset on 10 languages, namely X-Instruction. The instruction data built using our method incorporate more language-specific knowledge compared with the naive translation method. Experimental results have shown that the response quality of the model tuned on X-Instruction greatly exceeds the model distilled from a powerful teacher model, reaching or even surpassing the ones of ChatGPT. In addition, we find that models tuned on cross-lingual instruction following samples can follow the instruction in the output language without further tuning.

[19]  arXiv:2405.19763 [pdf, other]
Title: Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding
Comments: Accept at ACL2024 Main
Subjects: Computation and Language (cs.CL)

Recent strides in large language models (LLMs) have yielded remarkable performance, leveraging reinforcement learning from human feedback (RLHF) to significantly enhance generation and alignment capabilities. However, RLHF encounters numerous challenges, including the objective mismatch issue, leading to suboptimal performance in Natural Language Understanding (NLU) tasks. To address this limitation, we propose a novel Reinforcement Learning framework enhanced with Label-sensitive Reward (RLLR) to amplify the performance of LLMs in NLU tasks. By incorporating label-sensitive pairs into reinforcement learning, our method aims to adeptly capture nuanced label-sensitive semantic features during RL, thereby enhancing natural language understanding. Experiments conducted on five diverse foundation models across eight tasks showcase promising results. In comparison to Supervised Fine-tuning models (SFT), RLLR demonstrates an average performance improvement of 1.54%. Compared with RLHF models, the improvement averages at 0.69%. These results reveal the effectiveness of our method for LLMs in NLU tasks. Code and data available at: https://github.com/MagiaSN/ACL2024_RLLR.

[20]  arXiv:2405.19778 [pdf, other]
Title: Enhancing Consistency and Role-Specific Knowledge Capturing by Rebuilding Fictional Character's Persona
Comments: preprint
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

With the recent introduction of Assistants API, it is expected that document-based language models will be actively used in various domains, especially Role-playing. However, a key challenge lies in utilizing protagonist's persona: Assistants API often fails to achieve with its search because the information extraction part is different each time and it often omits important information such as protagonist's backstory or relationships. It is hard to maintain a consistent persona simply by using the persona document as input to the Assistants API. To address the challenge of achieving stable persona consistency, we propose CharacterGPT, a novel persona reconstruction framework to alleviate the shortcomings of the Assistants API. Our method involves Character Persona Training (CPT), an effective persona rebuilding process that updates the character persona by extracting the character's traits from given summary of the novel for each character as if the story in a novel progresses. In our experiments, we ask each character to take the Big Five Inventory personality test in various settings and analyze the results. To assess whether it can think outside the box, we let each character generate short novels. Extensive experiments and human evaluation demonstrate that CharacterGPT presents new possibilities for role-playing agent research.

[21]  arXiv:2405.19787 [pdf, other]
Title: From Symbolic Tasks to Code Generation: Diversification Yields Better Task Performers
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)

Instruction tuning -- tuning large language models on instruction-output pairs -- is a promising technique for making models better adapted to the real world. Yet, the key factors driving the model's capability to understand and follow instructions not seen during training remain under-explored. Our investigation begins with a series of synthetic experiments within the theoretical framework of a Turing-complete algorithm called Markov algorithm, which allows fine-grained control over the instruction-tuning data. Generalization and robustness with respect to the training distribution emerge once a diverse enough set of tasks is provided, even though very few examples are provided for each task. We extend these initial results to a real-world application scenario of code generation and find that a more diverse instruction set, extending beyond code-related tasks, improves the performance of code generation. Our observations suggest that a more diverse semantic space for instruction-tuning sets greatly improves the model's ability to follow instructions and perform tasks.

[22]  arXiv:2405.19793 [pdf, other]
Title: PDDLEGO: Iterative Planning in Textual Environments
Comments: In *SEM 2024
Subjects: Computation and Language (cs.CL)

Planning in textual environments have been shown to be a long-standing challenge even for current models. A recent, promising line of work uses LLMs to generate a formal representation of the environment that can be solved by a symbolic planner. However, existing methods rely on a fully-observed environment where all entity states are initially known, so a one-off representation can be constructed, leading to a complete plan. In contrast, we tackle partially-observed environments where there is initially no sufficient information to plan for the end-goal. We propose PDDLEGO that iteratively construct a planning representation that can lead to a partial plan for a given sub-goal. By accomplishing the sub-goal, more information is acquired to augment the representation, eventually achieving the end-goal. We show that plans produced by few-shot PDDLEGO are 43% more efficient than generating plans end-to-end on the Coin Collector simulation, with strong performance (98%) on the more complex Cooking World simulation where end-to-end LLMs fail to generate coherent plans (4%).

[23]  arXiv:2405.19795 [pdf, other]
Title: SLM as Guardian: Pioneering AI Safety with Small Language Models
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Most prior safety research of large language models (LLMs) has focused on enhancing the alignment of LLMs to better suit the safety requirements of humans. However, internalizing such safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness. To overcome such challenges, a modular approach employing a smaller LLM to detect harmful user queries is regarded as a convenient solution in designing LLM-based system with safety requirements.
In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation. We introduce our safety requirements and the taxonomy of harmfulness categories, and then propose a multi-task learning mechanism fusing the two tasks into a single model. We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs.

[24]  arXiv:2405.19799 [pdf, other]
Title: Unsupervised Mutual Learning of Dialogue Discourse Parsing and Topic Segmentation
Subjects: Computation and Language (cs.CL)

The advancement of large language models (LLMs) has propelled the development of dialogue systems. Unlike the popular ChatGPT-like assistant model, which only satisfies the user's preferences, task-oriented dialogue systems have also faced new requirements and challenges in the broader business field. They are expected to provide correct responses at each dialogue turn, at the same time, achieve the overall goal defined by the task. By understanding rhetorical structures and topic structures via topic segmentation and discourse parsing, a dialogue system may do a better planning to achieve both objectives. However, while both structures belong to discourse structure in linguistics, rhetorical structure and topic structure are mostly modeled separately or with one assisting the other in the prior work. The interaction between these two structures has not been considered for joint modeling and mutual learning. Furthermore, unsupervised learning techniques to achieve the above are not well explored. To fill this gap, we propose an unsupervised mutual learning framework of two structures leveraging the global and local connections between them. We extend the topic modeling between non-adjacent discourse units to ensure global structural relevance with rhetorical structures. We also incorporate rhetorical structures into the topic structure through a graph neural network model to ensure local coherence consistency. Finally, we utilize the similarity between the two fused structures for mutual learning. The experimental results demonstrate that our methods outperform all strong baselines on two dialogue rhetorical datasets (STAC and Molweni), as well as dialogue topic datasets (Doc2Dial and TIAGE).

[25]  arXiv:2405.19831 [pdf, other]
Title: Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic Similarity and Privacy Preservation of Differentially Private Rewritten Text
Comments: 10 pages, 2 figures, 2 tables. Accepted to ARES 2024 (IWAPS)
Subjects: Computation and Language (cs.CL)

The study of Differential Privacy (DP) in Natural Language Processing often views the task of text privatization as a $\textit{rewriting}$ task, in which sensitive input texts are rewritten to hide explicit or implicit private information. In order to evaluate the privacy-preserving capabilities of a DP text rewriting mechanism, $\textit{empirical privacy}$ tests are frequently employed. In these tests, an adversary is modeled, who aims to infer sensitive information (e.g., gender) about the author behind a (privatized) text. Looking to improve the empirical protections provided by DP rewriting methods, we propose a simple post-processing method based on the goal of aligning rewritten texts with their original counterparts, where DP rewritten texts are rewritten $\textit{again}$. Our results shown that such an approach not only produces outputs that are more semantically reminiscent of the original inputs, but also texts which score on average better in empirical privacy evaluations. Therefore, our approach raises the bar for DP rewriting methods in their empirical privacy evaluations, providing an extra layer of protection against malicious adversaries.

[26]  arXiv:2405.19842 [pdf, other]
Title: Improve Student's Reasoning Generalizability through Cascading Decomposed CoTs Distillation
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Large language models (LLMs) exhibit enhanced reasoning at larger scales, driving efforts to distill these capabilities into smaller models via teacher-student learning. Previous works simply fine-tune student models on teachers' generated Chain-of-Thoughts (CoTs) data. Although these methods enhance in-domain (IND) reasoning performance, they struggle to generalize to out-of-domain (OOD) tasks. We believe that the widespread spurious correlations between questions and answers may lead the model to preset a specific answer which restricts the diversity and generalizability of its reasoning process. In this paper, we propose Cascading Decomposed CoTs Distillation (CasCoD) to address these issues by decomposing the traditional single-step learning process into two cascaded learning steps. Specifically, by restructuring the training objectives -- removing the answer from outputs and concatenating the question with the rationale as input -- CasCoD's two-step learning process ensures that students focus on learning rationales without interference from the preset answers, thus improving reasoning generalizability. Extensive experiments demonstrate the effectiveness of CasCoD on both IND and OOD benchmark reasoning datasets. Code can be found at https://github.com/C-W-D/CasCoD.

[27]  arXiv:2405.19846 [pdf, other]
Title: Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Large language models, initially pre-trained with a limited context length, can better handle longer texts by continuing training on a corpus with extended contexts. However, obtaining effective long-context data is challenging due to the scarcity and uneven distribution of long documents across different domains. To address this issue, we propose a Query-centric data synthesis method, abbreviated as Quest. Quest is an interpretable method based on the observation that documents retrieved by similar queries are relevant but low-redundant, thus well-suited for synthesizing long-context data. The method is also scalable and capable of constructing large amounts of long-context data. Using Quest, we synthesize a long-context dataset up to 128k context length, significantly outperforming other data synthesis methods on multiple long-context benchmark datasets. In addition, we further verify that the Quest method is predictable through scaling law experiments, making it a reliable solution for advancing long-context models.

[28]  arXiv:2405.19856 [pdf, other]
Title: DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
Comments: Accepted by the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). arXiv admin note: substantial text overlap with arXiv:2404.00599, arXiv:2401.06401
Subjects: Computation and Language (cs.CL); Software Engineering (cs.SE)

How to evaluate the coding abilities of Large Language Models (LLMs) remains an open question. We find that existing benchmarks are poorly aligned with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs.
To address the knowledge gap, we propose a new benchmark named DevEval, which has three advances. (1) DevEval aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) DevEval is annotated by 13 developers and contains comprehensive annotations (e.g., requirements, original repositories, reference code, and reference dependencies). (3) DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains (e.g., Internet, Database). Based on DevEval, we propose repository-level code generation and evaluate 8 popular LLMs on DevEval (e.g., gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa). Our experiments reveal these LLMs' coding abilities in real-world code repositories. For example, in our experiments, the highest Pass@1 of gpt-4-turbo is only 53.04%. We also analyze LLMs' failed cases and summarize their shortcomings. We hope DevEval can facilitate the development of LLMs in real code repositories. DevEval, prompts, and LLMs' predictions have been released.

[29]  arXiv:2405.19874 [pdf, other]
Title: Is In-Context Learning Sufficient for Instruction Following in LLMs?
Comments: Preprint. Code at this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

In-context learning (ICL) allows LLMs to learn from examples without changing their weights, which is a particularly promising capability for long-context LLMs that can potentially learn from many examples. Recently, Lin et al. (2024) proposed URIAL, a method using only three in-context examples to align base LLMs, achieving non-trivial instruction following performance. In this work, we show that, while effective, ICL alignment with URIAL still underperforms compared to instruction fine-tuning on established benchmarks such as MT-Bench and AlpacaEval 2.0 (LC), especially with more capable base LMs. Unlike for tasks such as classification, translation, or summarization, adding more ICL demonstrations for long-context LLMs does not systematically improve instruction following performance. To address this limitation, we derive a greedy selection approach for ICL examples that noticeably improves performance, yet without bridging the gap to instruction fine-tuning. Finally, we provide a series of ablation studies to better understand the reasons behind the remaining gap, and we show how some aspects of ICL depart from the existing knowledge and are specific to the instruction tuning setting. Overall, our work advances the understanding of ICL as an alignment technique. We provide our code at https://github.com/tml-epfl/icl-alignment.

[30]  arXiv:2405.19958 [pdf, other]
Title: Multi-Aspect Controllable Text Generation with Disentangled Counterfactual Augmentation
Comments: Accepted in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Multi-aspect controllable text generation aims to control the generated texts in attributes from multiple aspects (e.g., "positive" from sentiment and "sport" from topic). For ease of obtaining training samples, existing works neglect attribute correlations formed by the intertwining of different attributes. Particularly, the stereotype formed by imbalanced attribute correlations significantly affects multi-aspect control. In this paper, we propose MAGIC, a new multi-aspect controllable text generation method with disentangled counterfactual augmentation. We alleviate the issue of imbalanced attribute correlations during training using counterfactual feature vectors in the attribute latent space by disentanglement. During inference, we enhance attribute correlations by target-guided counterfactual augmentation to further improve multi-aspect control. Experiments show that MAGIC outperforms state-of-the-art baselines in both imbalanced and balanced attribute correlation scenarios. Our source code and data are available at https://github.com/nju-websoft/MAGIC.

[31]  arXiv:2405.19967 [pdf, other]
Title: Improved Out-of-Scope Intent Classification with Dual Encoding and Threshold-based Re-Classification
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Detecting out-of-scope user utterances is essential for task-oriented dialogues and intent classification. Current methodologies face difficulties with the unpredictable distribution of outliers and often rely on assumptions about data distributions. We present the Dual Encoder for Threshold-Based Re-Classification (DETER) to address these challenges. This end-to-end framework efficiently detects out-of-scope intents without requiring assumptions on data distributions or additional post-processing steps. The core of DETER utilizes dual text encoders, the Universal Sentence Encoder (USE) and the Transformer-based Denoising AutoEncoder (TSDAE), to generate user utterance embeddings, which are classified through a branched neural architecture. Further, DETER generates synthetic outliers using self-supervision and incorporates out-of-scope phrases from open-domain datasets. This approach ensures a comprehensive training set for out-of-scope detection. Additionally, a threshold-based re-classification mechanism refines the model's initial predictions. Evaluations on the CLINC-150, Stackoverflow, and Banking77 datasets demonstrate DETER's efficacy. Our model outperforms previous benchmarks, increasing up to 13% and 5% in F1 score for known and unknown intents on CLINC-150 and Stackoverflow, and 16% for known and 24% % for unknown intents on Banking77. The source code has been released at https://github.com/Hossam-Mohammed-tech/Intent\_Classification\_OOS.

[32]  arXiv:2405.20053 [pdf, other]
Title: Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Pre-trained Language Models (LMs) exhibit strong zero-shot and in-context learning capabilities; however, their behaviors are often difficult to control. By utilizing Reinforcement Learning from Human Feedback (RLHF), it is possible to fine-tune unsupervised LMs to follow instructions and produce outputs that reflect human preferences. Despite its benefits, RLHF has been shown to potentially harm a language model's reasoning capabilities and introduce artifacts such as hallucinations where the model may fabricate facts. To address this issue we introduce Direct Preference Heads (DPH), a fine-tuning framework that enables LMs to learn human preference signals through an auxiliary reward head without directly affecting the output distribution of the language modeling head. We perform a theoretical analysis of our objective function and find strong ties to Conservative Direct Preference Optimization (cDPO). Finally we evaluate our models on GLUE, RACE, and the GPT4All evaluation suite and demonstrate that our method produces models which achieve higher scores than those fine-tuned with Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) alone.

[33]  arXiv:2405.20079 [pdf, other]
Title: Student Answer Forecasting: Transformer-Driven Answer Choice Prediction for Language Learning
Comments: Accepted as a poster paper at EDM 2024: 17th International Conference on Educational Data Mining in Atlanta, USA
Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)

Intelligent Tutoring Systems (ITS) enhance personalized learning by predicting student answers to provide immediate and customized instruction. However, recent research has primarily focused on the correctness of the answer rather than the student's performance on specific answer choices, limiting insights into students' thought processes and potential misconceptions. To address this gap, we present MCQStudentBert, an answer forecasting model that leverages the capabilities of Large Language Models (LLMs) to integrate contextual understanding of students' answering history along with the text of the questions and answers. By predicting the specific answer choices students are likely to make, practitioners can easily extend the model to new answer choices or remove answer choices for the same multiple-choice question (MCQ) without retraining the model. In particular, we compare MLP, LSTM, BERT, and Mistral 7B architectures to generate embeddings from students' past interactions, which are then incorporated into a finetuned BERT's answer-forecasting mechanism. We apply our pipeline to a dataset of language learning MCQ, gathered from an ITS with over 10,000 students to explore the predictive accuracy of MCQStudentBert, which incorporates student interaction patterns, in comparison to correct answer prediction and traditional mastery-learning feature-based approaches. This work opens the door to more personalized content, modularization, and granular support.

[34]  arXiv:2405.20089 [pdf, other]
Title: The Fine-Tuning Paradox: Boosting Translation Quality Without Sacrificing LLM Abilities
Comments: Accepted to ACL 2024 (long, main)
Subjects: Computation and Language (cs.CL)

Fine-tuning large language models (LLMs) for machine translation has shown improvements in overall translation quality. However, it is unclear what is the impact of fine-tuning on desirable LLM behaviors that are not present in neural machine translation models, such as steerability, inherent document-level translation abilities, and the ability to produce less literal translations. We perform an extensive translation evaluation on the LLaMA and Falcon family of models with model size ranging from 7 billion up to 65 billion parameters. Our results show that while fine-tuning improves the general translation quality of LLMs, several abilities degrade. In particular, we observe a decline in the ability to perform formality steering, to produce technical translations through few-shot examples, and to perform document-level translation. On the other hand, we observe that the model produces less literal translations after fine-tuning on parallel data. We show that by including monolingual data as part of the fine-tuning data we can maintain the abilities while simultaneously enhancing overall translation quality. Our findings emphasize the need for fine-tuning strategies that preserve the benefits of LLMs for machine translation.

[35]  arXiv:2405.20092 [pdf, other]
Title: Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation
Subjects: Computation and Language (cs.CL); Software Engineering (cs.SE)

Despite recent progress made by large language models in code generation, they still struggle with programs that meet complex requirements. Recent work utilizes plan-and-solve decomposition to decrease the complexity and leverage self-tests to refine the generated program. Yet, planning deep-inside requirements in advance can be challenging, and the tests need to be accurate to accomplish self-improvement. To this end, we propose FunCoder, a code generation framework incorporating the divide-and-conquer strategy with functional consensus. Specifically, FunCoder recursively branches off sub-functions as smaller goals during code generation, represented by a tree hierarchy. These sub-functions are then composited to attain more complex objectives. Additionally, we designate functions via a consensus formed by identifying similarities in program behavior, mitigating error propagation. FunCoder outperforms state-of-the-art methods by +9.8% on average in HumanEval, MBPP, xCodeEval and MATH with GPT-3.5 and GPT-4. Moreover, our method demonstrates superiority on smaller models: With FunCoder, StableCode-3b surpasses GPT-3.5 by +18.6% and achieves 97.7% of GPT-4's performance on HumanEval. Further analysis reveals that our proposed dynamic function decomposition is capable of handling complex requirements, and the functional consensus prevails over self-testing in correctness evaluation.

[36]  arXiv:2405.20131 [pdf, other]
Title: Language Models Need Inductive Biases to Count Inductively
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano's axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer "reasoning" to the simplest case of counting, investigating length generalization does occur throughout the literature. In the "train short, test long" paradigm of NLP, length refers to the training sentence length. In formal language recognition, length refers to the input sequence length, or the maximum stack size induced by a pushdown automata. In general problem solving, length refers to the number of hops in a deductive reasoning chain or the recursion depth. For all cases, counting is central to task success. And crucially, generalizing counting inductively is central to success on OOD instances. This work provides extensive empirical results on training language models to count. We experiment with architectures ranging from RNNs, Transformers, State-Space Models and RWKV. We present carefully-designed task formats, auxiliary tasks and positional embeddings to avoid limitations in generalization with OOD-position and OOD-vocabulary. We find that while traditional RNNs trivially achieve inductive counting, Transformers have to rely on positional embeddings to count out-of-domain. As counting is the basis for many arguments concerning the expressivity of Transformers, our finding calls for the community to reexamine the application scope of primitive functions defined in formal characterizations. Finally, modern RNNs also largely underperform traditional RNNs in generalizing counting inductively. We discuss how design choices that enable parallelized training of modern RNNs cause them to lose merits of a recurrent nature.

[37]  arXiv:2405.20139 [pdf, other]
Title: GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Knowledge Graphs (KGs) represent human-crafted factual knowledge in the form of triplets (head, relation, tail), which collectively form a graph. Question Answering over KGs (KGQA) is the task of answering natural questions grounding the reasoning to the information provided by the KG. Large Language Models (LLMs) are the state-of-the-art models for QA tasks due to their remarkable ability to understand natural language. On the other hand, Graph Neural Networks (GNNs) have been widely used for KGQA as they can handle the complex graph information stored in the KG. In this work, we introduce GNN-RAG, a novel method for combining language understanding abilities of LLMs with the reasoning abilities of GNNs in a retrieval-augmented generation (RAG) style. First, a GNN reasons over a dense KG subgraph to retrieve answer candidates for a given question. Second, the shortest paths in the KG that connect question entities and answer candidates are extracted to represent KG reasoning paths. The extracted paths are verbalized and given as input for LLM reasoning with RAG. In our GNN-RAG framework, the GNN acts as a dense subgraph reasoner to extract useful graph information, while the LLM leverages its natural language processing ability for ultimate KGQA. Furthermore, we develop a retrieval augmentation (RA) technique to further boost KGQA performance with GNN-RAG. Experimental results show that GNN-RAG achieves state-of-the-art performance in two widely used KGQA benchmarks (WebQSP and CWQ), outperforming or matching GPT-4 performance with a 7B tuned LLM. In addition, GNN-RAG excels on multi-hop and multi-entity questions outperforming competing approaches by 8.9--15.5% points at answer F1.

[38]  arXiv:2405.20145 [pdf, other]
Title: Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers
Comments: Accepted for publication at the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP (SIGTYP-WS) 2024; 11 pages, 1 figure, 9 tables
Subjects: Computation and Language (cs.CL)

Historical languages present unique challenges to the NLP community, with one prominent hurdle being the limited resources available in their closed corpora. This work describes our submission to the constrained subtask of the SIGTYP 2024 shared task, focusing on PoS tagging, morphological tagging, and lemmatization for 13 historical languages. For PoS and morphological tagging we adapt a hierarchical tokenization method from Sun et al. (2023) and combine it with the advantages of the DeBERTa-V3 architecture, enabling our models to efficiently learn from every character in the training data. We also demonstrate the effectiveness of character-level T5 models on the lemmatization task. Pre-trained from scratch with limited data, our models achieved first place in the constrained subtask, nearly reaching the performance levels of the unconstrained task's winner. Our code is available at https://github.com/bowphs/SIGTYP-2024-hierarchical-transformers

[39]  arXiv:2405.20163 [pdf, other]
Title: Reasoning about concepts with LLMs: Inconsistencies abound
Comments: 15 pages, 5 figures, 3 tables
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

The ability to summarize and organize knowledge into abstract concepts is key to learning and reasoning. Many industrial applications rely on the consistent and systematic use of concepts, especially when dealing with decision-critical knowledge. However, we demonstrate that, when methodically questioned, large language models (LLMs) often display and demonstrate significant inconsistencies in their knowledge. Computationally, the basic aspects of the conceptualization of a given domain can be represented as Is-A hierarchies in a knowledge graph (KG) or ontology, together with a few properties or axioms that enable straightforward reasoning. We show that even simple ontologies can be used to reveal conceptual inconsistencies across several LLMs. We also propose strategies that domain experts can use to evaluate and improve the coverage of key domain concepts in LLMs of various sizes. In particular, we have been able to significantly enhance the performance of LLMs of various sizes with openly available weights using simple knowledge-graph (KG) based prompting strategies.

[40]  arXiv:2405.20175 [pdf, other]
Title: InstructionCP: A fast approach to transfer Large Language Models into target language
Comments: 10 pages, 1 figure
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

The rapid development of large language models (LLMs) in recent years has largely focused on English, resulting in models that respond exclusively in English. To adapt these models to other languages, continual pre-training (CP) is often employed, followed by supervised fine-tuning (SFT) to maintain conversational abilities. However, CP and SFT can reduce a model's ability to filter harmful content. We propose Instruction Continual Pre-training (InsCP), which integrates instruction tags into the CP process to prevent loss of conversational proficiency while acquiring new languages. Our experiments demonstrate that InsCP retains conversational and Reinforcement Learning from Human Feedback (RLHF) abilities. Empirical evaluations on language alignment, reliability, and knowledge benchmarks confirm the efficacy of InsCP. Notably, this approach requires only 0.1 billion tokens of high-quality instruction-following data, thereby reducing resource consumption.

[41]  arXiv:2405.20179 [pdf, other]
Title: Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning CodeLLMs
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Robotics (cs.RO)

Large language models (LLMs) have shown great promise at generating robot programs from natural language given domain-specific robot application programming interfaces (APIs). However, the performance gap between proprietary LLMs and smaller open-weight LLMs remains wide. This raises a question: Can we fine-tune smaller open-weight LLMs for generating domain-specific robot programs to close the performance gap with proprietary LLMs? While Self-Instruct is a promising solution by generating a diverse set of training data, it cannot verify the correctness of these programs. In contrast, a robot simulator with a well-defined world can identify execution errors but limits the diversity of programs that it can verify. In this work, we introduce Robo-Instruct, which brings the best of both worlds -- it promotes the diversity of Self-Instruct while providing the correctness of simulator-based checking. Robo-Instruct introduces RoboSim to synthesize a consistent world state on the fly by inferring properties relevant to the program being checked, and simulating actions accordingly. Furthermore, the instructions and programs generated by Self-Instruct may be subtly inconsistent -- such as the program missing a step implied by the instruction. Robo-Instruct further addresses this with InstAlign, an instruction-program alignment procedure that revises the task instruction to reflect the actual results of the generated program. Given a few seed task descriptions and the robot APIs, Robo-Instruct is capable of generating a training dataset using only a small open-weight model. This dataset can then be used to fine-tune small open-weight language models, enabling them to match or even exceed the performance of several proprietary LLMs, such as GPT-3.5-Turbo and Gemini-Pro.

[42]  arXiv:2405.20192 [pdf, other]
Title: TAIA: Large Language Models are Out-of-Distribution Data Learners
Comments: 25 pages
Subjects: Computation and Language (cs.CL)

Fine-tuning on task-specific question-answer pairs is a predominant method for enhancing the performance of instruction-tuned large language models (LLMs) on downstream tasks. However, in certain specialized domains, such as healthcare or harmless content generation, it is nearly impossible to obtain a large volume of high-quality data that matches the downstream distribution. To improve the performance of LLMs in data-scarce domains with domain-mismatched data, we re-evaluated the Transformer architecture and discovered that not all parameter updates during fine-tuning contribute positively to downstream performance. Our analysis reveals that within the self-attention and feed-forward networks, only the fine-tuned attention parameters are particularly beneficial when the training set's distribution does not fully align with the test set. Based on this insight, we propose an effective inference-time intervention method: \uline{T}raining \uline{A}ll parameters but \uline{I}nferring with only \uline{A}ttention (\trainallInfAttn). We empirically validate \trainallInfAttn using two general instruction-tuning datasets and evaluate it on seven downstream tasks involving math, reasoning, and knowledge understanding across LLMs of different parameter sizes and fine-tuning techniques. Our comprehensive experiments demonstrate that \trainallInfAttn achieves superior improvements compared to both the fully fine-tuned model and the base model in most scenarios, with significant performance gains. The high tolerance of \trainallInfAttn to data mismatches makes it resistant to jailbreaking tuning and enhances specialized tasks using general data.

[43]  arXiv:2405.20204 [pdf, other]
Title: Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Comments: 4 pages, ICML2024 workshop submission
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

[44]  arXiv:2405.20215 [pdf, other]
Title: TS-Align: A Teacher-Student Collaborative Framework for Scalable Iterative Finetuning of Large Language Models
Subjects: Computation and Language (cs.CL)

Mainstream approaches to aligning large language models (LLMs) heavily rely on human preference data, particularly when models require periodic updates. The standard process for iterative alignment of LLMs involves collecting new human feedback for each update. However, the data collection process is costly and challenging to scale. To address this issue, we introduce the "TS-Align" framework, which fine-tunes a policy model using pairwise feedback data automatically mined from its outputs. This automatic mining process is efficiently accomplished through the collaboration between a large-scale teacher model and a small-scale student model. The policy fine-tuning process can be iteratively repeated using on-policy generations within our proposed teacher-student collaborative framework. Through extensive experiments, we demonstrate that our final aligned policy outperforms the base policy model with an average win rate of 69.7% across seven conversational or instruction-following datasets. Furthermore, we show that the ranking capability of the teacher is effectively distilled into the student through our pipeline, resulting in a small-scale yet effective reward model for policy model alignment.

[45]  arXiv:2405.20245 [pdf, other]
Title: Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use
Comments: Accepted by IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), 2024
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

Business Document Information Extraction (BDIE) is the problem of transforming a blob of unstructured information (raw text, scanned documents, etc.) into a structured format that downstream systems can parse and use. It has two main tasks: Key-Information Extraction (KIE) and Line Items Recognition (LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem, where the tools are these downstream systems. We then present Retrieval Augmented Structured Generation (RASG), a novel general framework for BDIE that achieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE benchmarks.
The contributions of this paper are threefold: (1) We show, with ablation benchmarks, that Large Language Models (LLMs) with RASG are already competitive with or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on BDIE benchmarks. (2) We propose a new metric class for Line Items Recognition, General Line Items Recognition Metric (GLIRM), that is more aligned with practical BDIE use cases compared to existing metrics, such as ANLS*, DocILE, and GriTS. (3) We provide a heuristic algorithm for backcalculating bounding boxes of predicted line items and tables without the need for vision encoders. Finally, we claim that, while LMMs might sometimes offer marginal performance benefits, LLMs + RASG is oftentimes superior given real-world applications and constraints of BDIE.

[46]  arXiv:2405.20252 [pdf, other]
Title: Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization
Subjects: Computation and Language (cs.CL)

Large language models (LLMs) have shown great progress in responding to user questions, allowing for a multitude of diverse applications. Yet, the quality of LLM outputs heavily depends on the prompt design, where a good prompt might enable the LLM to answer a very challenging question correctly. Therefore, recent works have developed many strategies for improving the prompt, including both manual crafting and in-domain optimization. However, their efficacy in unrestricted scenarios remains questionable, as the former depends on human design for specific questions and the latter usually generalizes poorly to unseen scenarios. To address these problems, we give LLMs the freedom to design the best prompts according to themselves. Specifically, we include a hierarchy of LLMs, first constructing a prompt with precise instructions and accurate wording in a hierarchical manner, and then using this prompt to generate the final answer to the user query. We term this pipeline Hierarchical Multi-Agent Workflow, or HMAW. In contrast with prior works, HMAW imposes no human restriction and requires no training, and is completely task-agnostic while capable of adjusting to the nuances of the underlying task. Through both quantitative and qualitative experiments across multiple benchmarks, we verify that despite its simplicity, the proposed approach can create detailed and suitable prompts, further boosting the performance of current LLMs.

[47]  arXiv:2405.20253 [pdf, other]
Title: Evaluating Large Language Model Biases in Persona-Steered Generation
Comments: Accepted to Findings of ACL 2024. Code and data available at this https URL
Subjects: Computation and Language (cs.CL)

The task of persona-steered text generation requires large language models (LLMs) to generate text that reflects the distribution of views that an individual fitting a persona could have. People have multifaceted personas, but prior work on bias in LLM-generated opinions has only explored multiple-choice settings or one-dimensional personas. We define an incongruous persona as a persona with multiple traits where one trait makes its other traits less likely in human survey data, e.g. political liberals who support increased military spending. We find that LLMs are 9.7% less steerable towards incongruous personas than congruous ones, sometimes generating the stereotypical stance associated with its demographic rather than the target stance. Models that we evaluate that are fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women, but present significantly less diverse views of personas. We also find variance in LLM steerability that cannot be predicted from multiple-choice opinion evaluation. Our results show the importance of evaluating models in open-ended text generation, as it can surface new LLM opinion biases. Moreover, such a setup can shed light on our ability to steer models toward a richer and more diverse range of viewpoints.

[48]  arXiv:2405.20267 [pdf, other]
Title: Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions
Subjects: Computation and Language (cs.CL)

As LLMs evolve on a daily basis, there is an urgent need for a trustworthy evaluation method that can provide robust evaluation results in a timely fashion. Currently, as static benchmarks are prone to contamination concerns, users tend to trust human voting platforms, such as Chatbot Arena. However, human annotations require extensive manual efforts. To provide an automatic, robust, and trustworthy evaluation framework, we innovatively propose the Auto-Arena of LLMs, which automates the entire evaluation process with LLM agents. Firstly, an examiner LLM devises queries. Then, a pair of candidate LLMs engage in a multi-round peer-battle around the query, during which the LLM's true performance gaps become visible. Finally, a committee of LLM judges collectively discuss and determine the winner, which alleviates bias and promotes fairness. In our extensive experiment on the 17 newest LLMs, Auto-Arena shows the highest correlation with human preferences, providing a promising alternative to human evaluation platforms.

[49]  arXiv:2405.20269 [pdf, ps, other]
Title: IsraParlTweet: The Israeli Parliamentary and Twitter Resource
Comments: Presented at LREC-COLING 2024
Subjects: Computation and Language (cs.CL)

We introduce IsraParlTweet, a new linked corpus of Hebrew-language parliamentary discussions from the Knesset (Israeli Parliament) between the years 1992-2023 and Twitter posts made by Members of the Knesset between the years 2008-2023, containing a total of 294.5 million Hebrew tokens. In addition to raw text, the corpus contains comprehensive metadata on speakers and Knesset sessions as well as several linguistic annotations. As a result, IsraParlTweet can be used to conduct a wide variety of quantitative and qualitative analyses and provide valuable insights into political discourse in Israel.

[50]  arXiv:2405.20274 [pdf, other]
Title: ROAST: Review-level Opinion Aspect Sentiment Target Joint Detection
Comments: arXiv admin note: text overlap with arXiv:2309.13297
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Aspect-Based Sentiment Analysis (ABSA) has experienced tremendous expansion and diversity due to various shared tasks spanning several languages and fields and organized via SemEval workshops and Germeval. Nonetheless, a few shortcomings still need to be addressed, such as the lack of low-resource language evaluations and the emphasis on sentence-level analysis. To thoroughly assess ABSA techniques in the context of complete reviews, this research presents a novel task, Review-Level Opinion Aspect Sentiment Target (ROAST). ROAST seeks to close the gap between sentence-level and text-level ABSA by identifying every ABSA constituent at the review level. We extend the available datasets to enable ROAST, addressing the drawbacks noted in previous research by incorporating low-resource languages, numerous languages, and a variety of topics. Through this effort, ABSA research will be able to cover more ground and get a deeper comprehension of the task and its practical application in a variety of languages and domains (https://github.com/RiTUAL-UH/ROAST-ABSA).

[51]  arXiv:2405.20285 [pdf, other]
Title: Who Writes the Review, Human or AI?
Subjects: Computation and Language (cs.CL)

With the increasing use of Artificial Intelligence in Natural Language Processing, concerns have been raised regarding the detection of AI-generated text in various domains. This study aims to investigate this issue by proposing a methodology to accurately distinguish AI-generated and human-written book reviews. Our approach utilizes transfer learning, enabling the model to identify generated text across different topics while improving its ability to detect variations in writing style and vocabulary. To evaluate the effectiveness of the proposed methodology, we developed a dataset consisting of real book reviews and AI-generated reviews using the recently proposed Vicuna open-source language model. The experimental results demonstrate that it is feasible to detect the original source of text, achieving an accuracy rate of 96.86%. Our efforts are oriented toward the exploration of the capabilities and limitations of Large Language Models in the context of text identification. Expanding our knowledge in these aspects will be valuable for effectively navigating similar models in the future and ensuring the integrity and authenticity of human-generated content.

[52]  arXiv:2405.20304 [pdf, other]
Title: Group Robust Preference Optimization in Reward-free RLHF
Comments: Preprint
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

Adapting large language models (LLMs) for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers' groups (e.g., different demographics, ethnicities, company teams, etc.), traditional RLHF approaches adopt a "one-size-fits-all" approach, i.e., they indiscriminately assume and optimize a single preference model, thus not being robust to unique characteristics and needs of the various groups. To address this limitation, we propose a novel Group Robust Preference Optimization (GRPO) method to align LLMs to individual groups' preferences robustly. Our approach builds upon reward-free direct preference optimization methods, but unlike previous approaches, it seeks a robust policy which maximizes the worst-case group performance. To achieve this, GRPO adaptively and sequentially weights the importance of different groups, prioritizing groups with worse cumulative loss. We theoretically study the feasibility of GRPO and analyze its convergence for the log-linear policy class. By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.

[53]  arXiv:2405.20314 [pdf, ps, other]
Title: S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs
Subjects: Computation and Language (cs.CL)

Speculative decoding (SD) has attracted a significant amount of research attention due to the substantial speedup it can achieve for LLM inference. However, despite the high speedups they offer, speculative decoding methods often achieve optimal performance on high-end devices or with a substantial GPU memory overhead. Given limited memory and the necessity of quantization, a high-performing model on a high-end GPU can slow down by up to 7 times. To this end, we propose Skippy Simultaneous Speculative Decoding (or S3D), a cost-effective self-speculative SD method based on simultaneous multi-token decoding and mid-layer skipping. When compared against recent effective open-source SD systems, our method has achieved one of the top performance-memory ratios while requiring minimal architecture changes and training data. Leveraging our memory efficiency, we created a smaller yet more effective SD model based on Phi-3. It is 1.4 to 2 times faster than the quantized EAGLE model and operates in half-precision while using less VRAM.

[54]  arXiv:2405.20315 [pdf, other]
Title: ANAH: Analytical Annotation of Hallucinations in Large Language Models
Comments: Accepted by ACL 2024
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Reducing the `$\textit{hallucination}$' problem of Large Language Models (LLMs) is crucial for their wide applications. A comprehensive and fine-grained measurement of the hallucination is the first key step for the governance of this issue but is under-explored in the community. Thus, we present $\textbf{ANAH}$, a bilingual dataset that offers $\textbf{AN}$alytical $\textbf{A}$nnotation of $\textbf{H}$allucinations in LLMs within Generative Question Answering. Each answer sentence in our dataset undergoes rigorous annotation, involving the retrieval of a reference fragment, the judgment of the hallucination type, and the correction of hallucinated content. ANAH consists of ~12k sentence-level annotations for ~4.3k LLM responses covering over 700 topics, constructed by a human-in-the-loop pipeline. Thanks to the fine granularity of the hallucination annotations, we can quantitatively confirm that the hallucinations of LLMs progressively accumulate in the answer and use ANAH to train and evaluate hallucination annotators. We conduct extensive experiments on studying generative and discriminative annotators and show that, although current open-source LLMs have difficulties in fine-grained hallucination annotation, the generative annotator trained with ANAH can surpass all open-source LLMs and GPT-3.5, obtain performance competitive with GPT-4, and exhibits better generalization ability on unseen questions.

[55]  arXiv:2405.20318 [pdf, other]
Title: CausalQuest: Collecting Natural Causal Questions for AI Agents
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)

Humans have an innate drive to seek out causality. Whether fuelled by curiosity or specific goals, we constantly question why things happen, how they are interconnected, and many other related phenomena. To develop AI agents capable of addressing this natural human quest for causality, we urgently need a comprehensive dataset of natural causal questions. Unfortunately, existing datasets either contain only artificially-crafted questions that do not reflect real AI usage scenarios or have limited coverage of questions from specific sources. To address this gap, we present CausalQuest, a dataset of 13,500 naturally occurring questions sourced from social networks, search engines, and AI assistants. We formalize the definition of causal questions and establish a taxonomy for finer-grained classification. Through a combined effort of human annotators and large language models (LLMs), we carefully label the dataset. We find that 42% of the questions humans ask are indeed causal, with the majority seeking to understand the causes behind given effects. Using this dataset, we train efficient classifiers (up to 2.85B parameters) for the binary task of identifying causal questions, achieving high performance with F1 scores of up to 0.877. We conclude with a rich set of future research directions that can build upon our data and models.

[56]  arXiv:2405.20335 [pdf, other]
Title: Xwin-LM: Strong and Scalable Alignment Practice for LLMs
Subjects: Computation and Language (cs.CL)

In this work, we present Xwin-LM, a comprehensive suite of alignment methodologies for large language models (LLMs). This suite encompasses several key techniques, including supervised finetuning (SFT), reward modeling (RM), rejection sampling finetuning (RS), and direct preference optimization (DPO). The key components are as follows: (1) Xwin-LM-SFT, models initially finetuned with high-quality instruction data; (2) Xwin-Pair, a large-scale, multi-turn preference dataset meticulously annotated using GPT-4; (3) Xwin-RM, reward models trained on Xwin-Pair, developed at scales of 7B, 13B, and 70B parameters; (4) Xwin-Set, a multiwise preference dataset in which each prompt is linked to 64 unique responses generated by Xwin-LM-SFT and scored by Xwin-RM; (5) Xwin-LM-RS, models finetuned with the highest-scoring responses from Xwin-Set; (6) Xwin-LM-DPO, models further optimized on Xwin-Set using the DPO algorithm. Our evaluations on AlpacaEval and MT-bench demonstrate consistent and significant improvements across the pipeline, demonstrating the strength and scalability of Xwin-LM. The repository https://github.com/Xwin-LM/Xwin-LM will be continually updated to foster community research.

Cross-lists for Fri, 31 May 24

[57]  arXiv:2405.19342 (cross-list from cs.SD) [pdf, other]
Title: Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

Recent works demonstrate that voice assistants do not perform equally well for everyone, but research on demographic robustness of speech technologies is still scarce. This is mainly due to the rarity of large datasets with controlled demographic tags. This paper introduces the Sonos Voice Control Bias Assessment Dataset, an open dataset composed of voice assistant requests for North American English in the music domain (1,038 speakers, 166 hours, 170k audio samples, with 9,040 unique labelled transcripts) with a controlled demographic diversity (gender, age, dialectal region and ethnicity). We also release a statistical demographic bias assessment methodology, at the univariate and multivariate levels, tailored to this specific use case and leveraging spoken language understanding metrics rather than transcription accuracy, which we believe is a better proxy for user experience. To demonstrate the capabilities of this dataset and statistical method to detect demographic bias, we consider a pair of state-of-the-art Automatic Speech Recognition and Spoken Language Understanding models. Results show statistically significant differences in performance across age, dialectal region and ethnicity. Multivariate tests are crucial to shed light on mixed effects between dialectal region, gender and age.

[58]  arXiv:2405.19343 (cross-list from cs.SD) [pdf, other]
Title: Luganda Speech Intent Recognition for IoT Applications
Comments: Presented as a conference paper at ICLR 2024/AfricaNLP
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

The advent of Internet of Things (IoT) technology has generated massive interest in voice-controlled smart homes. While many voice-controlled smart home systems are designed to understand and support widely spoken languages like English, speakers of low-resource languages like Luganda may need more support. This research project aimed to develop a Luganda speech intent classification system for IoT applications to integrate local languages into smart home environments. The project uses hardware components such as Raspberry Pi, Wio Terminal, and ESP32 nodes as microcontrollers. The Raspberry Pi processes Luganda voice commands, the Wio Terminal is a display device, and the ESP32 nodes control the IoT devices. The ultimate objective of this work was to enable voice control using Luganda, which was accomplished through a natural language processing (NLP) model deployed on the Raspberry Pi. The NLP model utilized Mel Frequency Cepstral Coefficients (MFCCs) as acoustic features and a Convolutional Neural Network (Conv2D) architecture for speech intent classification. A dataset of Luganda voice commands was curated for this purpose and this has been made open-source. This work addresses the localization challenges and linguistic diversity in IoT applications by incorporating Luganda voice commands, enabling users to interact with smart home devices without English proficiency, especially in regions where local languages are predominant.

[59]  arXiv:2405.19534 (cross-list from cs.LG) [pdf, other]
Title: Preference Learning Algorithms Do Not Learn Preference Rankings
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via $\textit{ranking accuracy}$. Surprisingly, we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We furthermore derive the $\textit{idealized ranking accuracy}$ that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. We demonstrate that existing models exhibit a significant $\textit{alignment gap}$ -- $\textit{i.e.}$, a gap between the observed and idealized ranking accuracies. We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors in the reference model, and derive a simple and efficient formula for quantifying the difficulty of learning a given preference datapoint. Finally, we demonstrate that ranking accuracy strongly correlates with the empirically popular win rate metric when the model is close to the reference model used in the objective, shedding further light on the differences between on-policy (e.g., RLHF) and off-policy (e.g., DPO) preference learning algorithms.

[60]  arXiv:2405.19561 (cross-list from cs.AI) [pdf, other]
Title: Quo Vadis ChatGPT? From Large Language Models to Large Knowledge Models
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

The startling success of ChatGPT and other large language models (LLMs) using transformer-based generative neural network architecture in applications such as natural language processing and image synthesis has many researchers excited about potential opportunities in process systems engineering (PSE). The almost human-like performance of LLMs in these areas is indeed very impressive, surprising, and a major breakthrough. Their capabilities are very useful in certain tasks, such as writing first drafts of documents, code writing assistance, text summarization, etc. However, their success is limited in highly scientific domains as they cannot yet reason, plan, or explain due to their lack of in-depth domain knowledge. This is a problem in domains such as chemical engineering as they are governed by fundamental laws of physics and chemistry (and biology), constitutive relations, and highly technical knowledge about materials, processes, and systems. Although purely data-driven machine learning has its immediate uses, the long-term success of AI in scientific and engineering domains would depend on developing hybrid AI systems that use first principles and technical knowledge effectively. We call these hybrid AI systems Large Knowledge Models (LKMs), as they will not be limited to only NLP-based techniques or NLP-like applications. In this paper, we discuss the challenges and opportunities in developing such systems in chemical engineering.

[61]  arXiv:2405.19562 (cross-list from cs.CY) [pdf, other]
Title: Selective Explanations
Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)

Feature attribution methods explain black-box machine learning (ML) models by assigning importance scores to input features. These methods can be computationally expensive for large ML models. To address this challenge, there has been increasing efforts to develop amortized explainers, where a machine learning model is trained to predict feature attribution scores with only one inference. Despite their efficiency, amortized explainers can produce inaccurate predictions and misleading explanations. In this paper, we propose selective explanations, a novel feature attribution method that (i) detects when amortized explainers generate low-quality explanations and (ii) improves these explanations using a technique called explanations with initial guess. Our selective explanation method allows practitioners to specify the fraction of samples that receive explanations with initial guess, offering a principled way to bridge the gap between amortized explainers and their high-quality counterparts.

[62]  arXiv:2405.19592 (cross-list from cs.LG) [pdf, other]
Title: Why Larger Language Models Do In-context Learning Differently?
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Large language models (LLM) have emerged as a powerful tool for AI, with the key ability of in-context learning (ICL), where they can perform well on unseen tasks based on a brief series of task examples without necessitating any adjustments to the model parameters. One recent interesting mysterious observation is that models of different scales may have different ICL behaviors: larger models tend to be more sensitive to noise in the test context. This work studies this observation theoretically aiming to improve the understanding of LLM and ICL. We analyze two stylized settings: (1) linear regression with one-layer single-head linear transformers and (2) parity classification with two-layer multiple attention heads transformers (non-linear data and non-linear model). In both settings, we give closed-form optimal solutions and find that smaller models emphasize important hidden features while larger ones cover more hidden features; thus, smaller models are more robust to noise while larger ones are more easily distracted, leading to different ICL behaviors. This sheds light on where transformers pay attention to and how that affects ICL. Preliminary experimental results on large base and chat models provide positive support for our analysis.

[63]  arXiv:2405.19597 (cross-list from cs.LG) [pdf, other]
Title: SVFT: Parameter-Efficient Fine-Tuning with Singular Vectors
Comments: 17 pages, 5 figures, 14 tables
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Popular parameter-efficient fine-tuning (PEFT) methods, such as LoRA and its variants, freeze pre-trained model weights \(W\) and inject learnable matrices \(\Delta W\). These \(\Delta W\) matrices are structured for efficient parameterization, often using techniques like low-rank approximations or scaling vectors. However, these methods typically show a performance gap compared to full fine-tuning. Although recent PEFT methods have narrowed this gap, they do so at the cost of additional learnable parameters. We propose SVFT, a simple approach that fundamentally differs from existing methods: the structure imposed on \(\Delta W\) depends on the specific weight matrix \(W\). Specifically, SVFT updates \(W\) as a sparse combination of outer products of its singular vectors, training only the coefficients (scales) of these sparse combinations. This approach allows fine-grained control over expressivity through the number of coefficients. Extensive experiments on language and vision benchmarks show that SVFT recovers up to 96% of full fine-tuning performance while training only 0.006 to 0.25% of parameters, outperforming existing methods that only recover up to 85% performance using 0.03 to 0.8% of the trainable parameter budget.

[64]  arXiv:2405.19616 (cross-list from cs.AI) [pdf, other]
Title: Easy Problems That LLMs Get Wrong
Comments: AutogenAI Ltd. Associated code at this https URL
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

We introduce a comprehensive Linguistic Benchmark designed to evaluate the limitations of Large Language Models (LLMs) in domains such as logical reasoning, spatial intelligence, and linguistic understanding, among others. Through a series of straightforward questions, it uncovers the significant limitations of well-regarded models to perform tasks that humans manage with ease. It also highlights the potential of prompt engineering to mitigate some errors and underscores the necessity for better training methodologies. Our findings stress the importance of grounding LLMs with human reasoning and common sense, emphasising the need for human-in-the-loop for enterprise applications. We hope this work paves the way for future research to enhance the usefulness and reliability of new models.

[65]  arXiv:2405.19716 (cross-list from cs.CV) [pdf, other]
Title: Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Comments: 19 pages, 14 figures, 6 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data and append its self-generated image descriptions to the prompts. We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method. Further studies investigate various components of STIC and highlight its potential to leverage vast quantities of unlabeled images for self-training. Code and data are made publicly available.

[66]  arXiv:2405.19732 (cross-list from cs.CV) [pdf, other]
Title: Two Optimizers Are Better Than One: LLM Catalyst for Enhancing Gradient-Based Optimization
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

Learning a skill generally relies on both practical experience by doer and insightful high-level guidance by instructor. Will this strategy also work well for solving complex non-convex optimization problems? Here, a common gradient-based optimizer acts like a disciplined doer, making locally optimal update at each step. Recent methods utilize large language models (LLMs) to optimize solutions for concrete problems by inferring from natural language instructions, akin to a high-level instructor. In this paper, we show that these two optimizers are complementary to each other, suggesting a collaborative optimization approach. The gradient-based optimizer and LLM-based optimizer are combined in an interleaved manner. We instruct LLMs using task descriptions and timely optimization trajectories recorded during gradient-based optimization. Inferred results from LLMs are used as restarting points for the next stage of gradient optimization. By leveraging both the locally rigorous gradient-based optimizer and the high-level deductive LLM-based optimizer, our combined optimization method consistently yields improvements over competitive baseline prompt tuning methods. Our results demonstrate the synergistic effect of conventional gradient-based optimization and the inference ability of LLMs. The code is released at https://github.com/guozix/LLM-catalyst.

[67]  arXiv:2405.19782 (cross-list from cs.SE) [pdf, other]
Title: Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion
Comments: Accepted in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)
Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)

Recent years have witnessed the deployment of code language models (LMs) in various code intelligence tasks such as code completion. Yet, it is challenging for pre-trained LMs to generate correct completions in private repositories. Previous studies retrieve cross-file context based on import relations or text similarity, which is insufficiently relevant to completion targets. In this paper, we propose a dataflow-guided retrieval augmentation approach, called DraCo, for repository-level code completion. DraCo parses a private repository into code entities and establishes their relations through an extended dataflow analysis, forming a repo-specific context graph. Whenever triggering code completion, DraCo precisely retrieves relevant background knowledge from the repo-specific context graph and generates well-formed prompts to query code LMs. Furthermore, we construct a large Python dataset, ReccEval, with more diverse completion targets. Our experiments demonstrate the superior accuracy and applicable efficiency of DraCo, improving code exact match by 3.43% and identifier F1-score by 3.27% on average compared to the state-of-the-art approach.

[68]  arXiv:2405.19877 (cross-list from cs.AI) [pdf, other]
Title: KNOW: A Real-World Ontology for Knowledge Capture with Large Language Models
Authors: Arto Bendiken
Comments: 5 pages, 1 figure
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

We present KNOW--the Knowledge Navigator Ontology for the World--the first ontology designed to capture everyday knowledge to augment large language models (LLMs) in real-world generative AI use cases such as personal AI assistants. Our domain is human life, both its everyday concerns and its major milestones. We have limited the initial scope of the modeled concepts to only established human universals: spacetime (places, events) plus social (people, groups, organizations). The inclusion criteria for modeled concepts are pragmatic, beginning with universality and utility. We compare and contrast previous work such as Schema.org and Cyc--as well as attempts at a synthesis of knowledge graphs and language models--noting how LLMs already encode internally much of the commonsense tacit knowledge that took decades to capture in the Cyc project. We also make available code-generated software libraries for the 12 most popular programming languages, enabling the direct use of ontology concepts in software engineering. We emphasize simplicity and developer experience in promoting AI interoperability.

[69]  arXiv:2405.19954 (cross-list from cs.CR) [pdf, other]
Title: GenKubeSec: LLM-Based Kubernetes Misconfiguration Detection, Localization, Reasoning, and Remediation
Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

A key challenge associated with Kubernetes configuration files (KCFs) is that they are often highly complex and error-prone, leading to security vulnerabilities and operational setbacks. Rule-based (RB) tools for KCF misconfiguration detection rely on static rule sets, making them inherently limited and unable to detect newly-discovered misconfigurations. RB tools also suffer from misdetection, since mistakes are likely when coding the detection rules. Recent methods for detecting and remediating KCF misconfigurations are limited in terms of their scalability and detection coverage, or due to the fact that they have high expertise requirements and do not offer automated remediation along with misconfiguration detection. Novel approaches that employ LLMs in their pipeline rely on API-based, general-purpose, and mainly commercial models. Thus, they pose security challenges, have inconsistent classification performance, and can be costly. In this paper, we propose GenKubeSec, a comprehensive and adaptive, LLM-based method, which, in addition to detecting a wide variety of KCF misconfigurations, also identifies the exact location of the misconfigurations and provides detailed reasoning about them, along with suggested remediation. When empirically compared with three industry-standard RB tools, GenKubeSec achieved equivalent precision (0.990) and superior recall (0.999). When a random sample of KCFs was examined by a Kubernetes security expert, GenKubeSec's explanations as to misconfiguration localization, reasoning and remediation were 100% correct, informative and useful. To facilitate further advancements in this domain, we share the unique dataset we collected, a unified misconfiguration index we developed for label standardization, our experimentation code, and GenKubeSec itself as an open-source tool.

[70]  arXiv:2405.20003 (cross-list from cs.LG) [pdf, other]
Title: Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Uncertainty quantification in Large Language Models (LLMs) is crucial for applications where safety and reliability are important. In particular, uncertainty can be used to improve the trustworthiness of LLMs by detecting factually incorrect model responses, commonly called hallucinations. Critically, one should seek to capture the model's semantic uncertainty, i.e., the uncertainty over the meanings of LLM outputs, rather than uncertainty over lexical or syntactic variations that do not affect answer correctness. To address this problem, we propose Kernel Language Entropy (KLE), a novel method for uncertainty estimation in white- and black-box LLMs. KLE defines positive semidefinite unit trace kernels to encode the semantic similarities of LLM outputs and quantifies uncertainty using the von Neumann entropy. It considers pairwise semantic dependencies between answers (or semantic clusters), providing more fine-grained uncertainty estimates than previous methods based on hard clustering of answers. We theoretically prove that KLE generalizes the previous state-of-the-art method called semantic entropy and empirically demonstrate that it improves uncertainty quantification performance across multiple natural language generation datasets and LLM architectures.

[71]  arXiv:2405.20101 (cross-list from cs.SD) [pdf, other]
Title: Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

Most speech self-supervised learning (SSL) models are trained with a pretext task which consists in predicting missing parts of the input signal, either future segments (causal prediction) or segments masked anywhere within the input (non-causal prediction). Learned speech representations can then be efficiently transferred to downstream tasks (e.g., automatic speech or speaker recognition). In the present study, we investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context, i.e., fulfilling a downstream task that is very similar to the pretext task. To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder. In particular, we propose two solutions to match the HuBERT output with the HiFiGAN input, by freezing one and fine-tuning the other, and vice versa. Performance of both approaches was assessed in single- and multi-speaker settings, for both informed and blind inpainting configurations (i.e., the position of the mask is known or unknown, respectively), with different objective metrics and a perceptual evaluation. Performances show that if both solutions allow to correctly reconstruct signal portions up to the size of 200ms (and even 400ms in some cases), fine-tuning the SSL encoder provides a more accurate signal reconstruction in the single-speaker setting case, while freezing it (and training the neural vocoder instead) is a better strategy when dealing with multi-speaker data.

[72]  arXiv:2405.20172 (cross-list from cs.SD) [pdf, other]
Title: Iterative Feature Boosting for Explainable Speech Emotion Recognition
Comments: Published in: 2023 International Conference on Machine Learning and Applications (ICMLA)
Journal-ref: 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 2023, pp. 543-549
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

In speech emotion recognition (SER), using predefined features without considering their practical importance may lead to high dimensional datasets, including redundant and irrelevant information. Consequently, high-dimensional learning often results in decreasing model accuracy while increasing computational complexity. Our work underlines the importance of carefully considering and analyzing features in order to build efficient SER systems. We present a new supervised SER method based on an efficient feature engineering approach. We pay particular attention to the explainability of results to evaluate feature relevance and refine feature sets. This is performed iteratively through feature evaluation loop, using Shapley values to boost feature selection and improve overall framework performance. Our approach allows thus to balance the benefits between model performance and transparency. The proposed method outperforms human-level performance (HLP) and state-of-the-art machine learning methods in emotion recognition on the TESS dataset.

[73]  arXiv:2405.20213 (cross-list from cs.AI) [pdf, other]
Title: PostDoc: Generating Poster from a Long Multimodal Document Using Deep Submodular Optimization
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

A poster from a long input document can be considered as a one-page easy-to-read multimodal (text and images) summary presented on a nice template with good design elements. Automatic transformation of a long document into a poster is a very less studied but challenging task. It involves content summarization of the input document followed by template generation and harmonization. In this work, we propose a novel deep submodular function which can be trained on ground truth summaries to extract multimodal content from the document and explicitly ensures good coverage, diversity and alignment of text and images. Then, we use an LLM based paraphraser and propose to generate a template with various design aspects conditioned on the input content. We show the merits of our approach through extensive automated and human evaluations.

[74]  arXiv:2405.20271 (cross-list from cs.LG) [pdf, other]
Title: ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections
Comments: Accepted to ICML 2024. Code available at this https URL
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

Parameter-efficient finetuning (PEFT) has become ubiquitous to adapt foundation models to downstream task requirements while retaining their generalization ability. However, the amount of additionally introduced parameters and compute for successful adaptation and hyperparameter searches can explode quickly, especially when deployed at scale to serve numerous individual requests. To ensure effective, parameter-efficient, and hyperparameter-robust adaptation, we propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. By design, ETHER transformations require a minimal number of parameters, are less likely to deteriorate model performance, and exhibit robustness to hyperparameter and learning rate choices. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters ($\sim$$10$-$100$ times lower than LoRA or OFT) across multiple image synthesis and natural language tasks without exhaustive hyperparameter tuning. Finally, we investigate the recent emphasis on Hyperspherical Energy retention for adaptation and raise questions on its practical utility. The code is available at https://github.com/mwbini/ether.

[75]  arXiv:2405.20309 (cross-list from cs.LG) [pdf, other]
Title: Large Language Models Can Self-Improve At Web Agent Tasks
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three distinct synthetic training data mixtures and achieve a 31\% improvement in task completion rate over the base model on the WebArena benchmark through a self-improvement procedure. We additionally contribute novel evaluation metrics for assessing the performance, robustness, capabilities, and quality of trajectories of our fine-tuned agent models to a greater degree than simple, aggregate-level benchmark scores currently used to measure self-improvement.

[76]  arXiv:2405.20341 (cross-list from cs.LG) [pdf, other]
Title: From Zero to Hero: Cold-Start Anomaly Detection
Comments: ACL 2024. Our code is available at this https URL
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

When first deploying an anomaly detection system, e.g., to detect out-of-scope queries in chatbots, there are no observed data, making data-driven approaches ineffective. Zero-shot anomaly detection methods offer a solution to such "cold-start" cases, but unfortunately they are often not accurate enough. This paper studies the realistic but underexplored cold-start setting where an anomaly detection model is initialized using zero-shot guidance, but subsequently receives a small number of contaminated observations (namely, that may include anomalies). The goal is to make efficient use of both the zero-shot guidance and the observations. We propose ColdFusion, a method that effectively adapts the zero-shot anomaly detector to contaminated observations. To support future development of this new setting, we propose an evaluation suite consisting of evaluation protocols and metrics.

Replacements for Fri, 31 May 24

[77]  arXiv:1708.09157 (replaced) [pdf, other]
Title: Cross-lingual, Character-Level Neural Morphological Tagging
Comments: Published as a conference paper at EMNLP 2017; Fixed minor typos and cleaned up formatting
Subjects: Computation and Language (cs.CL)
[78]  arXiv:1811.01331 (replaced) [pdf, other]
Title: Elastic CRFs for Open-ontology Slot Filling
Comments: 5 pages
Subjects: Computation and Language (cs.CL)
[79]  arXiv:2112.07783 (replaced) [pdf, ps, other]
Title: Online antisemitism across platforms
Authors: Tom De Smedt
Comments: 6 pages
Journal-ref: Textgain Technical Reports 7 (2021)
Subjects: Computation and Language (cs.CL)
[80]  arXiv:2205.15744 (replaced) [pdf, other]
Title: EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning
Comments: This work is a multilingual extension of arXiv:2105.13856. This work has been accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing (DOI: 10.1109/TASLP.2024.3402064). Copyright has been transferred
Subjects: Computation and Language (cs.CL)
[81]  arXiv:2212.01032 (replaced) [pdf, other]
Title: Systematic Analysis for Pretrained Language Model Priming for Parameter-Efficient Fine-tuning
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[82]  arXiv:2305.12392 (replaced) [pdf, other]
Title: PiVe: Prompting with Iterative Verification Improving Graph-based Generative Capability of LLMs
Comments: Our code and data are at this https URL (accepted to ACL 2024 Findings)
Subjects: Computation and Language (cs.CL)
[83]  arXiv:2305.13067 (replaced) [pdf, other]
Title: Distilling Robustness into Natural Language Inference Models with Domain-Targeted Augmentation
Authors: Joe Stacey, Marek Rei
Comments: Accepted at ACL Findings 2024
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
[84]  arXiv:2306.16092 (replaced) [pdf, other]
Title: Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model
Subjects: Computation and Language (cs.CL)
[85]  arXiv:2309.05804 (replaced) [pdf, other]
Title: Hi Model, generating 'nice' instead of 'good' is not as bad as generating 'rice'! Towards Context and Semantic Infused Dialogue Generation Loss Function and Evaluation Metric
Subjects: Computation and Language (cs.CL)
[86]  arXiv:2309.08952 (replaced) [pdf, other]
Title: Cross-Lingual Knowledge Editing in Large Language Models
Comments: Accepted to ACL 2024 main conference
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[87]  arXiv:2310.08975 (replaced) [pdf, other]
Title: ChatKBQA: A Generate-then-Retrieve Framework for Knowledge Base Question Answering with Fine-tuned Large Language Models
Comments: Accepted by Findings of ACL 2024
Journal-ref: ACL 2024
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[88]  arXiv:2310.12942 (replaced) [pdf, other]
Title: On the Representational Capacity of Recurrent Neural Language Models
Comments: Added requirement for non-negative probabilities to definitions 2.3 and 3.1, fixed typos
Journal-ref: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7011-7034
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
[89]  arXiv:2311.04044 (replaced) [pdf, other]
Title: PrivLM-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models
Comments: To appear at ACL 2024
Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
[90]  arXiv:2311.15316 (replaced) [pdf, other]
Title: Sibyl: Sensible Empathetic Dialogue Generation with Visionary Commonsense Knowledge
Comments: Work in progress
Subjects: Computation and Language (cs.CL)
[91]  arXiv:2312.10160 (replaced) [pdf, other]
Title: Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning
Comments: ACL 2024 Findings
Subjects: Computation and Language (cs.CL)
[92]  arXiv:2312.17267 (replaced) [pdf, other]
Title: Enhancing Low-Resource Relation Representations through Multi-View Decoupling
Comments: Accepted to AAAI 2024
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[93]  arXiv:2401.02415 (replaced) [pdf, other]
Title: LLaMA Pro: Progressive LLaMA with Block Expansion
Comments: Accepted by ACL 2024, Main Conference
Subjects: Computation and Language (cs.CL)
[94]  arXiv:2401.06102 (replaced) [pdf, other]
Title: Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Comments: ICML 2024 (to appear)
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[95]  arXiv:2401.09967 (replaced) [pdf, other]
Title: Sketch-Guided Constrained Decoding for Boosting Blackbox Large Language Models without Logit Access
Subjects: Computation and Language (cs.CL)
[96]  arXiv:2401.10189 (replaced) [pdf, other]
Title: Chem-FINESE: Validating Fine-Grained Few-shot Entity Extraction through Text Reconstruction
Comments: 16 pages. Accepted by Findings of the Association for Computational Linguistics: EACL 2024. Code and resources are available at this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[97]  arXiv:2401.14016 (replaced) [pdf, other]
Title: Towards Uncertainty-Aware Language Agent
Comments: Our code and data are at this https URL (accepted to ACL 2024 Findings). arXiv admin note: text overlap with arXiv:2310.05915
Subjects: Computation and Language (cs.CL)
[98]  arXiv:2402.01349 (replaced) [pdf, other]
Title: Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models
Comments: 17 pages, 8 figures
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[99]  arXiv:2402.03271 (replaced) [pdf, other]
Title: Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models
Comments: Update Results
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[100]  arXiv:2402.06967 (replaced) [pdf, other]
Title: Instruct Once, Chat Consistently in Multiple Rounds: An Efficient Tuning Framework for Dialogue
Comments: Accepted by ACL 2024
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[101]  arXiv:2402.10466 (replaced) [pdf, other]
Title: Large Language Models as Zero-shot Dialogue State Tracker through Function Calling
Comments: ACL 2024 Main. Code available at: this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[102]  arXiv:2402.11505 (replaced) [pdf, other]
Title: Federated Fine-tuning of Large Language Models under Heterogeneous Tasks and Client Resources
Comments: 19 pages, 13 tables, 9 figures
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[103]  arXiv:2402.12052 (replaced) [pdf, other]
Title: Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs
Comments: Accepted by ACL 2024 main conference. Repo: this https URL
Subjects: Computation and Language (cs.CL)
[104]  arXiv:2402.12174 (replaced) [pdf, other]
Title: BIDER: Bridging Knowledge Inconsistency for Efficient Retrieval-Augmented LLMs via Key Supporting Evidence
Comments: Accepted by ACL 2024 Findings
Subjects: Computation and Language (cs.CL)
[105]  arXiv:2402.12786 (replaced) [pdf, other]
Title: Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations
Comments: Accepted by ACL 2024
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[106]  arXiv:2402.13494 (replaced) [pdf, other]
Title: GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis
Comments: Accepted to ACL 2024 Main
Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
[107]  arXiv:2402.14700 (replaced) [pdf, other]
Title: Unveiling Linguistic Regions in Large Language Models
Comments: Accepted by ACL 2024. Camera-Ready Version
Subjects: Computation and Language (cs.CL)
[108]  arXiv:2402.14800 (replaced) [pdf, other]
Title: Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
Comments: Mixture-of-Experts Large Language Models, ACL2024
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[109]  arXiv:2402.14808 (replaced) [pdf, other]
Title: RelayAttention for Efficient Large Language Model Serving with Long System Prompts
Comments: accepted by the ACL 2024 main conference
Subjects: Computation and Language (cs.CL)
[110]  arXiv:2402.15159 (replaced) [pdf, other]
Title: Machine Unlearning of Pre-trained Large Language Models
Comments: ACL 2024 main. Code and data at this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
[111]  arXiv:2403.03893 (replaced) [pdf, other]
Title: From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[112]  arXiv:2403.07088 (replaced) [pdf, other]
Title: SPA: Towards A Computational Friendly Cloud-Base and On-Devices Collaboration Seq2seq Personalized Generation
Comments: 15 pages, second version of SPA(Side Plugin Adaption)
Subjects: Computation and Language (cs.CL)
[113]  arXiv:2403.09919 (replaced) [pdf, other]
Title: Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Comments: 11 pages, 6 figures
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
[114]  arXiv:2403.11904 (replaced) [pdf, other]
Title: CICLe: Conformal In-Context Learning for Largescale Multi-Class Food Risk Classification
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
[115]  arXiv:2403.15112 (replaced) [pdf, other]
Title: Text clustering with LLM embeddings
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[116]  arXiv:2403.17760 (replaced) [pdf, other]
Title: Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons
Comments: LREC-COLING 2024
Subjects: Computation and Language (cs.CL)
[117]  arXiv:2403.18350 (replaced) [pdf, other]
Title: Evaluation of Semantic Search and its Role in Retrieved-Augmented-Generation (RAG) for Arabic Language
Subjects: Computation and Language (cs.CL)
[118]  arXiv:2404.00242 (replaced) [pdf, other]
Title: DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference
Comments: Update DeFT-v2. DeFT-v1 was accepted by ICLR'24 AGI Workshop ( this https URL ). Code will be released soon
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[119]  arXiv:2404.04530 (replaced) [pdf, other]
Title: A Morphology-Based Investigation of Positional Encodings
Comments: Work in Progress
Subjects: Computation and Language (cs.CL)
[120]  arXiv:2404.12715 (replaced) [pdf, other]
Title: Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration
Comments: 16 pages, 9 figures, 9 tables
Subjects: Computation and Language (cs.CL)
[121]  arXiv:2405.11577 (replaced) [pdf, other]
Title: A Multi-Perspective Analysis of Memorization in Large Language Models
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[122]  arXiv:2405.15525 (replaced) [pdf, other]
Title: Sparse Matrix in Large Language Model Fine-tuning
Comments: 14 pages
Subjects: Computation and Language (cs.CL)
[123]  arXiv:2405.15924 (replaced) [pdf, other]
Title: SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation
Comments: Accepted by ACL2024 Findings
Subjects: Computation and Language (cs.CL)
[124]  arXiv:2405.16295 (replaced) [pdf, other]
Title: Comparative Analysis of Open-Source Language Models in Summarizing Medical Text Data
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
[125]  arXiv:2405.17732 (replaced) [pdf, other]
Title: C$^{3}$Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models
Subjects: Computation and Language (cs.CL)
[126]  arXiv:2405.17743 (replaced) [pdf, other]
Title: ORLM: Training Large Language Models for Optimization Modeling
Comments: Work in progress
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
[127]  arXiv:2405.17935 (replaced) [pdf, other]
Title: Tool Learning with Large Language Models: A Survey
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[128]  arXiv:2405.18719 (replaced) [pdf, other]
Title: Contextual Position Encoding: Learning to Count What's Important
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[129]  arXiv:2405.19327 (replaced) [pdf, other]
[130]  arXiv:2305.14521 (replaced) [pdf, ps, other]
Title: Few-shot Adaptation to Distribution Shifts By Mixing Source and Target Embeddings
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
[131]  arXiv:2310.07968 (replaced) [pdf, other]
Title: Think, Act, and Ask: Open-World Interactive Personalized Robot Navigation
Comments: Video URL: this https URL
Subjects: Robotics (cs.RO); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
[132]  arXiv:2310.12815 (replaced) [pdf, other]
Title: Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Comments: To appear in USENIX Security Symposium 2024
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
[133]  arXiv:2402.01869 (replaced) [pdf, other]
Title: InferCept: Efficient Intercept Support for Augmented Large Language Model Inference
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
[134]  arXiv:2402.02446 (replaced) [pdf, other]
Title: LQER: Low-Rank Quantization Error Reconstruction for LLMs
Comments: Accepted at ICML2024
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
[135]  arXiv:2402.05140 (replaced) [pdf, other]
Title: Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[136]  arXiv:2402.06782 (replaced) [pdf, other]
Title: Debating with More Persuasive LLMs Leads to More Truthful Answers
Comments: For code please check: this https URL
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[137]  arXiv:2402.07865 (replaced) [pdf, other]
Title: Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
Comments: Published at ICML 2024. 22 pages, 11 figures. Training code and models: this https URL Evaluation code: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
[138]  arXiv:2402.18496 (replaced) [pdf, other]
Title: Language Models Represent Beliefs of Self and Others
Comments: project page: this https URL
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[139]  arXiv:2403.01643 (replaced) [pdf, other]
Title: You Need to Pay Better Attention: Rethinking the Mathematics of Attention Mechanism
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
[140]  arXiv:2403.04280 (replaced) [pdf, other]
Title: A New Benchmark for Evaluating Automatic Speech Recognition in the Arabic Call Domain
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[141]  arXiv:2403.15401 (replaced) [pdf, ps, other]
Title: Large Language Model for Mental Health: A Systematic Review
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[142]  arXiv:2404.07972 (replaced) [pdf, other]
Title: OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Comments: 51 pages, 21 figures
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[143]  arXiv:2404.09385 (replaced) [pdf, other]
Title: A Large-Scale Evaluation of Speech Foundation Models
Comments: The extended journal version for SUPERB and SUPERB-SG. Published in IEEE/ACM TASLP. The Arxiv version is preferred
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Signal Processing (eess.SP)
[144]  arXiv:2405.09983 (replaced) [pdf, other]
Title: Zero-Shot Hierarchical Classification on the Common Procurement Vocabulary Taxonomy
Comments: Full-length version of the short paper accepted at COMPSAC 2024
Journal-ref: COMPSAC 2024
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
[145]  arXiv:2405.12107 (replaced) [pdf, other]
Title: Imp: Highly Capable Large Multimodal Models for Mobile Devices
Comments: fix some typos and correct a few number in the tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
[146]  arXiv:2405.15143 (replaced) [pdf, other]
Title: Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[147]  arXiv:2405.15973 (replaced) [pdf, other]
Title: Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
Comments: 15 pages, 8 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
[148]  arXiv:2405.16510 (replaced) [pdf, other]
Title: Meta-Task Planning for Language Agents
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
[149]  arXiv:2405.17503 (replaced) [pdf, other]
Title: Code Repair with LLMs gives an Exploration-Exploitation Tradeoff
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
[ total of 149 entries: 1-149 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2405, contact, help  (Access key information)