We gratefully acknowledge support from
the Simons Foundation and member institutions.

Information Retrieval

New submissions

[ total of 12 entries: 1-12 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Mon, 13 May 24

[1]  arXiv:2405.06129 [pdf, ps, other]
Title: Narrative to Trajectory (N2T+): Extracting Routes of Life or Death from Human Trafficking Text Corpora
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)

Climate change and political unrest in certain regions of the world are imposing extreme hardship on many communities and are forcing millions of vulnerable populations to abandon their homelands and seek refuge in safer lands. As international laws are not fully set to deal with the migration crisis, people are relying on networks of exploiting smugglers to escape the devastation in order to live in stability. During the smuggling journey, migrants can become victims of human trafficking if they fail to pay the smuggler and may be forced into coerced labor. Government agencies and anti-trafficking organizations try to identify the trafficking routes based on stories of survivors in order to gain knowledge and help prevent such crimes. In this paper, we propose a system called Narrative to Trajectory (N2T+), which extracts trajectories of trafficking routes. N2T+ uses Data Science and Natural Language Processing techniques to analyze trafficking narratives, automatically extract relevant location names, disambiguate possible name ambiguities, and plot the trafficking route on a map. In a comparative evaluation we show that the proposed multi-dimensional approach offers significantly higher geolocation detection than other state of the art techniques.

[2]  arXiv:2405.06130 [pdf, ps, other]
Title: Creating Geospatial Trajectories from Human Trafficking Text Corpora
Subjects: Information Retrieval (cs.IR)

Human trafficking is a crime that affects the lives of millions of people across the globe. Traffickers exploit the victims through forced labor, involuntary sex, or organ harvesting. Migrant smuggling could also be seen as a form of human trafficking when the migrant fails to pay the smuggler and is forced into coerced activities. Several news agencies and anti-trafficking organizations have reported trafficking survivor stories that include the names of locations visited along the trafficking route. Identifying such routes can provide knowledge that is essential to preventing such heinous crimes. In this paper we propose a Narrative to Trajectory (N2T) information extraction system that analyzes reported narratives, extracts relevant information through the use of Natural Language Processing (NLP) techniques, and applies geospatial augmentation in order to automatically plot trajectories of human trafficking routes. We evaluate N2T on human trafficking text corpora and demonstrate that our approach of utilizing data preprocessing and augmenting database techniques with NLP libraries outperforms existing geolocation detection methods.

[3]  arXiv:2405.06138 [pdf, other]
Title: Seasonality Patterns in 311-Reported Foodborne Illness Cases and Machine Learning-Identified Indications of Foodborne Illnesses from Yelp Reviews, New York City, 2022-2023
Comments: Paper counterpart to flash talk presented at 8th Annual Conference of the UConn Center for mHealth and Social Media, Advancing Public Health and Science with Artificial Intelligence
Subjects: Information Retrieval (cs.IR)

Restaurants are critical venues at which to investigate foodborne illness outbreaks due to shared sourcing, preparation, and distribution of foods. Formal channels to report illness after food consumption, such as 311, New York City's non-emergency municipal service platform, are underutilized. Given this, online social media platforms serve as abundant sources of user-generated content that provide critical insights into the needs of individuals and populations. We extracted restaurant reviews and metadata from Yelp to identify potential outbreaks of foodborne illness in connection with consuming food from restaurants. Because the prevalence of foodborne illnesses may increase in warmer months as higher temperatures breed more favorable conditions for bacterial growth, we aimed to identify seasonal patterns in foodborne illness reports from 311 and identify seasonal patterns of foodborne illness from Yelp reviews for New York City restaurants using a Hierarchical Sigmoid Attention Network (HSAN). We found no evidence of significant bivariate associations between any variables of interest. Given the inherent limitations of relying solely on user-generated data for public health insights, it is imperative to complement these sources with other data streams and insights from subject matter experts. Future investigations should involve conducting these analyses at more granular spatial and temporal scales to explore the presence of such differences or associations.

[4]  arXiv:2405.06460 [pdf, other]
Title: ProCIS: A Benchmark for Proactive Retrieval in Conversations
Subjects: Information Retrieval (cs.IR)

The field of conversational information seeking, which is rapidly gaining interest in both academia and industry, is changing how we interact with search engines through natural language interactions. Existing datasets and methods are mostly evaluating reactive conversational information seeking systems that solely provide response to every query from the user. We identify a gap in building and evaluating proactive conversational information seeking systems that can monitor a multi-party human conversation and proactively engage in the conversation at an opportune moment by retrieving useful resources and suggestions. In this paper, we introduce a large-scale dataset for proactive document retrieval that consists of over 2.8 million conversations. We conduct crowdsourcing experiments to obtain high-quality and relatively complete relevance judgments through depth-k pooling. We also collect annotations related to the parts of the conversation that are related to each document, enabling us to evaluate proactive retrieval systems. We introduce normalized proactive discounted cumulative gain (npDCG) for evaluating these systems, and further provide benchmark results for a wide range of models, including a novel model we developed for this task. We believe that the developed dataset, called ProCIS, paves the path towards developing proactive conversational information seeking systems.

Cross-lists for Mon, 13 May 24

[5]  arXiv:2405.06211 (cross-list from cs.CL) [pdf, other]
Title: A Survey on RAG Meets LLMs: Towards Retrieval-Augmented Large Language Models
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

As one of the most advanced techniques in AI, Retrieval-Augmented Generation (RAG) techniques can offer reliable and up-to-date external knowledge, providing huge convenience for numerous tasks. Particularly in the era of AI-generated content (AIGC), the powerful capacity of retrieval in RAG in providing additional knowledge enables retrieval-augmented generation to assist existing generative AI in producing high-quality outputs. Recently, large Language Models (LLMs) have demonstrated revolutionary abilities in language understanding and generation, while still facing inherent limitations, such as hallucinations and out-of-date internal knowledge. Given the powerful abilities of RAG in providing the latest and helpful auxiliary information, retrieval-augmented large language models have emerged to harness external and authoritative knowledge bases, rather than solely relying on the model's internal knowledge, to augment the generation quality of LLMs. In this survey, we comprehensively review existing research studies in retrieval-augmented large language models (RA-LLMs), covering three primary technical perspectives: architectures, training strategies, and applications. As the preliminary knowledge, we briefly introduce the foundations and recent advances of LLMs. Then, to illustrate the practical significance of RAG for LLMs, we categorize mainstream relevant work by application areas, detailing specifically the challenges of each and the corresponding capabilities of RA-LLMs. Finally, to deliver deeper insights, we discuss current limitations and several promising directions for future research.

[6]  arXiv:2405.06242 (cross-list from cs.CR) [pdf, other]
Title: Impedance vs. Power Side-channel Vulnerabilities: A Comparative Study
Subjects: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

In recent times, impedance side-channel analysis has emerged as a potent strategy for adversaries seeking to extract sensitive information from computing systems. It leverages variations in the intrinsic impedance of a chip's internal structure across different logic states. In this study, we conduct a comparative analysis between the newly explored impedance side channel and the well-established power side channel. Through experimental evaluation, we investigate the efficacy of these two side channels in extracting the cryptographic key from the Advanced Encryption Standard (AES) and analyze their performance. Our results indicate that impedance analysis demonstrates a higher potential for cryptographic key extraction compared to power side-channel analysis. Moreover, we identify scenarios where power side-channel analysis does not yield satisfactory results, whereas impedance analysis proves to be more robust and effective. This work not only underscores the significance of impedance side-channel analysis in enhancing cryptographic security but also emphasizes the necessity for a deeper understanding of its mechanisms and implications.

Replacements for Mon, 13 May 24

[7]  arXiv:2309.13063 (replaced) [pdf, other]
Title: Using Large Language Models to Generate, Validate, and Apply User Intent Taxonomies
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[8]  arXiv:2309.14891 (replaced) [pdf, other]
Title: RE-SORT: Removing Spurious Correlation in Multilevel Interaction for CTR Prediction
Comments: 15 pages, 7 figures
Subjects: Information Retrieval (cs.IR)
[9]  arXiv:2401.14678 (replaced) [pdf, other]
Title: Prompt-enhanced Federated Content Representation Learning for Cross-domain Recommendation
Comments: 11 pages, 3 figures, accepted by WWW 2024
Subjects: Information Retrieval (cs.IR)
[10]  arXiv:2404.14851 (replaced) [pdf, other]
Title: From Matching to Generation: A Survey on Generative Information Retrieval
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[11]  arXiv:2201.02797 (replaced) [pdf, other]
Title: A Unified Review of Deep Learning for Automated Medical Coding
Comments: ACM Computing Surveys
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
[12]  arXiv:2405.00982 (replaced) [pdf, other]
Title: On the Evaluation of Machine-Generated Reports
Comments: 12 pages, 4 figures, accepted at SIGIR 2024 as perspective paper
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
[ total of 12 entries: 1-12 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2405, contact, help  (Access key information)