ACTION selected a group of 8 students (high school, undergraduate, and graduate) for an 8-week research experience at UCSB. 2025 is the second year of ACTION's summer program. 

Student researchers joined with participants from other undergraduate summer programs, including McNair Scholars and groups managed by the Center for Engineering and Science Partnerships (CSEP). All students had opportunities to attend skills development seminars and to network with other programs. 

After an 8-week experience, ACTION was able to show four posters at the 2025 Undergraduate Research Symposium. Three of the projects were mentored by a postdoctoral scholar. The fourth was mentored by an ACTION faculty member. 

ACTION UCSB Participants

COVERTLENS: Uncovering Concealed Harmful Behaviors in LLMs via Activation Analysis

Megan Gross, San Jose State University
Arjun Chopra, Cal Poly SLO
Saqif Ayaan Sudheer, UC Santa Barbara
Brian Lee, University of Chicago
Yigitcan Kaya (Mentor), UC Santa Barbara

Large language models (LLMs) can reliably encode their outputs on demand (e.g., in Morse code), which is exploited by adversaries to trigger harmful behavior (e.g., leaking sensitive information, including passwords) while concealing the model’s responses. Our research develops tools to detect and prevent such attacks by analyzing LLMs’ internal behavior. We explore four approaches drawn from recent research.

First, we use steering vectors to capture internal activation patterns that distinguish benign from malicious requests. These patterns support both detection and intervention by steering the model toward safe behavior. Applying checks across early and late LLM layers effectively exposes various encoding behaviors but struggles to generalize to unseen malicious requests.

Second, we apply sparse autoencoders (SAEs) to decompose LLM activations into prominent features (e.g., one that activates for “Golden Gate Bridge”). We identified useful features for both malicious and benign prompts, but our results highlight challenges in isolating features consistent with malicious prompts and steering the model away from them.

Third, inspired by recent work, we prompt another LLM to explain the intent behind user inputs and extract its internal representations to train a classifier on a prompt dataset. We achieve 100% accuracy for regular malicious prompts and 85% for more ambiguous ones, showing a promising direction.

Finally, we explore algorithmically decoding encoded LLM inputs or outputs. We prompt another LLM to analyze adversarial requests and generate a decoder script (e.g., translating Morse to plain text). Once decoded, standard safety filters can more reliably block malicious requests.

 

Poster group 1: Megan Gross, Arjun Chopra, Brian Lee

REDACTBENCH: A Formal Framework for LLM-based Confidential Information Redaction

Saqif Ayaan Sudheer, UC Santa Barbara
Aditya Singh, Silver Creek High School
Pedro Gonzalez, CSU San Bernardino
Yigitcan Kaya (Mentor), UC Santa Barbara

Protecting sensitive information is essential for governments, companies, and institutions. When documents contain both public and private content, redactions ensure only appropriate information is shared. For example, U.S. agencies must redact state secrets before releasing documents under the Freedom of Information Act (FOIA), and companies often redact internal documents to enable safe collaboration.

Today, redactions are typically done by experts interpreting complex, domain-specific rules, an expensive and error-prone process. These rules vary widely across domains, from financial reports to energy data and military procurement. Our work explores how large language models (LLMs) can automate this task in challenging real-world settings, going beyond simple removal of personal data.

To tackle this challenge, we first define the task of LLM-based redaction. We focus on scenarios where certain types of information, such as revenue figures or product designs, are marked as confidential based on domain-specific guidelines. Given such a guideline and a document, we prompt an LLM (the Redactor) to remove only the confidential content while leaving the rest untouched. To assess redaction quality, we introduce two additional components: an Inferer that tries to recover redacted content (indicating potential leaks), and a Utility model that checks whether the remaining content remains informative. Together, these components treat redaction as a careful balance between confidentiality and utility.

To evaluate our framework systematically, we develop a synthetic document pipeline that takes custom document or information types and generates realistic documents and guidelines, enabling controlled, end-to-end evaluations of our framework.

 

Poster group 2: Saqif Ayaan and Pedro Gonzalez

HALLUCINATOR: Measuring the Susceptibility of LLMs to Falsehoods via Adversarial Prompting

Anoushka Sawant, San Jose State University
Yigitcan Kaya (Mentor), UC Santa Barbara

Large language models (LLMs) are increasingly used to generate answers in settings where accuracy and safety matter, from education to customer support. Yet these models can be manipulated to produce harmful or misleading content. In this work, we explore how algorithmically generated prompts can push LLMs to output conspiracy theories, falsehoods, and biased statements. For example, by appending a carefully crafted suffix, using algorithms such as Greedy Coordinate Gradient (GCG), to a question like “Tell me about Barack Obama,” we can cause a model to respond with conspiracy theories such as “Obama was not born in the U.S.” or “Obama is the first Muslim president.”

We use such adversarial prompts to study how easily different LLMs can be made to produce misinformation. Our method provides a systematic way to measure a model’s vulnerability: the shorter or simpler the attack needed to induce a falsehood, the more susceptible the model is. Using this approach, we uncover patterns in how models trained by different organizations, and in different countries, respond to prompts about topics like race, gender, sexuality, and politics. Some models are significantly more resistant, while others quickly give in to false or harmful narratives.

Our findings show that adversarial prompting can serve as a powerful tool for probing hidden weaknesses in LLMs, revealing not just accidental errors but systematic vulnerabilities to misinformation and bias. These insights can help developers design safer, more trustworthy AI systems.

 

Poster group 3: Anoushka Sawant

Prioritized League Self-Play for Dots-and-Boxes

Erik Feng, UC Santa Barbara
João P. Hespanha (Mentor), UC Santa Barbara

We study reinforcement learning for the two‑player game Dots‑and‑Boxes using self‑play with Deep Q‑Networks (DQNs). Our model operates on the OpenSpiel environment with a multi‑layer perceptron and legal‑action masking, and is trained with Double‑DQN targets, target‑network synchronization, and a ring replay buffer. To make progress measurable across training, we construct a “league” of frozen snapshots and evaluate head‑to‑head performance using Elo ratings computed from round‑robin matches among snapshots. We compare against two baselines: (i) a uniform‑random policy and (ii) a simple heuristic (“safe‑greedy”) that captures boxes when available and otherwise avoids creating immediate captures for the opponent. We also outline a prioritized league schedule, adapted from prioritized fictitious self‑play, where the opponent for each episode is sampled from the league with a probability that favors near‑even matchups, aiming to stabilize learning and reduce overfitting to any single snapshot. We report win rates versus the baselines and Elo trajectories over training on a 5-by-5 board. This setup provides a compact, reproducible benchmark for multi‑agent credit assignment and opponent sampling, and it supports ablations over network size, exploration, and league prioritization that we will present in the poster..

 

Photo of Poster 4: Erik Feng

The 8-week research experience concluded with a debrief over a group lunch and a promise to run again in 2026. 

If you are interested in interning with the ACTION AI Institute, be sure to check our Education & Outreach page for an application starting in November 2025. 

 

Photo from final lunch and debrief