Yuling Gu

Yuling Gu

yuling.gu@nyu.edu

NOTE: my allenai email is no longer in use

UPDATES

Upcoming/current activities:

September 2025:
Starting my PhD studies! :)

Past activites highlight:

May 2025: Attended NAACL 2025 to present OLMES!
December 2024: Submitted PhD applications for Fall 2025 admission. Attended NeurIPS 2024. Delighted to be part of TÜLU 3 and OLMo 2 releases, working on evaluations!
August 2024: Attended ACL 2024, check out the following papers that I'm part of:
(1) "OLMo: Accelerating the Science of Language Models"
(2) "Digital Socrates: Evaluating LLMs through Explanation Critiques"
(3) "PROC2PDDL: Open-Domain Planning Representations from Texts"
December 2023: Attended EMNLP 2023 to share our work "What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts and Rationales for Disambiguating Defeasible Social and Moral Situations" & "Robust Tooling and New Resources for Large Language Model Evaluation via Catwalk"
July 2023: Presented our work "Do language models have coherent mental models of everyday things?" in person at ACL 2023!
May 2023: Our paper "Do language models have coherent mental models of everyday things?" got accepted to ACL 2023!
May 2023: Our paper "Can AI language models replace human participants?" has been published in Trends in Cognitive Sciences!
Late November - Early December 2022: Attending NeurIPS and EMNLP 2022! Virtually :)
October 2022: Our paper "One Venue, Two Conferences: The Separation of Chinese and American Citation Networks" got accepted to the AI Cultures Workshop at NeurIPS 2022!
October 2022: Our paper on "DREAM-FLUTE" got accepted to the Figurative Language Processing Workshop at EMNLP 2022!
August 2022: We developed "DREAM-FLUTE" during a 3-day Hackathon at AI2 and it achieved (joint) first place for the Figurative Language Understanding Shared Task at EMNLP 2022!
July 2022: Presented our work in person at NAACL 2022!
April 2022: Our paper "DREAM: Improving Situational QA by First Elaborating the Situation" got accepted to NAACL 2022!
April 2022: Joined the Aristo team at Allen Institute for AI as a Predoctoral Young Investigator!
March 2022: Graduated from UW with a perfect GPA!
Summer 2021: Research Intern on the Aristo team at Allen Institute for AI.
Early August 2021: Presented (virtually) at the Unimplicit workshop (at ACL-IJCNLP 2021).
Late October 2020: Presented (virtually) at Interspeech 2020.
Late September 2020: Joined UW to begin my graduate studies!
July 2020: My paper on Singaporean children's speech got accepted at Interspeech 2020!
May 2020: Graduated summa cum laude from NYU!
Early December 2019: San Diego, California for 178th Meeting of the Acoustical Society of America (2 Poster Presentations)
Late October 2019: NYU CAS alumni-student debate. The Motion: "The Benefits of the Development of Artificial Intelligence Outweigh the Harms."
Early October 2019: Orlando, FL for the Grace Hopper Celebration (1 of only 4 undergraduate representatives from Courant Institute of Mathematical Sciences)
Late July 2019: Florence, Italy for ACL 2019 (Poster presentation)

Other interesting things:

My undergraduate honors thesis advisor at NYU, Prof. Ernest Davis, published Rebooting AI: Building Artificial Intelligence We can Trust. Check it out!

RESEARCH

Projects

PhD Student: NYU’s Center for Data Science and NYU Langone Medical Center (September 2025 - present)

Excited about pushing machine intelligence toward real-world impact, starting a new project soon ...

Predoctoral Young Investigator: Aristo team, Allen Institute for Artificial Intelligence (April 2022 - July 2025)

My work has exposed fundamental gaps in AI’s reasoning — from DREAM on situational understanding, to mental models of everyday things, to SimpleToM on Theory of Mind. I also developed Digital Socrates, a novel framework for automatically and systemically evaluating reasoning chains in large language models by applying Socratic questioning principles. Driving transparency and accessibility in AI research, I contributed to major open-source advances like OLMo, OLMo 2, TÜLU 3, OLMoE, and OLMES. I led the development of OLMES: A Standard for Language Model Evaluations -- a completely documented, practical, open standard for reproducible language model evaluations -- which is actively used to advance various projects at Ai2 and beyond, including research on consistent ranking of models, scaling laws, and building newer open-source models.

Research Intern: Aristo team, Allen Institute for Artificial Intelligence (Summer & Fall 2021)

When people answer questions about a specific situation, cognitive science suggests that they form a mental picture of that situation before answering. We train a new model, DREAM, to build such scene elaborations in a dataset-neutral way. We then demonstrate that using DREAM’s scene elaborations as additional context improves the answer accuracy across different downstream QA systems and on different end-tasks. Our approach is question-agnostic, leaves end-task QA models unchanged, and thus easily portable to other QA models, suggesting exciting opportunities for further improving and exploiting scene elaborations to better solve new problems.

Research assistant: Courant Institute of Mathematical Sciences, NYU (Summer 2018 - Spring 2020)

Honors Thesis Project: Detecting Event Duration in Text (Spring 2019 - Spring 2020)

Supervised by Prof. Ernest Davis. Use various classifiers, word and sentence representations, as well as linguistics theories to automatically detect temporal relations implicitly conveyed in texts (different levels: from single event description to multiple sentences); Analyze the performance of Transformer-based state-of-the-art models in detecting implicit meaning from a psycholinguistics perspective.

Can dependency parsing help event extraction in text? (Fall 2019 - Spring 2020)

Supervised by Prof. Ralph Grishman. Investigate the contribution of information from dependency parsing, Named Entity (NE) tagging, and Part Of Speech (POS) tagging in event extraction, beyond a baseline that uses pretrained BERT sentence representation.

Integrated Customization Environment for Information Extraction (ICE) (Summer 2019)

Supervised by Prof. Ralph Grishman. Experiment with different classifiers, together with grammatical linguistics insights, to automatically distinguish prepositional phrases as adjuncts or arguments (achieved 88% accurate prediction of the adjunct/argument distinction using linguistics theories alone).

Termolator: A terminology extraction system (Summer 2018 - Fall 2018)

Supervised by Prof. Adam Meyers. Refine the English Termolator's distributional metrics; Further develop the Chinese Termolator; Integrate past 5 years' developments to unify the two systems (my contributions: https://github.com/yulinggu-cs/ChineseTermolator2020, integrated to full system on July 2020).

Independent study project: Commonsense Reasoning (Summer 2018)

Supervised by Prof. Ernest Davis. Look into English-Chinese Machine Translation failures; Design Winograd schemas and compile pronoun disambiguation problems; Toward Annotating Commonsense Inferences in Text (TACIT) annotation.

Research Intern: Human Language Technology Group (Winter 2014 - Spring 2021)

Institute for Infocomm Research, A*STAR, Singapore, Singapore

Characterizing Singaporean, American, and British English acoustic and pronunciation patterns in children's speech using unsupervised clustering (supervised by Dr. Nancy F. Chen); Chinese tone perception in Singaporean and native Chinese Mandarin speakers; Investigating tone in whispered Mandarin (jointly supervised by Dr. Boon Pang Lim and Dr. Nancy F. Chen).

Other work experience

Courant Institute of Mathematical Sciences (CIMS), NYU

Grader for Artificial Intelligence course under Professor Ernest Davis (Fall 2019)

Grader for Basic Algorithms course under Professor Victor Shoup (Spring 2019)

PUBLICATIONS

For the most updated list of papers, please refer to my Google Scholar page!

David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A Smith, Hannaneh Hajishirzi, Kyle Lo and Jesse Dodge (2025). “Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation”. arXiv. [Paper] [Code]

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith and Hannaneh Hajishirzi (2025). “2 OLMo 2 Furious”. COLM 2025. [Paper] (see paper for various links to code and data)

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi and Hannaneh Hajishirzi (2025). “TÜLU 3: Pushing Frontiers in Open Language Model Post-Training”. COLM 2025. [Paper] [Code] [Eval]

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A Smith, Pang Wei Koh, Amanpreet Singh and Hannaneh Hajishirzi (2025). “OLMoE: Open Mixture-of-Experts Language Models”. ICLR 2025. [Paper] [Code]

Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Haddad, Jesse Dodge and Hannaneh Hajishirzi (2025). “OLMES: A Standard for Language Model Evaluations”. Findings of NAACL 2025. [Paper] [Code]

Yuling Gu, Oyvind Tafjord, Hyunwoo Kim, Jared Moore, Ronan Le Bras, Peter Clark and Yejin Choi (2024). “SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs”. arXiv. [arXiv] [Dataset]

Wenlong Zhao, Debanjan Mondal, Niket Tandon, Danica Dillion, Kurt Gray and Yuling Gu (2024). “WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models”. LREC-COLING 2024. [Paper] [Dataset & Code]

Tianyi Zhang, Li Zhang, Zhaoyi Hou, Ziyu Wang, Yuling Gu, Peter Clark, Chris Callison-Burch and Niket Tandon (2024). “PROC2PDDL: Open-Domain Planning Representations from Texts”. The 2nd Workshop on Natural Language Reasoning and Structured Explanations, ACL 2024. [Paper] [Dataset & Code]

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith and Hannaneh Hajishirzi (2024). “OLMo: Accelerating the Science of Language Models”. ACL 2024. [Paper] [Website]

Yuling Gu, Oyvind Tafjord and Peter Clark (2024). “Digital Socrates: Evaluating LLMs through Explanation Critiques”. ACL 2024. [Paper] [Dataset & Model]

Kavel Rao, Liwei Jiang, Valentina Pyatkin, Yuling Gu, Niket Tandon, Nouha Dziri, Faeze Brahman and Yejin Choi (2023). “What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts and Rationales for Disambiguating Defeasible Social and Moral Situations”. Findings of EMNLP 2023. [Paper] [Dataset]

Yuling Gu, Bhavana Dalvi Mishra and Peter Clark (2023). “Do language models have coherent mental models of everyday things?”. ACL 2023. [Paper] [Dataset & Code]

Danica Dillion, Niket Tandon, Yuling Gu and Kurt Gray (2023). “Can AI language models replace human participants?”. Trends in Cognitive Sciences. [Paper]

Yuling Gu (2022). “Measure More, Question More: Experimental Studies on Transformer-based Language Models and Complement Coercion”. arXiv. [arXiv]

Bingchen Zhao*, Yuling Gu*, Jessica Zosa Forde and Naomi Saphra (2022). “One Venue, Two Conferences: The Separation of Chinese and American Citation Networks”. AI Cultures Workshop at NeurIPS 2022. [arXiv]

Yuling Gu, Yao Fu, Valentina Pyatkin, Ian Magnusson, Bhavana Dalvi and Peter Clark (2022). “Just-DREAM-about-it: Figurative Language Understanding with DREAM-FLUTE”. The Third Workshop on Figurative Language Processing, EMNLP 2022. [Paper] [Dataset & Model]

Yuling Gu, Bhavana Dalvi Mishra and Peter Clark (2022). “DREAM: Improving Situational QA by First Elaborating the Situation”. NAACL 2022. [Paper] [Dataset & Model]

Yuling Gu and Nancy F. Chen (2022). “Large-Scale Acoustic Characterization of Singaporean Children's English Pronunciation”. arXiv. [arXiv]

Yuling Gu (2021). “Transformer-based language models and complement coercion: Experimental studies". The First Workshop on Understanding Implicit and Underspecified Language at ACL-IJCNLP 2021. [Underline link] [Poster]

Yuling Gu and Nancy F. Chen (2020). “Characterization of Singaporean Children's English: Comparisons to American and British Counterparts using Archetypal Analysis”. Interspeech 2020. [Paper]

Yuling Gu and Nancy F. Chen (2019). “Large-scale acoustic characterization of mid-low vowels across American, British, and Singaporean children". The Journal of Acoustical Society of America, Volume 146, Issue 4. 178th Meeting of the Acoustical Society of America. [Abstract] [Poster]

Yuling Gu and Nancy F. Chen (2019). “Acoustic characterization of Singaporean children’s English with American and British counterparts: A case study on approximants". The Journal of Acoustical Society of America, Volume 146, Issue 4. 178th Meeting of the Acoustical Society of America. [Abstract] [Poster]

Yuling Gu and Nancy F. Chen (2019). “Acoustic Characterization of Singaporean Children’s English: Comparisons to American and British Counterparts”. Widening NLP workshop at ACL 2019. [Abstract]

Yuling Gu, Boon Pang Lim and Nancy F. Chen (2016). “Perception of tone in whispered Mandarin sentences: the case for Singapore Mandarin”. Interspeech 2016. [Paper]

Curriculum Vitae

Open My CV (Last updated 2024 Dec)

PERSONAL

Always excited to travel, explore new things and reach out for the skies!