Evaluating progress of LLMs on scientific problem-solving
AI

Evaluating progress of LLMs on scientific problem-solving

Programmatic and model-based evaluations Tasks in CURIE are varied and have ground-truth annotations in mixed and heterogeneous form, e.g., as JSONs, latex equations, YAML files, or free-form text. Evaluating free-form generation is challenging because answers are often descriptive, and even when a format is specified, as in most of our cases, the response to each […]

Evaluating progress of LLMs on scientific problem-solving
AI

A novel benchmark for evaluating cross-lingual knowledge transfer in LLMs

Data creation and verification To construct ECLeKTic, we started by selecting articles that only exist in a single language on Wikipedia for 12 languages — English, French, German, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Mandarin Chinese, Portuguese, and Spanish. These pages are often based on topics most salient to speakers of that language, but they

Evaluating progress of LLMs on scientific problem-solving
AI

The evolution of graph learning

Graph algorithms (the pre–deep learning era) Initial work in graph analysis often focused on developing methods to better understand the structure of graphs. They aimed to uncover hidden patterns, properties, and relationships within graphs (e.g., community structures or centrality within a network) and were concerned with gaining insights into the graph’s overall organization and meaning.

A 100-AV Highway Deployment – The Berkeley Artificial Intelligence Research Blog
AI

A 100-AV Highway Deployment – The Berkeley Artificial Intelligence Research Blog

Training Diffusion Models with Reinforcement Learning We deployed 100 reinforcement learning (RL)-controlled cars into rush-hour highway traffic to smooth congestion and reduce fuel consumption for everyone. Our goal is to tackle “stop-and-go” waves, those frustrating slowdowns and speedups that usually have no clear cause but lead to congestion and significant energy waste. To train efficient

Evaluating progress of LLMs on scientific problem-solving
AI

Load balancing with random job arrivals

Cluster management systems, such as Google’s Borg, run hundreds of thousands of jobs across tens of thousands of machines with the goal of achieving high utilization via effective load balancing, efficient task placement, and machine sharing. Load balancing is the process of distributing network traffic or computational workloads across multiple servers or computing resources, and

Evaluating progress of LLMs on scientific problem-solving
AI

Loss of Pulse Detection on the Google Pixel Watch 3

Acknowledgements The research described here is joint work across Google Research, Google Health, Google DeepMind, and partnering teams, including Consumer Health Research, Personal Safety, quality, regulatory, and clinical operations. The following researchers contributed to this work: Kamal Shah, Anran Wang, Yiwen Chen, Jitender Munjal, Sumeet Chhabra, Anthony Stange, Enxun Wei, Tuan Phan, Tracy Giest, Beszel

Free Local RAG Scraper for Custom GPTs and Assistants • AI Blog
AI

Free Local RAG Scraper for Custom GPTs and Assistants • AI Blog

This web scraper runs entirely in your browser and is perfect for creating training data for AI models. It works by reading the website’s sitemap.xml file, making it particularly well-suited for modern platforms like Squarespace and Shopify that automatically generate sitemaps. The scraper preserves the structure of your content, including headings, paragraphs, lists, and tables,

Evaluating progress of LLMs on scientific problem-solving
AI

Generating synthetic data with differentially private LLM inference

Due to challenges in generating text while maintaining DP and computational efficiency, prior work focused on generating a small amount of data points (<10) to be used for in-context learning. We show that it’s possible to generate two to three orders of magnitude more data while preserving quality and privacy by solving issues related to

Evaluating progress of LLMs on scientific problem-solving
AI

Advancing AMIE for longitudinal disease management

A two-agent architecture for enhanced reasoning Our work addresses this challenge with a novel approach based on the interplay of two LLM-driven agents, which has similarities to how human clinicians tackle management problems. The Dialogue Agent is user-facing and equipped to rapidly respond based on its current understanding of the patient. This agent handles the

Scroll to Top