Jump to content

NASA-IBM Collaboration Develops INDUS Large Language Models for Advanced Science Research


Recommended Posts

  • Publishers
Posted

4 min read

NASA-IBM Collaboration Develops INDUS Large Language Models for Advanced Science Research

Five orange stars connected in a V-like shape with blue lines, like a diagram of the constellation of Indus. Each of the stars is labeled with one of the NASA Science Mission Directorate divisions: astrophysics, Earth science, heliophysics, planetary science, and biological and physical sciences.
Named for the southern sky constellation, INDUS (stylized in all caps) is a comprehensive suite of large language models supporting five science domains.
NASA

By Derek Koehl

Collaborations with private, non-federal partners through Space Act Agreements are a key component in the work done by NASA’s Interagency Implementation and Advanced Concepts Team (IMPACT). A collaboration with International Business Machines (IBM) has produced INDUS, a comprehensive suite of large language models (LLMs) tailored for the domains of Earth science, biological and physical sciences, heliophysics, planetary sciences, and astrophysics and trained using curated scientific corpora drawn from diverse data sources.

INDUS contains two types of models; encoders and sentence transformers. Encoders convert natural language text into numeric coding that can be processed by the LLM. The INDUS encoders were trained on a corpus of 60 billion tokens encompassing astrophysics, planetary science, Earth science, heliophysics, biological, and physical sciences data. Its custom tokenizer developed by the IMPACT-IBM collaborative team improves on generic tokenizers by recognizing scientific terms like biomarkers and phosphorylated. Over half of the 50,000-word vocabulary contained in INDUS is unique to the specific scientific domains used for its training. The INDUS encoder models were used to fine tune the sentence transformer models on approximately 268 million text pairs, including titles/abstracts and questions/answers.

By providing INDUS with domain-specific vocabulary, the IMPACT-IBM team achieved superior performance over open, non-domain specific LLMs on a benchmark for biomedical tasks, a scientific question-answering benchmark, and Earth science entity recognition tests. By designing for diverse linguistic tasks and retrieval augmented generation, INDUS is able to process researcher questions, retrieve relevant documents, and generate answers to the questions. For latency sensitive applications, the team developed smaller, faster versions of both the encoder and sentence transformer models.

Validation tests demonstrate that INDUS excels in retrieving relevant passages from the science corpora in response to a NASA-curated test set of about 400 questions. IBM researcher Bishwaranjan Bhattacharjee commented on the overall approach: “We achieved superior performance by not only having a custom vocabulary but also a large specialized corpus for training the encoder model and a good training strategy. For the smaller, faster versions, we used neural architecture search to obtain a model architecture and knowledge distillation to train it with supervision of the larger model.”

NASA Chief Scientist Kate Calvin gives remarks in a NASA employee town hall on how the agency is using and developing Artificial Intelligence (AI) tools to advance missions and research, Wednesday, May 22, 2024, at the NASA Headquarters Mary W. Jackson Building in Washington.
NASA Chief Scientist Kate Calvin gives remarks in a NASA employee town hall on how the agency is using and developing Artificial Intelligence (AI) tools to advance missions and research, Wednesday, May 22, 2024, at the NASA Headquarters Mary W. Jackson Building in Washington. The INDUS suite of models will help facilitate the agency’s AI goals.
NASA/Bill Ingalls

INDUS was also evaluated using data from NASA’s Biological and Physical Sciences (BPS) Division. Dr. Sylvain Costes, the NASA BPS project manager for Open Science, discussed the benefits of incorporating INDUS: “Integrating INDUS with the Open Science Data Repository  (OSDR) Application Programming Interface (API) enabled us to develop and trial a chatbot that offers more intuitive search capabilities for navigating individual datasets. We are currently exploring ways to improve OSDR’s internal curation data system by leveraging INDUS to enhance our curation team’s productivity and reduce the manual effort required daily.”

At the NASA Goddard Earth Sciences Data and Information Services Center (GES-DISC), the INDUS model was fine-tuned using labeled data from domain experts to categorize publications specifically citing GES-DISC data into applied research areas. According to NASA principal data scientist Dr. Armin Mehrabian, this fine-tuning “significantly improves the identification and retrieval of publications that reference GES-DISC datasets, which aims to improve the user journey in finding their required datasets.” Furthermore, the INDUS encoder models are integrated into the GES-DISC knowledge graph, supporting a variety of other projects, including the dataset recommendation system and GES-DISC GraphRAG.

Kaylin Bugbee, team lead of NASA’s Science Discovery Engine (SDE), spoke to the benefit INDUS offers to existing applications: “Large language models are rapidly changing the search experience. The Science Discovery Engine, a unified, insightful search interface for all of NASA’s open science data and information, has prototyped integrating INDUS into its search engine. Initial results have shown that INDUS improved the accuracy and relevancy of the returned results.”

INDUS enhances scientific research by providing researchers with improved access to vast amounts of specialized knowledge. INDUS can understand complex scientific concepts and reveal new research directions based on existing data. It also enables researchers to extract relevant information from a wide array of sources, improving efficiency. Aligned with NASA and IBM’s commitment to open and transparent artificial intelligence, the INDUS models are openly available on Hugging Face. For the benefit of the scientific community, the team has released the developed models and will release the benchmark datasets that span named entity recognition for climate change, extractive QA for Earth science, and information retrieval for multiple domains. The INDUS encoder models are adaptable for science domain applications, and the INDUS retriever models support information retrieval in RAG applications.

A paper on INDUS, “INDUS: Effective and Efficient Language Models for Scientific Applications,” is available on arxiv.org.

Learn more about the Science Discovery Engine here.

Share

Details

Last Updated
Jun 24, 2024

Related Terms

View the full article

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Similar Topics

    • By NASA
      5 min read
      How NASA Science Data Defends Earth from Asteroids
      Artist’s impression of NASA’s DART mission, which collided with the asteroid Dimorphos in 2022 to test planetary defense techniques. Open science data practices help researchers identify asteroids that pose a hazard to Earth, opening the possibility for deflection should an impact threat be identified. NASA/Johns Hopkins APL/Steve Gribben The asteroid 2024 YR4 made headlines in February with the news that it had a chance of hitting Earth on Dec. 22, 2032, as determined by an analysis from NASA’s Center for Near Earth Object Studies (CNEOS) at the agency’s Jet Propulsion Laboratory in Southern California. The probability of collision peaked at over 3% on Feb. 18 — the highest ever recorded for an object of its size. This sparked concerns about the damage the asteroid might do should it hit Earth.
      New data collected in the following days lowered the probability to well under 1%, and 2024 YR4 is no longer considered a potential Earth impactor. However, the event underscored the importance of surveying asteroid populations to reveal possible threats to Earth. Sharing scientific data widely allows scientists to determine the risk posed by the near-Earth asteroid population and increases the chances of identifying future asteroid impact hazards in NASA science data.
      “The planetary defense community realizes the value of making data products available to everyone,” said James “Gerbs” Bauer, the principal investigator for NASA’s Planetary Data System Small Bodies Node at the University of Maryland in College Park, Maryland.
      How Scientists Spot Asteroids That Could Hit Earth
      Professional scientists and citizen scientists worldwide play a role in tracking asteroids. The Minor Planet Center, which is housed at the Smithsonian Astrophysical Observatory in Cambridge, Massachusetts, collects and verifies vast numbers of asteroid and comet position observations submitted from around the globe. NASA’s Small Bodies Node distributes the data from the Minor Planet Center for anyone who wants to access and use it.
      A near-Earth object (NEO) is an asteroid or comet whose orbit brings it within 120 million miles of the Sun, which means it can circulate through Earth’s orbital neighborhood. If a newly discovered object looks like it might be an NEO, information about the object appears on the Minor Planet Center’s NEO Confirmation Page. Members of the planetary science community, whether or not they are professional scientists, are encouraged to follow up on these objects to discover where they’re heading.
      The asteroid 2024 YR4 as viewed on January 27, 2025. The image was taken by the Magdalena Ridge 2.4m telescope, one of the largest telescopes in NASA’s Planetary Defense network. Asteroid position information from observations such as this one are shared through the Minor Planet Center and NASA’s Small Bodies Node to help scientists pinpoint the chances of asteroids colliding with Earth. NASA/Magdalena Ridge 2.4m telescope/New Mexico Institute of Technology/Ryan When an asteroid’s trajectory looks concerning, CNEOS alerts NASA’s Planetary Defense Coordination Office at NASA Headquarters in Washington, which manages NASA’s ongoing effort to protect Earth from dangerous asteroids. NASA’s Planetary Defense Coordination Office also coordinates the International Asteroid Warning Network (IAWN), which is the worldwide collaboration of asteroid observers and modelers.
      Orbit analysis centers such as CNEOS perform finer calculations to nail down the probability of an asteroid colliding with Earth. The open nature of the data allows the community to collaborate and compare, ensuring the most accurate determinations possible.
      How NASA Discovered Risks of Asteroid 2024 YR4
      The asteroid 2024 YR4 was initially discovered by the NASA-funded ATLAS (Asteroid Terrestrial-impact Last Alert System) survey, which aims to discover potentially hazardous asteroids. Scientists studied additional data about the asteroid from different observatories funded by NASA and from other telescopes across the IAWN.
      At first, 2024 YR4 had a broad uncertainty in its future trajectory that passed over Earth. As the planetary defense community collected more observations, the range of possibilities for the asteroid’s future position on Dec. 22, 2032 clustered over Earth, raising the apparent chances of collision. However, with the addition of even more data points, the cluster of possibilities eventually moved off Earth.
      This visualization from NASA’s Center for Near Earth Object Studies shows the evolution of the risk corridor for asteroid 2024 YR4, using data from observations made up to Feb. 23, 2025. Each yellow dot represents the asteroid’s possible location on Dec. 22, 2032. As the range of possible locations narrowed, the dots at first converged on Earth, before skewing away harmlessly. NASA/JPL/CNEOS Having multiple streams of data available for analysis helps scientists quickly learn more about NEOs. This sometimes involves using data from observatories that are mainly used for astrophysics or heliophysics surveys, rather than for tracking asteroids.
      “The planetary defense community both benefits from and is beneficial to the larger planetary and astronomy related ecosystem,” said Bauer, who is also a research professor in the Department of Astronomy at the University of Maryland. “Much of the NEO survey data can also be used for searching astrophysical transients like supernova events. Likewise, astrophysical sky surveys produce data of interest to the planetary defense community.”
      How Does NASA Stop Asteroids From Hitting Earth?
      In 2022, NASA’s DART (Double Asteroid Redirection Test) mission successfully impacted with the asteroid Dimorphos, shortening the time it takes to orbit around its companion asteroid Didymos by 33 minutes. Didymos had no chance of hitting Earth, but the DART mission’s success means that NASA has a tested technique to consider when addressing a future asteroid potential impact threat.
      Artist’s impression of NASA’s upcoming NEO Surveyor mission, which will search for potentially hazardous near-Earth objects. The mission will follow open data practices to improve the chances of identifying dangerous asteroids. NASA/JPL-Caltech To increase the chances of discovering asteroid threats to Earth well in advance, NASA is working on a new space-based observatory, NEO Surveyor, which will be the first spacecraft specifically designed to look for asteroids and comets that pose a hazard to Earth. The mission is expected to launch in the fall of 2027, and the data it collects will be available to everyone through NASA archives.
      “Many of the NEOs that pose a risk to Earth remain to be found,” Bauer said. “An asteroid impact has a very low likelihood at any given time, but consequences could be high, and open science is an       important component to being vigilant.”
      For more information about NASA’s approach to sharing science data, visit:
      https://science.nasa.gov/open-science.
      By Lauren Leese 
      Web Content Strategist for the Office of the Chief Science Data Officer 
      Share








      Details
      Last Updated Apr 10, 2025 Related Terms
      Open Science Planetary Defense Explore More
      2 min read Citizen Scientists Use NASA Open Science Data to Research Life in Space


      Article


      1 week ago
      5 min read Old Missions, New Discoveries: NASA’s Data Archives Accelerate Science


      Article


      1 week ago
      3 min read NASA Open Data Turns Science Into Art


      Article


      1 month ago
      Keep Exploring Discover More Topics From NASA
      Missions



      Humans in Space



      Climate Change



      Solar System


      View the full article
    • By NASA
      Science Launching on SpaceX's 32nd Cargo Resupply Mission to the Space Station
    • By NASA
      4 min read
      Preparations for Next Moonwalk Simulations Underway (and Underwater)
      NASA and SpaceX are launching the company’s 32nd commercial resupply services mission to the International Space Station later this month, bringing a host of new research to the orbiting laboratory. Aboard the SpaceX Dragon spacecraft are experiments focused on vision-based navigation, spacecraft air quality, materials for drug and product manufacturing, and advancing plant growth with less reliance on photosynthesis.
      This and other research conducted aboard the space station advances future space exploration, including missions to the Moon and Mars, and provides many benefits to humanity.
      Investigations traveling to the space station include:
      Robotic spacecraft guidance
      Smartphone Video Guidance Sensor-2 (SVGS-2) uses the space station’s Astrobee robots to demonstrate using a vision-based sensor developed by NASA to control a formation flight of small satellites. Based on a previous in-space demonstration of the technology, this investigation is designed to refine the maneuvers of multiple robots and integrate the information with spacecraft systems.
      Potential benefits of this technology include improved accuracy and reliability of systems for guidance, navigation, and control that could be applied to docking crewed spacecraft in orbit and remotely operating multiple robots on the lunar or Martian surface.
      Two of the space station’s Astrobee robots are used to test a vision-based guidance system for Smartphone Video Guidance Sensor (SVGS)NASA Protection from particles
      During spaceflight, especially long-duration missions, concentrations of airborne particles must be kept within ranges safe for crew health and hardware performance. The Aerosol Monitors investigation tests three different air quality monitors in space to determine which is best suited to protect crew health and ensure mission success. The investigation also tests a device for distinguishing between smoke and dust. Aboard the space station, the presence of dust can cause false smoke alarms that require crew member response. Reducing false alarms could save valuable crew time while continuing to protect astronaut safety.
      Better materials, better drugs
      The DNA Nano Therapeutics-Mission 2 produces a special type of molecule formed by DNA-inspired, customizable building blocks known as Janus base nanomaterials. It also evaluates how well the materials reduce joint inflammation and whether they can help regenerate cartilage lost due to arthritis. These materials are less toxic, more stable, and more compatible with living tissues than current drug delivery technologies.
      Environmental influences such as gravity can affect the quality of these materials and delivery systems. In microgravity, they are larger and have greater uniformity and structural integrity. This investigation could help identify the best formulations and methods for cost-effective in-space production. These nanomaterials also could be used to create novel systems targeting therapy delivery that improves patient outcomes with fewer side effects.
      Stem cells grown along the Janus base nanomaterials (JBNs) made aboard the International Space Station.University of Connecticut Next-generation pharmaceutical nanostructures
      The newest Industrial Crystallization Cassette (ADSEP-ICC) investigation adds capabilities to an existing protein crystallization facility. The cassette can process more sample types, including tiny gold particles used in devices that detect cancer and other diseases or in targeted drug delivery systems. Microgravity makes it possible to produce larger and more uniform gold particles, which improves their use in research and real-life applications of technologies related to human health.
      Helping plants grow
      Rhodium USAFA NIGHT examines how tomato plants respond to microgravity and whether a carbon dioxide replacement can reduce how much space-grown plants depend on photosynthesis. Because photosynthesis needs light, which requires spacecraft power to generate, alternatives would reduce energy use. The investigation also examines whether using supplements increases plant growth on the space station, which has been observed in preflight testing on Earth. In future plant production facilities aboard spacecraft or on celestial bodies, supplements could come from available organic materials such as waste.
      Understanding how plants adapt to microgravity could help grow food during long-duration space missions or harsh environments on Earth.
      Hardware for the Rhodium Plant LIFE, which was the first in a series used to study how space affects plant growth.NASA Atomic clocks in space
      An ESA (European Space Agency) investigation, Atomic Clock Ensemble in Space (ACES), examines fundamental physics concepts such as Einstein’s theory of relativity using two next-generation atomic clocks operated in microgravity. Results have applications to scientific measurement studies, the search for dark matter, and fundamental physics research that relies on highly accurate atomic clocks in space. The experiment also tests a technology for synchronizing clocks worldwide using global navigation satellite networks.
      An artist’s concept shows the Atomic Clock Ensemble in Space hardware mounted on the Earth-facing side of the space station’s exterior.ESA Download high-resolution photos and videos of the research mentioned in this article.
      Keep Exploring Discover More Topics From NASA
      Space Station Research and Technology
      Latest News from Space Station Research
      Station Benefits for Humanity
      Space Station Research Results
      View the full article
    • By Space Force
      The discussion was part of the 40th Space Symposium, held by the Space Foundation to drive conversations on data, partnerships and innovation across the space industry.

      View the full article
    • By NASA
      Explore This Section Science Science Activation NASA Science Supports Data… Overview Learning Resources Science Activation Teams SME Map Opportunities More Science Activation Stories Citizen Science   3 min read
      NASA Science Supports Data Literacy for K-12 Students
      Data – and our ability to understand and use it – shapes nearly every aspect of our world, from decisions in our lives to the skills we need in the workplace and more. All of us, as either producers or consumers of data, will experience how it can be used to problem-solve and think critically as we navigate the world around us. For that reason, Data Science has become an increasingly essential and growing field that combines the collection, organization, analysis, interpretation, and sharing of data in virtually every area of life. As more data become more openly available, our Data Science skills will be of increasing importance. And yet, there is a widening gap between what students learn in school and the skills they will need to succeed in a data-driven world. The integration of Data Science into K-12 education opens doors to higher education, high-paying careers, and empowering learners to eventually participate in the creation of new knowledge and understanding of our world, and at least 29 states have reported some level of data science implementation at the K-12 level, including standard or framework adoption, course piloting, and educator professional learning.
      In February 2025, the first-ever Data Science Education K-12: Research to Practice Conference (DS4E) took place in San Antonio, TX. A number of representatives from NASA’s Science Activation program and other NASA partners attended and presented along with over 250 educators, researchers, and school leaders from across the nation. Science Activation projects share a passion for helping people of all ages and backgrounds connect with NASA science experts, content, experiences, and learning resources, and the AEROKATS & ROVER Education Network (AREN); Place-Based Learning to Advance Connections, Education, and Stewardship (PLACES); Global Learning and Observations to Benefit the Environment (GLOBE) Mission Earth; and My NASA Data teams did just that. Their presentations at the conference included:
      “BYOD – Build or Bring Your Own Data: Developing K-12 Datasets” (PLACES) “Using NASA Data Resources as a Tool to Support Storytelling with Data in K-12 Education” (My NASA Data) “Place-Based Data Literacy: Real People, Real Places, Real Data” (AREN) Conference participants expressed interest in learning more about NASA assets, including data and subject matter experts. Stemming from their participation in this first DS4E, several Science Activation teams are collaborating to potentially host regional events next year under the umbrella of this effort (PLACES in particular), a wonderful example of how Science Activation project teams help lead the charge in the advancement of key Science, Technology, Education, and Mathematics (STEM) fields, such as Data Science, to activate minds and promote a deeper understanding of our world and beyond.
      Learn more about how Science Activation connects NASA science experts, real content, and experiences with community leaders to do science in ways that activate minds and promote deeper understanding of our world and beyond: https://science.nasa.gov/learn
      Data Science Education K-12 Research to Practice Conference Share








      Details
      Last Updated Apr 09, 2025 Editor NASA Science Editorial Team Related Terms
      Science Activation Earth Science Grades 5 – 8 for Educators Grades 9-12 for Educators Grades K – 4 for Educators Opportunities For Educators to Get Involved Opportunities For Researchers to Get Involved Explore More
      3 min read Findings from the Field: A Research Symposium for Student Scientists


      Article


      1 day ago
      34 min read Style Guidelines for ‘The Earth Observer’ Newsletter 


      Article


      1 day ago
      5 min read Connected Learning Ecosystems: Educators Gather to Empower Learners and Themselves


      Article


      2 days ago
      Keep Exploring Discover More Topics From NASA
      James Webb Space Telescope


      Webb is the premier observatory of the next decade, serving thousands of astronomers worldwide. It studies every phase in the…


      Perseverance Rover


      This rover and its aerial sidekick were assigned to study the geology of Mars and seek signs of ancient microbial…


      Parker Solar Probe


      On a mission to “touch the Sun,” NASA’s Parker Solar Probe became the first spacecraft to fly through the corona…


      Juno


      NASA’s Juno spacecraft entered orbit around Jupiter in 2016, the first explorer to peer below the planet’s dense clouds to…

      View the full article
  • Check out these Videos

×
×
  • Create New...