Of course, computers can be used to play grandmaster level chess (chess_computer), but can they make scientific discoveries? Researchers from the US Department of Energy's Lawrence Berkeley National Laboratory (Berkeley Lab) have shown that an algorithm without training in materials science can scan the text of millions of papers and uncover new scientific knowledge.
A group led by Anubhav Jain, a scientist at Berkeley Lab's Energy Conservation and Distributed Resources department, collected 3.3 million abstracts of published material papers and fed them into an algorithm called Word2vec. By analyzing the relationship between words, the algorithm was able to predict discoveries of new thermoelectric materials in advance and suggest yet unknown materials as candidates for thermoelectric materials.
"Without telling anything about material science, you learned concepts like the periodic table and crystal structure of metals," said jain. "It suggested the potential of the technique. But probably the most interesting thing we figured out is that you can use this algorithm to solve gaps in material research, things that people need to study but haven't studied so far."
The results were published on July 3 in the journal Nature . The main author of the study, "Unsupervised Word Embedding Capture Latent Knowledge from Materials Science Literature," is Vahe Tshitoyan, a Berkeley Lab Postdoctor who is now working at Google. Together with Jain, Berkeley Lab researchers helped Kristin Persson and Gerbrand Ceder lead the study.
"In the paper, it is stated that textual extraction of scientific literature can uncover hidden knowledge and that pure text-based extraction can create basic scientific knowledge" Ceder, who also has an agreement at UC Berkeley's Materials Science and Technology Department.
Tshitoyan said the project was motivated by the difficulties in understanding the overwhelming amount of published studies. "In every field of research, there are 1
The King-Queen + team gathered 3.3 million abstracts from papers published in more than 1,000 journals between 1922 and 2018. Word2vec each took approx. 500,000 different words in these abstracts and each turned to a 200-dimensional vector or an array of 200 numbers.
"What is important is not every number, but using the numbers to see how words are related to each other," says Jain, who heads a group working on discovery and design of new materials for energy consumption using of a mixture of theory, calculation and data mining. "For example, you can drag vectors using standard vector mathematics. Other researchers have shown that if you train the algorithm on non-scientific text sources and take the vector that comes from" + minus queen ", you get the same results as a man less woman & # 39; it describes the relationship without telling anything about it. "
Likewise, the algorithm was able to learn the meaning of scientific terms and concepts like the crystal structure of metals based on the words of the abstracts and their coincidence in other words. For example, the answer that it could solve the equation "king-queen + man" could find that for the equation "ferromagnetic-NiFe + IrMn", the answer would be "antiferromagnetic."
Word2vec was even able to learn the relationship between elements on the periodic table when the vector for each chemical element was projected on two dimensions.
Predicting discoveries year in advance
So if Word2vec is so smart, could it predict new thermoelectric materials? A good thermoelectric material can effectively transform heat into electricity and is made of materials that are safe, abundant and easy to produce. The Berkeley Lab team took the top thermoelectric candidates suggested by the algorithm which ranked each connection by the similarity of its word vector to the word "thermoelectric". Then they ran calculations to confirm the algorithm's predictions.
Of the top 10 predictions, they found that everyone had calculated power factors slightly higher than the average of known thermoelectrics; the top three candidates had power factors above the 95th percentile of known thermoelectrics.
Next, they were tested if the algorithm could perform experiments "earlier" by giving it only abstract to year 2000. the top predictions, a significant number appeared in later studies – four times more than if materials had just been chosen randomly . For example, three of the five best predictions trained using data up to 2008 have been discovered, and the remaining two contain rare or toxic elements.
The results were surprising. "I honestly didn't think the algorithm should be so predictive of future results," Jain said. "I had thought that the algorithm could be descriptive of what people had done before but didn't come up with these different connections. I was pretty surprised when I saw not just the predictions, but also the reasoning behind the predictions, things like the half Heusler structure, which is a really warm crystal structure for thermoelectrics these days. "
He added:" This study shows that if this algorithm was in place earlier, some materials could have been discovered years in advance. "Along with the study, the researchers released the top 50 thermoelectric materials predicted by the algorithm. They will also release the word deposits necessary for people to make their own applications if they want to search, says a better topological insulation material.
Next, Jain said the team is working on a smarter and more powerful search engine that allows researchers to search abstracts in a more useful way.
The study was funded by the Toyota Research Institute. Other study co-authors are the Berkeley Lab researchers John Dagdelen, Leigh Weston, Alexander Dunn and Ziqin Rong and UC Berkeley researcher Olga Kononova.
High-efficiency thermoelectric materials: New insight into tin selenide
Unsupervised word embeddings captures latent knowledge from the material science literature, Nature (2019). DOI: 10.1038 / s41586-019-1335-8, https://nature.com/articles/s41586-019-1335-8
With a little training, machine learning algorithms can uncover hidden scientific knowledge (2019, July 3)
July 3, 2019
This document is subject to copyright. Besides any fair trade for private study or research, no
part can be reproduced without written permission. The content is for informational purposes only.